Removing epsilon productions from context free grammars
February 8th, 2010 at 6:53 amBackground
epsilon productions are very useful to express many grammars in a compact way. For example, take these simple function call productions in some imaginary C-like language:
func_call:: identifier '(' arguments_opt ')'
arguments_opt:: arguments_list | eps
arguments_list:: argument | argument ',' arguments_list
When composing grammars by hand, simplicity matters. It’s very useful to be able to look at arguments_opt and know that it’s an optional list of arguments. The same non-terminal can be reused in several other productions.
However, epsilon productions pose a problem for several algorithms that act on grammars. Therefore, prior to running these algorithms, epsilon productions have to be removed. Fortunately, this can be done relatively effortlessly in an automatic way.
Here I want to present an algorithm and a simple implementation for epsilon production removal.
The algorithm
Intuitively, it’s quite simple to remove epsilon productions. Consider the grammar for function calls presented above. The argument_opt nonterminal in func_call is just a short way of saying that there either is an argument list inside those parens or nothing. In other words, it can be rewritten as follows:
func_call:: identifier '(' arguments_opt ')'
| identifier '(' ')'
arguments_opt:: arguments_list
arguments_list:: argument | argument ',' arguments_list
This duplication of productions for func_call will have to be repeated for every other production that had arguments_opt in it. This grammar looks somewhat strange, as arguments_opt is now identical to arguments_list. It is correct, however.
A more interesting case occurs when the epsilon production is in a nonterminal that appears more than once in some other production [1]. Consider:
B:: A z A
A:: a | eps
When we remove the epsilon production from A, we have to duplicate the productions that have A in them, but the production for B has two A. Since either of the A instances in the production can be empty, the only proper way to do this is go over all the combinations:
B:: z | A z | z A | A z A
A:: a
In the general case, if A appears k times in some production, this production will be replicated 2^k times, each time with a different combination [2].
This leads us to the algorithm:
- Pick a nonterminal A with an epsilon production
- Remove that epsilon production
- For each production containing A: Replicate it 2^k times where k is the number of A instances in the production, such that all combinations of A being there or not will be represented.
- If there are still epsilon productions in the grammar, go back to step 1.
A couple of points to pay attention to:
- It’s obvious that a step of the algorithm can create new epsilon productions [3]. This is handled correctly, as it works iteratively until all epsilon productions are removed.
- The only place where an epsilon production cannot be removed is at the start symbol. If the grammar can generate an empty string, we can’t ruin that. A special case will have to handle this case.
Implementation
Here’s an implementation of this algorithm in Python:
from collections import defaultdict
class CFG(object):
def __init__(self):
self.prod = defaultdict(list)
self.start = None
def set_start_symbol(self, start):
""" Set the start symbol of the grammar.
"""
self.start = start
def add_prod(self, lhs, rhs):
""" Add production to the grammar. 'rhs' can
be several productions separated by '|'.
Each production is a sequence of symbols
separated by whitespace.
Empty strings are interpreted as an eps-production.
Usage:
grammar.add_prod('NT', 'VP PP')
grammar.add_prod('Digit', '1|2|3|4')
# Optional Digit: digit or eps
grammar.add_prod('Digit_opt', Digit |')
"""
# The internal data-structure representing productions.
# maps a nonterminal name to a list of productions, each
# a list of symbols. An empty list [] specifies an
# eps-production.
#
prods = rhs.split('|')
for prod in prods:
self.prod[lhs].append(prod.split())
def remove_eps_productions(self):
""" Removes epsilon productions from the grammar.
The algorithm:
1. Pick a nonterminal p_eps with an epsilon production
2. Remove that epsilon production
3. For each production containing p_eps, replace it
with several productions such that all the
combinations of p_eps being there or not will be
represented.
4. If there are still epsilon productions in the
grammar, go back to step 1
The replication can be demonstrated with an example.
Suppose that A contains an epsilon production, and
we've found a production B:: [A, k, A]
Then this production of B will be replaced with these:
[A, k], [k], [k, A], [A, k, A]
"""
while True:
# Find an epsilon production
#
p_eps, index = self._find_eps_production()
# No epsilon productions? Then we're done...
#
if p_eps is None:
break
# Remove the epsilon production
#
del self.prod[p_eps][index]
# Now find all the productions that contain the
# production that removed.
# For each such production, replicate it with all
# the combinations of the removed production.
#
for lhs in self.prod:
prods = []
for lhs_prod in self.prod[lhs]:
num_p_eps = lhs_prod.count(p_eps)
if num_p_eps == 0:
prods.append(lhs_prod)
else:
prods.extend(self._create_prod_combinations(
prod=lhs_prod,
nt=p_eps,
count=num_p_eps))
# Remove duplicates
#
prods = sorted(prods)
prods = [prods[i] for i in xrange(len(prods))
if i == 0 or prods[i] != prods[i-1]]
self.prod[lhs] = prods
def _find_eps_production(self):
""" Finds an epsilon production in the grammar. If such
a production is found, returns the pair (lhs, index):
the name of the non-terminal that has an epsilon
production and its index in lhs's list of productions.
If no epsilon productions were found, returns the
pair (None, None).
Note: eps productions in the start symbol will be
ignored, because we don't want to remove them.
"""
for lhs in self.prod:
if not self.start is None and lhs == self.start:
continue
for i, p in enumerate(self.prod[lhs]):
if len(p) == 0:
return lhs, i
return None, None
def _create_prod_combinations(self, prod, nt, count):
""" prod:
A production (list) that contains at least one
instance of 'nt'
nt:
The non-terminal which should be replicated
count:
The amount of times 'nt' appears in 'lhs_prod'.
Assumed to be >= 1
Returns the generated list of productions.
"""
# The combinations are a kind of a powerset. Membership
# in a powerset can be checked by using the binary
# representation of a number.
# There are 2^count possibilities in total.
#
numset = 1 << count
new_prods = []
for i in xrange(numset):
nth_nt = 0
new_prod = []
for s in prod:
if s == nt:
if i & (1 << nth_nt):
new_prod.append(s)
nth_nt += 1
else:
new_prod.append(s)
new_prods.append(new_prod)
return new_prods
And here are the results with some of the sample grammars presented earlier in the article:
cfg = CFG()
cfg.add_prod('identifier', '( arguments_opt )')
cfg.add_prod('arguments_opt', 'arguments_list | ')
cfg.add_prod('arguments_list', 'argument | argument , arguments_list')
cfg.remove_eps_productions()
for p in cfg.prod:
print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]
Produces:
func_call :: ['identifier ( )', 'identifier ( arguments_opt )']
arguments_list :: ['argument', 'argument , arguments_list']
arguments_opt :: ['arguments_list']
As expected. And:
cfg = CFG()
cfg.add_prod('B', 'A z A')
cfg.add_prod('A', 'a | ')
cfg.remove_eps_productions()
for p in cfg.prod:
print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]
Produces:
A :: ['a']
B :: ['A z', 'A z A', 'z', 'z A']
The implementation isn’t tuned for efficiency, but for simplicity. Luckily, CFGs are usually small enough to make the runtime of this implementation manageable. Note that the preservation of epsilon productions in the start rule is implemented in the _find_eps_production method.

| [1] | From here on, lowercase letters early in the alphabet (a, b, c…) are terminals. Early uppercase letters (A, B, C…) are nonterminals, and letters late in the alphabet (z, y, x…) are arbitrary strings of terminals and nonterminals. |
| [2] | If this sounds like generating a powerset, you’re right. |
| [3] | Consider the productions: |
A:: a | eps
B:: b | A
After removing the epsilon production from A we’ll have:
A:: a
B:: b | A | eps
Related posts:

February 8th, 2010 at 16:14
I probably misunderstand your intention here, but it looks like you are taking a reasonable grammar allowing zero, one, or more arguments to a function, and replacing it with a grammar that only allows two options: either a function that takes no arguments, or a function that takes an infinite number. Because, if
arguments_optis the path chosen to build a sentence, then you have to start adding arguments forever without any way to stop:func_call:: identifier '(' arguments_opt ')'| identifier '(' ')'
arguments_opt:: arguments_list
arguments_list:: argument ',' arguments_list
Surely this was not the intention?
February 8th, 2010 at 18:25
@Brandon: It seems, to me at least, that the original grammar has the same problem you mention: once you get to the arguments_opt rule you either choose the epsilon production, yielding a function with no arguments, or the arguments_list which goes on forever.
Replacing this:
with this:
would solve the problem in the first grammar but not in the second while adding an alternative production in arguments_list would solve the problem in both, something like this:
February 8th, 2010 at 20:26
This is not a new or innovative algorithm. You are performing one step of the conversion of a CFG to Chomsky Normal Form. Please see Sipser’s “Theory of Computation,” page 107.
February 8th, 2010 at 21:17
@Brandon and Dan,
You’re right, the
arguments_listis indeed unusual. I didn’t notice, because it really doesn’t matter for this article. But now I’ve fixed it to make more sense. Thanks for noticing.February 8th, 2010 at 21:20
@Someone Who Reads,
I do not claim originality or innovation here. Like in most articles I post, I’m just sharing something interesting and trying to explain it in simple enough terms for myself to understand (and keep a useful reference for the future). I suppose you’ve misunderstood my intentions.
February 9th, 2010 at 14:36
How will you handle semantic actions? Can you transform the resulting parse tree back into the form of the original grammar, so that the hand-written semantic actions will still apply?
February 10th, 2010 at 06:13
@Steve,
This is an interesting question – I didn’t consider it. I suppose it’s possible to preserve the semantic actions on epsilon rules because we generally know where they were removed from. However, I did not really think about this seriously – the algorithm I’m using this for doesn’t require semantic actions at all.
May 11th, 2012 at 16:11
The algorithm doesn’t work for this grammar
cfg = CFG()
cfg.add_prod(‘S’, ‘a B | A C | a’)
cfg.add_prod(‘A’, ‘a | A a S a C b | B C’)
cfg.add_prod(‘B’, ‘b | C’)
cfg.add_prod(‘C’, ‘B A | ‘)
May 13th, 2012 at 16:49
Oleg,
Can you give an example of why you think it won’t work?
May 14th, 2012 at 11:14
@eliben, For that example, you might have seen it continues to create new eps productions in each iteration. Have you had a chance to run it?
However this should be the right answer, but I haven’t got anything because it got stuck in a while loop.
S –> aB
S –> AC
S –> A
S –> C
A –> a
A –> A a S a C b
A –> a S a C b
A –> A a a C b
A –> A a S a b
A –> a a C b
A –> a S a b
A –> A a a b
A –> a a b
A –> BC
A –> B
a –> C
B –> b
B –> C
C –> BA
C –> B
C –> A
S –> eps