Removing epsilon productions from context free grammars

February 8th, 2010 at 6:53 am

Background

epsilon productions are very useful to express many grammars in a compact way. For example, take these simple function call productions in some imaginary C-like language:

func_call:: identifier '(' arguments_opt ')'
arguments_opt:: arguments_list | eps
arguments_list:: argument | argument ',' arguments_list

When composing grammars by hand, simplicity matters. It’s very useful to be able to look at arguments_opt and know that it’s an optional list of arguments. The same non-terminal can be reused in several other productions.

However, epsilon productions pose a problem for several algorithms that act on grammars. Therefore, prior to running these algorithms, epsilon productions have to be removed. Fortunately, this can be done relatively effortlessly in an automatic way.

Here I want to present an algorithm and a simple implementation for epsilon production removal.

The algorithm

Intuitively, it’s quite simple to remove epsilon productions. Consider the grammar for function calls presented above. The argument_opt nonterminal in func_call is just a short way of saying that there either is an argument list inside those parens or nothing. In other words, it can be rewritten as follows:

func_call:: identifier '(' arguments_opt ')'
          | identifier '(' ')'
arguments_opt:: arguments_list
arguments_list:: argument | argument ',' arguments_list

This duplication of productions for func_call will have to be repeated for every other production that had arguments_opt in it. This grammar looks somewhat strange, as arguments_opt is now identical to arguments_list. It is correct, however.

A more interesting case occurs when the epsilon production is in a nonterminal that appears more than once in some other production [1]. Consider:

B:: A z A
A:: a | eps

When we remove the epsilon production from A, we have to duplicate the productions that have A in them, but the production for B has two A. Since either of the A instances in the production can be empty, the only proper way to do this is go over all the combinations:

B:: z | A z | z A | A z A
A:: a

In the general case, if A appears k times in some production, this production will be replicated 2^k times, each time with a different combination [2].

This leads us to the algorithm:

  1. Pick a nonterminal A with an epsilon production
  2. Remove that epsilon production
  3. For each production containing A: Replicate it 2^k times where k is the number of A instances in the production, such that all combinations of A being there or not will be represented.
  4. If there are still epsilon productions in the grammar, go back to step 1.

A couple of points to pay attention to:

  • It’s obvious that a step of the algorithm can create new epsilon productions [3]. This is handled correctly, as it works iteratively until all epsilon productions are removed.
  • The only place where an epsilon production cannot be removed is at the start symbol. If the grammar can generate an empty string, we can’t ruin that. A special case will have to handle this case.

Implementation

Here’s an implementation of this algorithm in Python:

from collections import defaultdict


class CFG(object):
    def __init__(self):
        self.prod = defaultdict(list)
        self.start = None

    def set_start_symbol(self, start):
        """ Set the start symbol of the grammar.
        """
        self.start = start

    def add_prod(self, lhs, rhs):
        """ Add production to the grammar. 'rhs' can
            be several productions separated by '|'.
            Each production is a sequence of symbols
            separated by whitespace.
            Empty strings are interpreted as an eps-production.

            Usage:
                grammar.add_prod('NT', 'VP PP')
                grammar.add_prod('Digit', '1|2|3|4')

                # Optional Digit: digit or eps
                grammar.add_prod('Digit_opt', Digit |')
        """
        # The internal data-structure representing productions.
        # maps a nonterminal name to a list of productions, each
        # a list of symbols. An empty list [] specifies an
        # eps-production.
        #
        prods = rhs.split('|')
        for prod in prods:
            self.prod[lhs].append(prod.split())

    def remove_eps_productions(self):
        """ Removes epsilon productions from the grammar.

            The algorithm:

            1. Pick a nonterminal p_eps with an epsilon production
            2. Remove that epsilon production
            3. For each production containing p_eps, replace it
               with several productions such that all the
               combinations of p_eps being there or not will be
               represented.
            4. If there are still epsilon productions in the
               grammar, go back to step 1

            The replication can be demonstrated with an example.
            Suppose that A contains an epsilon production, and
            we've found a production B:: [A, k, A]
            Then this production of B will be replaced with these:
            [A, k], [k], [k, A], [A, k, A]
        """
        while True:
            # Find an epsilon production
            #
            p_eps, index = self._find_eps_production()

            # No epsilon productions? Then we're done...
            #
            if p_eps is None:
                break

            # Remove the epsilon production
            #
            del self.prod[p_eps][index]

            # Now find all the productions that contain the
            # production that removed.
            # For each such production, replicate it with all
            # the combinations of the removed production.
            #
            for lhs in self.prod:
                prods = []

                for lhs_prod in self.prod[lhs]:
                    num_p_eps = lhs_prod.count(p_eps)
                    if num_p_eps == 0:
                        prods.append(lhs_prod)
                    else:
                        prods.extend(self._create_prod_combinations(
                            prod=lhs_prod,
                            nt=p_eps,
                            count=num_p_eps))

                # Remove duplicates
                #
                prods = sorted(prods)
                prods = [prods[i] for i in xrange(len(prods))
                                  if i == 0 or prods[i] != prods[i-1]]

                self.prod[lhs] = prods

    def _find_eps_production(self):
        """ Finds an epsilon production in the grammar. If such
            a production is found, returns the pair (lhs, index):
            the name of the non-terminal that has an epsilon
            production and its index in lhs's list of productions.
            If no epsilon productions were found, returns the
            pair (None, None).

            Note: eps productions in the start symbol will be
            ignored, because we don't want to remove them.
        """
        for lhs in self.prod:
            if not self.start is None and lhs == self.start:
                continue

            for i, p in enumerate(self.prod[lhs]):
                if len(p) == 0:
                    return lhs, i

        return None, None

    def _create_prod_combinations(self, prod, nt, count):
        """ prod:
                A production (list) that contains at least one
                instance of 'nt'
            nt:
                The non-terminal which should be replicated
            count:
                The amount of times 'nt' appears in 'lhs_prod'.
                Assumed to be >= 1

            Returns the generated list of productions.
        """
        # The combinations are a kind of a powerset. Membership
        # in a powerset can be checked by using the binary
        # representation of a number.
        # There are 2^count possibilities in total.
        #
        numset = 1 << count
        new_prods = []

        for i in xrange(numset):
            nth_nt = 0
            new_prod = []

            for s in prod:
                if s == nt:
                    if i & (1 << nth_nt):
                        new_prod.append(s)
                    nth_nt += 1
                else:
                    new_prod.append(s)

            new_prods.append(new_prod)

        return new_prods

And here are the results with some of the sample grammars presented earlier in the article:

cfg = CFG()
cfg.add_prod('identifier', '( arguments_opt )')
cfg.add_prod('arguments_opt', 'arguments_list | ')
cfg.add_prod('arguments_list', 'argument | argument , arguments_list')

cfg.remove_eps_productions()
for p in cfg.prod:
    print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]

Produces:

func_call ::  ['identifier ( )', 'identifier ( arguments_opt )']
arguments_list ::  ['argument', 'argument , arguments_list']
arguments_opt ::  ['arguments_list']

As expected. And:

cfg = CFG()
cfg.add_prod('B', 'A z A')
cfg.add_prod('A', 'a | ')

cfg.remove_eps_productions()
for p in cfg.prod:
    print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]

Produces:

A ::  ['a']
B ::  ['A z', 'A z A', 'z', 'z A']

The implementation isn’t tuned for efficiency, but for simplicity. Luckily, CFGs are usually small enough to make the runtime of this implementation manageable. Note that the preservation of epsilon productions in the start rule is implemented in the _find_eps_production method.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] From here on, lowercase letters early in the alphabet (a, b, c…) are terminals. Early uppercase letters (A, B, C…) are nonterminals, and letters late in the alphabet (z, y, x…) are arbitrary strings of terminals and nonterminals.
[2] If this sounds like generating a powerset, you’re right.
[3] Consider the productions:
A:: a | eps
B:: b | A

After removing the epsilon production from A we’ll have:

A:: a
B:: b | A | eps

Related posts:

  1. Generating random sentences from a context free grammar
  2. The context sensitivity of C’s grammar
  3. The context sensitivity of C’s grammar, revisited
  4. Book review: “Free to choose” by Milton and Rose Friedman
  5. Recursive descent, LL and predictive parsers

11 Responses to “Removing epsilon productions from context free grammars”

  1. Brandon Craig RhodesNo Gravatar Says:

    I probably misunderstand your intention here, but it looks like you are taking a reasonable grammar allowing zero, one, or more arguments to a function, and replacing it with a grammar that only allows two options: either a function that takes no arguments, or a function that takes an infinite number. Because, if arguments_opt is the path chosen to build a sentence, then you have to start adding arguments forever without any way to stop:
    func_call:: identifier '(' arguments_opt ')'
    | identifier '(' ')'
    arguments_opt:: arguments_list
    arguments_list:: argument ',' arguments_list

    Surely this was not the intention?

  2. DanNo Gravatar Says:

    @Brandon: It seems, to me at least, that the original grammar has the same problem you mention: once you get to the arguments_opt rule you either choose the epsilon production, yielding a function with no arguments, or the arguments_list which goes on forever.

    Replacing this:

    arguments_list:: argument ',' arguments_list

    with this:

    arguments_list:: argument ',' arguments_opt

    would solve the problem in the first grammar but not in the second while adding an alternative production in arguments_list would solve the problem in both, something like this:

    arguments_list:: argument ',' arguments_list | argument
  3. Someone Who ReadsNo Gravatar Says:

    This is not a new or innovative algorithm. You are performing one step of the conversion of a CFG to Chomsky Normal Form. Please see Sipser’s “Theory of Computation,” page 107.

  4. elibenNo Gravatar Says:

    @Brandon and Dan,

    You’re right, the arguments_list is indeed unusual. I didn’t notice, because it really doesn’t matter for this article. But now I’ve fixed it to make more sense. Thanks for noticing.

  5. elibenNo Gravatar Says:

    @Someone Who Reads,

    I do not claim originality or innovation here. Like in most articles I post, I’m just sharing something interesting and trying to explain it in simple enough terms for myself to understand (and keep a useful reference for the future). I suppose you’ve misunderstood my intentions.

  6. Steve HanovNo Gravatar Says:

    How will you handle semantic actions? Can you transform the resulting parse tree back into the form of the original grammar, so that the hand-written semantic actions will still apply?

  7. elibenNo Gravatar Says:

    @Steve,

    This is an interesting question – I didn’t consider it. I suppose it’s possible to preserve the semantic actions on epsilon rules because we generally know where they were removed from. However, I did not really think about this seriously – the algorithm I’m using this for doesn’t require semantic actions at all.

  8. OlegNo Gravatar Says:

    The algorithm doesn’t work for this grammar

    cfg = CFG()
    cfg.add_prod(‘S’, ‘a B | A C | a’)
    cfg.add_prod(‘A’, ‘a | A a S a C b | B C’)
    cfg.add_prod(‘B’, ‘b | C’)
    cfg.add_prod(‘C’, ‘B A | ‘)

  9. elibenNo Gravatar Says:

    Oleg,

    Can you give an example of why you think it won’t work?

  10. OlegNo Gravatar Says:

    @eliben, For that example, you might have seen it continues to create new eps productions in each iteration. Have you had a chance to run it?

    However this should be the right answer, but I haven’t got anything because it got stuck in a while loop.

    S –> aB
    S –> AC
    S –> A
    S –> C
    A –> a
    A –> A a S a C b
    A –> a S a C b
    A –> A a a C b
    A –> A a S a b
    A –> a a C b
    A –> a S a b
    A –> A a a b
    A –> a a b
    A –> BC
    A –> B
    a –> C
    B –> b
    B –> C
    C –> BA
    C –> B
    C –> A
    S –> eps

  11. RaviNo Gravatar Says:

    Please Convert and post the code in C Language for removing epsilon production from the grammer

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)