Removing epsilon productions from context free grammars

Background

epsilon productions are very useful to express many grammars in a compact way. For example, take these simple function call productions in some imaginary C-like language:

func_call:: identifier '(' arguments_opt ')'
arguments_opt:: arguments_list | eps
arguments_list:: argument | argument ',' arguments_list

When composing grammars by hand, simplicity matters. It's very useful to be able to look at arguments_opt and know that it's an optional list of arguments. The same non-terminal can be reused in several other productions.

However, epsilon productions pose a problem for several algorithms that act on grammars. Therefore, prior to running these algorithms, epsilon productions have to be removed. Fortunately, this can be done relatively effortlessly in an automatic way.

Here I want to present an algorithm and a simple implementation for epsilon production removal.

The algorithm

Intuitively, it's quite simple to remove epsilon productions. Consider the grammar for function calls presented above. The argument_opt nonterminal in func_call is just a short way of saying that there either is an argument list inside those parens or nothing. In other words, it can be rewritten as follows:

func_call:: identifier '(' arguments_opt ')'
          | identifier '(' ')'
arguments_opt:: arguments_list
arguments_list:: argument | argument ',' arguments_list

This duplication of productions for func_call will have to be repeated for every other production that had arguments_opt in it. This grammar looks somewhat strange, as arguments_opt is now identical to arguments_list. It is correct, however.

A more interesting case occurs when the epsilon production is in a nonterminal that appears more than once in some other production [1]. Consider:

B:: A z A
A:: a | eps

When we remove the epsilon production from A, we have to duplicate the productions that have A in them, but the production for B has two A. Since either of the A instances in the production can be empty, the only proper way to do this is go over all the combinations:

B:: z | A z | z A | A z A
A:: a

In the general case, if A appears k times in some production, this production will be replicated 2^k times, each time with a different combination [2].

This leads us to the algorithm:

Pick a nonterminal A with an epsilon production
Remove that epsilon production
For each production containing A: Replicate it 2^k times where k is the number of A instances in the production, such that all combinations of A being there or not will be represented.
If there are still epsilon productions in the grammar, go back to step 1.

A couple of points to pay attention to:

It's obvious that a step of the algorithm can create new epsilon productions [3]. This is handled correctly, as it works iteratively until all epsilon productions are removed.
The only place where an epsilon production cannot be removed is at the start symbol. If the grammar can generate an empty string, we can't ruin that. A special case will have to handle this case.

Implementation

Here's an implementation of this algorithm in Python:

from collections import defaultdict


class CFG(object):
    def __init__(self):
        self.prod = defaultdict(list)
        self.start = None

    def set_start_symbol(self, start):
        """ Set the start symbol of the grammar.
        """
        self.start = start

    def add_prod(self, lhs, rhs):
        """ Add production to the grammar. 'rhs' can
            be several productions separated by '|'.
            Each production is a sequence of symbols
            separated by whitespace.
            Empty strings are interpreted as an eps-production.

            Usage:
                grammar.add_prod('NT', 'VP PP')
                grammar.add_prod('Digit', '1|2|3|4')

                # Optional Digit: digit or eps
                grammar.add_prod('Digit_opt', Digit |')
        """
        # The internal data-structure representing productions.
        # maps a nonterminal name to a list of productions, each
        # a list of symbols. An empty list [] specifies an
        # eps-production.
        #
        prods = rhs.split('|')
        for prod in prods:
            self.prod[lhs].append(prod.split())

    def remove_eps_productions(self):
        """ Removes epsilon productions from the grammar.

            The algorithm:

            1. Pick a nonterminal p_eps with an epsilon production
            2. Remove that epsilon production
            3. For each production containing p_eps, replace it
               with several productions such that all the
               combinations of p_eps being there or not will be
               represented.
            4. If there are still epsilon productions in the
               grammar, go back to step 1

            The replication can be demonstrated with an example.
            Suppose that A contains an epsilon production, and
            we've found a production B:: [A, k, A]
            Then this production of B will be replaced with these:
            [A, k], [k], [k, A], [A, k, A]
        """
        while True:
            # Find an epsilon production
            #
            p_eps, index = self._find_eps_production()

            # No epsilon productions? Then we're done...
            #
            if p_eps is None:
                break

            # Remove the epsilon production
            #
            del self.prod[p_eps][index]

            # Now find all the productions that contain the
            # production that removed.
            # For each such production, replicate it with all
            # the combinations of the removed production.
            #
            for lhs in self.prod:
                prods = []

                for lhs_prod in self.prod[lhs]:
                    num_p_eps = lhs_prod.count(p_eps)
                    if num_p_eps == 0:
                        prods.append(lhs_prod)
                    else:
                        prods.extend(self._create_prod_combinations(
                            prod=lhs_prod,
                            nt=p_eps,
                            count=num_p_eps))

                # Remove duplicates
                #
                prods = sorted(prods)
                prods = [prods[i] for i in xrange(len(prods))
                                  if i == 0 or prods[i] != prods[i-1]]

                self.prod[lhs] = prods

    def _find_eps_production(self):
        """ Finds an epsilon production in the grammar. If such
            a production is found, returns the pair (lhs, index):
            the name of the non-terminal that has an epsilon
            production and its index in lhs's list of productions.
            If no epsilon productions were found, returns the
            pair (None, None).

            Note: eps productions in the start symbol will be
            ignored, because we don't want to remove them.
        """
        for lhs in self.prod:
            if not self.start is None and lhs == self.start:
                continue

            for i, p in enumerate(self.prod[lhs]):
                if len(p) == 0:
                    return lhs, i

        return None, None

    def _create_prod_combinations(self, prod, nt, count):
        """ prod:
                A production (list) that contains at least one
                instance of 'nt'
            nt:
                The non-terminal which should be replicated
            count:
                The amount of times 'nt' appears in 'lhs_prod'.
                Assumed to be >= 1

            Returns the generated list of productions.
        """
        # The combinations are a kind of a powerset. Membership
        # in a powerset can be checked by using the binary
        # representation of a number.
        # There are 2^count possibilities in total.
        #
        numset = 1 << count
        new_prods = []

        for i in xrange(numset):
            nth_nt = 0
            new_prod = []

            for s in prod:
                if s == nt:
                    if i & (1 << nth_nt):
                        new_prod.append(s)
                    nth_nt += 1
                else:
                    new_prod.append(s)

            new_prods.append(new_prod)

        return new_prods

And here are the results with some of the sample grammars presented earlier in the article:

cfg = CFG()
cfg.add_prod('identifier', '( arguments_opt )')
cfg.add_prod('arguments_opt', 'arguments_list | ')
cfg.add_prod('arguments_list', 'argument | argument , arguments_list')

cfg.remove_eps_productions()
for p in cfg.prod:
    print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]

Produces:

func_call ::  ['identifier ( )', 'identifier ( arguments_opt )']
arguments_list ::  ['argument', 'argument , arguments_list']
arguments_opt ::  ['arguments_list']

As expected. And:

cfg = CFG()
cfg.add_prod('B', 'A z A')
cfg.add_prod('A', 'a | ')

cfg.remove_eps_productions()
for p in cfg.prod:
    print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]

Produces:

A ::  ['a']
B ::  ['A z', 'A z A', 'z', 'z A']

The implementation isn't tuned for efficiency, but for simplicity. Luckily, CFGs are usually small enough to make the runtime of this implementation manageable. Note that the preservation of epsilon productions in the start rule is implemented in the _find_eps_production method.

[1]	From here on, lowercase letters early in the alphabet (a, b, c...) are terminals. Early uppercase letters (A, B, C...) are nonterminals, and letters late in the alphabet (z, y, x...) are arbitrary strings of terminals and nonterminals.

[2]	If this sounds like generating a powerset, you're right.

[3]	Consider the productions:

A:: a | eps
B:: b | A

After removing the epsilon production from A we'll have:

A:: a
B:: b | A | eps