Real Computer Science begins where we almost stop reading ...: Chomsky Normal Form

Tuesday, 15 January 2013

Chomsky Normal Form

It is convenient to assume that every context-free grammar can without loss of generality be put
into a special format, called a normal form. One such format is Chomsky Normal Form. In CNF, we
expect every production to be of the from A ? BC or D ? d, where A, B, C and D are non-terminal
symbols and d is a non-lambda terminal symbols. If lambda (the empty string) is actually part of the
language, then S ? ? is allowed.
We will use CNF in three different places:
1. A proof of a pumping lemma for CFG’s.
2. A proof that every language generated by a CFG can be accepted by a nondeterministic
pushdown machine.
3. An algorithm (dynamic programming style) for determining whether a given string is
generated by a given context free grammar.
There are many steps needed to turn an arbitrary CFG into CNF. The steps are listed below:
1. Get rid of Useless symbols.
2. Get rid of lambda-productions.
3. Get rid of Unit Productions.
4. Get rid of Long Productions.
5. Get rid of Terminal symbols.
The steps are completely algorithmic with step one repeatable after each of the other steps if
necessary. We will do a complete example in class.
Useless Symbols –
Delete all productions containing non-terminal symbols that cannot generate terminal strings.
Delete all productions containing non-terminal symbols that cannot be reached by S.
The details of the two steps for finding useless symbols are very similar and each is a bottomup
style algorithm. To find all non-terminal symbols that generate terminal strings, we do it inductively
starting with all non-terminal symbols that generate a single terminal string in one step. Call this set T.
Then we iterate again looking for productions whose right sides are combinations of terminal symbols
and non-terminal from T. The non-terminals on the left sides of these productions are added to T, and
we repeat. This continues until T remains the same through an iteration.
To find all non-terminal symbols that can be reached by S, we do a similar thing but we start
from S and check which non-terminals appear on the right side of its productions. Call this set T.
Then we check which non-terminals appear on the right side of productions whose left side is a nonterminal
in T. This continues until T remains the same through an iteration.
The steps need to be done in this order. For example, if you do it in the opposite order, then
the grammar S ?AB, S ? 0, A ? 0, would result in the grammar S ? 0, A ? 0. If we do it in the
correct order, then we get the right answer, namely just S ? 0.
Subsequent steps may introduce new useless symbols. Hence Useless symbol removal can be
done after each of the upcoming steps to ensure that we don’t waste time carrying Useless symbols
forward.
Lambda-Production Removal
The basic idea is to find all non-terminals that can eventually produce a lambda (nullable nonterminals),
and then take every production in the grammar and substitute lambdas for each subset of
such non-terminals. We add all these new productions. We can then delete the actual lambda
productions. For example A ? 0N1N0 N ? ?. We add A ? 0N10 | 01N0 | 010, and delete N
? ?
The problem with this strategy is that it must be done for all nullable non-terminals
simultaneously. If not, here is a problem scenario: S ? 0 | X1 | 0Y0 X ? Y | ? Y ? 1 | X. In
this case, when we try to substitute for X ? ?, we add S ? 1 and Y ? ?, and then we sub for Y ?
?, we add S ? 00 and X ? ?. The trick is to calculate all nullable non-terminals at the start which
include X and Y, and then sub for all simultaneously, deleting all resulting lambda productions, except
perhaps for S ? ?. If lambda is actually in the language, then we add a special start symbol S’ ? S |
?, where S remains the old start symbol.
Unit Productions
To get rid of Unit productions like A ? B, we simply add A? anything, for every production
of the form B ? anything, and then delete A ? B. The only problem with this is that B might itself
have Unit productions. Hence, like we did in lambda productions, we first calculate all Unit nonterminals
that A can generate in one or more steps. Then the A ? anything productions are added
for all the Unit non-terminals X in the list, where X ? anything, as long as anything is not a Unit
production. The original Unit productions are then deleted. The Unit non-terminals that can be
generated by A can be computed in a straightforward bottom-up manner similar to what we did earlier
for lambda productions.
For example, consider S ? A | 11 A ? B | 1 B ? S | 0. The Unit non-terminals
that can be generated by S, A and B are A,B and B,S and A,S respectively. So we add S ? 1 and
S ? 0; A ? 11 and A ? 0; and B ? 1 and B ? 11. Then we delete S ? A, A ? B and B ?
S.
Long Productions
Now that we have gotten rid of all length zero and length one productions, we need to
concentrate on length > 2 productions. Let A ? ABC, then we simply replace this with A ? XC and
X ? AB, where X is a new non-terminal symbol. We can do this same trick inductively (recursively)
for longer productions.
If we have long terminal productions or mixed terminal/non-terminal productions, then for each
terminal symbol, say our alphabet is {0,1}, we add productions M ? 0 and N ? 1. Then all 0’s and
1’s are replaced by M’s and N’s, where M and N are new non-terminal symbols.

Real Computer Science begins where we almost stop reading ...

Tuesday, 15 January 2013

Chomsky Normal Form

No comments:

Post a Comment