Real Computer Science begins where we almost stop reading ...: Building Top-Down Parsers

Sunday, 23 June 2013

Building Top-Down Parsers

The major class of grammars used for top-down parsing is the class of LL(1) grammars. LL(1) means that the parser we build from the grammar is processing the input string from left to right (that's the first L) using leftmost derivations (the second L). This is exactly what the parsers we wish to design shall do. They will create a leftmost derivation and check it against the input string. In addition, they are allowed to look ahead one symbol. This means that the parser can examine the next symbol on its input string without advancing past it. The major piece of information that the parsers need to do this is called the select set.

Definition. The select set for the production A ® a (denoted SELECT(A ® a )) where A is a nonterminal and a is a string of terminal and nonterminial symbols is the set of all terminal symbols which begin strings derived from a .

Consider the grammar S ® e | aSb. The select set for the second production is {a}. For the first production, we cannot apply the above definition, so we have to think about things a little. Note that if the parser is looking at a symbol which follows S, then we probably should apply S ® e whenever we see that symbol. In this case the symbol b always follows S. The parser built from this appears in figure 1.

Figure 1 - A Parser for S ® e | aSb

It is important to note that in the second row of the table, the parser looked ahead one symbol to determine that S ® e should be applied.

Here is the formal definition of LL(1) grammars, a class of deterministic context free grammars based upon select sets.

Definition. A context free grammar is an LL(1) grammar if and only if any two productions with the same left-hand sides have different select sets.

We need now to fully define select sets for arbitrary productions. Intuitively, they will contain all of the terminal symbols first produced by the production, which we call its FIRST set. If the production can lead to the empty string, then the FOLLOW set of the left-hand side nonterminal must be included. Here are the formal definitions.

Definition. FIRST(A ® a ) is the set of all terminal symbols a such that some string of the form ab can be derived from a .

The FIRST set for S ® aSb is of course the set {a}.

Definition. The follow set for the nonterminal A (denoted FOLLOW(A)) is the set of all terminals a for which some string a Aab can be derived from the starting symbol S. (Where a and b are possibly empty strings of both terminals and nonterminals.)

The FOLLOW set for S in the grammar S ® e | aSb is the set {b}.

Definition. SELECT(A ® a ) contains FIRST(A ® a ). If e can be derived from a then it also contains FOLLOW(A).

We begin by determining which nonterminals can generate the empty string e using the algorithm of figure 2.

Figure 2 - Finding Nullable Nonterminals

Let's pause a moment and make sure that the algorithm for finding nonterminals which generate the null string is correct. Two steps are needed for this.

We claim that it terminates because one nonterminal is deleted every time the loop is executed and sooner or later the loop will terminate since there are no more epsilon rules or we shall run out of nonterminals.

As an example, let us apply the algorithm of figure 2 to the following grammar.

First we remove all of the productions containing terminals. Only the following four productions remain.

S ® ABA

A ® e

B ® e

C ® AD

Executing the loop one time places A and B into the set of nullable nonterminals and leaves the two following productions after removing the epsilon rules and substituting e for A and B in the above productions.

S ® e

C ® D

The loop is executed once more and S is added to the nullable nonterminal set and then the algorithm halts since C ® D is not an epsilon rule.

Correctness comes from thinking about just how a nonterminal A could generate the null string e . There must be a production A ® a where all of the nonterminals in a in turn generate e . And one of them must generate e directly.

Now we must work out computing procedures for FIRST, FOLLOW, and SELECT. We have defined them, but have not demonstrated how to compute them.

(Note: In the following, capital Roman letters are nonterminals, small Roman letters from the beginning of the alphabet are terminals, x and y can be either terminals or nonterminals. All Greek letters (except for e ) are possibly empty strings of terminals and nonterminals.)

In order to compute FIRST sets several steps are necessary. We start by identifying all of the terminal and nonterminal symbols that will be generated immediately from productions and place them in a set named BEGIN for the nonterminal on the right-hand side.

Computing FIRST(A), which we recall is the set of symbols appearing first in strings generated from A, involves taking the reflexive, transitive closure of BEGIN(A). This is done easily with a queue as follows.

Using the last grammar as an example we find the BEGIN and FIRST sets shown below for all of the nonterminal symbols.

Recall that A and c are in BEGIN(S) since they began productions and B had to join BEGIN(S) since A can generate the empty string.

Now we have all of the terminals that show up at the beginning of strings generated by each nonterminal. For completeness we state that for each terminal symbol a, FIRST(a) = {a}.

At last we have all of the information needed to compute FIRST(A ® a ) for each production. These sets are basically the union of the FIRST sets for all the symbols which can start a . Since these FIRST sets for productions will be used only to find SELECT sets they need only contain terminal symbols.

Here are the terminal symbols for the FIRST sets for the two productions that do not have right-hand sides beginning with terminals.

S ® ABA

{ a, b }

C ® AD

{ a, c }

Some more relations are needed to build the FOLLOW sets for each productions. We need to detect symbols that come after others and which end derivations. AFTER(A) is the set of all of symbols that immediately follow A and END(A) is the set of all symbols that are last in productions with A as a left-hand side.

Note that A can follow immediately after itself since B may generate the empty string. Here are the AFTER and END sets for the nonterminals.

If we set LAST(A) to be the reflexive, transitive closure of END(A) then it is the set of symbols which end strings generated from A. The same algorithm used to calculate the FIRST sets can be used here.

The next part is a little intricate. FOLLOW(N₁) is the set of all terminal symbols t such that there exist symbols N₂ and x where:

N₁ Î LAST(N₂) {N₁ ends a string derived from N₂}
and x Î AFTER(N₂) {x directly follows N₂ in some production}
and t Î FIRST(x) {t is the first terminal in string generated by x}

To compute the FOLLOW sets for the nonterminals, we shall fill in a chart with the symbols that correspond to the three relationships above.

N₁	N₂	x	t
S	S
A	S,A,C,D	A,B,D	a,b,c
B	S,B	A	a
C	S,C
D	S,C,D

At this point we have all of the information we need in order to define SELECT sets for each production. The definition is the following.

This provides the following select sets for the example grammar. After all of this work, we now know that it is not a LL(1) grammar because the two A productions do not have different select sets.

S ® ABA	{a,b}	S ® cC	{c}
A ® e	{a,b,c}	A ® a	{a}
B ® e	{a}	B ® bD	{b}
C ® AD	{a,c}	C ® b	{b}
D ® aA	{a}	D ® c	{c}

Here is another example, the grammar for arithmetic expressions. It appears in factored form on the right.

E ® T \| T+E		E ® TA
		A ® +E \| e
T ® F \| F* T		T ® FB
		B ® * T \| e
F ® x \| (E)		F ® x \| (E)

Our first task is to find the nonterminals that can generate the empty string. This is easy since only A and B can do this and it was obvious. Having no surprises is, by the way, a sign of good grammar design!

The chart below provides all of the relations that were described above for the nonterminals of the factored grammar.

	E	A	T	B	F
BEGIN	T	+	F	*	x, (
FIRST	E, T, F, x, (	A, +	T, F, x, (	B, *	F, x, (
AFTER	)		A		B
END	A,T	E	B, F	T	x, )
LAST	E,A,T,B,F,x,)	E,A,T,B,F,x,)	T,B,F,x,)	T,B,F,x,)	F,x,)

The FIRST sets for the interesting productions are the following.

E ® TA

{ x, ( }

T ® FB

{ x, ( }

Using the above information we can construct the chart for the FOLLOW sets (in the t column) as shown below.

N₁	N₂	x	t
E	E,A	)	)
A	E,A	)	)
T	E,A,T,B	A, )	+, )
B	E,A,T,B	A, )	+, )
F	E,A,T,B,F	A, B, )	+, * , )

Putting this all together we arrive at the following select sets and are able to construct a parser for the grammar. The general rule for LL(1) parsers is to pop a production's left-hand side and push the right hand side when reading the select set. Here is our example as figure 3.

Figure 3 - A Parser for Arithmetic Expressions

That wasn't so bad after all. We did a lot of work, but came up with a deterministic parser for a LL(1) grammar that describes arithmetic expressions for a programming language.

One more note on top-down parsers is needed. They're usually presented with instructions to either pop the top stack symbol (for an epsilon rule) or to replace the top stack symbol with a string (for other productions). No advance of the input is made in either case. It is always implicitly assumed that when the top of the stack matches the input then the parser pops the symbol and advances. These are called predict and verify operations based on leftmost derivations. Also, these parsers are often presented in a slightly different format. All of the match or verify operations with terminals on the input and stack are understood and a replace table is provided as the parser. This table shows what to place on the stack when the top symbol and the next input symbol are given. Here is our last example in this form.

See how the two formats describe the same parser? In the new form we just examine the input and the stack and replace the stack symbol by the string in the appropriate box of the parser table. By the way, the blank boxes depict configurations that should not occur unless there is an error. In a real parser one might have these boxes point to error message routines.

At this point we have defined a subclass (in fact a proper one) of the deterministic context free languages for which we know how to build parsers. We also have some tools that can be used to place grammars in the correct form if needed. These tools are:

omitting useless nonterminals,
omitting unreachable nonterminals,
factoring,
substitution, and
recursion removal.

A further word about recursion is needed. It should be obvious that left recursion (immediate or cyclic) is not compatible with top-down parsing. The following result provides some confort for those who make sure that the grammars they design are not recursive.

Theorem. Every LL(1) language has a nonrecursive grammar.

Finally, if we wish to translate or parse something the following recipe should help to streamline the task.

Develop a grammar for it,
Convert the grammar to LL(1) form, and
Construct the parser.

This promises success since LL(1) grammars provide parsers which are nonrecursive and deterministic.

E ® T \| T+E		E ® TA
		A ® +E \| e
T ® F \| F* T		T ® FB
		B ® * T \| e
F ® x \| (E)		F ® x \| (E)

Real Computer Science begins where we almost stop reading ...

Sunday, 23 June 2013

Building Top-Down Parsers

No comments:

Post a Comment