Real Computer Science begins where we almost stop reading ...: Top-down Parsing and LL(1) Languages

Sunday, 23 June 2013

Top-down Parsing and LL(1) Languages

We wish to examine the role of context free languages and pushdown automata in translation, compiling, and parsing. The reason for this is that most of the syntax for programming languages is context free and we shall use them to build compilers. In other words, if we can construct grammars for programming languages, we should be able to use pushdown automata to recognize correct programs and aid in their translation into assembly language.

In order to recognize programs with correct syntax all we need to do is write down a grammar that is able to generate all correct programs and then build a pushdown machine that will recognize this language. The following theoretical computer science result provides a mechanism for this.

Theorem 1. (Greibach Normal Form). Every context free language can be generated by a grammar with productions of the form A ® a a where a is a terminal and a is a (possibly empty) string of nonterminals.

Now we know exactly how to build the machine. We convert the grammar to Greibach normal form and jot down the one state pushdown machine that recognizes strings from the language which are generated by the grammar. The machine is based exactly on the productions. For every production of the form A ® aa , the pushdown machine:

reads an a,

pops the A, and

pushes a.

Unfortunately, there is one small problem. Machines produced from Greibach normal form grammars are often nondeterministic. This removes the utility from the process since we cannot convert these machines into the kind of programs we are used to writing. (After all, we normally do not do nondeterministic programming on purpose.)

Here is a small aside. We should note that we know that the context free languages are recursive and thus recognizable by programs which always halt. So if we desired, we could have the program do some backtracking and go through all possible derivations described by the grammar. But unfortunately this makes our computation time somewhere between n² and n³ steps when recognizing strings of length n. And, we really do not want to use the entire class of context free languages, only the deterministic ones.

This is exactly what we shall do. By resorting to the same strategy we used to get away from unsolvability we can eliminate nondeterministic languages by simplifying the grammars we design. Here is the first step.

Definition. A context free grammar is an s-grammar if and only if every production's right hand side begins with a terminal symbol and this terminal is different for any two productions with the same left-hand side.

This looks good. We can easily build machines for these grammars and can count on them being deterministic. Let us try an example, our old and trusted friend:

S ® aSb
S ® ab

It is of course not in the necessary form, but with the operation presented in figure 1, we can fix that.

Figure 1 - Factoring

We shall now try this on the above grammar for aⁿbⁿ. The a is of course just the terminal symbol a while the b 's are Sb and b. After factoring we come up with the following grammar.

S ® aA
A ® Sb
A ® b

This is not quite what we want, but substituting for S will allow us to write down the following equivalent s-grammar.

S ® aA
A ® aAb
A ® b

Designing machines for these grammars is something we have done before. For a production of the form A ® aa we just:

read the a,

pop the A, and

push a onto the stack.

The machine accepts by empty stack at the end of the input. There is one minor adjustment that must be made. We do not wish terminal symbols to be pushed onto the stack. We can prevent this by presenting the s-grammar in Greibach normal form by turning all terminals that do not begin the right-hand side into nonterminals. Or, we could modify the pushdown machine so that it just pops a terminal whenever it is on top of the stack and on the input string.

Thus in designing the pushdown machine, we always push the tail of a production's right hand side and pop the nonterminal on the left whenever we read the proper symbol. We also pop terminals when they appear on top of the stack and under the reading head at the same time.

Let us continue. At times factoring produces productions which are not compatible with s-grammars. For example, the fragment from a grammar for arithmetic assignment statements

E ® T+E
E ® T

becomes the following when we factor it.

E ® TZ
Z ® +E
Z ® e

and we can substitute for the T until we get terminals leading all of the right-hand sides of the productions. Our above example contains something we have not considered, an epsilon rule (Z ® e ). This was not exactly what we wished to see. In fact it messes up our parser. After all, just how do we read nothing and pop a Z? And when do we know to do it? The following definitions lead to an answer to this question.

Definition. The select set for the production A ® aa (denoted SELECT(A ® aa )) where A is a nonterminal and a is a possibly empty string of terminal and nonterminial symbols is the set {a}.

Definition. A context free grammar is a q-grammar if and only if every production's right hand side begins with a terminal symbol or is e , and whenever two productions possess the same left-hand side they have different select sets.

Thus if we have productions with matching left-hand sides, they must behave like s-grammars, and the select sets guarantee this. Select sets solve that problem, but what exactly is the select set for an epsilon rule? It should be just the terminal symbol we expect to see next. If we knew what should follow the A in the epsilon rule A ® e then we would know when to pop the A without using up an input symbol. The next definitions quickly provide the precise way to start setting this up.

Definition. The follow set for the nonterminal A (denoted FOLLOW(A)) is the set of all terminals a for which some string a Aab can be derived from the starting symbol S. (Where a and b are possibly empty strings of both terminals and nonterminals.)

Definition. SELECT(A ® e ) = FOLLOW(A).

That was not so difficult. We merely apply an epsilon rule whenever a symbol that follows the left-hand side nonterminal comes along. This makes a lot of sense. There is one small catch though. We must not advance past this symbol on the input string at this time. Here is our first example in a new, elegant form since it now contains an epsilon rule:

S ® aSb
S ® e

It is easy to see that the terminal symbol b must follow the S since S remains between the a's and b's until it disappears via the epsilon rule. A machine constructed from this grammar appears as figure 2.

read pop	push advance?
a S b S b b	Sb yes no yes

Figure 2 - A Parser for a q-grammar

Note that technically we are no longer designing pushdown machines. Actually we have designed a top-down parser. The difference is that parsers can advance along the input string (or not) as they desire. But we should note that they still are really pushdown automata.

Since our last example was a bit different than a pushdown automaton, let us examine the computation it goes through when presented with the string aabb. In this example, the stack bottom is to the left and the blue, italic input symbols have been read.

Stack	Input	Action
S	aabb	Apply S ® aSb
bS	a abb	Apply S ® aSb
bbS	aa bb	Apply S ® e
bb	aa bb	Verify b
b	aab b	Verify b
	aabb	Accept

Another way to look at this computation is to note that a leftmost derivation of aabb took place. In fact, if we were to concatenate the input read and the stack up to the point where the b's were verified, we would have the following leftmost derivation.

S Þ aSb Þ aaSbb Þ aabb

So, a top-down parser executes and verifies a leftmost derivation based on the input and stack symbols it sees along the way. We shall explore this phenomenon later.

Let us now go one step farther. Why require a production to have a right hand beginning with a terminal? Why not allow productions of the form:

E ® TZ

that we got when we factored the above some productions for arithmetic expressions earlier? This seems more intuitive and natural. Besides, it is no fun transforming grammars into Greibach Normal Form. Things would be far easier if we had a way of predicting when to apply the rule as we did for epsilon rules. Modifying our select set definition a little helps since it allows select sets to indicate when to apply productions.

Definition. The select set for the production A ® a (denoted SELECT(A ® a )) where A is a nonterminal and a is a string of terminal and nonterminial symbols is the set of all terminal symbols which begin strings derived from a .

This tells us that if our pushdown machine has an E on its stack and is reading a symbol from SELECT(E ® TZ), then that is the production that should be applied by pushing TZ on the stack. Here is a new class of deterministic context free grammars based upon select sets.

Definition. A context free grammar is an LL(1) grammar if and only if any two productions with the same left-hand sides have different select sets.

Two items need to be cleared up. First, LL(1) means that the parser we build from the grammar is processing the input string from left to right (that's the first L) using leftmost derivations (the second L). This is exactly what the parsers we have been designing have been doing. They create a leftmost derivation and check it against the input string. In addition we look ahead one symbol. By that we mean that the parser can examine the next symbol on its input string without advancing past it. The parser in figure 2 is an example of this.

Next, we have neglected to fully define select sets for arbitrary productions. Intuitively, this will be all of the terminal symbols first produced by the production, which we call its FIRST set. If the production can lead to the empty string, then the FOLLOW set of the left-hand side nonterminal must be included. Here are the formal definitions.

Definition. FIRST(A ® a ) is the set of all terminal symbols a such that some string of the form ab can be derived from a .

Definition. SELECT(A ® a ) contains FIRST(A ® a ). If e can be derived from a then it also contains FOLLOW(A).

It is time for another theoretical aside. The proof of this last theorem is left as an exercise in careful substitution.

Theorem 2. The class of languages generated by q-grammars is the same as the class of LL(1) languages.

Real Computer Science begins where we almost stop reading ...

Sunday, 23 June 2013

Top-down Parsing and LL(1) Languages

No comments:

Post a Comment