Real Computer Science begins where we almost stop reading ...: Recursive Descent Parsing

Sunday, 23 June 2013

Recursive Descent Parsing

One of the most straightforward forms of parsing is recursive descent parsing. This is a top-down process in which the parser attempts to verify that the syntax of the input stream is correct as it is read from left to right. A basic operation necessary for this involves reading characters from the input stream and matching then with terminals from the grammar that describes the syntax of the input. Our recursive descent parsers will look ahead one character and advance the input stream reading pointer when proper matches occur. The routine presented in figure 1 accomplishes this matching and reading process.

Figure 1 - Symbol Matching Routine

Note that the variable called 'next' looks ahead and always provides the next character that will be read from the input stream. This feature is essential if we wish our parsers to be able to predict what is due to arrive as input. Note also that an error indicator is returned.

What a recursive descent parser actually does is to perform a depth-first search of the derivation tree for the string being parsed. This provides the 'descent' portion of the name. The 'recursive' portion comes from the parser's form, a collection of recursive procedures.

As our first example, consider the simple grammar

E ® x+T
T ® (E)
T ® x

and the derivation tree in figure 2 for the expression x+(x+x)

Figure 2 - Derivation Tree for x+(x+x)

A recursive descent parser traverses the tree by first calling a procedure to recognize an E. This procedure reads an 'x' and a '+' and then calls a procedure to recognize a T. This would look like the following routine.

Note that 'errorhandler' is a procedure that notifies the user that a syntax error has been made and then possibly terminates execution.

In order to recognize a T, the parser must figure out which of the productions to execute. This is not difficult and is done in the procedure that appears below.

In the above routine, the parser determines whether T had the form (E) or x. If not then the error routine was called, otherwise the appropriate terminals and nonterminals were recognized.

So, all one needs to write a recursive descent parser is a nice grammar. But, what exactly is a 'nice' grammar? The remainder of this essay will be devoted to answering this question.

First of all a grammar must be deterministic. Examine the grammar for conditional statements presented below. (S is the nonterminal that generates statements and B is one that generates Boolean expressions.)

S ® if B then S;
S ® if B then S else S;

Both of these productions begin the same way, so it is not clear from the first symbols exactly which production is to be used in recognizing a conditional statement. The solution to this problem is called factoring. To factor, we set things up so that we recognize the common prefix (in this case 'if B then S') and then decide which suffix to recognize. The following grammar does exactly this.

S ® if B then S Z
Z ® ;
Z ® else S;

Writing a recursive descent parser for this is now quite simple. The general factoring algorithm is presented in figure 3 with a as the common prefix and b _i as the suffixes mentioned above.

Figure 3 - Factoring

Let us attempt another example, namely the old favorite, a grammar that generates strings of the form aⁿbⁿ shown below.

S ® aSb
S ® ab Factoring brings us: S ® aZ
Z ® Sb
Z ® b

which is not quite the form we want since it is unclear when to apply the Z-productions. Substituting for S will solve this problem and produce the following productions that are exactly what we want to have.

S ® aZ
Z ® aZb
Z ® b

The general algorithm for substitution appears as figure 4. Capital Roman letters are nonterminals and Greek letters represent (possibly empty) strings of terminals and nonterminals.

Figure 4 - Substitution

All this really means is that if you wish to substitute for a nonterminal in a production, you must substitute using all of the instances where it appears on the left-hand side of a production.

Here are three things that should not occur in nice grammars along with algorithms for their removal. First, we have the grammatical equivalent of spinning one's wheels.

Definition. A chain rule (or singleton production) is one of the form A ® B where both A and B are nonterminals.

Ridding a grammar of chain rules merely involves substitution and makes the resultant parser more efficient. Now we examine bad nonterminals.

Definition. A useless nonterminal symbol is one that generates no strings of terminals.

Consider the following grammar for aⁿbⁿ.

S ® AZ
Z ® B | AB
A ® a
B ® b

It is obvious that A and B can generate strings of terminals. So, replace them in each production of the grammar by a terminal symbol (any one will do). Also delete all of the productions with A and B on the left-hand side since they have been certified to be useful. Now the remaining productions are S ® aZ, Z ® b, and Z ® ab. If we certify Z to be useful and replace it with a terminal in the remaining productions, it is clear that S is useful too. Here is a formal algorithm.

Figure 5 - Useless Nonterminal Detection

Finally we have another type of nonterminal which takes up space in grammars and thus is not acceptable in nice grammars.

Definition. An unreachable nonterminal symbol is one that cannot be generated from the starting symbol.

A transition diagram is a graph with nonterminals as vertices and directed edges going from one nonterminal to another if they are on different sides of the same production. (The direction is from left to right.) In other words, an edge in the graph means that one nonterminal generates another. Figure 6 illustrates this for another grammar that generates strings of the form aⁿbⁿ.

Figure 6 - Grammar and Transition Diagram

A depth-first search of the diagram reveals that all nonterminals are reachable from S.

Next, let us examine the more general grammar for very simple arithmetic expressions which follows.

E ® x | E+x

Note that this grammar generates strings of alternate s's and plus signs. The second production does not lend itself to recursive descent since when we build the routine for E, the first thing it will do is to call itself. This unfortunately leads to infinite looping. Here is the formal definition for this kind of production.

Definition. A production of the form A ® Aa is left recursive.

If we look carefully at derivations using the production E® E+x we note that they could progress like this:

E Þ E+x Þ E+x+x Þ E+x+x+x Þ E+x+x+x+x Þ x+x+x+x+x

That is, a sequence of '+x' strings is produced before E generates something using the production other than E® E+x. Another way to say it is that the string '+x' keeps being added to the front of the derivation string until finally 'x' is generated.

At this point we note that we could have the above derivation in a different manner. We could have generated 'x' first and then generated the '+x' strings. Suppose E' is a new nonterminal that generates a sequence of '+x' strings. The following productions do exactly this.

E ® xE'
E' ® e | +xE'

We now have a grammar that does not contain left recursion. But it does contain a production featuring the empty word. This is called an epsilon rule and we shall have to be rather clever in dealing with it. The mechanism needed though is actually in place. What we shall do is figure out whether another E' is to be recognized or not. Looking at the productions we observe that strings generated by E' always begin with a '+' sign. So, all we do is look ahead and see if a '+' is coming up. If so, we look for another E'. Here is the routine for the nonterminal E'.

The formal algorithm for left recursion removal appears as figure 7.

Figure 7 - Left Recursion Removal

In the algorithm, note that since neither the a j nor the b i can begin with an A', none of the productions in the above group are left recursive. Noticing that A does now generate strings of the form a jb i* provides the proof of correctness for the algorithm.

Here is another example. Suppose we begin with the grammar fragment:

A ® AB
A ® AC
A ® DA
A ® a

and divide it into the two groups indicated by the construction in the left recursion removal algorithm. We now have:

A ® Ab	A ® a
A ® AB	A ® DA
A ® AC	A ® a

The b i mentioned in the proof are B and C. And the a j are DA and a. Now we retain the A ® a productions add A' ® e and build the two new groups mentioned in the algorithm to obtain:

A ® a	A ® a A'	A' ® b A'
A ® DA	A ® DAA'	A' ® BA'
A ® a	A ® aA'	A' ® CA'

That removes immediate left recursion. But we're not out of the woods quite yet. There are more kinds of recursion. Consider the grammar:

A ® BD
B ® CC
C ® AB

Now examine the following derivation from the nonterminal A.

A Þ BD Þ CCD Þ ABCD

It seems that recursion has once more caused a difficulty. This is cyclic left recursion and it must go. The process is not too difficult. The first thing we do is to order the nonterminals A, B, C. If we can make sure that B does not generate a string beginning with A and that C does not generate strings beginning with A or B, we have accomplished our task.

Let us begin. The B-production is fine as it generates a string beginning with C and C appears later in the ordering than B. The C-production is a problem though. Consider the following sequence of substitutions:

C ® AB	original production
C ® BDB	substitute for A
C ® CCDB	substitute for B

Now there is no cyclic recursion but we do have immediate left recursion. This is easily taken care of via the algorithm in figure 7 and the problem is solved. Here is the complete algorithm as figure 8.

Figure 8 - Removing Cyclic Left Recursion

Here is another example. Consider the following grammar.

S ® AB
A ® a | SA
B ® b | SB

We first order the nonterminals S, A, and B. The first step is to get the right hand sides in descending order. S ® AB is fine as is A ® a. The next production, A ® SA, is not what we wish to have, so, we follow the steps in the for loop and transform this production as indicated below:

Original	Þ	Substitute	Þ	Remove Recursion
A ® SA		A ® ABA		A ® aA'
				A' ® e
				A' ® BAA'

To continue, B ® b is what we want, but B ® SB needs some substitution.

Original	Þ	Substitute	Þ	Substitute
B ® SB		B ® ABB		B ® aBB
				B ® aA'BB

The final grammar is the following.

S ® AB
A ® a | aA'
A' ® e | BAA'
B ® b | aBB | aA'BB

The grammar is now in a form that will lend itself to recursive descent parser construction after some factoring and substitution.

Real Computer Science begins where we almost stop reading ...

Sunday, 23 June 2013

Recursive Descent Parsing

No comments:

Post a Comment