One of the most
straightforward forms of parsing is recursive descent parsing. This is a
top-down process in which the parser attempts to verify that the syntax
of the input stream is correct as it is read from left to right. A
basic operation necessary for this involves reading characters from the
input stream and matching then with terminals from the grammar that
describes the syntax of the input. Our recursive descent parsers will
look ahead one character and advance the input stream reading pointer
when proper matches occur. The routine presented in figure 1
accomplishes this matching and reading process.
Figure 1 - Symbol Matching Routine
Note that the
variable called 'next' looks ahead and always provides the next
character that will be read from the input stream. This feature is
essential if we wish our parsers to be able to predict what is due to
arrive as input. Note also that an error indicator is returned.
What a recursive descent parser actually does is to
perform a depth-first search of the derivation tree for the string being
parsed. This provides the 'descent' portion of the name. The
'recursive' portion comes from the parser's form, a collection of
recursive procedures.
As our first example, consider the simple grammar
T ® (E)
T ® x
and the derivation tree in figure 2 for the expression x+(x+x)
Figure 2 - Derivation Tree for x+(x+x)
A recursive
descent parser traverses the tree by first calling a procedure to
recognize an E. This procedure reads an 'x' and a '+' and then calls a
procedure to recognize a T. This would look like the following routine.
Note that 'errorhandler'
is a procedure that notifies the user that a syntax error has been made
and then possibly terminates execution.
In order to recognize a T, the parser must figure out
which of the productions to execute. This is not difficult and is done
in the procedure that appears below.
In the above routine, the
parser determines whether T had the form (E) or x. If not then the error
routine was called, otherwise the appropriate terminals and
nonterminals were recognized.
So, all one needs to write a recursive descent parser
is a nice grammar. But, what exactly is a 'nice' grammar? The remainder
of this essay will be devoted to answering this question.
First of all a grammar must be deterministic. Examine
the grammar for conditional statements presented below. (S is the
nonterminal that generates statements and B is one that generates
Boolean expressions.)
S ® if B then S else S;
Both of these
productions begin the same way, so it is not clear from the first
symbols exactly which production is to be used in recognizing a
conditional statement. The solution to this problem is called factoring.
To factor, we set things up so that we recognize the common prefix (in
this case 'if B then S') and then decide which suffix to recognize. The
following grammar does exactly this.
Z ® ;
Z ® else S;
Writing a recursive descent parser for this is now quite simple. The general factoring algorithm is presented in figure 3 with a as the common prefix and b i as the suffixes mentioned above.
Figure 3 - Factoring
Let us attempt another example, namely the old favorite, a grammar that generates strings of the form anbn shown below.
S ® ab
Z ® Sb
Z ® b
which is not quite
the form we want since it is unclear when to apply the Z-productions.
Substituting for S will solve this problem and produce the following
productions that are exactly what we want to have.
Z ® aZb
Z ® b
The general
algorithm for substitution appears as figure 4. Capital Roman letters
are nonterminals and Greek letters represent (possibly empty) strings of
terminals and nonterminals.
Figure 4 - Substitution
All this really
means is that if you wish to substitute for a nonterminal in a
production, you must substitute using all of the instances where it
appears on the left-hand side of a production.
Here are three things that should not occur in nice
grammars along with algorithms for their removal. First, we have the
grammatical equivalent of spinning one's wheels.
Definition. A chain rule (or singleton production) is one of the form A ® B where both A and B are nonterminals.
Ridding a
grammar of chain rules merely involves substitution and makes the
resultant parser more efficient. Now we examine bad nonterminals.
Definition. A useless nonterminal symbol is one that generates no strings of terminals.
Consider the following grammar for anbn.
Z ® B | AB
A ® a
B ® b
It
is obvious that A and B can generate strings of terminals. So, replace
them in each production of the grammar by a terminal symbol (any one
will do). Also delete all of the productions with A and B on the
left-hand side since they have been certified to be useful. Now the
remaining productions are S ® aZ, Z ® b, and Z ®
ab. If we certify Z to be useful and replace it with a terminal in the
remaining productions, it is clear that S is useful too. Here is a
formal algorithm.
Figure 5 - Useless Nonterminal Detection
Finally we have another type of nonterminal which takes up space in grammars and thus is not acceptable in nice grammars.
Definition. An unreachable nonterminal symbol is one that cannot be generated from the starting symbol.
A
transition diagram is a graph with nonterminals as vertices and
directed edges going from one nonterminal to another if they are on
different sides of the same production. (The direction is from left to
right.) In other words, an edge in the graph means that one nonterminal
generates another. Figure 6 illustrates this for another grammar that
generates strings of the form anbn.
Figure 6 - Grammar and Transition Diagram
A depth-first search of the diagram reveals that all nonterminals are reachable from S.
Next, let us examine the more general grammar for very simple arithmetic expressions which follows.
E ® x | E+x
Note that this
grammar generates strings of alternate s's and plus signs. The second
production does not lend itself to recursive descent since when we build
the routine for E, the first thing it will do is to call itself. This
unfortunately leads to infinite looping. Here is the formal definition
for this kind of production.
Definition. A production of the form A ® Aa is left recursive.
If we look carefully at derivations using the production E® E+x we note that they could progress like this:
E Þ E+x Þ E+x+x Þ E+x+x+x Þ E+x+x+x+x Þ x+x+x+x+x
That is, a sequence of '+x' strings is produced before E generates something using the production other than E® E+x.
Another way to say it is that the string '+x' keeps being added to the
front of the derivation string until finally 'x' is generated.
At this point we note that we could have the above derivation in
a different manner. We could have generated 'x' first and then
generated the '+x' strings. Suppose E' is a new nonterminal that generates a
sequence of '+x' strings. The following productions do exactly this.
E' ® e | +xE'
We now have a
grammar that does not contain left recursion. But it does contain a
production featuring the empty word. This is called an epsilon rule
and we shall have to be rather clever in dealing with it. The mechanism
needed though is actually in place. What we shall do is figure out
whether another E' is to be recognized or not. Looking at the
productions we observe that strings generated by E' always begin with a
'+' sign. So, all we do is look ahead and see if a '+' is coming up. If
so, we look for another E'. Here is the routine for the nonterminal E'.
The formal algorithm for left recursion removal appears as figure 7.
Figure 7 - Left Recursion Removal
In the algorithm, note that since neither the a j nor the b i
can begin with an A', none of the productions in the above group are
left recursive. Noticing that A does now generate strings of the form a jb i* provides the proof of correctness for the algorithm.
Here is another example. Suppose we begin with the grammar fragment:
A ® AC
A ® DA
A ® a
and divide it into the two groups indicated by the construction in the left recursion removal algorithm. We now have:
A ® Ab
|
A ® a
|
A ® AB
|
A ® DA
|
A ® AC
|
A ® a
|
The b i mentioned in the proof are B and C. And the a j are DA and a. Now we retain the A ® a productions add A' ® e and build the two new groups mentioned in the algorithm to obtain:
A ® a
|
A ® a A'
|
A' ® b A'
|
A ® DA
|
A ® DAA'
|
A' ® BA'
|
A ® a
|
A ® aA'
|
A' ® CA'
|
That removes immediate left recursion. But we're not out of the woods quite yet. There are more kinds of recursion. Consider the grammar:
B ® CC
C ® AB
Now examine the following derivation from the nonterminal A.
A Þ BD Þ CCD Þ ABCD
It seems that recursion has once more caused a difficulty. This is cyclic left recursion
and it must go. The process is not too difficult. The first thing we do
is to order the nonterminals A, B, C. If we can make sure that B does
not generate a string beginning with A and that C does not generate
strings beginning with A or B, we have accomplished our task.
Let us begin. The B-production is fine as it
generates a string beginning with C and C appears later in the ordering
than B. The C-production is a problem though. Consider the following
sequence of substitutions:
C ® AB | original production |
C ® BDB | substitute for A |
C ® CCDB | substitute for B |
Now there is no cyclic
recursion but we do have immediate left recursion. This is easily taken
care of via the algorithm in figure 7 and the problem is solved. Here is
the complete algorithm as figure 8.
Figure 8 - Removing Cyclic Left Recursion
Here is another example. Consider the following grammar.
A ® a | SA
B ® b | SB
We first order the nonterminals S, A, and B. The first step is to get the right hand sides in descending order. S ® AB is fine as is A ® a. The next production, A ® SA, is not what we wish to have, so, we follow the steps in the for loop and transform this production as indicated below:
Original
|
Þ
|
Substitute
|
Þ
|
Remove Recursion |
A ® SA
|
A ® ABA
|
A ® aA'
|
||
A' ® e
|
||||
A' ® BAA'
|
To continue, B ® b is what we want, but B ® SB needs some substitution.
Original
|
Þ
|
Substitute
|
Þ
|
Substitute
|
B ® SB
|
B ® ABB
|
B ® aBB | ||
B ® aA'BB |
The final grammar is the following.
A ® a | aA'
A' ® e | BAA'
B ® b | aBB | aA'BB
The grammar is now
in a form that will lend itself to recursive descent parser construction
after some factoring and substitution.
No comments:
Post a Comment