We shall now turn from
top-down or predictive parsing to the reverse. Instead of predicting
what is to come and verifying it from the input we shall use a bottom-up
approach. This means that rather than beginning with the starting
symbol and generating an input string, we shall examine the string and
attempt to work our way back to the starting symbol. In other words, we
shall reconstruct the parse. We will process the input and decide how it
was generated using our parser stack as a notebook. Let us begin by
examining a string generated by the grammar:
B ® Ab
B ® b
The following chart provides an intuitive overview of
this new approach to parsing. Here the string aabb is parsed in a
bottom-up manner. We shall discuss the steps after presenting the chart.
In step 1 we moved the first input symbol (a) into the examination area and guessed that we might be working on A ®
aB. But, since we had not seen a B we put off making any decisions and
pushed the a onto the stack to save it for awhile. In step 2 we did the
same. In step three we encountered a b. We knew that it could come from
applying the production B ®
b. So we substituted a B for the b. (Remember that we are working
backwards.) In step 4 we looked at the stack and sure enough discovered
an a. This meant that we could surely apply A ®
aB, and so we did. (We moved the new A in to be examined after getting
rid of the B which was there as well as the a on the stack since we used
them to make the new A.) In step 5 we looked at the stack and the
examination area, could not decide what was going on (except something
such as B ®
Ab possibly), so we just put the A onto the stack. Then in step 6 we
looked at the b in the examination area and the A on the stack and used
them both to make a B via the production B ®
Ab. This B entered the examination area. Looking at this B and the stacked a in step 7 we applied A ®
aB to them and placed an A in the examination area. Since nothing was
left to do in step 8 and we were examining our starting symbol we
accepted.
See what happened? We looked at our input string and
whenever we could figure out how a symbol was generated, we applied the
production that did it. We in essence worked our way up the
derivation tree. And, we used the stack to save parts of the tree to our
left that we needed to tie in later. Since our grammar was unambiguous
and deterministic, we were able of do it.
Now let us do it all over with some new terminology
and some mixing up of the above columns. When we push an input symbol
into the stack we shall call it a shift. And when we apply a production we shall call it a reduce operation. We shall shift
our guesses onto the stack with input symbols. For example, if we see
an a and guess that we're seeing the results of applying the production A
®
aB, we shift the pair (a, aB) onto the stack. After we reduce, we shall place a guess pair on the stack with the nonterminal we just produced. Here we go.
Our new parsing technique
involves keeping notes on past input on the stack. For instance, in step
5 we have an a (which might be part of an aB) at the bottom of our
stack, and an A (which we hope shall be part of an Ab) on top of the
stack. We then use these notes to try to work backwards to the starting
symbol. This is what happens when we do reduce operations. This is the
standard bottom-up approach we have always seen in computer science. Our
general method is to do a rightmost derivation except that we do it
backwards! Neat. What we did at each step was to examine the stack and
see if we could do a reduction by applying a production to the top
elements of the stack. If so, then we replaced the right hand side
symbols (which were at the top of the stack) with the left-hand side
nonterminal.
After doing a reduction we put the new nonterminal on
the stack along with a guess of what was being built. We also did this
when we shifted a terminal onto the stack. Let us examine these guesses.
We tried to make them as accurate as possible by looking at the stack
before pushing the (symbol, guess) pair. We should also note that the
pair (a,aB) means that we have placed the a on the stack and think that
maybe a B will come along. On the other hand, the pair (b,Ab) indicates
that the top two symbols on the stack are A and b, and, we have seen the
entire right hand side of a production. Thus we always keep track of
what is in the stack.
Now for another enhancement. We shall get rid of some
duplication. Instead of placing (a, aB) on the stack we shall just put
a|B on it. This means that we have seen the part of aB which comes
before the vertical line - the symbol a. Putting aB| on the stack means
that we have a and B as our top two stack symbols. Here is the same
computation with our new stack symbols.
Let's pause a bit and examine these things we are placing on the stack. They are often called states and do indicate the state of the input string we have read and partially parsed. States are made up of items
that are just productions with an indicator that tells us how far on
the right hand side we have progressed. The set of items for the
previous grammar is:
A ® |aB
|
B ® |Ab
|
B ® |b
|
A ® a|B
|
B ® A|b
|
B ® b|
|
A ® aB|
|
B ® Ab|
|
Recall what an item means. A ®
a|B means that we have seen an a and hope to see a B and apply the
production. Traditionally we also invent a new starting symbol and add a
production where it goes to the old starting symbol. In this case this
means adding the items:
S0 ® |A | S0 ® A| |
to our collection of items.
There are lots and lots of items in a grammar. Some
are almost the same. Now it is time to group equivalent items together.
We take a closure of an item and get all of the equivalent ones. These closures shall form the stack symbols (or states) of our parser. These are computed according to the following procedure.
Figure 5 - Closure Computation for Items
We should
compute a few closures for the items in our grammar. The only time we
get past the first step above is when the vertical bar is to the left of
a nonterminal. Such as in B ®
|Ab. Let's do that one. We place B ®
|Ab in CLOSURE(B ®
|Ab) first. Then we look at all A ®
a productions and put A ®
|a in CLOSURE(B ®
|Ab) also. This gives us:
CLOSURE(B ®
|Ab) = {B ®
|Ab, A ®
|aB}
Some more closures are:
CLOSURE(S ®
|A) = {S ®
|A, A ®
|aB}
CLOSURE(S ® A|) = {S ® A|}
CLOSURE(A ® a|B) = {A ® a|B, B ® |Ab, B ® |b, A ® |aB}
CLOSURE(S ® A|) = {S ® A|}
CLOSURE(A ® a|B) = {A ® a|B, B ® |Ab, B ® |b, A ® |aB}
Thus the closure of an item is a collection of all
items which represent the same sequence of things placed upon the stack
recently. These items in the set are what we have seen on the input
string and processed. The productions represented are all those which
might be applied soon. The last closure presented above is particularly
interesting since it tells us that we have seen an a and should be about
to see a B. Thus either Ab, b, or aB could be arriving shortly. States
will built presently by combining closures of items.
Let's return to our last table where we did a
recognition of aabb. Note that in step 2 a|B was on top to the stack and
the next input was b. We then placed b| on the stack. Traditionally
sets of items called states are placed upon the stack and so the process
of putting the next state on the stack is referred to as a GOTO. Thus from step 2 to step 3 in the recognition of aabb we execute:
GOTO(a|B, b) = b|.
In step 3 we reduced with the production B ®
b and got a B. We then placed aB| on the stack. In our new terminology this is:
GOTO(a|B, B) = aB|.
It is time now to precisely define the GOTO operation. For a set of items (or state) Q and symbol x this is:
GOTO(Q, x) = {CLOSURE(A ®
a x|b )} for all A ®
a |xb Î Q
Check out the operations we
looked at above and those in the previous acceptance table. Several more examples
are:
GOTO({S ® |A, A ® |aB}, A)
|
= CLOSURE(S ®
A|)
= {S ® A|} |
GOTO({S ®
|A, A ®
|aB}, a)
|
= CLOSURE(A ®
a|B)
= {A ® a|B, B ® |Ab, B ® |b, A ® |aB} |
So, all we need do is add a new starting production (S0 ®
S) to a grammar and execute the
following state construction algorithm in order to generate all the
states we require in order to do parsing.
Figure 6 - State Construction Algorithm
The seven states determined for
our example grammar using the above algorithm are:
Q1 = {S ® A|}
Q2 = {A ® a|B, B ® |Ab, B ® |b, A ® |aB}
Q3 = {A ® aB|}
Q4 = {B ® A|b}
Q5 = {B ® b|}
Q6 = {B ® Ab|}
and the relationships formed from the GOTO(Q, x) relationship are:
Note that not all state-symbol pairs are represented. We do know why there are no states to go to from Q1, Q3, Q5, and Q6
- right? Because they are states that contain items we shall reduce.
After that we shall place another item on the stack as part of the
reduction process.
All that remains is to build a parser that will carry
out the sample computation we presented above. It is easy. Here are the
rules for building the parser table. (Recall that the stack symbols or
states are along the left side while grammar symbols lie across the
top.)
Figure 7 - Parser Table Construction
That is all
there is to it. Quite simple. Another note - we shall attach a GOTO
table to the right side of our parser table so that we know what to
place on the stack after a reduction. The parser for our sample grammar
is provided below. The words shift and reduce have been
omitted because they refer always to states and productions respectively
and there should be no problem telling which is which. (Aliases for the
states have been provided so that the table is readable.)
We know intuitively how these parsers work, but need to specify some things precisely. Shift operations merely push the indicated state on the stack. A reduce operation has two parts. For a reduction of A ® a where the length of a is k, first pop k states off the stack. (These are the right hand side symbols for the production.) Then if Qi is on top of the stack, push GOTO(Qi ,A)
onto the stack. So, what we are doing is to examine the stack and push
the proper state depending upon what was at the top and what was about
to be processed. And last, begin with Q0 on the stack. Try out our last example and note that exactly the same sequence of moves results.
Now let us label what we have been doing. Since we
have been processing the input from left to right and doing rightmost
derivations, this is called LR parsing. And the following theorem ties the LR languages into our framework.
Theorem. The following classes are equivalent.
b) LR(1) languages.
c) LR(0) languages with endmarkers.
d) Languages accepted by deterministic pushdown automata.
No comments:
Post a Comment