Real Computer Science begins where we almost stop reading ...: Bottom-Up Parsing

Sunday, 23 June 2013

Bottom-Up Parsing

We shall now turn from top-down or predictive parsing to the reverse. Instead of predicting what is to come and verifying it from the input we shall use a bottom-up approach. This means that rather than beginning with the starting symbol and generating an input string, we shall examine the string and attempt to work our way back to the starting symbol. In other words, we shall reconstruct the parse. We will process the input and decide how it was generated using our parser stack as a notebook. Let us begin by examining a string generated by the grammar:

A ® aB
B ® Ab
B ® b

The following chart provides an intuitive overview of this new approach to parsing. Here the string aabb is parsed in a bottom-up manner. We shall discuss the steps after presenting the chart.

In step 1 we moved the first input symbol (a) into the examination area and guessed that we might be working on A ® aB. But, since we had not seen a B we put off making any decisions and pushed the a onto the stack to save it for awhile. In step 2 we did the same. In step three we encountered a b. We knew that it could come from applying the production B ® b. So we substituted a B for the b. (Remember that we are working backwards.) In step 4 we looked at the stack and sure enough discovered an a. This meant that we could surely apply A ® aB, and so we did. (We moved the new A in to be examined after getting rid of the B which was there as well as the a on the stack since we used them to make the new A.) In step 5 we looked at the stack and the examination area, could not decide what was going on (except something such as B ® Ab possibly), so we just put the A onto the stack. Then in step 6 we looked at the b in the examination area and the A on the stack and used them both to make a B via the production B ® Ab. This B entered the examination area. Looking at this B and the stacked a in step 7 we applied A ® aB to them and placed an A in the examination area. Since nothing was left to do in step 8 and we were examining our starting symbol we accepted.

See what happened? We looked at our input string and whenever we could figure out how a symbol was generated, we applied the production that did it. We in essence worked our way up the derivation tree. And, we used the stack to save parts of the tree to our left that we needed to tie in later. Since our grammar was unambiguous and deterministic, we were able of do it.

Now let us do it all over with some new terminology and some mixing up of the above columns. When we push an input symbol into the stack we shall call it a shift. And when we apply a production we shall call it a reduce operation. We shall shift our guesses onto the stack with input symbols. For example, if we see an a and guess that we're seeing the results of applying the production A ® aB, we shift the pair (a, aB) onto the stack. After we reduce, we shall place a guess pair on the stack with the nonterminal we just produced. Here we go.

Our new parsing technique involves keeping notes on past input on the stack. For instance, in step 5 we have an a (which might be part of an aB) at the bottom of our stack, and an A (which we hope shall be part of an Ab) on top of the stack. We then use these notes to try to work backwards to the starting symbol. This is what happens when we do reduce operations. This is the standard bottom-up approach we have always seen in computer science. Our general method is to do a rightmost derivation except that we do it backwards! Neat. What we did at each step was to examine the stack and see if we could do a reduction by applying a production to the top elements of the stack. If so, then we replaced the right hand side symbols (which were at the top of the stack) with the left-hand side nonterminal.

After doing a reduction we put the new nonterminal on the stack along with a guess of what was being built. We also did this when we shifted a terminal onto the stack. Let us examine these guesses. We tried to make them as accurate as possible by looking at the stack before pushing the (symbol, guess) pair. We should also note that the pair (a,aB) means that we have placed the a on the stack and think that maybe a B will come along. On the other hand, the pair (b,Ab) indicates that the top two symbols on the stack are A and b, and, we have seen the entire right hand side of a production. Thus we always keep track of what is in the stack.

Now for another enhancement. We shall get rid of some duplication. Instead of placing (a, aB) on the stack we shall just put a|B on it. This means that we have seen the part of aB which comes before the vertical line - the symbol a. Putting aB| on the stack means that we have a and B as our top two stack symbols. Here is the same computation with our new stack symbols.

Let's pause a bit and examine these things we are placing on the stack. They are often called states and do indicate the state of the input string we have read and partially parsed. States are made up of items that are just productions with an indicator that tells us how far on the right hand side we have progressed. The set of items for the previous grammar is:

A ® \|aB	B ® \|Ab	B ® \|b
A ® a\|B	B ® A\|b	B ® b\|
A ® aB\|	B ® Ab\|

Recall what an item means. A ® a|B means that we have seen an a and hope to see a B and apply the production. Traditionally we also invent a new starting symbol and add a production where it goes to the old starting symbol. In this case this means adding the items:

S₀ ® |A

S₀ ® A|

to our collection of items.

There are lots and lots of items in a grammar. Some are almost the same. Now it is time to group equivalent items together. We take a closure of an item and get all of the equivalent ones. These closures shall form the stack symbols (or states) of our parser. These are computed according to the following procedure.

Figure 5 - Closure Computation for Items

We should compute a few closures for the items in our grammar. The only time we get past the first step above is when the vertical bar is to the left of a nonterminal. Such as in B ® |Ab. Let's do that one. We place B ® |Ab in CLOSURE(B ® |Ab) first. Then we look at all A ® a productions and put A ® |a in CLOSURE(B ® |Ab) also. This gives us:

CLOSURE(B ® |Ab) = {B ® |Ab, A ® |aB}

Some more closures are:

CLOSURE(S ® |A) = {S ® |A, A ® |aB}
CLOSURE(S ® A|) = {S ® A|}
CLOSURE(A ® a|B) = {A ® a|B, B ® |Ab, B ® |b, A ® |aB}

Thus the closure of an item is a collection of all items which represent the same sequence of things placed upon the stack recently. These items in the set are what we have seen on the input string and processed. The productions represented are all those which might be applied soon. The last closure presented above is particularly interesting since it tells us that we have seen an a and should be about to see a B. Thus either Ab, b, or aB could be arriving shortly. States will built presently by combining closures of items.

Let's return to our last table where we did a recognition of aabb. Note that in step 2 a|B was on top to the stack and the next input was b. We then placed b| on the stack. Traditionally sets of items called states are placed upon the stack and so the process of putting the next state on the stack is referred to as a GOTO. Thus from step 2 to step 3 in the recognition of aabb we execute:

GOTO(a|B, b) = b|.

In step 3 we reduced with the production B ® b and got a B. We then placed aB| on the stack. In our new terminology this is:

GOTO(a|B, B) = aB|.

It is time now to precisely define the GOTO operation. For a set of items (or state) Q and symbol x this is:

GOTO(Q, x) = {CLOSURE(A ® a x|b )} for all A ® a |xb Î Q

Check out the operations we looked at above and those in the previous acceptance table. Several more examples are:

GOTO({S ® \|A, A ® \|aB}, A)	= CLOSURE(S ® A\|) = {S ® A\|}
GOTO({S ® \|A, A ® \|aB}, a)	= CLOSURE(A ® a\|B) = {A ® a\|B, B ® \|Ab, B ® \|b, A ® \|aB}

So, all we need do is add a new starting production (S₀ ® S) to a grammar and execute the following state construction algorithm in order to generate all the states we require in order to do parsing.

Figure 6 - State Construction Algorithm

The seven states determined for our example grammar using the above algorithm are:

Q₀ = {S ® |A, A ® |aB}
Q₁ = {S ® A|}
Q₂ = {A ® a|B, B ® |Ab, B ® |b, A ® |aB}
Q₃ = {A ® aB|}
Q₄ = {B ® A|b}
Q₅ = {B ® b|}
Q₆ = {B ® Ab|}

and the relationships formed from the GOTO(Q, x) relationship are:

Note that not all state-symbol pairs are represented. We do know why there are no states to go to from Q₁, Q₃, Q₅, and Q₆ - right? Because they are states that contain items we shall reduce. After that we shall place another item on the stack as part of the reduction process.

All that remains is to build a parser that will carry out the sample computation we presented above. It is easy. Here are the rules for building the parser table. (Recall that the stack symbols or states are along the left side while grammar symbols lie across the top.)

Figure 7 - Parser Table Construction

That is all there is to it. Quite simple. Another note - we shall attach a GOTO table to the right side of our parser table so that we know what to place on the stack after a reduction. The parser for our sample grammar is provided below. The words shift and reduce have been omitted because they refer always to states and productions respectively and there should be no problem telling which is which. (Aliases for the states have been provided so that the table is readable.)

We know intuitively how these parsers work, but need to specify some things precisely. Shift operations merely push the indicated state on the stack. A reduce operation has two parts. For a reduction of A ® a where the length of a is k, first pop k states off the stack. (These are the right hand side symbols for the production.) Then if Q_i is on top of the stack, push GOTO(Q_i ,A) onto the stack. So, what we are doing is to examine the stack and push the proper state depending upon what was at the top and what was about to be processed. And last, begin with Q₀ on the stack. Try out our last example and note that exactly the same sequence of moves results.

Now let us label what we have been doing. Since we have been processing the input from left to right and doing rightmost derivations, this is called LR parsing. And the following theorem ties the LR languages into our framework.

Theorem. The following classes are equivalent.

a) Deterministic context free languages.
b) LR(1) languages.
c) LR(0) languages with endmarkers.
d) Languages accepted by deterministic pushdown automata.

GOTO({S ® \|A, A ® \|aB}, A)	= CLOSURE(S ® A\|) = {S ® A\|}
GOTO({S ® \|A, A ® \|aB}, a)	= CLOSURE(A ® a\|B) = {A ® a\|B, B ® \|Ab, B ® \|b, A ® \|aB}

Real Computer Science begins where we almost stop reading ...

Sunday, 23 June 2013

Bottom-Up Parsing

No comments:

Post a Comment