We wish to examine the
role of context free languages and pushdown automata in translation,
compiling, and parsing. The reason for this is that most of the syntax
for programming languages is context free and we shall use them to build
compilers. In other words, if we can construct grammars for
programming languages, we should be able to use pushdown automata to
recognize correct programs and aid in their translation into assembly
language.
In order to recognize programs with correct syntax all
we need to do is write down a grammar that is able to generate all correct
programs and then build a pushdown machine that will recognize this language.
The following theoretical computer science result provides a mechanism for this.
Theorem 1. (Greibach Normal Form). Every context free language can be
generated by a grammar with productions of the form A
®
a
a
where a is a terminal
and a
is a (possibly empty)
string of nonterminals.
Now
we know exactly how to build the machine. We convert the grammar to
Greibach normal form and jot down the one state pushdown machine that
recognizes strings from the language which are generated by the grammar.
The machine is based exactly on the productions. For every production
of the form A ® aa , the pushdown machine:
- reads an a,
- pops the A, and
- pushes a.
Unfortunately, there is one small problem. Machines produced from Greibach normal form grammars are often nondeterministic.
This removes the utility from the process since we cannot convert these
machines into the kind of programs we are used to writing. (After all,
we normally do not do nondeterministic programming on purpose.)
Here is a small aside. We should note that we know that the context free languages are
recursive and thus recognizable by programs which always halt. So if we
desired, we could have the program do some backtracking and go through
all possible derivations described by the grammar. But unfortunately
this makes our computation time somewhere between n2 and n3
steps when recognizing strings of length n. And, we really do not want
to use the entire class of context free languages, only the
deterministic ones.
This is exactly what we shall do. By resorting to the
same strategy we used to get away from unsolvability we can eliminate
nondeterministic languages by simplifying the grammars we design. Here
is the first step.
Definition. A context free grammar is an s-grammar
if and only if every production's right hand side begins with a
terminal symbol and this terminal is different for any two productions
with the same left-hand side.
This looks good. We can
easily build machines for these grammars and can count on them being deterministic.
Let us try an example, our old and trusted friend:
S ® ab
It is of course not in the necessary form, but with the operation presented in figure 1, we can fix that.
Figure 1 - Factoring
We shall now try this on the above grammar for anbn. The a is of course just the terminal symbol a while the b 's are Sb and b. After factoring we come up with the following grammar.
A ® Sb
A ® b
This is not quite what we want, but substituting for S will allow us to write down the following equivalent s-grammar.
A ® aAb
A ® b
Designing machines for these grammars is something we have done before. For a production of the form A ® aa we just:
- read the a,
- pop the A, and
- push a onto the stack.
The machine accepts by empty stack at the end of the
input. There is one minor adjustment that must be made. We do not wish
terminal symbols to be pushed onto the stack. We can prevent this by
presenting the s-grammar in Greibach normal form by turning all
terminals that do not begin the right-hand side into nonterminals. Or,
we could modify the pushdown machine so that it just pops a terminal
whenever it is on top of the stack and on the input string.
Thus in designing the pushdown machine, we always
push the tail of a production's right hand side and pop the nonterminal
on the left whenever we read the proper symbol. We also pop terminals
when they appear on top of the stack and under the reading head at the
same time.
Let us continue. At times factoring produces
productions which are not compatible with s-grammars. For example, the
fragment from a grammar for arithmetic assignment statements
E ® T
becomes the following when we factor it.
Z ® +E
Z ® e
and we can
substitute for the T until we get terminals leading all of the
right-hand sides of the productions. Our above example contains
something we have not considered, an epsilon rule (Z ® e
). This was not exactly what we wished to see. In fact it messes up our
parser. After all, just how do we read nothing and pop a Z? And when do
we know to do it? The following definitions lead to an answer to this
question.
Definition. The select set for the production A ® aa (denoted SELECT(A ® aa )) where A is a nonterminal and a is a possibly empty string of terminal and nonterminial symbols is the set {a}.
Definition. A context free grammar is a q-grammar if and only if every production's right hand side begins with a terminal symbol or is e , and whenever two productions possess the same left-hand side they have different select sets.
Thus
if we have productions with matching left-hand sides, they must behave
like s-grammars, and the select sets guarantee this. Select sets solve
that problem, but what exactly is the select set for an epsilon rule? It
should be just the terminal symbol we expect to see next. If we knew
what should follow the A in the epsilon rule A ® e
then we would know when to pop the A without using up an input symbol.
The next definitions quickly provide the precise way to start setting
this up.
Definition. The follow set for the nonterminal A (denoted FOLLOW(A)) is the set of all terminals a for which some string a Aab can be derived from the starting symbol S. (Where a and b are possibly empty strings of both terminals and nonterminals.)
Definition. SELECT(A ® e ) = FOLLOW(A).
That was not so
difficult. We merely apply an epsilon rule whenever a symbol that
follows the left-hand side nonterminal comes along. This makes a lot of
sense. There is one small catch though. We must not advance past
this symbol on the input string at this time. Here is our first example
in a new, elegant form since it now contains an epsilon rule:
S ® e
It is easy to see that the terminal symbol b must
follow the S since S remains between the a's and b's until it disappears
via the epsilon rule. A machine constructed from this grammar appears
as figure 2.
read pop | push advance? |
a S b S b b |
Sb yes no yes |
Figure 2 - A Parser for a q-grammar
Note that technically we are no longer designing pushdown machines. Actually we have designed a top-down parser.
The difference is that parsers can advance along the input string (or
not) as they desire. But we should note that they still are really
pushdown automata.
Since our last example was a bit different than a
pushdown automaton, let us examine the computation it goes through when
presented with the string aabb. In this example, the stack bottom is to
the left and the blue, italic input symbols have been read.
Stack
|
Input |
Action
|
S | aabb |
Apply S ® aSb
|
bS | a abb |
Apply S ® aSb
|
bbS | aa bb |
Apply S ® e
|
bb | aa bb |
Verify b
|
b | aab b |
Verify b
|
aabb |
Accept
|
Another way to look at
this computation is to note that a leftmost derivation of aabb took
place. In fact, if we were to concatenate the input read and the stack
up to the point where the b's were verified, we would have the following
leftmost derivation.
S Þ aSb Þ aaSbb Þ aabb
So, a top-down parser executes and verifies a
leftmost derivation based on the input and stack symbols it sees along
the way. We shall explore this phenomenon later.
Let us now go one step farther. Why require a
production to have a right hand beginning with a terminal? Why not allow
productions of the form:
E ® TZ
that we got when we factored the above some
productions for arithmetic expressions earlier? This seems more
intuitive and natural. Besides, it is no fun transforming grammars into
Greibach Normal Form. Things would be far easier if we had a way of
predicting when to apply the rule as we did for epsilon rules. Modifying
our select set definition a little helps since it allows select sets to
indicate when to apply productions.
Definition. The select set for the production A ® a (denoted SELECT(A ® a )) where A is a nonterminal and a is a string of terminal and nonterminial symbols is the set of all terminal symbols which begin strings derived from a .
This tells us that if our pushdown machine has an E on its stack and is reading a symbol from SELECT(E ®
TZ), then that is the production that should be applied by pushing TZ
on the stack. Here is a new class of deterministic context free grammars
based upon select sets.
Definition. A context free grammar is an LL(1) grammar if and only if any two productions with the same left-hand sides have different select sets.
Two items need
to be cleared up. First, LL(1) means that the parser we build from the
grammar is processing the input string from left to right (that's the first L) using leftmost derivations
(the second L). This is exactly what the parsers we have been designing
have been doing. They create a leftmost derivation and check it against
the input string. In addition we look ahead one symbol. By that
we mean that the parser can examine the next symbol on its input string
without advancing past it. The parser in figure 2 is an example of this.
Next, we have neglected to fully define select sets
for arbitrary productions. Intuitively, this will be all of the terminal
symbols first produced by the production, which we call its FIRST set.
If the production can lead to the empty string, then the FOLLOW set of
the left-hand side nonterminal must be included. Here are the formal
definitions.
Definition. FIRST(A ® a ) is the set of all terminal symbols a such that some string of the form ab can be derived from a .
Definition. SELECT(A ® a ) contains FIRST(A ® a ). If e can be derived from a then it also contains FOLLOW(A).
It is time for another theoretical aside. The proof of this last theorem is left as an exercise in careful substitution.
Theorem 2. The class of languages generated by q-grammars is the same as the class of LL(1) languages.
No comments:
Post a Comment