The major class of
grammars used for top-down parsing is the class of LL(1) grammars. LL(1)
means that the parser we build from the grammar is processing the input
string from left to right (that's the first L) using leftmost derivations
(the second L). This is exactly what the parsers we wish to design
shall do. They will create a leftmost derivation and check it against
the input string. In addition, they are allowed to look ahead one
symbol. This means that the parser can examine the next symbol on its
input string without advancing past it. The major piece of information
that the parsers need to do this is called the select set.
Definition. The select set for the production A ® a (denoted SELECT(A ® a )) where A is a nonterminal and a is a string of terminal and nonterminial symbols is the set of all terminal symbols which begin strings derived from a .
Consider the grammar S ® e
| aSb. The select set for the second production is {a}. For the first
production, we cannot apply the above definition, so we have to think
about things a little. Note that if the parser is looking at a symbol
which follows S, then we probably should apply S ® e whenever we see that symbol. In this case the symbol b always follows S. The parser built from this appears in figure 1.
Figure 1 - A Parser for S ® e | aSb
It is important to note that in the second row of the table, the parser looked ahead one symbol to determine that S ® e should be applied.
Here is the formal definition of LL(1) grammars, a class of deterministic context free grammars based upon select sets.
Definition. A context free grammar is an LL(1) grammar if and only if any two productions with the same left-hand sides have different select sets.
We need now to
fully define select sets for arbitrary productions. Intuitively, they
will contain all of the terminal symbols first produced by the
production, which we call its FIRST set. If the production can lead to
the empty string, then the FOLLOW set of the left-hand side nonterminal
must be included. Here are the formal definitions.
Definition. FIRST(A ® a ) is the set of all terminal symbols a such that some string of the form ab can be derived from a .
The FIRST set for S ® aSb is of course the set {a}.
Definition. The follow set for the nonterminal A (denoted FOLLOW(A)) is the set of all terminals a for which some string a Aab can be derived from the starting symbol S. (Where a and b are possibly empty strings of both terminals and nonterminals.)
The FOLLOW set for S in the grammar S ® e | aSb is the set {b}.
Definition. SELECT(A ® a ) contains FIRST(A ® a ). If e can be derived from a then it also contains FOLLOW(A).
We begin by determining which nonterminals can generate the empty string e using the algorithm of figure 2.
Figure 2 - Finding Nullable Nonterminals
Let's pause a
moment and make sure that the algorithm for finding nonterminals which
generate the null string is correct. Two steps are needed for this.
We claim that it terminates because one nonterminal
is deleted every time the loop is executed and sooner or later the loop
will terminate since there are no more epsilon rules or we shall run out
of nonterminals.
As an example, let us apply the algorithm of figure 2 to the following grammar.
A ® e | a
B ® e | bD
C ® AD | b
D ® aA | c
First we remove all of the productions containing terminals. Only the following four productions remain.
S ® ABA
|
A ® e
|
B ® e
|
C ® AD
|
Executing
the loop one time places A and B into the set of nullable nonterminals
and leaves the two following productions after removing the epsilon
rules and substituting e for A and B in the above productions.
S ® e
|
C ® D
|
The loop is executed once more and S is added to the nullable nonterminal set and then the algorithm halts since C ® D is not an epsilon rule.
Correctness comes from thinking about just how a nonterminal A could generate the null string e . There must be a production A ® a where all of the nonterminals in a in turn generate e . And one of them must generate e directly.
Now we must work out computing procedures for FIRST,
FOLLOW, and SELECT. We have defined them, but have not demonstrated how
to compute them.
(Note: In
the following, capital Roman letters are nonterminals, small Roman
letters from the beginning of the alphabet are terminals, x and y can be
either terminals or nonterminals. All Greek letters (except for e ) are possibly empty strings of terminals and nonterminals.)
In order to
compute FIRST sets several steps are necessary. We start by identifying
all of the terminal and nonterminal symbols that will be generated
immediately from productions and place them in a set named BEGIN for the
nonterminal on the right-hand side.
Computing FIRST(A), which
we recall is the set of symbols appearing first in strings generated
from A, involves taking the reflexive, transitive closure of BEGIN(A).
This is done easily with a queue as follows.
Using the last grammar as an example we find the BEGIN and FIRST sets shown below for all of the nonterminal symbols.
Recall that A and c are in
BEGIN(S) since they began productions and B had to join BEGIN(S) since A
can generate the empty string.
Now we have all of the terminals that show up at the
beginning of strings generated by each nonterminal. For completeness we
state that for each terminal symbol a, FIRST(a) = {a}.
At last we have all of the information needed to compute FIRST(A ® a ) for each production. These sets are basically the union of the FIRST sets for all the symbols which can start a . Since these FIRST sets for productions will be used only to find SELECT sets they need only contain terminal symbols.
Here are the terminal
symbols for the FIRST sets for the two productions that do not have
right-hand sides beginning with terminals.
S ® ABA
|
{ a, b }
|
C ® AD
|
{ a, c }
|
Some more relations are
needed to build the FOLLOW sets for each productions. We need to detect
symbols that come after others and which end derivations. AFTER(A) is
the set of all of symbols that immediately follow A and END(A) is the
set of all symbols that are last in productions with A as a left-hand
side.
Note that A can follow
immediately after itself since B may generate the empty string. Here are
the AFTER and END sets for the nonterminals.
If we set LAST(A) to be
the reflexive, transitive closure of END(A) then it is the set of
symbols which end strings generated from A. The same algorithm used to
calculate the FIRST sets can be used here.
The next part is a little intricate. FOLLOW(N1) is the set of all terminal symbols t such that there exist symbols N2 and x where:
and x Î AFTER(N2) {x directly follows N2 in some production}
and t Î FIRST(x) {t is the first terminal in string generated by x}
To compute the
FOLLOW sets for the nonterminals, we shall fill in a chart with the
symbols that correspond to the three relationships above.
N1
|
N2
|
x
|
t
|
S
|
S
|
||
A
|
S,A,C,D
|
A,B,D
|
a,b,c
|
B
|
S,B
|
A
|
a
|
C
|
S,C
|
||
D
|
S,C,D
|
At this point we have all
of the information we need in order to define SELECT sets for each
production. The definition is the following.
This provides the
following select sets for the example grammar. After all of this work,
we now know that it is not a LL(1) grammar because the two A productions
do not have different select sets.
S ® ABA
|
{a,b}
|
S ® cC
|
{c}
|
A ® e
|
{a,b,c}
|
A ® a
|
{a}
|
B ® e
|
{a}
|
B ® bD
|
{b}
|
C ® AD
|
{a,c}
|
C ® b
|
{b}
|
D ® aA
|
{a}
|
D ® c
|
{c}
|
Here is another example, the grammar for arithmetic expressions. It appears in factored form on the right.
E ® T | T+E | E ® TA | |
A ® +E | e | ||
T ® F | F* T | T ® FB | |
B ® * T | e | ||
F ® x | (E) | F ® x | (E) |
Our first task is to find
the nonterminals that can generate the empty string. This is easy since
only A and B can do this and it was obvious. Having no surprises is, by
the way, a sign of good grammar design!
The chart below provides all of the relations that were described above for the nonterminals of the factored grammar.
E
|
A
|
T
|
B
|
F
|
|
BEGIN
|
T
|
+
|
F
|
*
|
x, (
|
FIRST
|
E, T, F, x, (
|
A, +
|
T, F, x, (
|
B, *
|
F, x, (
|
AFTER
|
)
|
A
|
B
|
||
END
|
A,T
|
E
|
B, F
|
T
|
x, )
|
LAST
|
E,A,T,B,F,x,)
|
E,A,T,B,F,x,)
|
T,B,F,x,)
|
T,B,F,x,)
|
F,x,)
|
The FIRST sets for the interesting productions are the following.
E ® TA |
{ x, ( }
|
T ® FB |
{ x, ( }
|
Using the above information we can construct the chart for the FOLLOW sets (in the t column) as shown below.
N1
|
N2
|
x
|
t
|
E
|
E,A
|
)
|
)
|
A
|
E,A
|
)
|
)
|
T
|
E,A,T,B
|
A, )
|
+, )
|
B
|
E,A,T,B
|
A, )
|
+, )
|
F
|
E,A,T,B,F
|
A, B, )
|
+, * , )
|
Putting this all together
we arrive at the following select sets and are able to construct a
parser for the grammar. The general rule for LL(1) parsers is to pop a
production's left-hand side and push the right hand side when reading
the select set. Here is our example as figure 3.
Figure 3 - A Parser for Arithmetic Expressions
That wasn't so
bad after all. We did a lot of work, but came up with a deterministic
parser for a LL(1) grammar that describes arithmetic expressions for a
programming language.
One more note on top-down parsers is needed. They're
usually presented with instructions to either pop the top stack symbol
(for an epsilon rule) or to replace the top stack symbol with a string
(for other productions). No advance of the input is made in either case.
It is always implicitly assumed that when the top of the stack matches
the input then the parser pops the symbol and advances. These are called
predict and verify operations based on leftmost derivations. Also, these parsers are often presented in a slightly different format. All of the match or verify operations with terminals on the input and stack are understood and a replace table
is provided as the parser. This table shows what to place on the stack
when the top symbol and the next input symbol are given. Here is our
last example in this form.
See how the two formats
describe the same parser? In the new form we just examine the input and
the stack and replace the stack symbol by the string in the appropriate
box of the parser table. By the way, the blank boxes depict
configurations that should not occur unless there is an error. In a real
parser one might have these boxes point to error message routines.
At this point we have defined a subclass (in fact a
proper one) of the deterministic context free languages for which we
know how to build parsers. We also have some tools that can be used to
place grammars in the correct form if needed. These tools are:
- omitting useless nonterminals,
- omitting unreachable nonterminals,
- factoring,
- substitution, and
- recursion removal.
A further word about
recursion is needed. It should be obvious that left recursion (immediate
or cyclic) is not compatible with top-down parsing. The following
result provides some confort for those who make sure that the grammars
they design are not recursive.
Theorem. Every LL(1) language has a nonrecursive grammar.
Finally, if we wish to translate or parse something the following recipe should help to streamline the task.
- Develop a grammar for it,
- Convert the grammar to LL(1) form, and
- Construct the parser.
This promises success since LL(1) grammars provide parsers which are nonrecursive and deterministic.
No comments:
Post a Comment