Grammars that have the same languages as DFA's
A grammar is defined as G = (V, T, P, S) where
V is a set of variables. We usually use capital letters for variables.
T is a set of terminal symbols. This is the same as Sigma for a machine.
P is a list of productions (rules) of the form:
variable -> concatenation of variables and terminals
S is the starting variable. S is in V.
A string z is accepted by a grammar G if some sequence of rules from P
can be applied to z with a result that is exactly the variable S.
We say that L(G) is the language generated (accepted) by the grammar G.
To start, we restrict the productions P to be of the form
A -> w w is a concatenation of terminal symbols
B -> wC w is a concatenation of terminal symbols
A, B and C are variables in V
and thus get a grammar that generates (accepts) a regular language.
Suppose we are given a machine M = (Q, Sigma, delta, q0, F) with
Q = { S }
Sigma = { 0, 1 }
q0 = S
F = { S }
delta | 0 | 1 |
---+---+---+
S | S | S |
---+---+---+
this looks strange because we would normally use q0 is place of S
The regular expression for M is (0+1)*
We can write the corresponding grammar for this machine as
G = (V, T, P, S) where
V = { S } the set of states in the machine
T = { 0, 1 } same as Sigma for the machine
P =
S -> epsilon | 0S | 1S
S = S the q0 state from the machine
the construction of the rules for P is directly from M's delta
If delta has an entry from state S with input symbol 0 go to state S,
the rule is S -> 0S.
If delta has an entry from state S with input symbol 1 go to state S,
the rule is S -> 1S.
There is a rule generated for every entry in delta.
delta(qi,a) = qj yields a rule qi -> a qj
An additional rule is generated for each final state, i.e. S -> epsilon
(An optional encoding is to generate an extra rule for every transition
to a final state: delta(qi,a) = any final state, qi -> a
with this option, if the start state is a final state, the production
S -> epsilon is still required. )
The shorthand notation S -> epsilon | 0S | 1S is the same as writing
the three rules. Read "|" as "or".
Grammars can be more powerful (read accept a larger class of languages)
than finite state machines (DFA's NFA's NFA-epsilon regular expressions).
i i
For example the language L = { 0 1 | i=0, 1, 2, ... } is not a regular
language. Yet, this language has a simple grammar
S -> epsilon | 0S1
Note that this grammar violates the restriction needed to make the grammars
language a regular language, i.e. rules can only have terminal symbols
and then one variable. This rule has a terminal after the variable.
A grammar for matching parenthesis might be
G = (V, T, P, S)
V = { S }
T = { ( , ) }
P = S -> epsilon | (S) | SS
S = S
We can check this be rewriting an input string
( ( ( ) ( ) ( ( ) ) ) )
( ( ( ) ( ) ( S ) ) ) S -> (S) where the inside S is epsilon
( ( ( ) ( ) S ) ) S -> (S)
( ( ( ) S S ) ) S -> (S) where the inside S is epsilon
( ( ( ) S ) ) S -> SS
( ( S S ) ) S -> (S) where the inside S is epsilon
( ( S ) ) S -> SS
( S ) S -> (S)
S S -> (S)
Thus the string ((()()(()))) is accepted by G because the rewriting
produced exactly S, the start variable.
More examples of constructing grammars from language descriptions:
Construct a CFG for non empty Palindromes over T = { 0, 1 }
The strings in this language read the same forward and backward.
G = ( V, T, P, S) T = { 0, 1 }, V = S, S = S, P is below:
S -> 0 | 1 | 00 | 11 | 0S0 | 1S1
We started the construction with S -> 0 and S -> 1
the shortest strings in the language.
S -> 0S0 is a palindrome with a zero added to either end
S -> 1S1 is a palindrome with a one added to either end
But, we needed S -> 00 and S -> 11 to get the even length
palindromes started.
"Non empty" means there can be no rule S -> epsilon.
n n
Construct the grammar for the language L = { a b n>0 }
G = ( V, T, P, S ) T = { a, b } V = { S } S = S P is:
S -> ab | aSb
Because n>0 there can be no S -> epsilon
The shortest string in the language is ab
a's have to be on the front, b's have to be on the back.
When either an "a" or a "b" is added the other must be added
in order to keep the count the same. Thus S -> aSb.
The toughest decision is when to stop adding rules.
In this case start "generating" strings in the language
S -> ab ab for n=1
S -> aSb aabb for n=2
S -> aaSbb aaabbb for n=3 etc.
Thus, no more rules needed.
"Generating" the strings in a language defined by a grammar
is also called "derivation" of the strings in a language.
CFG Derivation Tree
Given a grammar with the usual representation G = (V, T, P, S)
with variables V, terminal symbols T, set of productions P and
the start symbol from V called S.
A derivation tree is constructed with
1) each tree vertex is a variable or terminal or epsilon
2) the root vertex is S
3) interior vertices are from V, leaf vertices are from T or epsilon
4) an interior vertex A has children, in order, left to right,
X1, X2, ... , Xk when there is a production in P of the
form A -> X1 X2 ... Xk
5) a leaf can be epsilon only when there is a production A -> epsilon
and the leafs parent can have only this child.
Watch out! A grammar may have an unbounded number of derivation trees.
It just depends on which production is expanded at each vertex.
For any valid derivation tree, reading the leafs from left to right
gives one string in the language defined by the grammar.
There may be many derivation trees for a single string in the language.
If the grammar is a CFG then a leftmost derivation tree exists
for every string in the corresponding CFL. There may be more than one
leftmost derivation trees for some string. See example below and
((()())()) example in previous lecture.
If the grammar is a CFG then a rightmost derivation tree exists
for every string in the corresponding CFL. There may be more than one
rightmost derivation tree for some string.
The grammar is called "ambiguous" if the leftmost (rightmost) derivation
tree is not unique for every string in the language defined by the grammar.
The leftmost and rightmost derivations are usually distinct but might
be the same.
Given a grammar and a string in the language represented by the grammar,
a leftmost derivation tree is constructed bottom up by finding a
production in the grammar that has the leftmost character of the string
(possibly more than one may have to be tried) and building the tree
towards the root. Then work on the second character of the string.
After much trial and error, you should get a derivation tree with a root S.
We will get to the CYK algorithm that does the parsing in a few lectures.
Examples: Construct a grammar for L = { x 0^n y 1^n z n>0 }
Recognize that 0^n y 1^n is a base language, say B
B -> y | 0B1 (The base y, the recursion 0B1 )
Then, the language is completed S -> xBz
using the prefix, base language and suffix.
(Note that x, y and z could be any strings not involving n)
G = ( V, T, P, S ) where
V = { B, S } T = { x, y, z, 0, 1 } S = S
P = S -> xBz
B -> y | 0B1
*
Now construct an arbitrary derivation for S => x00y11z
G
A derivation always starts with the start variable, S.
The "=>", "*" and "G" stand for "derivation", "any number of
steps", and "over the grammar G" respectively.
The intermediate terms, called sentential form, may contain
variable and terminal symbols.
Any variable, say B, can be replaced by the right side of any
production of the form B -> <right side>
A leftmost derivation always replaces the leftmost variable
in the sentential form. (In general there are many possible
replacements, the process is nondeterministic.)
One possible derivation using the grammar above is
S => xBz => x0B1z => x00B11z => x00y11z
The derivation must obviously stop when the sentential form
has only terminal symbols. (No more substitutions possible.)
The final string is in the language of the grammar. But, this
is a very poor way to generate all strings in the grammar!
A "derivation tree" sometimes called a "parse tree" uses the
rules above: start with the starting symbol, expand the tree
by creating branches using any right side of a starting
symbol rule, etc.
S
/ | \
/ | \
/ | \
/ | \
/ | \
x B z
/ | \
/ | \
/ | \
/ | \
0 B 1
/ | \
/ | \
0 B 1
|
y
Derivation ends x 0 0 y 1 1 z with all leaves
terminal symbols, a string in the language generated by the grammar.
More examples of grammars are:
G(L) for L = { x a^n y b^k z k > n > 0 }
note that there must be more b's than a's thus
B -> aybb | aBb | Bb
G = ( V, T, P, S ) where
V = { B, S } T = { a, b, x, y, z } S = S
P = S -> xBz B -> aybb | aBb | Bb
Incremental changes for "n > k > 0" B -> aayb | aBb | aB
Incremental changes for "n >= k >= 0" B -> y | aBb | aB
Independent exponents do not cause a problem when nested
equivalent to nesting parenthesis.
G(L) for L = { a^i b^j c^j d^i e^k f^k i>=0, j>=0, k>=0 }
| | | | | |
| +---+ | +---+
+-----------+
G = ( V, T , P, S )
V = { I, J, K, S } T = { a, b, c, d, e, f } S = S
P = S -> IK
I -> J | aId
J -> epsilon | bJc
K -> epsilon | eKf
G(L) for L = { a^i b^j c^k | any unbounded relation such as i=j=k>0, 0<i<k<j }
the G(L) can not be a context free grammar. Try it.
This will be intuitively seen in the push down automata and provable
with the pumping lemma for context free languages.
What is a leftmost derivation trees for some string?
It is a process that looks at the string left to right and
runs the productions backwards.
Here is an example, time starts at top and moves down.
Given G = (V, T, P, S) V={S, E, I} T={a, b, c} S=S P=
I -> a | b | c
E -> I | E+E | E*E
S -> E (a subset of grammar from book)
Given a string a + b * c
I
E
S derived but not used
I
E
[E + E]
E
S derived but not used
I
E
[E * E]
E
S done! Have S and no more input.
Left derivation tree, just turn upside down, delet unused.
S
|
E
/ | \
/ | \
/ | \
E * E
/ | \ |
E + E I
| | |
I I c
| |
a b
Check: Read leafs left to right, must be initial string, all in T.
Interior nodes must be variables, all in V.
Every vertical connection must be tracable to a production.
CFG simplification algorithm
The goal here is to take an arbitrary Context Free Grammar
G = (V, T, P, S) and perform transformations on the grammar that
preserve the language generated by the grammar but reach a
specific format for the productions.
Overview: Step 1a) Eliminate useless variables that can not become terminals
Step 1b) Eliminate useless variables that can not be reached
Step 2) Eliminate epsilon productions
Step 3) Eliminate unit productions
Step 4) Make productions Chomsky Normal Form
Step 5) Make productions Greibach Normal Form
The CYK parsing uses Chomsky Normal Form as input
The CFG to NPDA uses Greibach Normal Form as input
Details: one step at a time
1a) Eliminate useless variables that can not become terminals
See 1st Ed. book p88, Lemma 4.1, figure 4.7
2nd Ed. section 7.1
Basically: Build the set NEWV from productions of the form
V -> w where V is a variable and w is one or more terminals.
Insert V into the set NEWV.
Then iterate over the productions, now accepting any variable
in w as a terminal if it is in NEWV. Thus NEWV is all the
variables that can be reduced to all terminals.
Now, all productions containing a variable not in NEWV
can be thrown away. Thus T is unchanged, S is unchanged,
V=NEWV and P may become the same or smaller.
The new grammar G=(V,T,P,S) represents the same language.
1b) Eliminate useless variables that can not be reached from S
See 1st Ed. book p89, Lemma 4.2, 2nd Ed. book 7.1.
Set V'=S, T'=phi, mark all production as unused.
Iterate repeatedly through all productions until no change
in V' or T'. For any production A -> w, with A in V'
insert the terminals from w into the set T' and insert
the variables form w into the set V' and mark the
production as used.
Now, delete all productions from P that are marked unused.
V=V', T=T', S is unchanged.
The new grammar G=(V,T,P,S) represents the same language.
2) Eliminate epsilon productions.
See 1st Ed. book p90, Theorem 4.3, 2nd Ed. book 7.1
This is complex. If the language of the grammar contains
the null string, epsilon, then in principle remove epsilon
from the grammar, eliminate epsilon productions, then put
S -> epsilon back into the grammar later.
The new grammar G=(V,T,P,S) represents the same language except
the new language does not contain epsilon.
3) Eliminate unit productions.
See 1st Ed. book p91, Theorem 4.4, 2nd Ed. 7.1
Iterate through productions finding A -> B type "unit productions".
Delete this production from P.
Make a copy of all productions B -> gamma, replacing B with A.
Be careful of A -> B, B -> C, C -> D type cases,
there needs to be copies of B -> gamma, C -> gamma, D -> gamma for A.
Delete duplicate productions. (sort and remove adjacent duplicate)
The new grammar G=(V,T,P,S) represents the same language.
Briefly, some pseudo code for the above steps.
Step 1a) The set V' = phi
loop through the productions, P, to find:
A -> w where w is all terminals
union V' with A
n := 0
while n /= |V'|
n := |V'|
loop through productions to find:
A -> alpha where alpha is only terminals and variables in V'
union V' with A
end while
Eliminate := V - V'
loop through productions
delete any production containing a variable in Eliminate,
V := V'
Step 1b) The set V' = {S}
The set T' = phi
n := 0
while n /= |V'| + |T'|
n := |V'| + |T'|
loop through productions to find:
A -> alpha where A in V'
union V' with variables in alpha
union T' with terminals in alpha
end while
loop through productions
delete any production containing anything outside V' T' and epsilon
V := V'
T := T'
Step 2) The set N = phi
n := -1
while n /= |N|
n = |N|
loop through productions to find:
A -> epsilon
union N with A
delete production
A -> alpha where no terminals in alpha and
all variables in alpha are in N
union N with A
delete production
end while
if S in N set null string accepted
loop through productions
A -> alpha where at least one variable in alpha in N
generate rules A -> alpha' where alpha'
is all combinations of eliminating the
variables in N
Step 3) P' := all non unit productions ( not A -> B )
U := all unit productions
loop through productions in U, |U| times, to find:
A -> A
ignore this
A -> B
loop through productions in P'
copy/substitute B -> gamma to A -> gamma in P'
P := P'
eliminate duplicate productions (e.g. sort and check i+i against i)
No comments:
Post a Comment