Grammars that have the same languages as DFA's A grammar is defined as G = (V, T, P, S) where V is a set of variables. We usually use capital letters for variables. T is a set of terminal symbols. This is the same as Sigma for a machine. P is a list of productions (rules) of the form: variable -> concatenation of variables and terminals S is the starting variable. S is in V. A string z is accepted by a grammar G if some sequence of rules from P can be applied to z with a result that is exactly the variable S. We say that L(G) is the language generated (accepted) by the grammar G. To start, we restrict the productions P to be of the form A -> w w is a concatenation of terminal symbols B -> wC w is a concatenation of terminal symbols A, B and C are variables in V and thus get a grammar that generates (accepts) a regular language. Suppose we are given a machine M = (Q, Sigma, delta, q0, F) with Q = { S } Sigma = { 0, 1 } q0 = S F = { S } delta | 0 | 1 | ---+---+---+ S | S | S | ---+---+---+ this looks strange because we would normally use q0 is place of S The regular expression for M is (0+1)* We can write the corresponding grammar for this machine as G = (V, T, P, S) where V = { S } the set of states in the machine T = { 0, 1 } same as Sigma for the machine P = S -> epsilon | 0S | 1S S = S the q0 state from the machine the construction of the rules for P is directly from M's delta If delta has an entry from state S with input symbol 0 go to state S, the rule is S -> 0S. If delta has an entry from state S with input symbol 1 go to state S, the rule is S -> 1S. There is a rule generated for every entry in delta. delta(qi,a) = qj yields a rule qi -> a qj An additional rule is generated for each final state, i.e. S -> epsilon (An optional encoding is to generate an extra rule for every transition to a final state: delta(qi,a) = any final state, qi -> a with this option, if the start state is a final state, the production S -> epsilon is still required. ) The shorthand notation S -> epsilon | 0S | 1S is the same as writing the three rules. Read "|" as "or". Grammars can be more powerful (read accept a larger class of languages) than finite state machines (DFA's NFA's NFA-epsilon regular expressions). i i For example the language L = { 0 1 | i=0, 1, 2, ... } is not a regular language. Yet, this language has a simple grammar S -> epsilon | 0S1 Note that this grammar violates the restriction needed to make the grammars language a regular language, i.e. rules can only have terminal symbols and then one variable. This rule has a terminal after the variable. A grammar for matching parenthesis might be G = (V, T, P, S) V = { S } T = { ( , ) } P = S -> epsilon | (S) | SS S = S We can check this be rewriting an input string ( ( ( ) ( ) ( ( ) ) ) ) ( ( ( ) ( ) ( S ) ) ) S -> (S) where the inside S is epsilon ( ( ( ) ( ) S ) ) S -> (S) ( ( ( ) S S ) ) S -> (S) where the inside S is epsilon ( ( ( ) S ) ) S -> SS ( ( S S ) ) S -> (S) where the inside S is epsilon ( ( S ) ) S -> SS ( S ) S -> (S) S S -> (S) Thus the string ((()()(()))) is accepted by G because the rewriting produced exactly S, the start variable. More examples of constructing grammars from language descriptions: Construct a CFG for non empty Palindromes over T = { 0, 1 } The strings in this language read the same forward and backward. G = ( V, T, P, S) T = { 0, 1 }, V = S, S = S, P is below: S -> 0 | 1 | 00 | 11 | 0S0 | 1S1 We started the construction with S -> 0 and S -> 1 the shortest strings in the language. S -> 0S0 is a palindrome with a zero added to either end S -> 1S1 is a palindrome with a one added to either end But, we needed S -> 00 and S -> 11 to get the even length palindromes started. "Non empty" means there can be no rule S -> epsilon. n n Construct the grammar for the language L = { a b n>0 } G = ( V, T, P, S ) T = { a, b } V = { S } S = S P is: S -> ab | aSb Because n>0 there can be no S -> epsilon The shortest string in the language is ab a's have to be on the front, b's have to be on the back. When either an "a" or a "b" is added the other must be added in order to keep the count the same. Thus S -> aSb. The toughest decision is when to stop adding rules. In this case start "generating" strings in the language S -> ab ab for n=1 S -> aSb aabb for n=2 S -> aaSbb aaabbb for n=3 etc. Thus, no more rules needed. "Generating" the strings in a language defined by a grammar is also called "derivation" of the strings in a language.

**CFG Derivation Tree**

Given a grammar with the usual representation G = (V, T, P, S) with variables V, terminal symbols T, set of productions P and the start symbol from V called S. A derivation tree is constructed with 1) each tree vertex is a variable or terminal or epsilon 2) the root vertex is S 3) interior vertices are from V, leaf vertices are from T or epsilon 4) an interior vertex A has children, in order, left to right, X1, X2, ... , Xk when there is a production in P of the form A -> X1 X2 ... Xk 5) a leaf can be epsilon only when there is a production A -> epsilon and the leafs parent can have only this child. Watch out! A grammar may have an unbounded number of derivation trees. It just depends on which production is expanded at each vertex. For any valid derivation tree, reading the leafs from left to right gives one string in the language defined by the grammar. There may be many derivation trees for a single string in the language. If the grammar is a CFG then a leftmost derivation tree exists for every string in the corresponding CFL. There may be more than one leftmost derivation trees for some string. See example below and ((()())()) example in previous lecture. If the grammar is a CFG then a rightmost derivation tree exists for every string in the corresponding CFL. There may be more than one rightmost derivation tree for some string. The grammar is called "ambiguous" if the leftmost (rightmost) derivation tree is not unique for every string in the language defined by the grammar. The leftmost and rightmost derivations are usually distinct but might be the same. Given a grammar and a string in the language represented by the grammar, a leftmost derivation tree is constructed bottom up by finding a production in the grammar that has the leftmost character of the string (possibly more than one may have to be tried) and building the tree towards the root. Then work on the second character of the string. After much trial and error, you should get a derivation tree with a root S. We will get to the CYK algorithm that does the parsing in a few lectures. Examples: Construct a grammar for L = { x 0^n y 1^n z n>0 } Recognize that 0^n y 1^n is a base language, say B B -> y | 0B1 (The base y, the recursion 0B1 ) Then, the language is completed S -> xBz using the prefix, base language and suffix. (Note that x, y and z could be any strings not involving n) G = ( V, T, P, S ) where V = { B, S } T = { x, y, z, 0, 1 } S = S P = S -> xBz B -> y | 0B1 * Now construct an arbitrary derivation for S => x00y11z G A derivation always starts with the start variable, S. The "=>", "*" and "G" stand for "derivation", "any number of steps", and "over the grammar G" respectively. The intermediate terms, called sentential form, may contain variable and terminal symbols. Any variable, say B, can be replaced by the right side of any production of the form B -> <right side> A leftmost derivation always replaces the leftmost variable in the sentential form. (In general there are many possible replacements, the process is nondeterministic.) One possible derivation using the grammar above is S => xBz => x0B1z => x00B11z => x00y11z The derivation must obviously stop when the sentential form has only terminal symbols. (No more substitutions possible.) The final string is in the language of the grammar. But, this is a very poor way to generate all strings in the grammar! A "derivation tree" sometimes called a "parse tree" uses the rules above: start with the starting symbol, expand the tree by creating branches using any right side of a starting symbol rule, etc. S / | \ / | \ / | \ / | \ / | \ x B z / | \ / | \ / | \ / | \ 0 B 1 / | \ / | \ 0 B 1 | y Derivation ends x 0 0 y 1 1 z with all leaves terminal symbols, a string in the language generated by the grammar. More examples of grammars are: G(L) for L = { x a^n y b^k z k > n > 0 } note that there must be more b's than a's thus B -> aybb | aBb | Bb G = ( V, T, P, S ) where V = { B, S } T = { a, b, x, y, z } S = S P = S -> xBz B -> aybb | aBb | Bb Incremental changes for "n > k > 0" B -> aayb | aBb | aB Incremental changes for "n >= k >= 0" B -> y | aBb | aB Independent exponents do not cause a problem when nested equivalent to nesting parenthesis. G(L) for L = { a^i b^j c^j d^i e^k f^k i>=0, j>=0, k>=0 } | | | | | | | +---+ | +---+ +-----------+ G = ( V, T , P, S ) V = { I, J, K, S } T = { a, b, c, d, e, f } S = S P = S -> IK I -> J | aId J -> epsilon | bJc K -> epsilon | eKf G(L) for L = { a^i b^j c^k | any unbounded relation such as i=j=k>0, 0<i<k<j } the G(L) can not be a context free grammar. Try it. This will be intuitively seen in the push down automata and provable with the pumping lemma for context free languages. What is a leftmost derivation trees for some string? It is a process that looks at the string left to right and runs the productions backwards. Here is an example, time starts at top and moves down. Given G = (V, T, P, S) V={S, E, I} T={a, b, c} S=S P= I -> a | b | c E -> I | E+E | E*E S -> E (a subset of grammar from book) Given a string a + b * c I E S derived but not used I E [E + E] E S derived but not used I E [E * E] E S done! Have S and no more input. Left derivation tree, just turn upside down, delet unused. S | E / | \ / | \ / | \ E * E / | \ | E + E I | | | I I c | | a b Check: Read leafs left to right, must be initial string, all in T. Interior nodes must be variables, all in V. Every vertical connection must be tracable to a production.

### CFG simplification algorithm

The goal here is to take an arbitrary Context Free Grammar G = (V, T, P, S) and perform transformations on the grammar that preserve the language generated by the grammar but reach a specific format for the productions. Overview: Step 1a) Eliminate useless variables that can not become terminals Step 1b) Eliminate useless variables that can not be reached Step 2) Eliminate epsilon productions Step 3) Eliminate unit productions Step 4) Make productions Chomsky Normal Form Step 5) Make productions Greibach Normal Form The CYK parsing uses Chomsky Normal Form as input The CFG to NPDA uses Greibach Normal Form as input Details: one step at a time 1a) Eliminate useless variables that can not become terminals See 1st Ed. book p88, Lemma 4.1, figure 4.7 2nd Ed. section 7.1 Basically: Build the set NEWV from productions of the form V -> w where V is a variable and w is one or more terminals. Insert V into the set NEWV. Then iterate over the productions, now accepting any variable in w as a terminal if it is in NEWV. Thus NEWV is all the variables that can be reduced to all terminals. Now, all productions containing a variable not in NEWV can be thrown away. Thus T is unchanged, S is unchanged, V=NEWV and P may become the same or smaller. The new grammar G=(V,T,P,S) represents the same language. 1b) Eliminate useless variables that can not be reached from S See 1st Ed. book p89, Lemma 4.2, 2nd Ed. book 7.1. Set V'=S, T'=phi, mark all production as unused. Iterate repeatedly through all productions until no change in V' or T'. For any production A -> w, with A in V' insert the terminals from w into the set T' and insert the variables form w into the set V' and mark the production as used. Now, delete all productions from P that are marked unused. V=V', T=T', S is unchanged. The new grammar G=(V,T,P,S) represents the same language. 2) Eliminate epsilon productions. See 1st Ed. book p90, Theorem 4.3, 2nd Ed. book 7.1 This is complex. If the language of the grammar contains the null string, epsilon, then in principle remove epsilon from the grammar, eliminate epsilon productions, then put S -> epsilon back into the grammar later. The new grammar G=(V,T,P,S) represents the same language except the new language does not contain epsilon. 3) Eliminate unit productions. See 1st Ed. book p91, Theorem 4.4, 2nd Ed. 7.1 Iterate through productions finding A -> B type "unit productions". Delete this production from P. Make a copy of all productions B -> gamma, replacing B with A. Be careful of A -> B, B -> C, C -> D type cases, there needs to be copies of B -> gamma, C -> gamma, D -> gamma for A. Delete duplicate productions. (sort and remove adjacent duplicate) The new grammar G=(V,T,P,S) represents the same language. Briefly, some pseudo code for the above steps. Step 1a) The set V' = phi loop through the productions, P, to find: A -> w where w is all terminals union V' with A n := 0 while n /= |V'| n := |V'| loop through productions to find: A -> alpha where alpha is only terminals and variables in V' union V' with A end while Eliminate := V - V' loop through productions delete any production containing a variable in Eliminate, V := V' Step 1b) The set V' = {S} The set T' = phi n := 0 while n /= |V'| + |T'| n := |V'| + |T'| loop through productions to find: A -> alpha where A in V' union V' with variables in alpha union T' with terminals in alpha end while loop through productions delete any production containing anything outside V' T' and epsilon V := V' T := T' Step 2) The set N = phi n := -1 while n /= |N| n = |N| loop through productions to find: A -> epsilon union N with A delete production A -> alpha where no terminals in alpha and all variables in alpha are in N union N with A delete production end while if S in N set null string accepted loop through productions A -> alpha where at least one variable in alpha in N generate rules A -> alpha' where alpha' is all combinations of eliminating the variables in N Step 3) P' := all non unit productions ( not A -> B ) U := all unit productions loop through productions in U, |U| times, to find: A -> A ignore this A -> B loop through productions in P' copy/substitute B -> gamma to A -> gamma in P' P := P' eliminate duplicate productions (e.g. sort and check i+i against i)