Monday, 24 June 2013

Context Free Grammars, CFG

 Grammars that have the same languages as DFA's

  A grammar is defined as  G = (V, T, P, S) where
  V is a set of variables. We usually use capital letters for variables.
  T is a set of terminal symbols. This is the same as Sigma for a machine.
  P is a list of productions (rules) of the form:
      variable  ->  concatenation of variables and terminals
  S is the starting variable. S is in V. 

  A string z is accepted by a grammar G if some sequence of rules from P
  can be applied to z with a result that is exactly the variable S.

  We say that L(G) is the language generated (accepted) by the grammar G.

  To start, we restrict the productions P to be of the form
      A -> w         w is a concatenation of terminal symbols
      B -> wC        w is a concatenation of terminal symbols
                     A, B and C are variables in V
  and thus get a grammar that generates (accepts) a regular language.

  Suppose we are given a machine M = (Q, Sigma, delta, q0, F) with
  Q = { S }
  Sigma = { 0, 1 }
  q0 = S
  F = { S }
            delta    | 0 | 1 |
                  ---+---+---+
                   S | S | S |
                  ---+---+---+

  this looks strange because we would normally use q0 is place of S

  The regular expression for M is  (0+1)*

  We can write the corresponding grammar for this machine as
  G = (V, T, P, S) where
  V = { S }     the set of states in the machine
  T = { 0, 1 }  same as Sigma for the machine
  P =
       S -> epsilon | 0S | 1S

  S = S         the q0 state from the machine

  the construction of the rules for P is directly from M's delta
  If delta has an entry  from state S with input symbol 0 go to state S,
  the rule is   S -> 0S.
  If delta has an entry  from state S with input symbol 1 go to state S,
  the rule is   S -> 1S.

  There is a rule generated for every entry in delta.
  delta(qi,a) = qj  yields a rule  qi -> a qj

  An additional rule is generated for each final state, i.e. S -> epsilon
  (An optional encoding is to generate an extra rule for every transition
   to a final state: delta(qi,a) = any final state,  qi -> a
   with this option, if the start state is a final state, the production
   S -> epsilon is still required. )

  The shorthand notation S -> epsilon | 0S | 1S is the same as writing
  the three rules.  Read "|" as "or".

  Grammars can be more powerful (read accept a larger class of languages)
  than finite state machines (DFA's NFA's NFA-epsilon regular expressions).

                                  i i
  For example the language L = { 0 1  | i=0, 1, 2, ... } is not a regular
  language. Yet, this language has a simple grammar
                 S -> epsilon | 0S1

  Note that this grammar violates the restriction needed to make the grammars
  language a regular language, i.e. rules can only have terminal symbols
  and then one variable. This rule has a terminal after the variable.

  A grammar for matching parenthesis might be
  G = (V, T, P, S)
  V = { S }
  T = { ( , ) }
  P = S -> epsilon | (S) | SS
  S = S

  We can check this be rewriting an input string 

   ( ( ( ) ( ) ( ( ) ) ) )
   ( ( ( ) ( ) (  S  ) ) )    S -> (S) where the inside S is epsilon
   ( ( ( ) ( )    S    ) )    S -> (S)
   ( ( ( )  S     S    ) )    S -> (S) where the inside S is epsilon
   ( ( ( )     S       ) )    S -> SS
   ( (  S      S       ) )    S -> (S) where the inside S is epsilon
   ( (     S           ) )    S -> SS
   (         S           )    S -> (S)
               S              S -> (S)

   Thus the string ((()()(()))) is accepted by G because the rewriting
   produced exactly S, the start variable.

   More examples of constructing grammars from language descriptions:

   Construct a CFG for non empty Palindromes over T = { 0, 1 }
   The strings in this language read the same forward and backward.
     G = ( V, T, P, S)  T = { 0, 1 }, V = S, S = S, P is below:
     S -> 0 | 1 | 00 | 11 | 0S0 | 1S1
       We started the construction with S -> 0  and  S -> 1 
       the shortest strings in the language.
       S -> 0S0  is a palindrome with a zero added to either end
       S -> 1S1  is a palindrome with a one added to either end
       But, we needed  S -> 00  and  S -> 11  to get the even length
       palindromes started.
       "Non empty" means there can be no rule  S -> epsilon.
 
                                                n  n
  Construct the grammar for the language L = { a  b  n>0 }
    G = ( V, T, P, S )  T = { a, b }  V = { S }  S = S  P is:
    S -> ab | aSb
    Because n>0 there can be no S -> epsilon
    The shortest string in the language is  ab
    a's have to be on the front, b's have to be on the back.
    When either an "a" or a "b" is added the other must be added
    in order to keep the count the same. Thus  S -> aSb.
    The toughest decision is when to stop adding rules.
    In this case start "generating" strings in the language
        S -> ab             ab      for n=1
        S -> aSb           aabb     for n=2
        S -> aaSbb        aaabbb    for n=3  etc.
    Thus, no more rules needed.

    "Generating" the strings in a language defined by a grammar
    is also called "derivation" of the strings in a language.
 
CFG Derivation Tree 
Given a grammar with the usual representation  G = (V, T, P, S)
  with variables V, terminal symbols T, set of productions P and
  the start symbol from V called S.

  A derivation tree is constructed with
  1) each tree vertex is a variable or terminal or epsilon
  2) the root vertex is S
  3) interior vertices are from V, leaf vertices are from T or epsilon
  4) an interior vertex A has children, in order, left to right,
     X1, X2, ... , Xk when there is a production in P of the
     form  A -> X1 X2 ... Xk
  5) a leaf can be epsilon only when there is a production A -> epsilon
     and the leafs parent can have only this child.

  Watch out! A grammar may have an unbounded number of derivation trees.
  It just depends on which production is expanded at each vertex.

  For any valid derivation tree, reading the leafs from left to right
  gives one string in the language defined by the grammar.
  There may be many derivation trees for a single string in the language.

  If the grammar is a CFG then a leftmost derivation tree exists
  for every string in the corresponding CFL. There may be more than one
  leftmost derivation trees for some string. See example below and
  ((()())()) example in previous lecture.

  If the grammar is a CFG then a rightmost derivation tree exists
  for every string in the corresponding CFL. There may be more than one
  rightmost derivation tree for some string.

  The grammar is called "ambiguous" if the leftmost (rightmost) derivation
  tree is not unique for every string in the language defined by the grammar.

  The leftmost and rightmost derivations are usually distinct but might
  be the same.

  Given a grammar and a string in the language represented by the grammar,
  a leftmost derivation tree is constructed bottom up by finding a
  production in the grammar that has the leftmost character of the string
  (possibly more than one may have to be tried) and building the tree
  towards the root. Then work on the second character of the string.
  After much trial and error, you should get a derivation tree with a root S.
  We will get to the CYK algorithm that does the parsing in a few lectures.

  Examples: Construct a grammar for L = { x 0^n y 1^n z   n>0 }
            Recognize that  0^n y 1^n is a base language, say B
            B -> y | 0B1    (The base y, the recursion 0B1 )
            Then, the language is completed  S -> xBz
            using the prefix, base language and suffix.
            (Note that x, y and z could be any strings not involving n)

            G = ( V, T, P, S ) where
            V = { B, S }     T = { x, y, z, 0, 1 }  S = S
            P =    S -> xBz
                   B -> y | 0B1
                                               *
  Now construct an arbitrary derivation for  S =>  x00y11z
                                               G

  A derivation always starts with the start variable, S.
  The "=>", "*" and "G" stand for "derivation", "any number of
  steps", and "over the grammar G" respectively.

  The intermediate terms, called sentential form, may contain
  variable and terminal symbols.
  
  Any variable, say B, can be replaced by the right side of any
  production of the form  B -> <right side>

  A leftmost derivation always replaces the leftmost variable
  in the sentential form. (In general there are many possible
  replacements, the process is nondeterministic.)

  One possible derivation using the grammar above is
       S => xBz => x0B1z => x00B11z => x00y11z
  The derivation must obviously stop when the sentential form
  has only terminal symbols. (No more substitutions possible.)
  The final string is in the language of the grammar. But, this
  is a very poor way to generate all strings in the grammar!

  A "derivation tree" sometimes called a "parse tree" uses the
  rules above: start with the starting symbol, expand the tree
  by creating branches using any right side of a starting
  symbol rule, etc.

                               S
                             / | \
                           /   |   \
                         /     |     \
                       /       |       \
                     /         |         \
                    x          B          z
                             / | \
                           /   |   \
                         /     |     \
                       /       |       \
                      0        B        1
                             / | \
                           /   |   \
                          0    B    1
                               |
                               y

  Derivation ends  x  0   0    y     1   1  z   with all leaves

  terminal symbols, a string in the language generated by the grammar.
 
  More examples of grammars are:
  G(L) for L = { x a^n  y b^k z   k > n > 0 }
       note that there must be more b's than a's  thus
       B -> aybb | aBb | Bb
  G = ( V, T, P, S )  where
  V = { B, S }   T = { a, b, x, y, z }  S = S
  P =   S -> xBz     B -> aybb | aBb | Bb

  Incremental changes for "n > k > 0"     B -> aayb | aBb | aB
  Incremental changes for "n >= k >= 0"   B -> y | aBb | aB

  Independent exponents do not cause a problem when nested
  equivalent to nesting parenthesis.
  G(L) for L = { a^i b^j c^j d^i e^k f^k  i>=0, j>=0, k>=0 }
                   |   |   |   |   |   |
                   |   +---+   |   +---+
                   +-----------+

  G = ( V, T , P, S )
  V = { I, J, K, S }  T = { a, b, c, d, e, f }  S = S
  P =   S -> IK
        I -> J | aId
        J -> epsilon | bJc
        K -> epsilon | eKf

 G(L) for L = { a^i b^j c^k | any unbounded relation such as i=j=k>0, 0<i<k<j }
 the G(L) can not be a context free grammar. Try it.
 This will be intuitively seen in the push down automata and provable
 with the pumping lemma for context free languages.


 What is a leftmost derivation trees for some string?
 It is a process that looks at the string left to right and
 runs the productions backwards.
 Here is an example, time starts at top and moves down.

 Given G = (V, T, P, S)  V={S, E, I} T={a, b, c} S=S   P=
       I -> a | b | c
       E -> I | E+E | E*E
       S -> E                (a subset of grammar from book)

 Given a string   a + b * c
                  I
                  E
                  S            derived but not used
                      I
                      E
                 [E + E]
                    E
                    S          derived but not used
                          I
                          E
                   [E   * E]
                      E 
                      S        done! Have S and no more input.

  Left derivation tree, just turn upside down, delet unused.

                      S
                      |
                      E
                    / | \
                   /  |  \
                  /   |   \
                 E    *    E
               / | \       |
              E  +  E      I
              |     |      |
              I     I      c
              |     |
              a     b

Check: Read leafs left to right, must be initial string, all in T.
       Interior nodes must be variables, all in V.
       Every vertical connection must be tracable to a production.
 

CFG simplification algorithm

 The goal here is to take an arbitrary Context Free Grammar
  G = (V, T, P, S) and perform transformations on the grammar that
  preserve the language generated by the grammar but reach a
  specific format for the productions.

  Overview: Step 1a) Eliminate useless variables that can not become terminals
            Step 1b) Eliminate useless variables that can not be reached
            Step 2)  Eliminate epsilon productions
            Step 3)  Eliminate unit productions
            Step 4)  Make productions Chomsky Normal Form
            Step 5)  Make productions Greibach Normal Form

            The CYK parsing uses Chomsky Normal Form as input
            The CFG to NPDA uses Greibach Normal Form as input

  Details:  one step at a time

  1a) Eliminate useless variables that can not become terminals
      See 1st Ed. book p88, Lemma 4.1, figure 4.7
          2nd Ed. section 7.1
      Basically: Build the set NEWV from productions of the form
      V -> w  where V is a variable and w is one or more terminals.
      Insert V into the set NEWV.
      Then iterate over the productions, now accepting any variable
      in w as a terminal if it is in NEWV. Thus NEWV is all the
      variables that can be reduced to all terminals.

      Now, all productions containing a variable not in NEWV
      can be thrown away. Thus T is unchanged, S is unchanged,
      V=NEWV and P may become the same or smaller.
      The new grammar G=(V,T,P,S) represents the same language.

  1b) Eliminate useless variables that can not be reached from S
      See 1st Ed. book p89, Lemma 4.2, 2nd Ed. book 7.1.
      Set V'=S, T'=phi, mark all production as unused.
      Iterate repeatedly through all productions until no change
      in V' or T'. For any production A -> w, with A in V' 
      insert the terminals from w into the set T' and insert
      the variables form w into the set V' and mark the
      production as used.

      Now, delete all productions from P that are marked unused.
      V=V', T=T', S is unchanged. 
      The new grammar G=(V,T,P,S) represents the same language.


  2)  Eliminate epsilon productions.
      See 1st Ed. book p90, Theorem 4.3, 2nd Ed. book 7.1
      This is complex. If the language of the grammar contains
      the null string, epsilon, then in principle remove epsilon
      from the grammar, eliminate epsilon productions, then put
      S -> epsilon  back into the grammar later.

      The new grammar G=(V,T,P,S) represents the same language except
      the new language does not contain epsilon.


  3)  Eliminate unit productions.
      See 1st Ed. book p91, Theorem 4.4, 2nd Ed. 7.1
      Iterate through productions finding A -> B type "unit productions".
      Delete this production from P.
      Make a copy of all productions  B -> gamma, replacing B with A.
      Be careful of  A -> B,  B -> C, C -> D type cases,
      there needs to be copies of B -> gamma, C -> gamma, D -> gamma for A.

      Delete duplicate productions. (sort and remove adjacent duplicate)
      The new grammar G=(V,T,P,S) represents the same language.


  Briefly, some pseudo code for the above steps.

  Step 1a) The set V' = phi
           loop through the productions, P, to find:
             A -> w  where w is all terminals
                     union V' with A
           n := 0
           while n /= |V'|
             n := |V'|
             loop through productions to find:
               A -> alpha where alpha is only terminals and variables in V'
                    union V' with A
           end while
           Eliminate := V - V'
           loop through productions
             delete any production containing a variable in Eliminate,
           V := V'
           
  Step 1b) The set V' = {S}
           The set T' = phi
           n := 0
           while n /= |V'| + |T'|
             n := |V'| + |T'|
             loop through productions to find:
               A -> alpha  where A in V'
                           union V' with variables in alpha
                           union T' with terminals in alpha
           end while
           loop through productions
             delete any production containing anything outside V' T' and epsilon
           V := V'
           T := T'
           
  Step 2)  The set N = phi
           n := -1
           while n /= |N|
             n = |N|
             loop through productions to find:
               A -> epsilon
                            union N with A
                            delete production
                            
               A -> alpha   where no terminals in alpha and
                            all variables in alpha are in N
                            union N with A
                            delete production
           end while
           if S in N set null string accepted
           loop through productions
             A -> alpha   where at least one variable in alpha in N
                          generate rules A -> alpha'  where alpha'
                          is all combinations of eliminating the
                          variables in N
                          
  Step 3) P' := all non unit productions ( not A -> B )
          U  := all unit productions
          loop through productions in U, |U| times, to find:
            A -> A   
                      ignore this
                      
            A -> B
                      loop through productions in P'
                      copy/substitute  B -> gamma to A -> gamma in P'
          P := P'
          eliminate duplicate productions (e.g. sort and check i+i against i)