Real Computer Science begins where we almost stop reading ...: Chomsky Normal Form

Monday, 24 June 2013

Chomsky Normal Form

 Chomsky Normal Form is used by the CYK algorithm to determine
  if a string is accepted by a Context Free Grammar.

  Step 4) in the overall grammar "simplification" process is
  to convert the grammar to Chomsky Normal Form.
  Productions can be one of two formats A -> a or A -> BC
  The right side of the production is either exactly
  one terminal symbol or exactly two variables.

  The grammars must have the "simplification" steps 1), 2) and 3) out
  of the way, that is 1) No useless variables, 2) No nullable variables and
  3) no unit productions.


 Step 4) of "simplification" is the following algorithm:

   'length' refers to the number of variables plus terminal
   symbols on the right side of a production.

   Loop through the productions
     For each production with length greater than 1 do
       Replace each terminal symbol with a new variable and
       add a production   new variable -> terminal symbol.

   Loop through the productions
     For each production with length grater than 2 do
       Replace two rightmost variables with a new variable and
       add a production    new variable -> two rightmost variables.
       (Repeat - either on a production or loop until no replacements.)

   Now the grammar, as represented by the productions, is in Chomsky
   Normal Form. proceed with CYK. 

   An optimization is possible but not required, for any two productions
   with the same right side, delete the second production and
   replace all occurrences of the second productions left variable
   with the left variable of the first production in all productions.

 Example grammar:

  G = (V, T, P, S)  V={S,A} T={a,b} S=S
 
  S -> aAS
  S -> a
  A -> SbA
  A -> SS
  A -> ba 

  First loop through productions (Check n>1)

  S -> aAS becomes  S -> BAS   (B is the next unused variable name)
                    B -> a
  S -> a   stays    S -> a
  A -> SbA becomes  A -> SCA   (C is the next unused variable name)
                    C -> b
  A -> SS  stays    A -> SS
  A -> ba  becomes  A -> DE
                    D -> b
                    E -> a

  Second loop through productions (Check n>2)

  S -> BAS becomes  S -> BF    (F is the next unused variable)
                    F -> AS
  B -> a   stays    B -> a
  S -> a   stays    S -> a
  A -> SCA becomes  A -> SG
                    G -> CA
  C -> b   stays    C -> b
  A -> SS  stays    A -> SS
  A -> DE  stays    A -> DE
  D -> b   stays    D -> b
  E -> a   stays    E -> a


  Optimization is possible, B -> a, S -> a, E -> a can be replaced
  by the single production  S -> a  (just to keep 'S')
  and all occurrences of 'B' and 'E' get replaced by 'S'.

  Similarly D -> b can be deleted, keeping the C -> b production and
  substituting 'C' for 'D'.

  Giving the reduced Chomsky Normal Form:

  S -> SF
  F -> AS
  S -> a
  A -> CG
  G -> CA
  C -> b
  A -> SS
  A -> CS

For a computer generated reduction, a different naming convention
was chosen (to aid in debugging). First, a terminal symbol "a"
was replaced the prefixing "T_", thus "a" becomes "T_a".
Once a substitution is made, that substitution is remembered
so that there will be at most |T| rules generated of the form
  T_a -> a

When there are more that two variables on the right hand side of
a production, the new production is named "C_" concatenated with
the last two variables separated by an underscore.
This provides an easy reduction if the same two variables are
replaced more than once. The productions will be sorted and
duplicates will sort together and can be detected and eliminated quickly.
(In order to be completely safe, this algorithm requires that
 no underscores were used in the initial grammar.)

after eliminate, sorted productions:
A -> a 
B -> a 
S -> A B 
S -> a A 
S -> a S a S a 
S -> a S a a 
S -> a a S a 
S -> a a a 

Chomsky 1, replace terminal with variable
Chomsky part 1, sorted productions:
A   ->  a
B   ->  a
S   ->  A    B
S   ->  T_a  A
S   ->  T_a  S    T_a  S  T_a
S   ->  T_a  S    T_a  T_a
S   ->  T_a  T_a  S    T_a
S   ->  T_a  T_a  T_a
T_a ->  a

Chomsky 2, new production for each pair over two
Chomsky Part 2 generated productions
C_ST_a   ->  S    T_a
C_T_aS   ->  T_a  S
C_ST_a   ->  S    T_a
C_T_aT_a ->  T_a  T_a
C_ST_a   ->  S    T_a
C_ST_a   ->  S    T_a
C_T_aS   ->  T_a  S
C_T_aT_a ->  T_a  T_a

after Chomsky, sorted productions:
A        ->  a
B        ->  a
C_ST_a   ->  S    T_a
C_T_aS   ->  T_a  S
C_T_aT_a ->  T_a  T_a
S        ->  A    B
S        ->  T_a  A
S        ->  T_a  C_ST_a
S        ->  T_a  C_T_aS
S        ->  T_a  C_T_aT_a
T_a      ->  a

after Chomsky, Variables:
A  B  C_ST_a  C_T_aS  C_T_aT_a  S  T_a  

Terminals unchanged, S unchanged.

This simple structure for the productions makes
possible an efficient parser.
Real Computer Science begins where we almost stop reading ...

Monday, 24 June 2013

Chomsky Normal Form

No comments:

Post a Comment