General information about grammars
We define all grammars using the four tuple G = (V, T, P, S)
V is a finite set of variables, typically upper case letters,
and may have additional characters.
T is a finite set of terminal symbols, typically lower case letters,
and may be digits such as {0, 1} or other letters and digits.
Terminal symbols would be the tape alphabet in automata.
The intersection of V and T must be empty.
The zero length string, epsilon, may be used as a terminal symbol.
S is a distinguished element of V, the starting variable.
P are a finite set of productions written as alpha -> beta
where alpha and beta are sequences made up of the
elements of V and T. The types of grammars defined
below are restrictions on the content of alpha and beta.
Productions may be written one per line or may use
a short hand notation with the vertical bar | meaning "or".
S -> a
S -> aA
S -> BC
may be written as S -> a | aA | BC
Type 0 Recursive Grammar
The Chomsky Hierarchy of Formal Languages 0, 1, 2, 3.
The class of recursive grammars has no restriction on
alpha and beta. An arbitrary example is:
S -> a | bb | A | aAbCcc | AABBcC
aSAB -> c | C | BaA | aAbBBccCCCd
ab -> c | ba | C | cBaAA | A
C -> epsilon | CC | AaBb
A -> B
Simplification rules may be available to eliminate
useless productions.
There may be variables that can not be reached from S.
There may be variables that can not become terminal symbols.
A Recursive Grammar defines the same class of languages as:
Recursive Languages
Turing Machines
Type 1 Context Sensitive Grammar
A context sensitive grammar implies some alpha may not
be a single variable.
The restriction on a Type 0 grammar is that the length
of beta, right hand side, shall not be greater than
the length of alpha, left hand side.
For every production alpha -> beta
|beta| <= |alpha|
S -> a | A
AB -> cB | B | epsilon
aAb -> AaA | AA | A | a
Type 2 Context Free Grammar
A context free grammar implies alpha is a single variable.
S -> a | A | AB | aBbAc
A -> AB | ccBBAAaa
B -> BBB
All Context Free Grammars can be parsed by CYKP,
Cocke Younger Kasami parser
All Context Free Grammars can be converted to:
Chomsky Normal Form, productions V -> a V -> AB exactly two variables
Greibach Normal Form, productions V -> a V -> aW
a exactly one terminal, W any number of variables
A Context Free Grammar defines the same class of languages as:
NPDA Nondeterministic Push Down Automata
Type 3 Regular Grammar
The productions of a regular grammar are restricted to:
V -> a a single terminal
V -> aA a single terminal followed by a single variable
Conversion from a DFA to a regular grammar uses the states
of the DFA as variables and the alphabet of the DFA as
terminal symbols. The DFA delta transition table becomes
the productions.
a b
---+----+---
q0 | q1 | q2
q1 | q1 | q2
q2 | q2 | q1
renaming q0 to S, q1 to A, q2 to B
S -> aA
S -> bB
A -> aA | bB
B -> aB | bA
Additional rules are needed for final states.
If q2 is a file state, any transition to q2 gets a production
with only a terminal symbol.
S -> b
A -> b
B -> a
A Regular Grammar defines the same class of languages as:
Regular Languages
Regular Expressions
Finite Automata DFA, NFA, NFA-epsilon
Grammar Derivation Tree
The top-down tree for context free grammar:
The non leaf nodes of the derivation tree are labeled with
the variables of the grammar. The starting, root, node of
the tree is S. Unreachable variables will not appear in
the derivation tree.
The leaf nodes of the derivation tree are labeled with
the terminals of the grammar. Variables that can not
become terminals may appear as leaf nodes, yet may be
eliminated.
Great care must be taken to not keep repeating a
production. There may be a string of infinite length
in the grammar, and you would be drawing forever.
The goal is to end the tree with all leaf nodes
being terminal symbols.
There can be many different derivation trees for any
given grammar, based on the choice and order the
productions are drawn in the tree.
The leaf nodes, reading left to right, top to bottom,
give one string of terminal symbols that is in
the language of the grammar.
To start, chose one production of S -> ,
usually the longest or the one that has the most
variables.
Example: (numbering the productions to annotate the tree)
1 S -> 0A
2 S -> 1S
3 A -> 0A
4 A -> 1B
5 B -> 0B
6 B -> 1B
7 B -> 0
8 B -> 1
9 B -> epsilon
This would have a DFA and regular expression definition:
| 0 | 1 use Q0 = S, Q1 = A, Q2 = B
---+----+--- start Q0, final Q2, sigma {0,1}
Q0 | Q1 | Q0
Q1 | Q1 | Q2 1* 0 0* 1 (0+1)* or
Q2 | Q2 | Q2 (0+1)* 01 (0+1)*
The smallest tree can be:
A different derivation tree works bottom up,
starting with a string of terminal symbols from
the grammar as leaf nodes. If the string is in the
language of the grammar, there is a derivation tree
with a root node S. The tree is not necessarily unique.
Typically, find the left most terminal or string
of terminals that may be replaced by a variable.
Work up the tree, aiming for a root node of S.
Another typical method, is to work from the
right most terminal. If either method works,
it is provable that both will work.
This bottom up tree can be interpreted as a parse tree
when read top down. Given string 1101001 build
the derivation tree:
Grammar Parsing
There are many parsing algorithms. For context free grammars,
that are converted to Chomsky Normal Form, the CYK algorithm
is typically used.
CYK algorithm on Wikipedia
No comments:
Post a Comment