Real Computer Science begins where we almost stop reading ...: TOC Chapter 8

Saturday, 26 January 2013

TOC Chapter 8

Chapter 8. Context-Free Grammars–Properties and Parsing

Pumping lemma for regular sets presented in an earlier chapter was used to prove that some languages are non-regular. Now, we give another pumping lemma for context-free languages (CFL) whose application will be to show that some languages are non-context-free. The idea behind this lemma is that longer strings in a CFL, have substrings which can be pumped to get infinite number of strings in the language.

8.1. Pumping Lemma for CFL

Theorem 8.1

Let L be a CFL. Then there exists a number k (pumping length) such that if w is a string in L of length at least ‘k,’ then w can be written as w = uυxyz satisfying the following conditions:

|υy| > 0
|υxy| ≤ k
For each i ≥ 0, uυⁱxyⁱz ∊ L

Proof Let G be a context-free grammar (CFG) in Chomsky normal form (CNF) generating L. Let ‘n’ be the number of non-terminals of G. Take k = 2ⁿ. Let ‘s’ be a string in L such that |s| ≥ k. Any parse tree in G for s must be of depth at least n. This can be seen as follows:

If the parse tree has depth n, it has no path of length greater than n; then the maximum length of the word derived is 2ⁿ⁻¹. This statement can be proved by induction. If n = 1, the tree has structure

. If n = 2, the tree has the structure

. Assuming that the result holds upto i − 1, consider a tree with depth i. No path in this tree is of length greater than i. The tree has the structure as in the figure below.

T₁ and T₂ have depth i − 1 and the maximum length of the word derivable in each is 2ⁱ⁻², and so the maximum length of the string derivable in T is 2ⁱ⁻² + 2ⁱ⁻² = 2ⁱ⁻¹.

Choose a parse tree for s that has the least number of nodes. Consider the longest path in this tree. This path is of length at least ‘n + 1.’ Then, there must be at least n + 1 occurrences of non-terminals along this path. Consider the nodes in this path starting from the leaf node and going up towards the root. By pigeon-hole principle some non-terminal occurring on this path should repeat. Consider the first pair of occurrences of the non-terminal A (say) which repeats while reading along the path from bottom to top. In Figure 8.1, the repetition of A thus identified allows us to replace the subtree under the second occurrence of the non-terminal A with the subtree under the first occurrence of A. The legal parse trees are given in Figure 8.1.

Figure 8.1. Derivation trees showing pumping property

We divide s as uυxyz as in Figure 8.1(i). Each occurrence of A has a subtree, under it generating a substring of s. The occurrence of A near the root of the tree generates the string ‘υxy’ where the second occurrence of A produces x. Both the occurrences of A produce substrings of s. Hence, one can replace the occurrence of A that produces x by a parse tree that produces υxy as shown in Figure 8.1(ii). Hence, strings of the form uυⁱxyⁱz, for i > 0 are generated. One can replace the subtree rooted at A which produces ‘υxy’ by a subtree which produced x as in Figure 8.1(iii). Hence, the string ‘uxz’ is generated. In essence,

We have

Hence,

Therefore, we have

Both υ and y simultaneously cannot be empty as we consider the grammar in CNF. The lower A will occur in the left or right subtree. If it occurs in the left subtree, y cannot be ε and if it occurs in the right subtree, υ cannot be ε.

The length of υxy is at most k, because the first occurrence of A generates υxy and the next occurrence generates x. The number of non-terminal occurrences between these two occurrences of A is less than n +1. Hence, length of υxy is at most 2ⁿ(= k). Hence the proof.

One can use pumping lemma for showing that some languages are not context-free. The method of proof will be similar to that of application of pumping lemma for regular sets.

8.2. Closure Properties of CFL

In this section, we investigate the closure of CFLs under some operations like union, intersection, difference, substitution, homomorphism, and inverse homomorphism etc. The first result that we will prove is closure under substitution, using which we establish closure under union, catenation, catenation closure, catenation +, and homomorphism.

Theorem 8.2

Let L be a CFL over T_Σ and σ be a substitution on T such that σ(a) is a CFL for each a in T. Then σ(L) is a CFL.

Proof Let G = (N, T, P, S) be a CFG generating L. Since σ(a) is a CFL, let G_a = (N_a, T_a, P_a, S_a) be a CFG generating σ(a) for each a ∊ T. Without loss of generality, N_a ∩ N_b = φ and N_a ∩ N = φ for a ≠ b, a, b ∊T. We now construct a CFG G′= (N′, T′, P′, S′) which generates σ (L) as follows:

1.	N′ is the union of , a ∊ T and N
2.
3.	P′ consists of: all productions in P_a for a ∊ T all productions in P, but for each terminal a occurring in any rule of P, is to be replaced by S_a. i.e., in A → α, every occurrence of a (∊ T) in α is replaced by S_a.

Any derivation tree of G′ will typically look as in the following figure (Figure 8.2).

Figure 8.2. A derivation tree showing a string obtained by substitution

Here ab... k is a string of L and x_ax_b... x_k is a string of σ(L). To understand the working of G′ producing σ(L), we have the following discussion:

A string w is in L(G′) if and only if w is in σ(L). Suppose w is in σ(L). Then, there is some string x = a₁... a_k in L and strings x_i in σ(a_i), 1 ≤ i ≤ k, such that w = x₁ ... x_k. Clearly from the construction of G′, S_{a_i} ... S_{a_k} is generated (for a₁ ... a_k ∊ L). From each S_{a_i}, x_is are generated where x_i ∊ σ(a_i). This becomes clear from the above picture of derivation tree. Since G′ includes productions of G_{a_i}, x₁ ... x_k belongs to σ(L).

Conversely for w ∊ σ(L), we have to understand the proof with the help of the parse tree constructed above. That is, the start symbol of G and G′ are S. All the non-terminals of G, G_a’s are disjoint. Starting from S, one can use the productions of G′ and G and reach w = S_a₁ ... S_{a_k} and w′ = a₁ ... a_k, respectively. Hence, whenever w has a parse tree T, one can identity a string a₁a₂ ... a_k in L(G) and string in σ(a_i) such that x₁... x_k ∊ σ(L). Since x₁... x_k is a string formed by substitution of strings x_i’s for a_i’s, we conclude w ∊ σ(L).

8.3. Decidability Results for CFL

The three decision problems that we studied for regular languages are emptiness, finiteness, and membership problems. The same can be studied for CFLs also. The discussion of the results under this section is based on either the representation of a CFL as in a PDA form or in simplified CFG form. Hence, we will be using CFG in CNF or a PDA which accepts by empty stack or final state.

Theorem 8.8

Given a CFL L, there exists an algorithm to test whether L is empty, finite or infinite.

Proof To test whether L is empty, one can see whether the start symbol S of the CFG G = (N, T, S, P) which generates L is useful or not. If S is a useful symbol, then L ≠ φ.

To see whether the given CFL L is infinite, we have the following discussion. By pumping lemma for CFL, if L contains a word of length t, with |t| > k for a constant k (pumping length), then clearly L is infinite.

Conversely, if L is infinite it satisfies the conditions of the pumping lemma, otherwise L is finite. Hence, we have to test whether L contains a word of length greater than k.

8.4. SubFamilies of CFL

In this section, we consider the special cases of CFLs.

Definition 8.1

A CFG G = (N, T, P, S) is said to be linear if all rules are of the form A → x By or A → x, x, y ∊ T*, A, B ∊ N. i.e., the right-hand side consists of at most one non-terminal.

8.5. Parikh Mapping and Parikh’s Theorem

We present in this section a result which connects CFLs to semi-linear sets.

8.6. Self-embedding Property

In this section, we consider the self-embedding property which makes CFL more powerful than regular sets. Pumping lemma for CFL makes use of this property. By this property, it is possible to pump equally on both sides of a substring which is lacking in regular sets.

Definition 8.17

Let G = (N, T, P, S) be a CFG. A non-terminal A ∊ N is said to be self-embedding, if

where x, y ∊ (N ∪ T)⁺. A grammar G is self-embedding if it has a self-embedding non-terminal.

8.7. Homomorphic Characterization

Earlier, we saw that the family of CFL is a full AFL. For an AFL F, if there exists a language L₀ ∊ F, such that any language L in F can be obtained from L₀ by means of some of these six operations, then L₀ is called a generator for the AFL F. Any regular set is a generator for the family of regular sets. Let R be any regular set. Any other regular set R′ can be got by (Σ*∪ R) ∩ R′. Next, we show that the Dyck set is a generator for the family of CFL.

Definition 8.18

Consider the CFG

, n ≥ 1. The language L(G) is called the Dyck set over T and usually denoted by D_n.

Problems and Solutions

Prove that the following languages are not context-free.

L₁ = {a^p|p is a prime}.
L₂ = {a,b}* − {aⁿb^n² | n ≥ 0}

Solution.

L₁ = {a^p|p is a prime}.

Suppose L₁ is context-free.

Then, by pumping lemma there exists k such that for all z ∊ L₁ and |z| ≥ k. z can be written in the form uvxyz such that uυⁱxyⁱz ∊ L for all i ≥ 0. Consider some p > k. a^p ∊ L₁

a^p = uυxyz

Now u, υ, x, y, z ∊ a*. Therefore, by pumping lemma:

uxz(υy)ⁱ ∊ L₁ for all i ≥ 0

Let |υy| = r

uxz(a^r)ⁱ ∊ L₁ for all i ≥ 0

or

z(a^r)^{i − 1} ∊ L₁ for all i ≥ 0

a^{p+ r(i − 1)} ∊ L₁ for all i ≥ 0

Choose i such that p + r(i − 1) is not a prime. Select i − 1 = p. Therefore, i = p + 1.

a^{p+ rp} ∊ L₁.

But, a^{p + rp} = a^{p (r + 1)}.

p(r + 1) is not a prime. So, we come to the conclusion that a^s where s is not a prime belong to L₁. This is a contradiction.

Therefore, L₁ is not context-free.
L₂ = {a, b}* − {aⁿb^n²| n ≥ 0}.

Suppose L₂ is context-free.

Since the family of context-free languages is closed under intersection with regular sets. L₂ ∩ a*b* is context-free.

This contains strings of the form L₃ = {aⁿb^m| m ≠ n²}.

We shall show this is not context-free. If L₃ is context-free, then by pumping lemma there is a constant k which satisfies the conditions of pumping lemma. Choose z = aⁿb^m where n > k, m ≠ n².

z = uυxyz where |υxy| ≤ k, |υy| ≥ 1 such that uυⁱxyⁱz ∊ L₃ for all i ≥ 0. If υ or y consists of both a and b, then by pumping we shall get a string which is not of the form aⁱb^j. If υ ∊ a*, y ∊ b*, then we have to show by pumping that we can get a string not in L₃. Let υ = a^p and y = b^q. Then aⁿ ^{− p}b^m ^{− q} ∊ L (i = 0).

Choose m = (n − p)² + q. i.e., we started with aⁿb^(n−p)²+q. Then a^n−pb^{(n−p)²+ q−q} = a^(n−p)b^(n−p)² ∊ L₃.

This is a contradiction.

L₃ is not context-free and hence L₂ is not context-free.

Which of the following sets are context-free and which are not? Justify your answer.

a. L₁ = {aⁿb^mc^k|n, m, k ≥ 1 and 2n = 3k, or 5n = 7m}.

Solution.

S → A	S → BD
A → a³ Ac²	D → cD
A → a³Cc²	D → c
C → bC	B → a⁷Bb⁵
C → b	B → a⁷b⁵

This CFG generates L₁.

L₂ = {aⁱb^jc^kd^l, i, j, k, l ≥ 1, i = l, j = k}.

Solution.

L₂ is CFL generated by:

S → aSd
S → aAd
A → bAc
A → bc

L₃ = {x ∊ {a, b, c}* |# _ax = #_bx = #_cx}.

Solution.

L₃ is not context-free.

Suppose L₃ is context-free.

Then, since the family of CFL is closed under intersection with regular sets L₃ ∩ a* b* c* is regular.

This is {aⁿ bⁿ cⁿ |n ≥ 0}.

We have shown that this is not context-free

L₃ is not context-free.

L₄ = {a^mbⁿ|n, m ≥ 0, 5m − 3n = 24}.

Solution.

It can be seen that:

a⁶b² ∊ L₄
a⁹b⁷ ∊ L₄
a¹²b¹² ∊ L₄
a¹⁵b¹⁷ ∊ L₄

S → a³Sb⁵, S → a⁶b² which generates L₄.

L₅ = {a^mbⁿ |n ≠ m}.

Solution.

Worked out earlier in Chapter 2.

Let NL₂ be the set of non-context-free languages. Determine whether or not.

a. NL₂ is closed under union.

Solution.

No.

L₁ = {a^p\|p is a prime}.
L₂ = {a^p\|p is not a prime}.
L₁ ∪ L₂ = {aⁿ\|n ≥ 1} is CFL whereas L₁ and L₂ are not.

NL₂ is closed under complementation.

Solution.

No.

L = {x\|x ∊ {a, b}*} and not of the form ww is CFL.
is not a CFL.

NL₂ is closed under intersection.

Solution.

No.

L₁ = {a^p\|p is a prime}.
L₂ = {a²ⁿ \| n ≥ 0}.
L₁ and L₂ are not CFL. L₁ ∩ L₂ = {a²} is a singleton and hence a CFL.

NL₂ is closed under catenation.

Solution.

No.

L₁	=	{a^p\|p is a prime}.
L₂	=	{a^p\|p is not a prime}.
L₁	=	{a², a³, a⁵, a⁷, a¹¹, ...}.
L₂	=	{a, a⁴, a⁶, a⁸, a⁹, a¹⁰, a¹², ...}.
L₁L₂	=	{a³, a⁴, a⁶, a⁷, a⁸, a⁹, a¹⁰, a¹¹, ...}
	=	a* − {a, a², a⁵} is CFL.

NL₂ is closed under Kleene closure.

Solution.

No.

is CFL.

Is the language {x^myⁿ|m, n ∊ N, m ≤ n ≤ 2m} context-free? Justify your answer.

Solution.

Yes.

S → xSy
S → xSyy
S → ε

Is the union of a collection of context-free languages always context-free? Justify your answer.

Solution.

Finite union of CFL is CFL.

L₁, L₂, L₃, ..., L_k are CFL.

Let L_i be generated by G_i = (N_i, T, P_i, S_i) and let N_i ∩ N_j = φ for i ≠ j

generates L₁ ∪ L₂ ∪ ... L_k.

Infinite union of CFL is not CFL

L_i = {aⁱ|i is a particular prime}.

L_i are CFL.

But

is not a CFL.

Consider the following context-free grammar:

S → AA\|AS \|b
A → SA\|AS\|a

For strings abaab and bab, bbb.

Construct the CYK parsing table.

Are these strings in L(G)?

Solution.

Exercises

Consider L = {y ∊ {0, 1}*| |y|₀ = |y|₁}. Prove or disprove that L is context-free.

Let G be the following grammar:

S → CF\|DE\|AB
D → BA\|0
E → SD
A → AA\|1
B → BB\|0
C → FS
F → AB\|1

Use CYK algorithm to determine which of the following strings are in L(G).

10110, 0111010, 0110110.

For each of the above strings present the final table, state if the string is in L(G) and if it is, then give the derivation.

Let G be defined S → AS|b, A → SA|a. Construct CYK parsing tables for:

bbaab
ababab
aabba

Are these strings in L(G)?

S → SS|AA|b, A → AS|AA|a

Give CYK parsing tables for:

aabb, and bbaba.

Are these strings in L(G)?

For the CFG

S → AB \|BC
A → BA\|a
B → CC\|b
C → AB\|a

Construct CYK parsing tables for:

ababb
bbbaaa

Let DCFL be the collection of languages accepted by a deterministic PDA. Given Examples to show that even if L₁ and L₂ are in DCFL:

L₁ · L₂ need not be in DCFL
L₁ − L₂ need not be in DCFL

Given a DPDA M show that it is decidable whether L(M) is a regular set.

Let L be in DCFL and R a regular set. Show that it is decidable whether R is contained in L.

Using a cardinality argument show that there must be languages that are not context-free.

10.

Let LIN be a family of linear languages. Pumping lemma for linear languages can be stated as follows:

Let L ⊆ Σ* be in LIN. Then, there exists a constant p > 0 such that for all words z in L with |z| ≥ p can be expressed as z = uxwyυ for some u, υ, x, y, w ∊ Σ* such that:

|uxyυ| < p
|xy| ≥ 1
for all i ≥ 0, uxⁱwyⁱυ ∊ L

Prove the LIN language pumping lemma.
Using the LIN language pumping lemma, prove that the following languages are not linear.

{aⁱbⁱc^jd^j|i, j ≥ 1}
{x|x is in {a, b}* and #_a(x) = #_b(x)}
{aⁱbⁱcⁱ|i ≥ 1}

11.

Prove or disprove the following claims:

Family of deterministic context-free languages (DCF) is closed under complementation.
Family of DCF is closed under union.
Family of DCF is closed under regular intersection.
Family of DCF is closed under reversal.
Family of LIN is closed under union.
Family of LIN is closed under intersection.
Family of LIN is closed under reversal.

12.

A language L consists of all words w over the alphabet {a, b, c, d} which satisfy each of the following conditions:

#_a(w) + #_b(w) = 2(#_c(w) + #_d(w)).
aaa is a subword of w but abc is not a subword of w.
The third letter of w is not c.

Prove that L is context-free.

13.

Compare the family of minimal linear languages with the family regular languages. Characterize the languages belonging to the intersection of these two families.

14.

Prove that there is a linear language which is not generated by any deterministic linear grammar.

15.

A Parikh mapping ψ depends on the enumeration of the basic alphabet; another enumeration gives a different mapping ψ′. Prove that if ψ (L) is semi-linear for some ψ, then ψ′(L) is semi-linear for any ψ′.

16.

Consider languages over a fixed alphabet T with at least two letters. Prove that, for any natural number n, there is a CFL L_n which is not generated by any type-2 grammar containing fewer than n non-terminals.

17.

Consider the grammar G determined by the productions:

X₀ → adX₁da\|aX₀a\|aca,
X₁ → bX₁b\|bdX₀db.

Prove that L(G) is not sequential. This shows that not all linear languages are sequential. Conversely, give an Example of a sequential language which is not metalinear.

18.

Let G be a CFG with the production S → AB, A → a, B → AB|b. Run the CYK algorithm for the string aab.

19.

Modify the CYK algorithm to count the number of parse trees of a given string and to construct one if the number is non-zero.
Test your algorithm of part (i) above on the following grammar:

S → ST |a

T → BS

B → +

and string a + a + a.

20.

Use closure under union to show that the following languages are CFL.

{a^mb^m|m ≠ n}
{a, b}* − {aⁿbⁿ|n ≥ 0}
{w ∊ {a, b}*|w = w^R}
{a^mbⁿc^pd^q|n = q or m ≤ p or m + n = p + q}

21.

Prove the following stronger version of the pumping lemma.

Let G be a CFG. Then, there are numbers K and k such that any string w ∊ L(G) with |w| ≥ K can be re-written as w = uυxyz with |υxy| ≤ k in such a way that either υ or y is non-empty and uυⁿxyⁿz ∊ L(G) for every n ≥ 0.

22.

Show that the class of DCFL is not closed under homomorphism.

Real Computer Science begins where we almost stop reading ...

Saturday, 26 January 2013

TOC Chapter 8

Chapter 8. Context-Free Grammars–Properties and Parsing

8.1. Pumping Lemma for CFL

Theorem 8.1

Figure 8.1. Derivation trees showing pumping property

8.2. Closure Properties of CFL

Theorem 8.2

Figure 8.2. A derivation tree showing a string obtained by substitution

8.3. Decidability Results for CFL

Theorem 8.8

8.4. SubFamilies of CFL

Definition 8.1

8.5. Parikh Mapping and Parikh’s Theorem

8.6. Self-embedding Property

Definition 8.17

8.7. Homomorphic Characterization

Definition 8.18

Problems and Solutions

Exercises

No comments:

Post a Comment