Study Material
Semester-05
TOC
Unit-03

Unit 3: Context-Free Grammar and Language

Introduction to Context-Free Grammar and Language

Context-Free Grammar (CFG) plays a central role in the theory of computation, forming the basis for languages that require a more complex structure than regular languages. CFG is extensively used in programming languages, natural language processing, and compilers. This unit covers various essential topics such as grammar simplification, derivation trees, Chomsky and Greibach normal forms, closure properties, and the Pumping Lemma for context-free languages (CFL).

Grammar: Introduction and Representation

A grammar is a formal mechanism to describe the syntax of a language. It consists of a set of production rules that define how strings in a language can be generated. A grammar is usually denoted by the 4-tuple:

G=(V,Σ,R,S)G = (V, \Sigma, R, S)

Where:

  • VV is a finite set of non-terminal symbols (variables).
  • Σ\Sigma is a finite set of terminal symbols (alphabet of the language).
  • RR is a finite set of production rules, where each rule has the form AαA \rightarrow \alpha (where AVA \in V and α(VΣ)\alpha \in (V \cup \Sigma)^*).
  • SS is the start symbol, a non-terminal that generates strings in the language.

Chomsky Hierarchy

The Chomsky Hierarchy classifies languages based on their generative power:

  1. Type 0 (Recursively Enumerable Languages): Generated by unrestricted grammars, where no constraints are imposed on the production rules.
  2. Type 1 (Context-Sensitive Languages): Generated by context-sensitive grammars, where the left-hand side of production rules can have more than one symbol, and the length of the right-hand side is at least as long as the left-hand side.
  3. Type 2 (Context-Free Languages): Generated by context-free grammars, where each production rule has a single non-terminal on the left-hand side.
  4. Type 3 (Regular Languages): Generated by regular grammars, where production rules are of the form AaBA \rightarrow aB or AaA \rightarrow a (left or right linear).

Formal Definition of Regular Grammar (RG)

A Regular Grammar (RG) is a Type 3 grammar where the production rules are restricted to either left-linear or right-linear forms. The formal definition of an RG is:

  • Right-linear grammar (RLG): All production rules are of the form AaBA \rightarrow aB or AaA \rightarrow a, where A,BA, B are non-terminals, and aa is a terminal symbol.
  • Left-linear grammar (LLG): All production rules are of the form ABaA \rightarrow Ba or AaA \rightarrow a.

Conversions Between RG and FA

Regular grammars and finite automata (FA) are equivalent in the languages they generate. The conversion between them follows these steps:

LRG to RLG Conversion

Converting a left-linear grammar to a right-linear grammar (or vice versa) requires reversing the rules and adjusting the ordering of terminals and non-terminals in the production rules.

RG to FA Conversion

To convert a regular grammar into a finite automaton (FA):

  1. Create a state for each non-terminal.
  2. Define transitions between states based on the production rules.
  3. The start state corresponds to the start symbol, and accepting states are determined by productions that generate a terminal string.

FA to RG Conversion

The reverse process involves creating a non-terminal for each state of the FA, and the transition function defines the production rules.

Context-Free Grammar (CFG)

A Context-Free Grammar (CFG) is a grammar where every production rule has a single non-terminal on the left-hand side. CFGs can generate context-free languages (CFL), which include many programming languages and are widely used in compiler design.

Formal Definition of CFG

A CFG is defined as a 4-tuple:

G=(V,Σ,R,S)G = (V, \Sigma, R, S)

Where VV, Σ\Sigma, RR, and SS are defined similarly to regular grammars, with the restriction that every production rule must have a single non-terminal on the left-hand side.

Derivation Tree

A derivation tree (or parse tree) visually represents the process of generating a string from the start symbol by applying production rules. The root is the start symbol, internal nodes are non-terminals, and leaves are terminal symbols or empty strings.

Leftmost and Rightmost Derivations

  • Leftmost Derivation: In each step, the leftmost non-terminal is replaced by one of its production rules.
  • Rightmost Derivation: In each step, the rightmost non-terminal is replaced.

Sentential Forms

A sentential form is any intermediate string derived during the derivation process that contains both terminal and non-terminal symbols.

Ambiguous and Unambiguous Grammars

A grammar is ambiguous if there exists a string that has more than one distinct derivation tree or parse tree. Otherwise, it is unambiguous.

Example of Ambiguous Grammar

Consider the grammar:

SS+S  SS  aS \rightarrow S + S \ | \ S * S \ | \ a

For the string a + a * a, both of the following parse trees are valid:

  • First interpretation: (a + (a * a))
  • Second interpretation: ((a + a) * a)

Grammar Simplification

Grammar simplification involves reducing the complexity of a CFG without changing the language it generates. The simplification process includes:

  1. Elimination of useless symbols: Remove non-terminals that do not appear in any derivation of terminal strings.
  2. Elimination of epsilon (ε) productions: Remove production rules of the form AεA \rightarrow ε, except when AA is the start symbol.
  3. Elimination of unit productions: Remove productions where a non-terminal produces another non-terminal, i.e., ABA \rightarrow B.

Normal Forms

A CFG can be transformed into two standard forms: Chomsky Normal Form (CNF) and Greibach Normal Form (GNF).

Chomsky Normal Form (CNF)

A CFG is in Chomsky Normal Form if all of its production rules are of the form:

ABCorAaA \rightarrow BC \quad \text{or} \quad A \rightarrow a

Where AA, BB, and CC are non-terminals and aa is a terminal symbol. CNF is useful in parsing algorithms like the CYK algorithm.

Conversion to CNF

To convert a CFG into CNF:

  1. Eliminate epsilon-productions, unit productions, and useless symbols.
  2. Convert all remaining rules into the desired forms by introducing new non-terminals where necessary.

Greibach Normal Form (GNF)

A CFG is in Greibach Normal Form if every production rule is of the form:

AaαA \rightarrow a\alpha

Where AA is a non-terminal, aa is a terminal, and α\alpha is a string of non-terminals.

Conversion to GNF

The conversion to GNF involves eliminating left recursion and ensuring that every production begins with a terminal symbol followed by non-terminals.

Closure Properties of CFL

CFLs are closed under certain operations but not others. Specifically, CFLs are closed under:

  1. Union: If L1L_1 and L2L_2 are CFLs, then L1L2L_1 \cup L_2 is also a CFL.
  2. Concatenation: If L1L_1 and L2L_2 are CFLs, then L1L2L_1L_2 is a CFL.
  3. Kleene Star: If LL is a CFL, then LL^* is also a CFL.
  4. Intersection with a regular language: The intersection of a CFL with a regular language is a CFL.

However, CFLs are not closed under:

  1. Intersection: The intersection of two CFLs is not necessarily a CFL.
  2. Complement: The complement of a CFL is not necessarily a CFL.

Pumping Lemma for CFL

The Pumping Lemma for CFLs is a property used to prove that a language is not context-free. It states that for any CFL LL, there exists a constant pp (the pumping length) such that any string ss in LL with length sp|s| \geq p can be split into five parts:

s=uvwxys = uvwxy

Such that:

  1. vwxp|vwx| \leq p
  2. vxεvx \neq ε
  3. For all n0n \geq 0, the string uvnwxnyuv^nwx^ny is in LL

Application of Pumping Lemma

To prove that a language LL is not context-free, assume it is, and use the pumping lemma to derive a contradiction.

Example:

Consider the language L={anbncnn1}L = \{ a^n b^n c^n \mid n \geq 1 \}. Assume LL is a CFL and apply the pumping lemma to derive a contradiction, showing that it is impossible to pump the string while keeping the counts of aa, bb, and cc balanced.