Unit 3: Context-Free Grammar and Language
Introduction to Context-Free Grammar and Language
Context-Free Grammar (CFG) plays a central role in the theory of computation, forming the basis for languages that require a more complex structure than regular languages. CFG is extensively used in programming languages, natural language processing, and compilers. This unit covers various essential topics such as grammar simplification, derivation trees, Chomsky and Greibach normal forms, closure properties, and the Pumping Lemma for context-free languages (CFL).
Grammar: Introduction and Representation
A grammar is a formal mechanism to describe the syntax of a language. It consists of a set of production rules that define how strings in a language can be generated. A grammar is usually denoted by the 4-tuple:
Where:
- is a finite set of non-terminal symbols (variables).
- is a finite set of terminal symbols (alphabet of the language).
- is a finite set of production rules, where each rule has the form (where and ).
- is the start symbol, a non-terminal that generates strings in the language.
Chomsky Hierarchy
The Chomsky Hierarchy classifies languages based on their generative power:
- Type 0 (Recursively Enumerable Languages): Generated by unrestricted grammars, where no constraints are imposed on the production rules.
- Type 1 (Context-Sensitive Languages): Generated by context-sensitive grammars, where the left-hand side of production rules can have more than one symbol, and the length of the right-hand side is at least as long as the left-hand side.
- Type 2 (Context-Free Languages): Generated by context-free grammars, where each production rule has a single non-terminal on the left-hand side.
- Type 3 (Regular Languages): Generated by regular grammars, where production rules are of the form or (left or right linear).
Formal Definition of Regular Grammar (RG)
A Regular Grammar (RG) is a Type 3 grammar where the production rules are restricted to either left-linear or right-linear forms. The formal definition of an RG is:
- Right-linear grammar (RLG): All production rules are of the form or , where are non-terminals, and is a terminal symbol.
- Left-linear grammar (LLG): All production rules are of the form or .
Conversions Between RG and FA
Regular grammars and finite automata (FA) are equivalent in the languages they generate. The conversion between them follows these steps:
LRG to RLG Conversion
Converting a left-linear grammar to a right-linear grammar (or vice versa) requires reversing the rules and adjusting the ordering of terminals and non-terminals in the production rules.
RG to FA Conversion
To convert a regular grammar into a finite automaton (FA):
- Create a state for each non-terminal.
- Define transitions between states based on the production rules.
- The start state corresponds to the start symbol, and accepting states are determined by productions that generate a terminal string.
FA to RG Conversion
The reverse process involves creating a non-terminal for each state of the FA, and the transition function defines the production rules.
Context-Free Grammar (CFG)
A Context-Free Grammar (CFG) is a grammar where every production rule has a single non-terminal on the left-hand side. CFGs can generate context-free languages (CFL), which include many programming languages and are widely used in compiler design.
Formal Definition of CFG
A CFG is defined as a 4-tuple:
Where , , , and are defined similarly to regular grammars, with the restriction that every production rule must have a single non-terminal on the left-hand side.
Derivation Tree
A derivation tree (or parse tree) visually represents the process of generating a string from the start symbol by applying production rules. The root is the start symbol, internal nodes are non-terminals, and leaves are terminal symbols or empty strings.
Leftmost and Rightmost Derivations
- Leftmost Derivation: In each step, the leftmost non-terminal is replaced by one of its production rules.
- Rightmost Derivation: In each step, the rightmost non-terminal is replaced.
Sentential Forms
A sentential form is any intermediate string derived during the derivation process that contains both terminal and non-terminal symbols.
Ambiguous and Unambiguous Grammars
A grammar is ambiguous if there exists a string that has more than one distinct derivation tree or parse tree. Otherwise, it is unambiguous.
Example of Ambiguous Grammar
Consider the grammar:
For the string a + a * a
, both of the following parse trees are valid:
- First interpretation:
(a + (a * a))
- Second interpretation:
((a + a) * a)
Grammar Simplification
Grammar simplification involves reducing the complexity of a CFG without changing the language it generates. The simplification process includes:
- Elimination of useless symbols: Remove non-terminals that do not appear in any derivation of terminal strings.
- Elimination of epsilon (ε) productions: Remove production rules of the form , except when is the start symbol.
- Elimination of unit productions: Remove productions where a non-terminal produces another non-terminal, i.e., .
Normal Forms
A CFG can be transformed into two standard forms: Chomsky Normal Form (CNF) and Greibach Normal Form (GNF).
Chomsky Normal Form (CNF)
A CFG is in Chomsky Normal Form if all of its production rules are of the form:
Where , , and are non-terminals and is a terminal symbol. CNF is useful in parsing algorithms like the CYK algorithm.
Conversion to CNF
To convert a CFG into CNF:
- Eliminate epsilon-productions, unit productions, and useless symbols.
- Convert all remaining rules into the desired forms by introducing new non-terminals where necessary.
Greibach Normal Form (GNF)
A CFG is in Greibach Normal Form if every production rule is of the form:
Where is a non-terminal, is a terminal, and is a string of non-terminals.
Conversion to GNF
The conversion to GNF involves eliminating left recursion and ensuring that every production begins with a terminal symbol followed by non-terminals.
Closure Properties of CFL
CFLs are closed under certain operations but not others. Specifically, CFLs are closed under:
- Union: If and are CFLs, then is also a CFL.
- Concatenation: If and are CFLs, then is a CFL.
- Kleene Star: If is a CFL, then is also a CFL.
- Intersection with a regular language: The intersection of a CFL with a regular language is a CFL.
However, CFLs are not closed under:
- Intersection: The intersection of two CFLs is not necessarily a CFL.
- Complement: The complement of a CFL is not necessarily a CFL.
Pumping Lemma for CFL
The Pumping Lemma for CFLs is a property used to prove that a language is not context-free. It states that for any CFL , there exists a constant (the pumping length) such that any string in with length can be split into five parts:
Such that:
- For all , the string is in
Application of Pumping Lemma
To prove that a language is not context-free, assume it is, and use the pumping lemma to derive a contradiction.
Example:
Consider the language . Assume is a CFL and apply the pumping lemma to derive a contradiction, showing that it is impossible to pump the string while keeping the counts of , , and balanced.