virtualgate's picture

Lexical Analysis and Parsing 



A compiler is a program that reads a program written in one language (source language ) and converts into the equivalent program in a different language or the target language.

As an important part of this translation, compilers also reports users the presence of errors in programs.

Compilers are sometimes classified as single pass and multipass, load -and - go

depending on how they are constructed air what operation they have to perform.

The Analysis - synthesis mode of compilation:

The analysis part breaks up the source program into constituents pieces and creates an intermedeite representation of the source program.

The synthesis part constructs the desired target program form intermediate representation.


Contributor's Info

Created: Edited:
The Phases of a Complier

A compiler has the following phases:



let us assume the given grammar is :

S -> id

E -> E + T /T

T -> T*F / F

F -> ID

and statement is : x= a +b *c 

So task performed by each phase is like:

// ICG : Intermediate code generator

    CO:   Code optimization

    TCG: Tagert code generator


Contributor's Info

Introduction to Lexical Analyzer

The lexical analyzer is the first phase of a compiler. Its main tasks is to read the input character and produce a sequence of tokens as output.

later this sequence of tokens is used by Syntax Analyzer.

Role of lexical Analyzer:

  • Remove comments and white spaces  
  • Macros expansion. 
  • Read input characters from the source program.
  • Group them into lexemes.
  • Produce as output a sequence of tokens.
  • Interact with the symbol table.
  • Correlate error messages generated by the compiler with the source program.

It reads program line by line, so it displays errors present in program line by line. 

Some Definitions:

Token: a pair consisting of –

             Token name: the abstract symbol representing the lexical unit [affects                        parsing decision] –

             Optional attribute value [influences translations after parsing] •

Pattern: a description of the form that different lexemes take •

Lexeme: a sequence of characters in source program matching a pattern


Tokens Specification :

We need a formal way to specify patterns: regular expressions 

Alphabet: any finite set of symbols.

String over an alphabet: finite sequence of symbols drawn from that alphabet • Language: the countable set of strings over some fixed alphabet.


Some Basic Definition:




Contributor's Info

Created: Edited:
Counting No. of tokens(1)
Content covered: 
Here we will find how to count the number of tokens in a statement.
More Less
Counting No. of tokens(2)
Content covered: 
Second video example for token counting.
More Less
Errors and their recovery in Lexical Analysis

Errors detected in lexical Analysis:

1. Numeric literals that are too long.

2. Long identifiers.

3. Ill-formed numeric literals ex: int X= $ 12345.

4. Input character that is not part of the source of language.

Recovery techniques:

1. Panic mode recovery: Delete the unknown character.example: "chhar" is corrected as "char".

2. Transpose: bases on certain rules, we transpose two characters. example: "while "if written as" wheil". here e and i are transposed.

3. Replace: it is based on replacing one character by another character .examp;e: "chhr "can  be corrected as "char "by replacing second 'h' with 'a'

4. Insert: An extra or missing character is inserted to form a meaningful word. example: "cha" is corrected as "char "by inserting 'a'.

Contributor's Info

Symbol Table

Symbol Table:

A compiler uses a symbol table to keep track of scope and binding information about names. The symbol table is searched every time any name is encountered in the source text. Changes to the table occur if any new name or new information about an existing name is discovered.

Definition: Symbol Table is an important data structure created and maintained by the compiler in order to keep track of semantics of variable i.e. it stores information about the scope and binding information about names, information about instances of various entities such as variable and function names, classes, objects, etc.

  • The information is going to be collected by analysis phase and used by syntheses phase.
  • A symbol table mechanism must allow us to add new entries and find the existing entries efficiently.
  • It is useful for a compiler to be able to grow the symbol table dynamically, if necessary at compile time. If the size of the symbol table is fixed when the compiler is written, then the size chosen must be large enough to handle any source program that might be presented.

What should be in a symbol table?

  •  variable names
  •  procedure and function names
  • literal constants and strings
  •  source text labels

What information might compiler need?

  • textual name
  • data type
  • dimension information (for aggregates )
  • declaring procedure
  •  lexical level of declaration
  • storage class (base address )
  • offset in storage
  • if the record, a pointer to structure table
  • if parameter, by-reference or by-value?
  • can it be aliased? to what other names?
  • number and type of arguments to functions.

  Various Implementation techniques of symbol table:-

1. Unordered list

  •  Use an array or linked list.
  •  O(n) time search.
  •  O(1) time insertion.
  •  Simple & compact, but too slow in practice.

2. Ordered list

  •   O(log n) time with a binary search. 
  •   O(n) time insertion.
  •   Simple, good if the tables are known in advance. 

3. Binary search tree

  •   O(log n) time expected to search.
  •   O(n) worst-case.
  •   O(log n) time expected insertion.
  •   a less simple approach.

4.Hash table

  •   O(1) time expected for search.
  •   O(1) time expected for insertion.
  •   worst case is very unlikely. 
  •   subtle design issues, more programmer effort.
  •   this approach is taken in most compilers.


Contributor's Info

Elimination of left recursion and left factoring the grammars(1)

1. Left-Recursive Grammar:-

Any production which is in the form of

A --> A α / B  is called Left-Recursive.

Many parsers fail when they see left-recursive grammars.

So convert  left Recursive grammar to right recursive do the following steps:

A --> A α / B

converted as

A --> B A'

A' --> α  A' / ε​


There is a formal technique for eliminating left-recursion from productions

Step One: Direct-Recursion

For each rule which contains a left-recursive option,

A --> A  α| B

introduce a new nonterminal A' and rewrite the rule as

A -->  B A' 
A' -->  α  A' / ε​

Thus the production:

E --> E + T | T

is left-recursive with "E" playing the role of "A"," + T" playing the role of α   , and "T" playing the role of  A'. Introducing the new nonterminal E', the production can be replaced by:

E --> T E' 
E' -->  ε​ | + T E'

Of course, there may be more than one left-recursive part on the right-hand side. The general rule is to replace:

A --> A α1 | A α2   |.....A αn  |B1   |B 2  ...... | B n


A -->  B1A' |  B2A'| .......|  BnA'
A' -->  ε​ |  α1A' | α2  A' | .......|  αnA'



Contributor's Info

Elimination of left recursion and left factoring the grammars(2)
Content covered: 

We will cover here, how to remove Left Recursion and Left factoring.

More Less
Derivation Tree(1)

In a derivation tree, the root is the start variable, all internal nodes are labeled with variables, while all leaves are labeled with terminals. The children of an internal node are labeled from left to right with the right-hand side of the production used.

Example: S → 0S1S | 1S0S | ε

The derivation for 011100  for leftmost derivation is : S =⇒ 0S1S =⇒ 01S =⇒ 011S0S =⇒ 0111S0S0S =⇒ 01110S0S =⇒ 011100S =⇒ 011100



Contributor's Info

Derivation Tree(2)
Content covered: 

Derivation Tree (Left & Right Derivation Trees) video example.

More Less
Ambiguous grammar(1)

Ambiguous grammar:

=> If for a given grammer, for a given string if we get more than one parse tree, more than one Left Most Derivate or more than one Right most derivate, then the grammar is called as Ambiguous grammar.

=> Most of the parser does not allow Ambiguity. 

=> There is no algorithm to check ambiguity So you can say that Ambiguity problem is undecidable.

=> Ambiguity is a Property of Grammars, not Languages.


A → A + A | A − A | A * A | a

is ambiguous since there are two leftmost derivations for the string a + a + a:

    A→ A + A                         A→ A + A

      → a + A                         → A + A + A (First A is replaced by A+A. Replacement of the second A would yield a similar derivation)

    → a + A + A                  → a + A + A

    → a + a + A                   → a + a + A

    → a + a + a                   → a + a + a


Contributor's Info

Ambiguous Grammar(2)
Content covered: 

Derivation of String

a + a *b

More Less
Introduction to parsers and LL(1) parsing
Content covered: 

Introduction to parsers and LL(1) parsing

More Less
First and Follow
Content covered: 
We will learn how to find First and Follow .
More Less
First and Follow Examples(1)

The construction of a predictive parser is aided by two functions associated with a grammar G. These functions, FIRST and FOLLOW, allow us to fill in the entries of a predictive parsing table for G, whenever possible. Sets of tokens yielded by the FOLLOW function can also be used as synchronizing tokens during panic-mode error recovery.

1. FIRST(a)

First(a) = set of terminals that start a string of terminals derived from a.

Apply following rules until no terminal or ε can be added

1. If t  ∈ T, then First( t ) = { t }.  

2. If X ∈ N and X → ε exists (nullable), then add ε to First( X ).

3. If X ∈ N and X →  Y1Y2Y3… Ym, where Y1, Y2, Y3, ... Ym are non‐terminals, then:     for each, I from 1 to m     if Y1 … Yi‐1 are all nullable (or if i = 1)      First( X ) = First( X ) ∪ First( Yi )

2. FOLLOW (a) 

Apply the following rules until no terminal or e can be added

1. $  ∈ Follow( S ), where S is the start symbol.

2. Look at the occurrence of a non‐terminal on the right-hand side of a production which is followed by something If A → a B b, then First( b ) ‐ {e} ⊆ Follow( B )

3. Look at N on the RHS that is not followed by anything, if (A → a B) or (A → a B b and ε  First( b )), then Follow( A ) ⊆  Follow( B )








Contributor's Info

LL(1) Parser and LL(1) Parsing Table Notes

LL(1) Parsers:

A top-down parser that uses a one-token lookahead is called an LL(1) parser.

  • The first L indicates that the input is read from left to right.
  • The second L says that it produces a left-to-right derivation.
  • And the 1 says that it uses one lookahead token. (Some parsers look ahead at the next 2 tokens or even more than that.)

Notes on LL(1) Parsing Tables:

If any entry is multiply defined then G is not LL(1). Possible reasons:

  •  If G is ambiguous
  •  If G is left recursive
  •  If G is not left-factored
  •  Grammar is not LL(1)
  •  Most programming language grammars are not LL(1)

=> Some can be made LL(1) though but others can’t There are tools that build LL(1) tables  E.g. LLGen

Contributor's Info

Created: Edited:
LL(1) Parsier and LL(1) Parsing Table Video Example
Content covered: 

Here we will learn how to make LL(1) Parsing Table and how to find whether the given grammar is LL(1) or not? 

More Less
Recursive Descent Parser

A recursive descent parser is a type of parsing tool that works on a recursive basis, in other words, on the basis of using one instance of a command or event to generate another. Recursive descent parsers can be used to parse different types of code, such as XML, or other inputs. They are a specific type of parsing technology that can involve nested or built-in subsequent operations.

=> Key Points:

  • Recursive descent parser is a top-down parser.
  •  The parsing program consists of a set of procedures, one for each non-terminal.
  •  The process begins with the procedure for start symbol.
  • The start symbol is placed at the root node and on encountering each non- terminal, the procedure concerned is called to expand the non-terminal with its corresponding production.
  •  The procedure is called recursively until all non-terminals are expanded.
  •  Successful completion occurs when the scan over the entire input string is done. ie., all terminals in the sentence are derived by parse tree.
  • General recursive-descent may require backtracking; that is, it may require
    repeated scans over the input. However, backtracking is rarely needed to parse programming language constructs, so backtracking parsers are not seen frequently.

Reference is Compiler tools technique by A.V.Aho.

Contributor's Info

Created: Edited:
Bottom Up Parsing Introduction

Bottom-up: Start from the leaves and work up to the root We have to decide when rule A → β should be applied and we should find neighbour nodes labelled by symbols in β Reducing rule A → β results in adding a node A to the tree. A’s children are the nodes labelled by the symbols in β.


  • Constructs a parse tree for an input string beginning at the leaves (the bottom) and working up towards the root (the top).
  • Reduction of a string w to the start symbol of a grammar.
  • At each reduction step, a particular substring matching the right side of a production is replaced by the symbol on the left of that production. 
  • If the substring is chosen correctly at each step, a rightmost derivation is traced out in reverse


A handle of a string is a substring that matches the right side of a production and whose reduction to the nonterminal on the left side of the production represents one step along with the reverse of a rightmost derivation.


S → aABe     A → Abc | b      B → d

The sentence abbcde can be reduced as follows:

  • abbcde
  • aAbcde
  • aAde
  • aABe
  • S


Contributor's Info

Handles in Bottom- Up Parsing


A substring α of the tree’s upper frontier that matches some production A → α where reducing α to A is one step in the reverse of a rightmost derivation. We call such a string a handle.


a handle of a right-sentential form γ is a production A → β and a position in γ where β may be found and replaced by A to produce the previous right-sentential form in a rightmost derivation of γ i.e.,

if S ⇒∗ rm αAw ⇒rm αβw  then A → β  in the position following α is a handle of αβw
Because γ is a right-sentential form, the substring to the right of a handle contains only terminal symbols.

This all is the formal definition.


Informally you can understand handle like this:

for a given production, if a string is given to reduce, you start replacing the string with the right-hand side of production, the item that is removed or replaced is called handle. Reduce the given string until you get the start symbol.


1. S -> aABe

    A->  Abc /b

    B-> d

The string is abbcde find the handles.



aAbcde   // after replacing the second 'b'  with 'A' from left So b is one of handle

aAde        // after replacing 'Abc' with A So Abc is the second handle.

aABe        //  after replacing 'd' with 'B' So, 'd' is third handle.

S              // after replacing 'aABe' with 'S', So 'aABe' is fourth handle.



2. E -> E+n / E * n / n

find the handles for n+n*n



E+n*n    // replace first n with E so n is first handle.

E*n        // replace E+n with E so E+n  is the second handle.

E            // replace E*n with E so E*n is third handle.

So handles are n, E+n , E*n 

Contributor's Info

LR Parsers

They can be used to parse a large class of CFG and sometimes they are referred  as LR(k) parsing.

  • L: left-to-right scanning of the input.
  • R: Construction of the rightmost derivation in reverse.
  • k: Number of input symbols of lookahead.

 LR parsing is attractive for a variety of reasons:

  • LR parsers can be constructed to recognize virtually all programming language constructs for which CFG can be written.
  • The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive parsers.
  • The LR parsing method can be implemented efficiently.


  • Too much complicated to construct an LR parser by hand for a typical programming language grammar.
  • One needs a specialized tool - an LR parser generator (as Yacc)

An LR parser consists of:

  • An input
  •  An output
  •  A stack
  •  A driver program
  •  A parsing table that has two parts: action and goto



Contributor's Info

LR(0) and SLR(1) Parsers
Content covered: 

Here we will be covering LR(0) and SLR(1) parsing table and how they differ from each other.

key Points:


LR(0) parsing table

SLR(1) parsing table

Action and GOTO field.


More Less
LR(0) and SLR(1) Parsers (2)
Content covered: 

LR(0) and SLR(1) Parsing table and Shift Reduce moves.

More Less
LR(0) and SLR(1) Points
  • It is LR(0) because we don't care what is the next symbol.
  • Blank spaces always contain an error.
  • The only difference between LR(0) and SLR(1) is in the placement of reducing moves.
  • The action and goto entries are the same for both.
  • SLR(1) is more powerful than LR(0).

Is every grammar is LR(0)??

1. S-R conflict:-

let us say you at some random state Iwe have something like this :


So at state I we have two moves A-->d. and there is a shift move with terminal c, and since it is LR(0), then reduce move has to put throughout the row, so it will end up with two entries in a particular row at the same time. This is called an S-R conflict.

2. R-R conflict:-

let us say at state I11, we have two reduce move, again this is a violation of the rule that says that at a particular instance, we can have at most one entry for one field of a row.


  • If grammar is LR(0), it will definitely SLR(1).
  • If grammar is not LR(0), it may or may not be SLR(1).
  • You need not construct a full parsing table to detect conflicts, just look at the state that contains the final item.
  • Any left recursive grammar is not LL(1).


Contributor's Info

Some Related Terms asked in Gate Exam

Lexical Anaysis                -->      Finite Automata / DFA mimization

Code optimization        -->      DAG(subtree minimization)

Code generation            -->      Syntax tree(parse tree)

Abelian Groups              -->       Push Down Automata

Parsing                              -->       Production tree

Data Flow Analysis       -->       Code Optimization

Register Allocation       -->      Graph coloring / Code generation 

Regular Grammer         -->      Lexical Analysis

Backus Normal Form   -->     Context free Grammar

YAcc                                    -->      LALR(1)




Contributor's Info

CLR(1) and LALR(1) Parsers Example
Content covered: 

This is a video example covering CLR(1) or LR(1) and LALR(1) parsers.


LR(1) canonical item.

CLR(1) Parsing Table.

LALR(1) Parsing Table.


More Less
CLR(1) and LALR(1) Parsers Notes


  • They are bottom-up parsers, and they use the reverse of the rightmost derivation.
  • They use LR(1) canonical item.

LR(1) canonical items=  LR(0) itenms + look ahead.

Purpose of look ahead: Reduce moves are going to place only in lookahead symbol.

Difference between CLR(1) and LALR(1):-

  • No. of states in CLR(1) is greater than or equal to LALR(1) parser.
  • If grammar is CLR(1), it may or may not be LALR(1), because conflict may arise when you will merge the states for constructing LALR(1) table.
  • If grammar is not CLR(1), it will not be LALR(1), because already conflict happened in CLR(1) so the conflict will always present in LALR(1).
  • If there is no SR conflict in CLR(1), there will be no SR conflict in LALR(1).
  • If there is no RR conflict in CLR(1), we can't say about RR conflict in LALR(1) because there is a chance that when we merge the states, RR conflict may arise.

Some Key Points:

1. The no of states in various parser varies like this:

 CLR(1) >= ( LALR(1)=  SLR(1)  =  LR(0) )

2. The power of Parsers varies like this:

 CLR(1) > LALR(1) > SLR(1) > LR(0)

3. YAcc is a compiler that is used for LALR(1).

4. While constructing the parsing table for various parsers GOTO entry and Shift entries are identical, the only difference arises in reducing entries and error entries.



Contributor's Info


  • This quiz contains 10 questions on the topic Parsers
  • Lean well before you attempt the quiz
  • You can attempt the quiz unlimited number of times.

Difficulty Level:  intermediate

  • This quiz contains 10 questions on the topic Grammars and Lexical Analysis
  • Lean well before you attempt the quiz
  • You can attempt the quiz unlimited number of times.

Difficulty Level:  basic
Ambikesh Kumar Singh @ambikeshkumarsingh 24 Jun 2019 02:09 pm
A quiz will be live on 25 June 2019.