Errors and their recovery in Lexical Analysis

Errors detected in lexical Analysis:

1. Numeric literals that are too long.

2. Long identifiers.

3. Ill-formed numeric literals ex: int X= $ 12345.

4. Input character that is not part of the source of language.

Recovery techniques:

1. Panic mode recovery: Delete the unknown character.example: "chhar" is corrected as "char".

2. Transpose: bases on certain rules, we transpose two characters. example: "while "if written as" wheil". here e and i are transposed.

3. Replace: it is based on replacing one character by another character .examp;e: "chhr "can  be corrected as "char "by replacing second 'h' with 'a'

4. Insert: An extra or missing character is inserted to form a meaningful word. example: "cha" is corrected as "char "by inserting 'r'.

versha mishra @vershamishra
21 Oct 2019 09:57 pm
In 4 point, r should be there instead of a
Rohit Panwar @panwarrohit
22 Oct 2019 01:19 pm
@versha u r right..its a typing mistake we have updated now.
Counting no. of tokens
The following example will show you how to count the number of tokens from given statements:
Introduction to Lexical Analyzer

The lexical analyzer is the first phase of a compiler. Its main tasks is to read the input character and produce a sequence of tokens as output.

later this sequence of tokens is used by Syntax Analyzer.

Role of lexical Analyzer:

  • Remove comments and white spaces  
  • Macros expansion. 
  • Read input characters from the source program.
  • Group them into lexemes.
  • Produce as output a sequence of tokens.
  • Interact with the symbol table.
  • Correlate error messages generated by the compiler with the source program.

It reads program line by line, so it displays errors present in program line by line. 

Some Definitions:

Token: a pair consisting of –

             Token name: the abstract symbol representing the lexical unit [affects                        parsing decision] –

             Optional attribute value [influences translations after parsing] •

Pattern: a description of the form that different lexemes take •

Lexeme: a sequence of characters in source program matching a pattern


Tokens Specification :

We need a formal way to specify patterns: regular expressions 

Alphabet: any finite set of symbols.

String over an alphabet: finite sequence of symbols drawn from that alphabet • Language: the countable set of strings over some fixed alphabet.


Some Basic Definition:




Lexical Analysis

Lexical analysis is the first phase of a compiler. It takes the modified source code from language preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes into a series of tokens, by removing any whitespace or comments in the source code. If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.


Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some predefined rules for every lexeme to be identified as a valid token. These rules are defined by grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns are defined by means of regular expressions. In programming language, keywords, constants, identifiers, strings, numbers, operators and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line

int value = 100;

contains the tokens:

int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).