##### IEEE 754 Single and Double Precision

In all the floating point representation '0' is not representable. This problem is solved using IEEE 754 standard.

IEEE 754 Standard

It is concerned with floating point standard some of its fearture are:

1. The base of the system is 2 .
2. There is a provision for the value ±0 and ±∞.
3. The floating point number is stored either with single precision (32bit) or with double precision (64 bits).
4. The floating point number can be represented in
5. a)Fractional form
​b)Implicit normal form

Single Precision(32 bit)

 S(1) E(8) M(23)

excess 127 is used as bias

V=(-1)s (1.M) *2E-127

 S(1) E(8) M(23) Value 0 or 1 E=0 M=0 ±0 0 or 1 E=255 M=0 ±∞ 0 or 1 1≤E≤254 M=xxxx----xxxx Implicit Normalised form 0 or 1 E=0 M≠0 fractional form 0 or 1 E=255 M≠0 not a number

Double Precision(64 bit)

 S(1) E(11) M(52)

excess 1023 is used as bias

V=(-1)s (1.M) *2E-1023

 S(1) E(11) M(52) Value 0 or 1 E=0 M=0 ±0 0 or 1 E=2047 M=0 ±∞ 0 or 1 1≤E≤2046 M=xxxx----xxxx Implicit Normalised form 0 or 1 E=0 M≠0 fractional form 0 or 1 E=2047 M≠0 not a number

##### Theory on floating point representation
• The floating point numbers are stored in mantissa (M) and exponent (E) form.
• Most of the notations represent  mantissa as normalised sign magnitude fraction.

0.1101 signed magnitude

• The normalisation can be explicit or implicit.
• The exponent is denoted in baised form

The biased exponent is an unsigned number which can represent signed exponent of original number.

True exponent =biased exponent -Bias

The floating point number stored in the following form

 S E M

S-sign

E- Biased exponent (k bits)

0<=E<=2k -1

Bias=2K-1

M-mantissa

the vaue of the expression is given by

V=(-1)s(0.M)2 *2E-Bias

this is for explict normalisation(default)

V=(-1)s(1.M)2 *2E-Bias

this is implicit normalisation where 1 before radix will not be stored.

##### Ranges of signed integer representation

Three  ways to represent signed number are :-
1)signed magnitude representation
2)1's complement representation
3)2's complement representation

Range of numbers in these three signed number representation are
Max                                      Min
Signed magnitude representation(SM)        2n-1 -1                                -(2n-1​ -1)

1's complement representation(1's)             2n-1​ -1                                -(2n-1​ -1)

2's complement representation(2's)              2n-1​ -1                                -2n-1​

Fixed point unsigned integer

let we have  n bit the there will be  2n   number

in case of unsigned is range is  0 to 2n​ -1

the maximum no we can represent using fraction is 1-2-n

.i.e  0.1111111111111111111111111......till n times

2-1 +2-2 +2-3 +2-4 +...................................2-n

=a(1-rn)/(1-r)

=1-2-n

##### Few Example on conversion from one form to another

Example 1

Find 'X' for √144X =12

a) 8   b)10   c)12  d)>4

sol:

(144)8=1*82 +4*81 +4*80

=(100)10

(12)8 =1*81 +2*80

=(10)10

(144)12=1*122 +4*121 +4*120

=196

(12)12 =1*121 +2*120

∴ it is true for every base >4

option D is correct

Example 2

Consider the following  quadratic expression X2 -(12)rX+ (37)r =0 .if the above expression is solved it has given decimal root  as 8,5. what is r - value.

a)10  b)11  c)12   d)8

sol:

X2 -(12)rX+ (37)r =0

given two root in decimal then equation will be

(X-8)(X-5)=0

X2 -13X +40 =0

(12)r =13

(37)r =40

3*r +7 =40

r=11

option B is correct

Example 3

How many digit are required  to represent 126 -bit binary number in decimal.

sol:

10d -1 >=2126 -1

10d >=2126

d>=126log10 2

d>=37.02

d=38

##### Introduction to number system

In digital electronics, the number system is used for representing the information. The number system has different bases and the most common of them are the decimal, binary, octal, and hexadecimal.

A number N in base or radix b can be written as:
(N)b = dn-1 dn-2 — — — — d1 d0 . d-1 d-2 — — — — d-m

In the above, dn-1 to d0 is integer part, then follows a radix point, and then d-1 to d-m is fractional part.

dn-1 = Most significant bit (MSB)

d-m = Least significant bit (LSB)

Conversion from one base to other

1. Decimal to Binary

(10.25)10

(10)10 =(1010)2

fractional part

0.25*2=0.5

0.5*2=1.00

(0.25)10=(0.01)

=1010.01

2. Binary to Decimal

(1010.01)2

1×23 + 0x2+ 1×21+ 0x20 + 0x2 -1 + 1×2 -2 = 8+0+2+0+0+0.25 = 10.25

=(10.25)10

3. Decimal to Octal

(10.25)10

(10)10 = (12)8

Fractional part:

0.25 x 8 = 2.00

(10.25)10 = (12.2)8

4. Octal to Decimal

(12.2)8
1 x 81 + 2 x 80 +2 x 8-1 = 8+2+0.25 = 10.25

(12.2)8 = (10.25)10

 Binary Hexadecimal 0000 0 0001 1 0010 2 0011 3 0100 4 0101 5 0110 6 0111 7 1000 8 1001 9 1010 A 1011 B 1100 C 1101 D 1110 E 1111 F

Number System