# Matlab Numbers

We have seen that computers can store integers easily in binary notation. Using binary digits in which each digit is electronically or magnetically 'on' or 'off', then the size of the integer a computer can represent is limited only by the number of binary digits the computer reserves for those numbers.

In mathematical computations, it is much more common to
require rational or irrational numbers - the number of
applied problems that can be solved using only integers
is limited.
One way to handle such numbers is through a *fixed point representation*.
The idea is that a certain number of binary digits is reserved for
the decimal part of a number, and the rest are used for the part of the
number to the left of the decimal point. In a very simple example, we
might consider a representation of numbers using 16 binary digits. The
first eight digits represent the number to the left of a decimal point,
while the last seven represent the number to the right of the
decimal point. Since the eight binary digits can only represent
numbers from 0 to 255, then we really only get two decimal digits,
which requires only seven binary digits.
Thus, we could use the first digit of the second byte to represent the
sign of the number: 0 for positive, 1 for negative.
In such a system we might get representations
such as those below.

00000001 00000001 ⇒ 1.01 10000000 00010000 ⇒ 128.16 11111111 01100011 ⇒ 255.99 11111111 11100011 ⇒ -255.99

We stress that computers do not use this system - it is purely speculative. The reason they do not use it is that it can only represent numbers between -256 and 256, and with only two-decimal-digit accuracy. If a computation were to require very large numbers or very small numbers, this system would fail miserably.

Instead of the fixed point system described above, computers always
use a *floating point* system for representing non-integers.
The idea is that instead of using fixed numbers of binary digits to
store integer and decimal parts of the number, we will use fixed
numbers of binary digits to store a coefficient and an exponent.
Again using an oversimplified system for an example,
suppose that we have 16 binary digits to store our number.
We could reserve the first digit for the sign of the number,
ten binary digits for the coefficient, the next digit for
the sign of the exponent, and the last four digits for
the value of the exponent. In the examples below we have
marked the coefficient sign bit red and the exponent sign bit green.

00000000 01000001 ⇒ 2×2^{1} = 4
00000000 01010001 ⇒ 2×2^{-1} = 1
00010000 01010011 ⇒ (2^{7}+2^{2})×2^{-3} = 16.5
01111111 11101111 ⇒ 1023×2^{15} = 3.3521664×10^{7}
00000000 00111111 ⇒ 1×2^{-15} = 3.0517577×10^{-5}

The last two numbers in the example represent the largest and smallest numbers that can be represented in this system. Indeed, there are several things worth noting about floating point number systems.

- The set of floating point numbers is finite, and in particular, there is a largest and a smallest floating point number in any such system.
- There are very many floating point numbers near zero, while they become increasingly sparse far from zero. The important thing is the number of significant digits.
- When we multiply two floating point numbers, we get a number with more nonzero digits - a number which is probably not in the set of floating point numbers. We have to 'round' it off to get a number with a floating point representation. Thus floating point arithmetic is not exact.
- The exponent could be thought of as giving the position of the decimal point for the number. Multiplying a number by two corresponds to shifting the coefficient left one digit, or to increasing the exponent by one.

Note that in the example above there is a bias toward larger
numbers - i.e. the largest number available is over 10^{7},
but the smallest is not 10^{-7}. The accepted standard
for floating point number systems, IEEE 754, deals with this issue
in reasonably clever ways. Most computing systems use the IEEE 754
standards, which use four bytes (32 bits) for ordinary floating point numbers.
The standard also provides for 64 bit floating point numbers.

Matlab does all computations using floating point numbers.
It uses *double precision* floating point numbers, meaning
that it uses the 64 bit standard discussed above. In practice this means
that you can assume that Matlab carries around something like 16
decimal digits of significance, the smallest number available
is around 2.2×10^{-308},
and the largest number in Matlab is around
1.8×10^{308}.

^{-300}does not mean that it should. Think about it: if we are doing computations with numbers of magnitude around 10

^{0}, and can carry only 16 digits of precision, then in floating point arithmetic

^{-16}= 1

^{-16}is a 1 with 16 zeros in front of it: it is too small to make any change in the representation of 1, so it is lost to roundoff error. For this, there is a number called

*machine epsilon*, or eps for short, which is considered to be the smallest computationally useful number, relative to 1. In Matlab it is around 2.2×10

^{-16}. When you are doing computations, you should probably think about the largest number that arises in the results, and call it

*L*. In that case,

*L * eps*is effectively zero for your computation; it is the number below which you lose significance. In this class, you can usually suppose that numbers smaller than about 10

^{-14}are equivalent to zero in most of our computations.

The "final exam" for this course will take place
at 8:00 AM on Tuesday, 12 December. This will be an ordinary
50 minute test. It will be comprehensive, but weighted toward
the latter half of the semester. As always, paper notes will
be permitted, but no electronic devices will be allowed.
A sample exam is available.

A
Solution example is available
for the quiz. The solution to
Test 1 is still available too.

The ultimate assignment is posted.