# Matlab Numbers

We have seen that computers can store integers easily in binary notation. Using binary digits in which each digit is electronically or magnetically 'on' or 'off', then the size of the integer a computer can represent is limited only by the number of binary digits the computer reserves for those numbers.

In mathematical computations, it is much more common to require rational or irrational numbers - the number of applied problems that can be solved using only integers is limited. One way to handle such numbers is through a fixed point representation. The idea is that a certain number of binary digits is reserved for the decimal part of a number, and the rest are used for the part of the number to the left of the decimal point. In a very simple example, we might consider a representation of numbers using 16 binary digits. The first eight digits represent the number to the left of a decimal point, while the last seven represent the number to the right of the decimal point. Since the eight binary digits can only represent numbers from 0 to 255, then we really only get two decimal digits, which requires only seven binary digits. Thus, we could use the first digit of the second byte to represent the sign of the number: 0 for positive, 1 for negative. In such a system we might get representations such as those below.

00000001 00000001 ⇒ 1.01 10000000 00010000 ⇒ 128.16 11111111 01100011 ⇒ 255.99 11111111 11100011 ⇒ -255.99

We stress that computers do not use this system - it is purely speculative. The reason they do not use it is that it can only represent numbers between -256 and 256, and with only two-decimal-digit accuracy. If a computation were to require very large numbers or very small numbers, this system would fail miserably.

Instead of the fixed point system described above, computers always use a floating point system for representing non-integers. The idea is that instead of using fixed numbers of binary digits to store integer and decimal parts of the number, we will use fixed numbers of binary digits to store a coefficient and an exponent. Again using an oversimplified system for an example, suppose that we have 16 binary digits to store our number. We could reserve the first digit for the sign of the number, ten binary digits for the coefficient, the next digit for the sign of the exponent, and the last four digits for the value of the exponent. In the examples below we have marked the coefficient sign bit red and the exponent sign bit green.

00000000 01000001 ⇒ 2×21 = 4 00000000 01010001 ⇒ 2×2-1 = 1 00010000 01010011 ⇒ (27+22)×2-3 = 16.5 01111111 11101111 ⇒ 1023×215 = 3.3521664×107 00000000 00111111 ⇒ 1×2-15 = 3.0517577×10-5

The last two numbers in the example represent the largest and smallest numbers that can be represented in this system. Indeed, there are several things worth noting about floating point number systems.

• The set of floating point numbers is finite, and in particular, there is a largest and a smallest floating point number in any such system.
• There are very many floating point numbers near zero, while they become increasingly sparse far from zero. The important thing is the number of significant digits.
• When we multiply two floating point numbers, we get a number with more nonzero digits - a number which is probably not in the set of floating point numbers. We have to 'round' it off to get a number with a floating point representation. Thus floating point arithmetic is not exact.
• The exponent could be thought of as giving the position of the decimal point for the number. Multiplying a number by two corresponds to shifting the coefficient left one digit, or to increasing the exponent by one.

Note that in the example above there is a bias toward larger numbers - i.e. the largest number available is over 107, but the smallest is not 10-7. The accepted standard for floating point number systems, IEEE 754, deals with this issue in reasonably clever ways. Most computing systems use the IEEE 754 standards, which use four bytes (32 bits) for ordinary floating point numbers. The standard also provides for 64 bit floating point numbers.

Matlab does all computations using floating point numbers. It uses double precision floating point numbers, meaning that it uses the 64 bit standard discussed above. In practice this means that you can assume that Matlab carries around something like 16 decimal digits of significance, the smallest number available is around 2.2×10-308, and the largest number in Matlab is around 1.8×10308.

The fact that the Matlab can form numbers as small as 10-300 does not mean that it should. Think about it: if we are doing computations with numbers of magnitude around 100, and can carry only 16 digits of precision, then in floating point arithmetic
1 + 10-16 = 1
In other words, the number 1 is stored with 15 zeros after it, while 10-16 is a 1 with 16 zeros in front of it: it is too small to make any change in the representation of 1, so it is lost to roundoff error. For this, there is a number called machine epsilon, or eps for short, which is considered to be the smallest computationally useful number, relative to 1. In Matlab it is around 2.2×10-16. When you are doing computations, you should probably think about the largest number that arises in the results, and call it L. In that case, L * eps is effectively zero for your computation; it is the number below which you lose significance. In this class, you can usually suppose that numbers smaller than about 10-14 are equivalent to zero in most of our computations.

The "final exam" for this course will take place at 8:00 AM on Tuesday, 12 December. This will be an ordinary 50 minute test. It will be comprehensive, but weighted toward the latter half of the semester. As always, paper notes will be permitted, but no electronic devices will be allowed.

A Solution example is available for the quiz.

Assignment A is posted.

Department of Mathematics, PO Box 643113, Neill Hall 103, Washington State University, Pullman WA 99164-3113, 509-335-3926, Contact Us