Math 300: File Types (back to Math 300 notes)

The title of this section begs the question "since all files on computers are composed effectively of ones and zeros, how can they be different?" The answer lies in the programs that are meant to interpret the files to the user. Yes - all files are the same, but they are interpreted differently to you by the computer.

There are many ways to distinguish among files. The simplest and most basic concerns whether they are considered to be text or binary. Text files (in English) are usually coded in ASCII (the American Standard Code for Information Interchange) - a seven bit scheme for representing letters as binary numbers. We won't go into too much detail here except to give some examples.

The space character is number 32 in the system. There are several other special characters in the range 33-47, and then the arabic numerals appear as characters 48-57. In other words, character 48 is zero, character 49 is one, and so on. The upper case letters begin at character number 65 ("A"), and the lower case letters begin at character 97.

Most modern operating systems actually use an extended ASCII code, based on eight-bit binary numbers. This gives 256 symbols - usually enough to include some Greek letters as well as cedillas and umlauts needed for european languages.

Text files may be viewed using simple commands. The command to list the contents of a text file in both Unix and Windows is called "more". If you need to modify the file, then you should use a text editor. These are programs that perform the simple mapping between the characters of the files and the shapes of letters, numbers, and other characters on the page. You have used text editors such as notepad, but may be unaware of edit (in Windows) or emacs, vi, pico, xedit, and gedit in Unix.

Of course, English letters are not the only characters we have to represent. The need to allow computers to handle all human languages has led to the adoption of unicode characters. While ASCII characters can be represented as seven bits, unicode characters use sixteen. This obviously allows 512 times as many characters to be represented, at a cost of twice the work and space to store them.

All non-ASCII files on a computer (except for a few proprietary machines such as IBM mainframes) are considered to be "binary" files. Such files may be programs - sequences of instructions that the computer can run directly - or they may require other programs to interpret them. For example, a MS Word document cannot be viewed well using a simple text editor because it contains non-ASCII character codes. Other files are composed of ASCII characters, but nonetheless contain instructions to a program inside them - such files are called markup. All of these files can be associated by an operating system with the program required to interpret them through a MIME type.

MIME stands for Multipurpose Internet Mail Extension, but its usefulness extends far beyond email attachments now. It basically comprises a triplet associated with a file: the actual MIME type in the form "type/subtype", an extension that identifies a file as falling within the classification, and a program that is to be used to interpret the file. Thus, the operating system can look at the extension of a filename (the part after the dot) and use it to decide what program should be used to open the file. This is why MS Word documents always end with ".doc", for example.