Before to put File I/O in practice, we think we need some general introduction about file organization and management on the different operating platforms. This chapter covers general aspects about file formats and access methods.
Computers organize data in what is commonly called files. Files can contain all sorts of data, going from a list of names and related telephone numbers (a telephone directory), via an electronic note or help manual, to program source code or even executable program instructions.
In the rest of this course, we call a text file any file that contains only readable characters (such as our electronic note or program source)(footnote 1). Opposed to that, the term binary file is widely used when the contents of the file looks garbage when printed or displayed on the screen without being pre-processed by an appropriate program. Examples of such files are executable programs, but also, files containing for example the data for spreadsheet programs (Lotus 123, Excel...). Indeed, these files will not produce meaningful output when printed from a plain DOS or OS/2 session.
Another way to define text files is stating that these files are readable when edited on the screen using a line or full-screen editor, such as XEDIT, ISPF/PDF EDIT, EPM, EDLIN, Personal Editor, KEDIT, E3, VI, etc.). Opposed to these editors are the so-called WYSIWYG (What You See Is What You Get) word processors (e.g. Lotus WordPro, Word for Windows, DisplayWrite, etc.). Documents produced by these programs must in most cases be considered to be binary files as they contain, along with the text, lots of unreadable control codes defining the document characteristics, such as page layout, fonts definitions, etc..
A computer deals with this data through programs (e.g. the telephone list can be printed by a print program, the source code can be translated into another file containing machine instructions via a compiler, an executable program file can be loaded into storage by the program loader). The programs have to know the organization and the meaning of the data in the files in order to produce correct output. You all know that it is a nonsense to print an executable program file. Similarly, it is impossible to execute a telephone directory.
Files can be organized in several different ways. The most common file organizations are:
Sequential | The data is a continuous list of bytes or records. The data records may have been sorted before they were stored in the file. In most cases, the files are processed from the first byte or record, sequentially up to the last byte or record (that's where the name merely comes from). We will see that in some cases direct access to the data is possible. Text files are sequential, but lots of other files are sequential too. |
Direct access | A special form of sequential files, where records can be accessed directly by means of their sequence number in the file (record number or position on DASD). An example of direct access organization is the RRDS (Relative Record Data Set) organization in VSAM. In most cases, the records in these files must be all of the same length (fixed record length, or fixed records for short). |
Indexed | A separate index file or control block is related to a sequential base file. This index contains the key information used to extract the data (e.g. client number) and a pointer or record number of the base record. Multiple indexes can be created for a same base file allowing to access the data in different views. Our telephone directory could have a name index (all names are stored in the index in alphabetical order) to allow retrieval by name, and a phone index (all phone numbers in sort order) to allow to find the name corresponding to a phone number. KSDS (Key Sequenced Data Sets) of the VSAM access method is one example of indexed files. .IDX files of DBase III on the PC is another example of an index file containing the pointers to the base .DBF file. |
Relational | we could say that this is a more elaborated form of indexed or direct access files. It would take us too far to try to explain relational file organizations. We will not use these in this course either. Examples of programs working with such files are DB/2 or Oracle. |
In this course we will focus only on sequential files, although some forms of direct access may be possible.
In this document, we use the term host system to designate systems running mainframe operating systems, such as OS/390, VM/ESA, VSE/ESA or OS/400. At the other side, we use the term personal systems as a generic name for operating systems such as OS/2, Windows, Mac-OS, but also all flavors of Unix systems.
This distinction is made only for the purpose of this course, as both sides have several differences when dealing with files. Using EBCDIC versus ASCII systems could have been another choice, but we have first to explain these acronyms, that's why we opted for the first generic names.
The first difference that comes into mind, is the different bit pattern assignments or code-pages used on the systems. In the early days 2 sets were agreed upon. IBM hosts chose for the EBCDIC character set (8 bits), while personal systems opted for ASCII character sets (originally 7 bits).
EBCDIC in turn was the successor of the BCDIC character set (6 bits), while, with the advent of the IBM Personal Computer in the early 80-ties, the 7-bit ASCII set was extended to an 8-bit set to provide for foreign language characters (accented characters) and other graphics.
The character A, for example, has the value 'C1'X in EBCDIC, while it has a value '41'X in ASCII.
Nowadays, both EBCDIC and ASCII character sets have multiple versions, called codepages to adapt to different national languages.
This means, that when data is transferred from a host system to/from a personal system, character translations have to be performed. When in this document, we have translations from ASCII to EBCDIC or vice versa, then we suppose that the host uses codepage 500 (international EBCDIC codepage), while the personal system uses codepage 850 (international ASCII codepage).
A second difference between host and personal systems is that host systems are called record oriented, while the personal systems are called byte or stream oriented.
This difference has historical reasons. When IBM announced the System/360 hardware (back in 1964) the input/output of these systems was build around card readers/punchers and line printers. The entity of data was therefore a punchcard image (80 bytes) or a print line (121 characters), or so-called records. At the same time, magnetic DASDs were introduced (2311...). In order to preserve this notion of record, the DASDs were organized in records too, and to avoid loosing too much space due to inter-record gaps, records were often grouped in blocks on TAPE or DASD. In general, when a program on host systems deals with data files, it deals with records, not single bytes.
This raised the complexity of the parts of the systems, but had as major advantage that buffering techniques could be used whereby the processor is only interrupted when one or more records (a buffer) are transferred, therefore allowing a much higher multiprogramming level.
The host systems also implemented, right from the beginning, the access methods, pieces of software that we now commonly call middle-ware. These relieve the application programs from the burden of accessing the data. Examples of access methods are SAM (Sequential Access Method), VSAM (Virtual Storage Access Method), ISAM (Indexed Sequential Access Method), DAM (Direct Access Method) and DB/2. Programs call for the services of these access methods which in turn understand the organization of their files on external media.
Personal computers however were (and still are) build around cheap components. Whereas DASDs have a similar physical organization both on mainframes and PCs (e.g. tracks are organized in sectors), the PC softwares don't have those standard access methods implemented in the system. The PC operating system only provides information on file allocations (e.g. starting and number of sectors), so that files are merely to be considered as a long string of bytes.
On personal computers, the internal organization of the files is normally managed by the programs themselves. A document produced by the now popular Word product can only be managed by that product itself. Competing word-processors have their own internal organization, but in order to gain market share, they may have routines that understand other formats. If you want to look at or print the document, you have first to start the program and open the file from there. If you browse or print the file from the operating systems' command line, you will get only unreadable garbage.
The only case where the PC operating system (and, as we will see, also utilities like REXX) help the user in managing the structure of the files is in the case of text files, as we have called them. These are sequential or flat files, containing a series of lines or records. To separate these records from each other, special control characters are inserted in between them. On personal systems, these control characters are '0D0A'x or CRLF characters. The original choice for these characters comes from the fact that code '0D'x means Carriage Return (CR) for the screen and the printer, hence instruct those devices to return the cursor or print head to the left margin. The '0A'x character stands for the Line Feed (LF) code, instructing the devices to advance to the next line.
So, when such files are printed, or displayed to the screen, the devices interpret the CRLF characters to separate the records. Text editors (e.g. NOTEPAD, EPM, EDLIN) automatically add the CRLF characters when writing the records to disk.
Due to the differences in hardware implementation, host systems always buffer data in blocks or records (multiple bytes). Personal systems on the contrary treat data one byte at a time. To try to explain why this distinction is made, let's look at how the user interacts with the system through the keyboard and the screen.
On the host, when you edit a file, you can change the complete screen surface without the computer knowing what you are doing. Even more, the actions you take on the keyboard are not seen by the system, unless you hit the Enter, a PF- or PA-key or the Clear key. At that time, the data that is on your screen is transferred to the host. In fact, the characters you see on the screen are in the storage buffer of a control unit (or a 3270 emulator) and the Enter-key signals the control unit to transfer the data to the host. So, one buffer (record) full of data is sent to the host in one operation. The hardware architecture on the host was designed in such a way that the processor is not busy while the user is editing the file. In a similar way, the input/output to a DASD is for the most part handled asynchronously from the processor by the channel subsystem and DASD control units.
The personal computer architecture on the other hand was designed in a totally other way. Most elements of the system (screen, keyboard, processor, drives and storage) are linked together via common wiring, called a bus. The screen controller for example has direct memory access (DMA) to one specific part of the systems' storage. On the other side, your editing program can write into that same storage. The result is that as soon as you write a character to the screen storage area, this character will appear on the editor screen. Similarly, as soon as you hit a character-key on the keyboard, the character can be read by the processor and stored in the screen storage buffer. Hence it appears on the screen immediately(footnote 2).
Another way of explaining would be to compare mail and telephone. Hosts tend to work as the mail office works. You write your letter, put it in an envelope (record) and post it. Then it is picked up by the postman, carried to another town or country by road, rail or air and distributed again by a postman. Computers tend to do this at electronic speed of course.
If you phone somebody, the words (continuous series of bytes) are heard at the other end of the line at (almost) the same time you pronounce them. This is the personal computer way of communicating.
Each operating system has a bit its own way of organizing files. We will study here how it's implemented for sequential files.
Files are stored as one long string of bytes. For text files, the logical records are separated by CRLF as we have seen earlier. We should add to this that some editors or programs dealing with sequential text files consider the X'1A' character to be the end-of-file indicator.
The files are stored on diskettes or hard-disks one sector at a time and no 2 files share a same sector. The file identifications are stored in a separate area on the external medium (e.g. FAT table in PC DOS).
All files stored on CMS formatted minidisks could be considered to be sequential, but have also direct access possibilities as we will see.
In CMS we can distinguish between files with fixed length and variable length records.
When you format a minidisk in CMS, blocks with a fixed size (512, 1K, 2K or 4K) are created. The files are stored in the blocks, and a file record can span more than one block. One block however contains data of only one file.
The file identifications are stored in separate control blocks on the minidisk. We already mentioned the File Status Table on several occasions. It merely contains what you get when you issue the FILELIST command, but in addition, it contains a pointer to the index blocks for the file. These index blocks contain:
For variable files, CMS has to keep the record length of each record. This length is stored in a 2 byte field, which imposes a limit to length of a record (64K). Fixed length records don't have that limit (actually it is 2 Gigabytes per record, as imposed the architectural addressability limit of 31 bits).
What's most important to remember here is that CMS can directly access each record individually without having to read all preceding blocks.
Are defined in the VTOC (Volume Table Of Contents). The record length is however not stored in that table, but each record is prefixed with a 2-byte field indicating the length of the records.
Furthermore, files are stored in a pre-allocated area (extent) on the DASD and the records are physically consecutive. If records are grouped in blocks, then the length of the block is also indicated in a 2-byte prefix and blocks are limited to 32K.
As a consequence, these systems can not access specific records directly, unless they first read all preceding blocks.
| Hosts | Personal |
---|---|---|
File access | record oriented | byte oriented |
Character set | EBCDIC | ASCII |
This ends our general introduction on file organizations. Chapter 13 introduces the different tools that allow reading and writing files from REXX.
(1)
With VM/ESA, the term text is also used when referring to the
object decks, outputs of compilers. It comes from the fact that the files
that contain this object have a filetype TEXT.
In this document, we don't mean this type
of files when we speak about text files.
Back to text
(2)
This is a bit simplified of what happens in reality, but this is how
it is perceived by the user.
Back to text
(3)
For more information on the file control blocks of CMS, see the
CMS Diagnosis Reference (LY24-5244).
Back to text