Slide Ads

Wednesday, July 15, 2009

Data File Metaformats -DSV Style-RFC 822 Format

A data file metaformat is a set of syntactic and lexical conventions that is either formally standardized or sufficiently well established by practice that there are standard service libraries to handle marshaling and unmarshaling it.

Unix has evolved or adopted metaformats suitable for a wide range of applications. It is good practice to use one of these (rather than an idiosyncratic custom format) wherever possible. The benefits begin with the amount of custom parsing and generation code that you may be able to avoid writing by using a service library. But the most important benefit is that developers and even many users will instantly recognize these formats and feel comfortable with them, which reduces the friction costs of learning new programs.

In the following discussion, when we refer to "traditional Unix tools" we are intending the combination of grep(1), sed(1), awk(1), tr(1), and cut(1) for doing text searches and transformations. Perl and other scripting languages tend to have good native support for parsing the line-oriented formats that these tools encourage.

Here, then, are the standard formats that can serve you as models.

5.2.1 DSV Style

DSV stands for Delimiter-Separated Values. Our first case study in textual metaformats was the /etc/passwd file, which is a DSV format with colon as the value separator. Under Unix, colon is the default separator for DSV formats in which the field values may contain whitespace.

/etc/passwd format (one record per line, colon-separated fields) is very traditional under Unix and frequently used for tabular data. Other classic examples include the /etc/group file describing security groups and the /etc/inittab file used to control startup and shutdown of Unix service programs at different run levels of the operating system.

Data files in this style are expected to support inclusion of colons in the data fields by backslash escaping. More generally, code that reads them is expected to support record continuation by ignoring backslash-escaped newlines, and to allow embedding nonprintable character data by C-style backslash escapes.

This format is most appropriate when the data is tabular, keyed by a name (in the first field), and records are typically short (less than 80 characters long). It works well with traditional Unix tools.

One occasionally sees field separators other than the colon, such as the pipe character | or even an ASCII NUL. Old-school Unix practice used to favor tabs, a preference reflected in the defaults for cut(1) and paste(1); but this has gradually changed as format designers became aware of the many small irritations that ensue from the fact that tabs and spaces are not visually distinguishable.

This format is to Unix what CSV (comma-separated value) format is under Microsoft Windows and elsewhere outside the Unix world. CSV (fields separated by commas, double quotes used to escape commas, no continuation lines) is rarely found under Unix.

In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.

The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote—but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).

5.2.2 RFC 822 Format

The RFC 822 metaformat derives from the textual format of Internet electronic mail messages; RFC 822 is the principal Internet RFC describing this format (since superseded by RFC 2822). MIME (Multipurpose Internet Media Extension) provides a way to embed typed binary data within RFC-822-format messages. (Web searches on either of these names will turn up the relevant standards.)

In this metaformat, record attributes are stored one per line, named by tokens resembling mail header-field names and terminated with a colon followed by whitespace. Field names do not contain whitespace; conventionally a dash is substituted instead. The attribute value is the entire remainder of the line, exclusive of trailing whitespace and newline. A physical line that begins with tab or whitespace is interpreted as a continuation of the current logical line. A blank line may be interpreted either as a record terminator or as an indication that unstructured text follows.

Under Unix, this is the traditional and preferred textual metaformat for attributed messages or anything that can be closely analogized to electronic mail. More generally, it's appropriate for records with a varying set of fields in which the hierarchy of data is flat (no recursion or tree structure).

Usenet news uses it; so do the HTTP 1.1 (and later) formats used by the World Wide Web. It is very convenient for editing by humans. Traditional Unix search tools are still good for attribute searches, though finding record boundaries will be a little more work than in a record-per-line format.

One weakness of RFC 822 format is that when more than one RFC 822 message or record is put in a file, the record boundaries may not be obvious—how is a poor literal-minded computer to know where the unstructured text body of a message ends and the next header begins? Historically, there have been several different conventions for delimiting messages in mailboxes. The oldest and most widely supported, leading each message with a line that begins with the string "From " and sender information, is not appropriate for other kinds of records; it also requires that lines in message text beginning with "From " be escaped (typically with >)—a practice which not infrequently leads to confusion.

Some mail systems use delimiter lines consisting of control characters unlikely to appear in messages, such as several ASCII 01 (control-A) characters in succession. The MIME standard gets around the problem by including an explicit message length in the header, but this is a fragile solution which is very likely to break if messages are ever manually edited. For a somewhat better solution, see the record-jar style described later in this chapter.

For examples of RFC 822 format, look in your mailbox.

No comments: