DAT Format (Papyrus)
The DAT format is used to store data for some early games developed by Papyrus Design Group, including Nomad and several racing simulators. Unfortunately, the format has no signature and a generic filename extension (.DAT), so automatic detection of these files is difficult.
There are two versions of the Papyrus DAT format that differ slightly, as explained below. Version 1 is used by Riders of Rohan and Nomad, and Version 2 by all the racing simulators.
There is no signature for this format. One method to identify .DAT files is to read in the contained file entries (failing if the end of the file is reached early) and confirm that each filename only consists of ASCII characters or nulls, and each file offset plus its size is less than the total size of the DAT file. It is extremely unlikely that this method would incorrectly identify a file.
The file starts with a single field indicating the number of files present.
|UINT16LE numFiles||File count|
After the two-byte header, the following structure is repeated numFiles times. As each entry includes both an offset and a size, it is possible (in theory) for two different files to share the same data, and to insert “hidden” data between files.
Note that the total length of this structure is either 28 bytes (for Version 1 of this format) or 27 bytes (for Version 2).
|UINT16LE flags||For Version 1 of the format: Indicates whether compression is used, and whether a compressed file is prefixed by uncompressed header words. In Version 2, neither of these features are supported.|
|UINT32LE uncompressed_size||The total size of the contained file after it is decompressed|
|UINT32LE compressed_size||The size of the contained file as it is stored compressed in the archive. For Version 2 archives, this will always match the uncompressed_size field.|
|BYTE filename[13 or 14]||Filename (8.3, 12 chars including dot, null padded to 14 chars in Version 1, and null padded to 13 chars in Version 2)|
|UINT32LE offset||Offset within the .DAT archive at which this file's data starts|
Compression (Version 1 only)
At least two of the bits in flags are important for indicating the compression method used to store the file:
- flags.8 indicates whether the file is stored with the LZ compression algorithm described below. If this bit is clear, the uncompressed_size and compressed_size fields must match. In this case, the file is uncompressed and its content may be copied byte-for-byte from the .DAT. It appears and Riders of Rohan and Nomad make use of compression, while later games do not.
- flags.2 indicates whether the file is stored with uncompressed header data. If this bit is set, the entirety of the file's data is stored with LZ compression. If this bit is clear, the first two words (four bytes) are uncompressed and may be copied directly from the .DAT, leaving the remaining data to pass through LZ decompression. Note that the compressed_size field does not include this four-byte header, and as a result, the next file in the .DAT archive will start at offset + compressed_size + 4.
- This uncompressed-header feature seems to only be used for the raw VGA fullscreen images (which are stored with .lbm extensions). In this case, the two words of header data contain the width and height of the image file (320 x 200, or 40 01 C8 00).
The data for the first contained file starts immediately after the last index entry.
The compressed files themselves are individually deflated with a modified 8-bit LZ algorithm. Data is grouped into chunks, and each chunk can consist of either a literal byte, or a back-reference to a string of bytes that was previously encountered during decompression. In this way, byte sequences that repeat can be represented by an abbreviated codeword, thereby saving space.
In addition to the output stream, the algorithm maintains a 4096-byte ring buffer that provides the source of data for the back-references. This buffer must be initialized with all bytes set to 20h, and the offset pointer into the buffer set to FEEh. This pointer value was chosen because it is 18 bytes from the end of the buffer, and that is the maximum length of a single back-reference sequence (as we will see later.) Therefore, at least one chunk will be decoded before the pointer wraps back to the start.
In the compressed source data, each sequence of eight chunks is preceded by a flag byte, in which each bit indicates the nature of one of the following chunks:
- 1: the chunk is a literal single byte, to be copied directly from the compressed source
- 0: the chunk is a two-byte coded back-reference into the ring buffer
These bits are checked in order of LSB to MSB. For example, a flag byte of F7h means that the next three chunks are literal, followed by a single reference, then four more literals.
As each byte is copied to the output, it is also copied to the 4K ring buffer. Whenever data is being read from or written to this buffer, the pointer is reset to position 0 when it reaches position 1000h.
To decode a back-reference, read the two-byte sequence as a little-endian pair. It contains two fields:
- Byte count (bits 15:12): This four-bit count is actually the length minus three. Because the coded sequence itself occupies two bytes, there is no gain in encoding any fewer than three source bytes. A value of 0 in this field indicates a length of 3, and the max value of Fh indicates a length of 18 (12h).
- Offset into ring buffer (bits 11:0): A zero-based offset from the start of the 4096-byte ring buffer at which to begin reading the number of bytes specified in the byte count field. As noted, the decoder must start filling this buffer from offset FEEh, so the first few back-references seen in a compressed file may be in this higher range (FEEh to FFFh).