id Software RLEW compression

From ModdingWiki
Jump to navigation Jump to search

id Software RLEW compression was used in a number of id's early games. It is an RLE Compression algorithm that works at the word level, where each word is a group of two bytes (16 bits). The algorithm is used on files that themselves use 16-bit words as their basic units, such as game levels, where a single 16-bit value such as "12 34" might be repeated multiple times in a row. A byte-level RLE algorithm would not be able to compress this data at all, as each byte is different to the previous, however the same RLE algorithm operating at the 16-bit word level can achieve good compression.

The use of 16-bit values over 8-bit also yields other benefits, such as the maximum repetition count increasing from around 256 to 65,536, allowing even greater compression of data with large numbers of repeated values. The RLE code word can also be chosen to be less likely to appear in the data, minimising the number of escapes required. While 8-bit RLE has a 1 in 256 chance of a given byte needing to be escaped, 16-bit RLE reduces that to 1 in 65,536.

RLEW compression is almost exclusively used to compress level data. Most revisions of the early game engines by id Software used arrays of 16 bit values to store the level data.

Algorithm

The algorithm is very straightforward. Every 16-bit value is passed through unchanged until an RLE code value is encountered. Typically this is 0xFEFE or 0xABCD (the latter stored as the two bytes 0xCD 0xAB in little-endian order). When this code word is read, instead of writing it out like the other values, two further words are read. The first is the number of times to repeat the word, and the second is the value to be repeated.

All 16-bit words are in little-endian order, although in practice only the code word and the repeat count need to be treated as little-endian. The other values can be read in either endian so long as they are read and written unchanged.

Data is usually read from the start of a file to the end, but many implementations have the size of the decompressed data as a UINT16LE at the beginning of the file. This should be used where possible so that trailing data at the end of the file does not inadvertently get read as input data.

A simple algorithm for decompressing this format is as follows;

  1. If implemented, get the first word in the file (here called finalLength)
  2. While the length of the output data is less than finalLength:
    1. Read a word from the input data
    2. Is this word the RLE code word? (e.g. 0xFEFE)
      • If yes;
        1. Get the next two words, the first as Count and the second as Value.
        2. Write Value as a word, Count times.
        3. Move forward three words and go to 2.
      • If no;
        1. Write the word to the output.
        2. Move forward a word and go to 2.

Minimal compression

As only the code word needs to be handled specially, a quick compression function can be implemented by passing through all values as-is, and if the code word has just been written then it is followed by the additional words 0x0001 and the code word again. For example if the code word is 0xFEFE, then it will end up in the output as FEFE 0001 FEFE, meaning "RLE sequence begins: write 0xFEFE once".

Inefficiencies and hidden data

It is interesting to note that as only the RLEW code word needs to be escaped, no other word would ever need to be repeated only once. This means the escape sequence FEFE 0001 FEFE could have been shortened to FEFE 0001 by omitting the field containing the word to repeat, with the assumption that the only word ever written just once is the RLEW code word itself.

Furthermore, the Count field in the RLE sequence is set to 1 to write the word once. This means it can be set to 0 to write the word zero times, effectively discarding that word. In theory this could allow "hidden" data to be embedded in the compressed data stream that would be skipped over by the decompressor and not affect the output decompressed data.

Most RLE algorithms avoid this inefficiency by adding 1 (or more) to the count value before using it, meaning a count of 0 writes the word once, a count of 1 writes it twice, and so on. In this case, as the RLE sequence takes up three words, no space is saved until at least four words are converted into an RLE code. This means adding 3 to the count would have been most efficient, with a count of 0 reserved as a special case for handling the escaping of the code word itself. Thus 0 would write the code word, 1 would repeat the word 4 times, 2 would repeat it 5 times, and so on. Words appearing 2 or 3 times in a row would be written out as-is without incurring any space penalty, with the sole exception of the RLEW code word appearing 2 or 3 times in a row, which would incur a small space penalty.

RLEB compression

This is almost exactly the same as RLEW compression. The only difference is that it operates on bytes instead of words. The flag value for this compression is $FE in most cases.

Dangerous Dave 2, Rescue Rover and Shadow Knights use this method to compress some of their full-screen images. Rescue Rover and Shadow Knights also use this method to compress the game's sprite graphics.

Example code

  • Javascript: the cmp-rlew-id algorithm in gamecomp.js (full compression and decompression)