Carmack compression

From ModdingWiki
Jump to navigation Jump to search
Carmack compression
Format typeCompression algorithm
TypeStream
I/O unit size1-byte
Games

Carmack compression is used in the GAMEMAPS file in a number of games to further shrink the levels down beyond what RLEW compression can achieve. Its basic idea is somewhat like LZ (Lempel-Ziv) compression in that it contains pointers back to previous data.

As in RLEW compression, the first word in the Carmack compressed data is the number of bytes (not words) in the decompressed data. This is typically the number of bytes in the compressed RLEW data, as Carmack compression is performed after RLEW compression.

The format specifies some values in words - these are 16-bits in length (UINT16LE).

Carmack compression contains two types of references to previous data: near pointers and far pointers.

Near Pointers

Near pointers occupy three bytes in the compressed data. The first is the number of words in the referenced sequence, the second is the signal byte of 0xA7, and the third is the number of words to the start of the reference (counting backwards from the current location). As a concrete example, the three bytes 0x05 0xA7 0x0A mean 'repeat the 5 words starting 10 words ago'.

Notice that near pointers only let one refer to the last 255 words. To refer to sequences further back, one must use far pointers.

Far Pointers

Far pointers occupy four bytes in the compressed data:

  • The first byte is the number of words in the sequence to copy from the previously generated output;
  • The second byte is the tag '0xa8'
  • The fourth and fifth bytes are an UINT16LE specifying the number of words from the beginning of output to the start of the sequence to copy. This is different from the near pointers, which are offsets from the "current" output position!

As a concrete example, the four bytes 0x10 0xA8 0x01 0x02 mean 'copy to the current output position the 16 words (or 32 bytes) starting 513 words (or 1026 bytes) from the start of the output'.

Words with a high byte of 0xA7 or 0xA8

Words whose high (second) byte is 0xA7 or 0xA8 would appear to be issue, as they would be confused with near or far pointers. These are handled by representing them as the three bytes: 00 Ax yy, where yy is the lower 8 bits of the word being escaped. This sequence is recognized as an exception, as repeating zero words would make no sense.

For example:

   00 A7 12 EE FF 00 A8 34 CC DD

Decodes to:

   12 A7 EE FF 34 A8 CC DD

Example code