Titus Interactive SQZ Compression

From ModdingWiki
Jump to navigation Jump to search
Titus Interactive SQZ Compression
Format typeCompression algorithm
TypeStream
I/O unit size1-bit out
Games

Titus Interactive SQZ Compression is an implementation of Huffman Compression and LZW compression. It is present in Titus the Fox and The Blues Brothers, developed by Titus Interactive.

Check out The Blues Brothers Huffman Compression for similar information.

File format

Data type Description
BYTE[1] sizeH bits 0..3: high nibble of uncompressed size, bits 4..7 unused
BYTE[1] compression 0x10: LZW is used, something else (0): Huffman+RLE is used
UINT16LE sizeL low word of uncompressed size
If Huffman:
UINT16LE HuffmanTreeSize Size of Huffman tree.
UINT16LE[HuffmanTreeSize / 2] HuffmanTree Huffman tree. A node is a leaf node when the most significant bit is set, otherwise it's an internal node.
BYTE[] compressedData Compressed data. The most significant bit of a byte is processed first, then the second most significant bit etc.
If LZW:
BYTE[] compressedData Compressed data.

LZW decoding

The bitstream consists of variable-length codewords; from here on, their word size is named 'nbit'. Initially, nbit == 9.

LZW makes use of a dictionary; the codewords are used as indices to perform lookups. This dictionary is not specified explicitly in the compressed stream though, instead it is derived during decompression. Initially the dictionary contains 258 entries. For entries 0..255, the value of each entry is simply the ASCII binary form of its index: dict[i] = chr(i). The remaining two entries, 0x100 and 0x101, have a special purpose and aren't used for lookups. The dictionary has a maximum size of 0x1000 entries.

The bitstream codewords can be divided in two categories. The first category is formed by the two special values mentioned above, CLEAR_CODE (0x100) and END_CODE (0x101). Whenever CLEAR_CODE is encountered, the decoder should reset its state to the initial condition, that is, set nbit to 9 and reset the dictionary to its initial 258 entries. END_CODE, as expected, marks the end of the stream.

The second category is formed by all other codeword values. For each codeword, first a new entry is optionally added to the dictionary. No entry is added immediately after a CLEAR_CODE, and neither when the dictionary is full. Next, the codeword's value is used as an index in the dictionary: the corresponding entry is output.

The value of the new dictionary entry is the output generated by the previous codeword, plus one more byte. This byte is normally the first byte of the current codeword's output. However, sometimes the current codeword points to a not-yet-existing dictionary entry. In this case, the extra byte's value is the first byte of the previous output. So, if the previous codeword generated an output of 'ABC', and the current codeword points beyond the dictionary, the new dictionary entry would be 'ABCA'.

For this to work correctly, the input stream must fulfill a few conditions, which all apply only when the current codeword points to a not-yet-existing entry. First, the previous codeword can't be a CLEAR_CODE, as there is no previous output in that case. Secondly, the dictionary cannot be full, as otherwise there is no room to add a new entry, which is needed to lookup the output. Finally, the current codeword must point to the to-be-added entry, for the same reason. It is the responsibility of the compressor to make sure these conditions hold, and they do for all LZW-encoded SQZ files in both Moktar and Titus the Fox.

If after adding an entry the number of elements in the dictionary is equal to 2^nbit, the value of nbit is incremented; unless nbit is equal to 12.

The number of unused input bits following END_CODE is at least one and at most eight, so sometimes an entirely unused input byte is present at the end.

Regarding bit-order: consider the first three input bytes of a file, and label the bits like this, with most-significant bits on the left in each byte: 76543210 FEDCBA98 NMLKJIHG then, the first two codewords (most-significant bit left) are: 76543210F EDCBA98NM

Constants: CLEAR_CODE = 0x100 (0x101 for CDRUN.COM) END_CODE = 0x101 (0x100 for CDRUN.COM) MAX_TABLE = 0x1000

	int nbit
	stringarray dict

	int prev := CLEAR_CODE
	while (prev != END_CODE) {
		if (prev == CLEAR_CODE) {
			nbit := 9
			dict := [chr(0)..chr(255), ?, ?] // 258 entries, ?=unused
		}
		int cw := next nbit-sized codeword
		if (cw != CLEAR_CODE && cw != END_CODE) {
			int newbyte
			if (cw < num_elem(dict)) {
				newbyte := first_byte(dict[cw])
			} else {
				// Assumption 1: prev != CLEAR_CODE
				// Assumption 2: num_elem(dict) < MAX_TABLE
				// Assumption 3: cw == num_elem(dict)
				newbyte := first_byte(dict[prev])
			}
			if ((prev != CLEAR_CODE) && (num_elem(dict) < MAX_TABLE)) {
				append dict[prev] ++ newbyte to dict
				if (num_elem(dict) == 2**nbit && nbit < 12) {
					nbit++
				}
			}
			output dict[cw]
		}
		prev := cw
	}

Huffman decoding

In Huffman coding a fixed binary tree structure is used, with output codewords stored at its leaves. The Huffman bitstream sequentially addresses these leaf nodes. The tree is navigated node for node, starting at the root. One bit is then read from the stream. This bit determines the next node to visit: a bit value of 0 means visit the first (left) child, where a value of 1 means take the second (right) child. Whenever a leaf node is reached, its value is output and the current node position is reset to the root.

In most Huffman implementations the tree is stored efficiently in what is known as "canonical" form. In the SQZ files however, this is not the case and the tree is stored in a very straightforward fashion, making it easy to use. The binary Huffman tree HT is stored as array of 16-bit words, where each word represents a node; each node is either an internal node or a leaf node. The value stored in the array for a node has two parts: bit 15 is set if it is a leaf node, bits 0..14 contain the node's value. For leaf nodes this value is a codeword, for internal nodes it is 2*(index of first child), or put another way: the byte offset in the array of the first child. The second child of an internal node is stored immediately after its first child.

Bit order is such that for each byte, the most significant bit is processed first.

Pseudocode for reading codewords (Huffman decoding):

	int node := 0
	while (bool b := next bit) {
		if (b) {
			node++
		}
		if (HT[node] < 0x8000) {
			// internal node
			node := HT[node] / 2
		} else {
			// leaf node
			int cw := HT[node] & 0x7FFF
			process codeword cw
			node := 0
		}
	}

This results in a sequence of RLE codewords. Each codeword 'cw' is processed in the following way:

Let L == cw mod 256 Let H == cw div 256

Codeword What to do H==0 last := L; output 'last' H!=0 && L==0 read next codeword; output cw times 'last' H!=0 && L==1 read next codeword; count := L*256; read next codeword; count += L; output count times 'last' H!=0 && L>=2 output L times 'last'

This is easily implemented using a state machine: the following pseudocode can be directly inserted in the loop above.

The pseudocode uses three pre-existing variables: int state := 0 byte last int count

Pseudocode for processing a codeword cw (RLE decoding):

	byte L := cw & 255
	byte H := cw >> 8
	if (state == 0) {
		if (H == 0) {
			last := L
			output 'last'
		} else if (L == 0) {
			state := 1
		} else if (L == 1) {
			state := 2
		} else {
			output L times 'last'
		}
	} else if (state == 1) { // cw == repeat count
		output cw times 'last'
		state := 0
	} else if (state == 2) { // L == high byte of 'count'
		count := L*256
		state := 3
	} else if (state == 3) { // L == low byte of 'count'
		count += L
		output count times 'last'
		state := 0
	}

Implementations

Source code

Pseudocode

Perl implementation

Source code in C (sqz.c)

The Blues Brothers Huffman Compression - QuickBASIC implementation (Huffman only)

UnSQZ

OpenTitus UnSQZ-tool extracts SQZ-compressed files.

Credits

SQZ information: Jesses.

OpenTitus SQZ extractor: Eirik.

QuickBasic implementation: Frenkel.

If you find this information helpful in a project you're working on, please give credit where credit is due. (A link back to this wiki would be nice too!)