Titus Interactive SQZ Compression
Format type | Compression algorithm |
---|---|
Type | Stream |
I/O unit size | 1-bit out |
Games |
Titus Interactive SQZ Compression is an implementation of Huffman Compression and LZW compression. It is present in Titus the Fox and The Blues Brothers, developed by Titus Interactive.
Check out The Blues Brothers Huffman Compression for similar information.
File format
Data type | Description |
---|---|
BYTE[1] sizeH | bits 0..3: high nibble of uncompressed size, bits 4..7 unused |
BYTE[1] compression | 0x10: LZW is used, something else (0): Huffman+RLE is used |
UINT16LE sizeL | low word of uncompressed size |
If Huffman: | |
UINT16LE HuffmanTreeSize | Size of Huffman tree. |
UINT16LE[HuffmanTreeSize / 2] HuffmanTree | Huffman tree. A node is a leaf node when the most significant bit is set, otherwise it's an internal node. |
BYTE[] compressedData | Compressed data. The most significant bit of a byte is processed first, then the second most significant bit etc. |
If LZW: | |
BYTE[] compressedData | Compressed data. |
LZW decoding
The bitstream consists of variable-length codewords; from here on, their word size is named 'nbit'. Initially, nbit == 9.
LZW makes use of a dictionary; the codewords are used as indices to perform lookups. This dictionary is not specified explicitly in the compressed stream though, instead it is derived during decompression. Initially the dictionary contains 258 entries. For entries 0..255, the value of each entry is simply the ASCII binary form of its index: dict[i] = chr(i). The remaining two entries, 0x100 and 0x101, have a special purpose and aren't used for lookups. The dictionary has a maximum size of 0x1000 entries.
The bitstream codewords can be divided in two categories. The first category is formed by the two special values mentioned above, CLEAR_CODE (0x100) and END_CODE (0x101). Whenever CLEAR_CODE is encountered, the decoder should reset its state to the initial condition, that is, set nbit to 9 and reset the dictionary to its initial 258 entries. END_CODE, as expected, marks the end of the stream.
The second category is formed by all other codeword values. For each codeword, first a new entry is optionally added to the dictionary. No entry is added immediately after a CLEAR_CODE, and neither when the dictionary is full. Next, the codeword's value is used as an index in the dictionary: the corresponding entry is output.
The value of the new dictionary entry is the output generated by the previous codeword, plus one more byte. This byte is normally the first byte of the current codeword's output. However, sometimes the current codeword points to a not-yet-existing dictionary entry. In this case, the extra byte's value is the first byte of the previous output. So, if the previous codeword generated an output of 'ABC', and the current codeword points beyond the dictionary, the new dictionary entry would be 'ABCA'.
For this to work correctly, the input stream must fulfill a few conditions, which all apply only when the current codeword points to a not-yet-existing entry. First, the previous codeword can't be a CLEAR_CODE, as there is no previous output in that case. Secondly, the dictionary cannot be full, as otherwise there is no room to add a new entry, which is needed to lookup the output. Finally, the current codeword must point to the to-be-added entry, for the same reason. It is the responsibility of the compressor to make sure these conditions hold, and they do for all LZW-encoded SQZ files in both Moktar and Titus the Fox.
If after adding an entry the number of elements in the dictionary is equal to 2^nbit, the value of nbit is incremented; unless nbit is equal to 12.
The number of unused input bits following END_CODE is at least one and at most eight, so sometimes an entirely unused input byte is present at the end.
Regarding bit-order: consider the first three input bytes of a file, and label the bits like this, with most-significant bits on the left in each byte: 76543210 FEDCBA98 NMLKJIHG then, the first two codewords (most-significant bit left) are: 76543210F EDCBA98NM
Constants: CLEAR_CODE = 0x100 (0x101 for CDRUN.COM) END_CODE = 0x101 (0x100 for CDRUN.COM) MAX_TABLE = 0x1000
int nbit
stringarray dict
int prev := CLEAR_CODE
while (prev != END_CODE) {
if (prev == CLEAR_CODE) {
nbit := 9
dict := [chr(0)..chr(255), ?, ?] // 258 entries, ?=unused
}
int cw := next nbit-sized codeword
if (cw != CLEAR_CODE && cw != END_CODE) {
int newbyte
if (cw < num_elem(dict)) {
newbyte := first_byte(dict[cw])
} else {
// Assumption 1: prev != CLEAR_CODE
// Assumption 2: num_elem(dict) < MAX_TABLE
// Assumption 3: cw == num_elem(dict)
newbyte := first_byte(dict[prev])
}
if ((prev != CLEAR_CODE) && (num_elem(dict) < MAX_TABLE)) {
append dict[prev] ++ newbyte to dict
if (num_elem(dict) == 2**nbit && nbit < 12) {
nbit++
}
}
output dict[cw]
}
prev := cw
}
Huffman decoding
In Huffman coding a fixed binary tree structure is used, with output codewords stored at its leaves. The Huffman bitstream sequentially addresses these leaf nodes. The tree is navigated node for node, starting at the root. One bit is then read from the stream. This bit determines the next node to visit: a bit value of 0 means visit the first (left) child, where a value of 1 means take the second (right) child. Whenever a leaf node is reached, its value is output and the current node position is reset to the root.
In most Huffman implementations the tree is stored efficiently in what is known as "canonical" form. In the SQZ files however, this is not the case and the tree is stored in a very straightforward fashion, making it easy to use. The binary Huffman tree HT is stored as array of 16-bit words, where each word represents a node; each node is either an internal node or a leaf node. The value stored in the array for a node has two parts: bit 15 is set if it is a leaf node, bits 0..14 contain the node's value. For leaf nodes this value is a codeword, for internal nodes it is 2*(index of first child), or put another way: the byte offset in the array of the first child. The second child of an internal node is stored immediately after its first child.
Bit order is such that for each byte, the most significant bit is processed first.
Pseudocode for reading codewords (Huffman decoding):
int node := 0
while (bool b := next bit) {
if (b) {
node++
}
if (HT[node] < 0x8000) {
// internal node
node := HT[node] / 2
} else {
// leaf node
int cw := HT[node] & 0x7FFF
process codeword cw
node := 0
}
}
This results in a sequence of RLE codewords. Each codeword 'cw' is processed in the following way:
Let L == cw mod 256 Let H == cw div 256
Codeword What to do H==0 last := L; output 'last' H!=0 && L==0 read next codeword; output cw times 'last' H!=0 && L==1 read next codeword; count := L*256; read next codeword; count += L; output count times 'last' H!=0 && L>=2 output L times 'last'
This is easily implemented using a state machine: the following pseudocode can be directly inserted in the loop above.
The pseudocode uses three pre-existing variables: int state := 0 byte last int count
Pseudocode for processing a codeword cw (RLE decoding):
byte L := cw & 255
byte H := cw >> 8
if (state == 0) {
if (H == 0) {
last := L
output 'last'
} else if (L == 0) {
state := 1
} else if (L == 1) {
state := 2
} else {
output L times 'last'
}
} else if (state == 1) { // cw == repeat count
output cw times 'last'
state := 0
} else if (state == 2) { // L == high byte of 'count'
count := L*256
state := 3
} else if (state == 3) { // L == low byte of 'count'
count += L
output count times 'last'
state := 0
}
Implementations
Source code
The Blues Brothers Huffman Compression - QuickBASIC implementation (Huffman only)
UnSQZ
OpenTitus UnSQZ-tool extracts SQZ-compressed files.
Credits
SQZ information: Jesses.
OpenTitus SQZ extractor: Eirik.
QuickBasic implementation: Frenkel.
If you find this information helpful in a project you're working on, please give credit where credit is due. (A link back to this wiki would be nice too!)