Softdisk Library Format

From ModdingWiki
(Redirected from SLIB compression)
Jump to navigation Jump to search
Softdisk Library Format
Format typeArchive
Max files65,535
File Allocation Table (FAT)Beginning
Filenames?Yes, 8.3
Metadata?None
Supports compression?Yes
Supports encryption?No
Supports subdirectories?No
Hidden data?Yes
Games

SLIB, or Softdisk LIBrary, compression is a container file format used by Softdisk software to compress various files used by their games, most notably the Commander Keen Dreams series of games, (Including Dangerous Dave 3 and Dangerous Dave 4)to store title images used at the beginning of each game. It was created in 1992 by Jim Row.

The data held in the file can be compressed in any one of three ways, uncompressed, LZW and LZH. The compression used is primitive and rather different from later or traditional versions of LZW\LZH. SLIB files were created by the program SOFTLIB.EXE and as such any game that uses this format contains various segments of code in common with SOFTLIB for the decompression of data.

There is a closely related format, the SHL or Softdisk Help Library format. SHL files contain only a single file. Their header is slightly different, it's file signature is 'CMP1' (CoMPression of 1 file) while that of SLIB files is SLIB. The veracity of both files can be confirmed by checking for a word of value 2 at offset 4 in the file. The actual files have been given a number of extensions; .CMP (CoMPressed), .SHL (Softdisk Help Library) or the game extension.

The SLIB file can roughly be broken into a number of parts; the header, which contains data about the various data chunks, and the data chunks themselves, each containing a single file. Each chunk also has a short header.

Header

The file header is found only in SLIB files and is absent in SHL files, which are loaded into memory in their enitrety. The SLIB header allows individual data chunks to be loaded inhto memory seperately.

The SLIB header is a variable length header that contains information about how many chunks there are in a file as well as their location in the file and size. It is used by the game to load chunks into memory and by SOFLIB to extract compressed files.

The first part of the header is a fixed length of 8 bytes and allows the game to identify the file as SLIB and also the total length of the header (Which will be 30 * the value at offset 6 plus 8.) The second part is a series of chunk headers that hold information about what file is held in each chunk. (The last six bytes are repeated at the start of the data chunk.)

FILE HEADER:
0 CHAR[4]	fID		Signature, 'SLIB' (Softdisk LIBrary)
4 UINT16LE	Version		Version number, always $0002
6 UINT16LE	Chunks		Number of data chunks in file
8 CHAR[30x]	Chunk headers	Chunk headers
IMAGE HEADERS:
?   CHAR[16]	Name         Name of compressed image (Max len 12 chars) padded with nuls
+16 UINT32LE	Dat st       Start of image data in file (From start of first image chunk)
+20 UINT32LE	Dat end      End of image data in file (From start of first image chunk)
+24 UINT32LE	iOriginalSize	Decompressed data size
+28 UINT16LE	iCompression	Compression used, 0 = none, 1 = LZW, 2 = LZH

Data Chunks

The data chunk has a short header followed by the actual data itself. The format differs slightly between SLIB chunkks and SHL files. Both format must identify themselves to the game, and they do so in different ways.

Notice that in the case of SLIB chunks bytes 4-10 are identical to bytes 24-30 of that chunk's header.

SLIB CHUNK FORMAT
?   CHAR[4]	cID		Signature'CUNK' (Chunk UNKompressed size)
+4  UINT32LE	iOriginalSize	Decompressed data size
+8  UINT16LE	iCompression	Compression method
+10 CHAR[?]	Data		Image data

SHL files are slightly more tricky; an uncompressed SHL file is simply raw data, compressed data must identify itself as such and so the header is slightly longer. It can be considered a combination of the chunk header and the SLIB header in the section above.

SHL FILE FORMAT
0  CHAR[4]	cID		Signature'CMP1' (CoMPressed s1ingle file)
4  UINT16LE	iCompression	Compression method (as above)
6  UINT32LE	iOriginalSize	Decompressed data size
10 UINT32LE	iCompressedSize	Compressed data size
14 CHAR[?]	Data		Image data

Compression

As noted above there are three forms of compression. The first is simple enough, no compression at all, the data is simply stored in the file. (Actually increasing the size!) The other two are called 'LZW' and 'LZH' by the programmers but both differ from what are now standard implmentations of those formats.

LZW

A compression value of 1 means the chunk is LZW compressed. The format of the compression used is different from the more 'usual' implementation of LZW. Most LZW works by building a 'dictionary', but LZW is in essence just referring back to data that has already been read.

The core of the implementation is that if a sequence is encountered that has already been read then it is replaced with a pointer to it. There are three types of data, flag bytes, pointers and literals.

Flag bytes divide the datastream into segments of eight 'values' which can be either literals or pointers. Pointers are 2 bytes long, literals 1 byte. (Therefore there will be a flag byte every 8 to 16 bytes of data.) The value of each bit (In little endian) indicates whether a value will be a literal (1) or codeword (0) Thus a value of 199 (11000111 in binary) indicates three pointers, three literals and two pointers in that order. (Total of 13 bytes.)

Literals are sequences that have never been seen in the datastream before, they cannot be compressed and are thus the same in the compressed and decompressed datastreams. (If the data is text they become quite obvious.) Any string less than 3 bytes long that has not been read before or cannot be pointed to (See below) will be stored as literals.

Pointers are reference to data that has already been read. They are two bytes long, with the first 12 bits giving the location to read data from and the last 4 bits giving the length of data to read.

The lower nybble (4 bits) of the second pointer byte holds the length of repeat data to read minus three. (This makes sense, the shortest sequence it makes sense to code is three bytes which can be given the value 0.) It will be immediately apparent that the maximum length of repeated data that can be stored as a pointer is 18 bytes.

The high nybble of the second byte is multiplied by 256 then added to the first byte to give the location of the data to read in the 'sliding window' minus 18. In hexadecimal this is as simple as appending the nybble to the first byte. For example flag bytes of $EA $F6 point to location $FEA with '6' relating to the data length to read.

The 'sliding window' in this case is the region of decompressed data that the compressed data can point to. It will be immediately obvious that the pointers can encode values between +-2048, or about 2KB. If the decompressed data is less than 2KB in size then zero is the start of the data, if it is larger than it is 2048 bytes from the end of the data. (This is the origin of the term 'sliding window'; it is a window of data that can be slid along the datastream as it gets bigger.)

It will be noted that it is probable that the compressed datastream will not be perfectly divisible by flag bytes. In this case the unused bits are set to 0. The decompressor stops when the decompressed data size is equal to the value given in the chunk header.

The last thing to keep in mind is that the flagged location obtained requires 18 to be added to it. This is due to how the decompressor is set up in memory; it needs a buffer of 18 bytes (the maximum length of compressed data that can be decompressed) between where it is working and the start of the datastream proper. These bytes are initially filled with blank spaces ($20 not $00.) and considered part of the datastream but not part that is output to the final file. (Their locations are negative.)

This has an odd result; it's entirely possible for a flag byte to point to a negative value and output one or more characters that were never part of the uncompressed data. To illustrate this we will look at the text string 'I am Sam. Sam I am!' as compressed by softlib.exe. Note that on line 6 it requests 5 bytes from address -1. This is the string ' I am'. The last four bytes are from the start of the sentence but the first, a blank space, comes from the buffer.

FF				Flag byte, 8 literals follow
49 20 61 6D 20 53 61 6D		'I am Sam' as literals
09				Flag byte, 1L, 2P, 1L,4Blanks ($09 = 00001001)
2E				'.' as literal
F2 F1				Pointer, read 1 + 3 = 4 bytes from $FF2, or -14 + 18 = 4 in the data. This is ' Sam'
ED F2				Pointer, read 2 + 3 = 5 bytes from $FED or -19 + 18 = -1 in the data. This is ' I am'
21				'!' as literal

LZH

LZH (also known as LZHUF) is a combination of LZSS compression and Huffman Compression originally developed by Haruyasu Yoshizaki for his LHarc compressor, based on the prior LZARI algorithm from Haruhiko Okumura.

Compressing data with LZHUF can be considered a two-stage process (though, in practice, both stages are run simultaneously).

First, the data is LZ-compressed, converting it to a series of literals (uncompressed bytes) and matches (references to a contiguous run of bytes which appeared earlier: these references consist of a length and an offset).

These literals and matches are then huffman compressed using two separate huffman tables.

Literals (as well as the lengths of matches) are compressed using the main _adaptive_ Huffman tree. This huffman tree has 256 + 28 = 284 leaf nodes: the first 256 represent literal bytes, the next 28 represent matches of length 2 to length 30. After every entry is encoded or decoded, a probability for that value is updated and, if necessary, the huffman tree is reconstructed to be optimal for the new probability distribution.

If the encoded value is a match length, the offset (which is 12-bits long) is then split into two 6-bit halves. The upper 6 bits are encoded with a fixed huffman dictionary. The bottom 6 bits are stored as-is.

The Planet Strike source code release contains C code that handles LZH compression and decompression in JM_LZH.C. The code was written by Jim T. Row, who apparently also wrote the Softlib utility, so chances are it is the same implementation of LZH.

The same compressor/decompressor can also be found in the open-sourced Keen Dreams source, in LZHUF.C.

An account of the development of this algorithm can be found in A History of Data Compression in Japan. Okumura further describes the algorithms in this page. Note however, that the Softdisk library implementation uses Jim T. Row's implementation, not the original by Okumura and Yoshizaki.

Softlib

Soflib (Softdisk Library Creator) is a DOS program that can be used to create or extract files from SLIB files. It can work with both forms of compression used in SLIB files. It is notable that for some reason files shorter than 24 bytes often fail to be compressed correctly. (Soflib outputs a library with an empty chunk in it.) Soflib can be downloaded from the tools section of this page.

Data contained in libraries

Keen Dreams uses SLIB to compress its title screen and also comes with a number of LZH compressed .SHL files containing text. The title screen is in LBM Format It is notable that the game does not read most of the LBM chunks, focusing instead on the FORM, BMHD and BODY chunks. This is because while the compressed files were designed to be viewed and edited in a standalone program, the game did not need things such as the LBM palette.

Dangerous Dave 3 and 4 use an additional SLIB file to store their digital sound effects, which are seperate from their PC\adlib sounds

Tools

The following tools are able to work with files in this format.

Name PlatformExtract files? Decompress on extract? Create new? Modify? Compress on insert? Access hidden data? Edit metadata? Notes
SOFTLIB.EXE DOSYesYesYesYesYesNoN/A Original DOS program that can create and modify Softdisk Libraries
Titlebuild WindowsNoNoYesNoYesNoN/A Windows program to turn a 320x200 bitmap into KDREAMS.CMP for Keen Dreams

Credits

This format partially reverse engineered by User:Lemm and User:Levellass.