ModdingWiki - User contributions [en-gb]

GameMaps Format

2015-08-09T17:23:35Z

Fleexy: /* Huffman compression */ filled in TODO's (source: FleexCore2 code)

{{Map Infobox
| Type = 2D tile-based
| Layers = 3
| Viewport = ''Varies by game''
| Game1 = Bio Menace
| Game2 = Blake Stone
| Game3 = Catacomb 3-D
| Game4 = Catacomb Abyss
| Game5 = Commander Keen
| Game6 = Corridor 7 Alien Invasion
| Game7 = Dangerous Dave 3
| Game8 = Dangerous Dave 4
| Game9 = Noah's Ark 3D
| Game10 = Operation Body Count
| Game11 = Spear of Destiny
| Game12 = Wolfenstein 3-D
}}

The '''GameMaps Format''' stores levels in a number of [[:Category:id Software]] games. The filenames and compression varies somewhat across different games but all files stored in this format were produced by the [[TED5]] level editor.

There are three main varieties of the file format: uncompressed, [[Carmack compression|carmackized]], and [[Huffman Compression|Huffman compressed]]. Each variation has its own file naming scheme and pattern of external/internal files.

There are two main components to the format. The game maps proper, which contain the actual level data, and the map headers, which contain both the location of each level's data within the game maps file, and the tile info for the game.

== Structure of uncompressed data ==

The uncompressed format is used by several games including [[Bio Menace]] and [[Wolfenstein 3-D]]. In this format map data is stored in <tt>MAPTEMP.xxx</tt> and map headers in <tt>MAPTHEAD.xxx</tt>. This is the working format saved by [[TED5]] when maps are being edited and can be directly accessed and edited by this utility, allowing changes to be made to the game.

Note that filenames differ and that Wolfenstein 3-D uses <tt>GAMEMAPS.WLx</tt> and <tt>MAPHEAD.WLx</tt> files to store data (except for v1.0, which uses <tt>MAPTEMP.WLx</tt> in place of <tt>GAMEMAPS.WLx</tt>).

The main indicator of this format being used is that in the unaltered game the map header file is external.

Only the latest Id Software games lacked compression, when file size was no longer an issue.

=== Map headers (MAPHEAD) ===

{|class="wikitable"
! Offset !! Type !! Name !! Description
|-
| 0 || [[UINT16LE]] || magic || Magic word signalling [[RLEW compression]]
|-
| 2 || [[UINT32LE]][400] || ptr || 100 pointers to start of level 0-99 data in the game maps file
|-
| 402 || {{TODO|Unknown}} || tileinfo || Tileinfo data
|}

The map header file (MAPHEAD) is of varying length and contains three main types of data. The first is the magic word or flag used for RLEW compression, which is always $ABCD. The second is 100 level pointers which give the location of the start of level data in the GAMEMAPS file, relative to the start of that file. A value greater than 0 indicates no level (generally 0, but occasionally -1 (xFFFFFFFF) is used). The third is the tileinfo data, which contains tile properties for each tile used in level creation. (These are masked and unmasked and either 8x8, 16x16 or 32x32.)

Many programs treat the tileinfo as a separate file from the MAPHEAD and it is possible to modify a game in this manner. Indeed, some games such as Wolfenstein 3-D do not have any tileinfo data at all in the map header file (giving a total file length of 402 bytes.) However TED5 works with any tileinfo data in the MAPHEAD.

=== Map data (GAMEMAPS) ===

The GAMEMAPS file consists of the string "TED5v1.0" and a number of RLEW compressed chunks of varying length. Each level in the file will have from two to four chunks (usually four) depending on the game, with all levels in a given game having the same number of chunks. These are the level header and 1-3 planes (foreground, background and sprite/info.) The chunks are in no particular order and it is possible to read through the entire file decompressing chunks as they're found.

Chunks are ordered by the MAPHEAD file, which will point to the GAMEMAPS level header chunks which in turn contain pointers to the other GAMEMAPS chunks used by that level.

All level data is in the form of [[UINT16LE]] values (or in the case of pointers, [[UINT32LE]].)

==== Level headers ====

The header for each level inside the GAMEMAPS file (which is pointed to by MAPHEAD) is 42 bytes long and RLEW compressed, though this is difficult to see since the data is so short and non-repetitive. For the offsets to level planes, a value of 0 indicates the plane does not exist.

Plane 0 is background using unmasked tiles, plane 1 is foreground and uses masked tiles, and plane 2 is sprite/info. Levels must contain a background plane and usually an infoplane.

{|class="wikitable"
! Offset !! Type !! Name !! Description
|-
| 0 || UINT32LE || offPlane0 || Offset in GAMEMAPS to beginning of compressed plane 0 data (or 0 if plane is not present)
|-
| 4 || UINT32LE || offPlane1 || Offset in GAMEMAPS to beginning of compressed plane 1 data (or 0 if plane is not present)
|-
| 8 || UINT32LE || offPlane2 || Offset in GAMEMAPS to beginning of compressed plane 2 data (or 0 if plane is not present)
|-
| 12 || UINT16LE || lenPlane0 || Length of compressed plane 0 data (in bytes)
|-
| 14 || UINT16LE || lenPlane1 || Length of compressed plane 1 data (in bytes)
|-
| 16 || UINT16LE || lenPlane2 || Length of compressed plane 2 data (in bytes)
|-
| 18 || UINT16LE || width || Width of level (in tiles)
|-
| 20 || UINT16LE || height || Height of level (in tiles)
|-
| 22 || char[16] || name || Internal name for level (used only by editor, not displayed in-game. null-terminated)
|}

Note that for Wolfenstein 3D, a 4-byte signature string ("!ID!") will normally be present directly after the level name. The signature does not appear to be used anywhere, but is useful for distinguishing between v1.0 files (the signature string is missing), and files for v1.1 and later (includes the signature string).

== Carmack compression ==

[[Carmack compression]] is the method used to compress later ''id Software'' games, when file size was still a concern. It is the most efficient and complex compression method and was created specifically to work with the 16-bit word structure of the GameMaps file. The compression is detailed on its [[Carmack compression|own page.]]

Carmackized game maps files are external <tt>GAMEMAPS.xxx</tt> files and the map header is stored internally in the executable. The map header must be extracted and the game maps decompressed before TED5 can access them. TED5 itself can produce carmackized files and external <tt>MAPHEAD.xxx</tt> files. Carmackization does not replace the RLEW compression used in uncompressed data, but compresses this data, that is, the data is doubly compressed.

Note that for Wolfenstein 3D v1.0, map files are not carmackized, only RLEW compression is applied.

== Huffman compression ==

[[Huffman Compression]] was probably used by earlier versions of TED5 (but possibly not TED5 at all) before carmackization was introduced. It uses the same method to compress its data as is used by ''id Software'' games to compress their graphics and sounds. Again this compression method works with RLEW compressed data and has its [[Huffman Compression|own page.]]

Huffman compression is easily detected since it works on the bit level and thus disrupts the word structure of the game data. This is easily seen in a hex editor. Compressed data will not contain the string $00 $00 or indeed even $00 very often. (In contrast, even carmackized data contains both strings hundreds of times.)

There will be two internal files for this format: the map header and the Huffman dictionary (which is always the first dictionary in the executable.) The map header format is also slightly different, being 502 bytes long, the extra 100 bytes being the length of the compressed level headers in the game maps data, which occur immediately after the normal level header offsets and before the tileinfo. Each entry is one octet indicating the decompressed header length in bytes or zero if the level does not exist. (These can be ignored when decompressing since Huffman data can be read until the decompressed level header's fixed size is reached, but if they are omitted when writing the MAPHEAD, the game may experience a buffer overflow when reading the maps.)

== Location of internal files ==

The GAMEMAPS file itself is always external, but in the case of compression, the MAPHEAD is stored internally in the main .exe file. Executables are themselves compressed with either with LZEXE or PKLite. Once the .exe has been decompressed it is a trivial task to find the MAPHEAD as it will start with the UINT16LE value $ABCD (i.e. the byte $CD followed by the byte $AB.) For level editing purposes only the first 402 (or 502) bytes of the file need to be extracted, though it is possible to read the MAPHEAD file to calculate its length.

The following table lists the offsets of the MAPHEAD file for various games, relative to the start of the decompressed game .exe file.

{{TODO|TODO: Add all known versions of all games}}

{|class="wikitable"
! Game !! Version !! Location !! Filename !! Offset !! Notes
|-
| [[Bio Menace]] || Freeware || External || <tt>MAPHEAD.BM[123]</tt> || - ||
|-
| [[Blake Stone]]: Aliens of Gold || Shareware || External || <tt>MAPHEAD.BS1</tt> || - ||
|-
| [[Blake Stone]] 2: Planet Strike || All || External || ? || - ||
|-
| [[Catacomb 3-D]] (3) || 1.00 || Internal || <tt>CAT3D.EXE</tt> || $1C570 || After UNLZEXE
|-
| [[Catacomb Abyss]] (4) || 1.13 || Internal || <tt>CATABYSS.EXE</tt> || $1C510 || After UNLZEXE
|-
| [[Catacomb Armageddon]] (5) || 1.01a || Internal || <tt>CATARMA.EXE</tt> || $1D900 || After UNLZEXE
|-
| [[Catacomb Apocalypse]] (6) || 1.00b || Internal || <tt>CATAPOC.EXE</tt> || $1DD50 || After UNLZEXE
|-
|rowspan=3| [[Corridor 7 Alien Invasion]] || Demo || External || ? || - ||
|-
|| CD || Internal || <tt>CORR7CD.EXE</tt> || $30D50 || File is not compressed
|-
| Floppy || Internal || <tt>C7.EXE</tt> || $24BF0 || File is not compressed
|-
|rowspan=2| [[Commander Keen 4-6|Keen 4]] || Special Demo || ? || ? || ? || File is PKLite compressed
|-
| 1.4 EGA || Internal || <tt>KEEN4E.EXE</tt> || $24830 || After UNLZEXE
|-
| [[Commander Keen 4-6|Keen 5]] || 1.4 EGA || Internal || <tt>KEEN5E.EXE</tt> || $25990 || After UNLZEXE
|-
| [[Commander Keen 4-6|Keen 6]] || 1.4 EGA || Internal || <tt>KEEN6.EXE</tt> || $25080 || After UNLZEXE
|-
| [[Commander Keen Dreams]] || 1.13 || Internal || ? || $1FA50 || After UNLZEXE
|-
| [[Noah's Ark 3D]] || All || External || ? || - ||
|-
| [[Operation Body Count]] || All || External || ? || - ||
|-
| [[Spear of Destiny]] || All || External || ? || - ||
|-
|rowspan=2| [[Wolfenstein 3-D]] || Shareware || External || <tt>MAPHEAD.WL1</tt> || - ||
|-
| Registered || External || <tt>MAPHEAD.WL6</tt> || - ||
|}

== Utilities ==

* [[TED5]] can edit the <tt>GAMEMAPS</tt> format of any games that use it. It is the original editor used to create these files.

== Credits ==

This file format was reverse engineered by Andrew Durdin (adurdin). If you find this information helpful in a project you're working on, please give credit where credit is due. (A link back to this wiki would be nice too!)

Talk:Keen 4-6 Tileinfo Format

2015-06-28T17:14:54Z

Fleexy: answered own question

What does the "sprite path arrow" special property actually do? I can't find any tiles that use it in Keen 5 or 6. [[User:Fleexy|Fleexy]] ([[User talk:Fleexy|talk]]) 17:06, 28 June 2015 (UTC)
: Oh wait, apparently the '''foreground tiles''' that hold the icons for the platform path arrows get that property. That's kind of strange, because those icons never actually appear as foreground tiles in levels. Does anybody know if the property is actually used/checked? [[User:Fleexy|Fleexy]] ([[User talk:Fleexy|talk]]) 17:14, 28 June 2015 (UTC)

Talk:Keen 4-6 Tileinfo Format

2015-06-28T17:06:28Z

Fleexy: asked about prop. 17

What does the "sprite path arrow" special property actually do? I can't find any tiles that use it in Keen 5 or 6. [[User:Fleexy|Fleexy]] ([[User talk:Fleexy|talk]]) 17:06, 28 June 2015 (UTC)

Talk:EGAGraph Format

2015-06-20T19:34:08Z

Fleexy: /* CGAGraph format */ new section

== xSPRITES.TXT ==

What is the actual content of the sprite table? The following is from the modkeen document, which does not agree with what is posted on the main page of this artice. It would suggest that the "shifts" are stored in here as well? Are they stored somewhere else?

''Episodes 4, 5, and 6 only: This file contains extra information about each sprite. Each line in the file has the sprite number, followed by the four clipping rectangle co-ordinates in square brackets [top, left, bottom, right], followed by the sprite origin in square brackets [top, left], followed by the number of shifts the sprite uses. The origin of the sprite image is the point from which its location is calculated. For example, the hand sprite in Keen 5 (5SPR0291.BMP) has several images. The origin for each of these images is in the centre of the "eye", so that as the hand rotates, the different sprite images all appear to rotate about the eye. The origin coordinates are given in pixels from the top-left corner of the sprite image. The shifts is the number of different copies of the sprite image that are stored in memory, and can be 1, 2, or 4. As a general rule, the more shifts a sprite has, the smoother it moves, but the more memory it takes up. If you are making a very large sprite, you can reduce the number of shifts to save memory. But if you have a small sprite and want it to move more smoothly, increase the number of shifts.''

Also from modkeen source:

typedef struct

{

unsigned short Width;

unsigned short Height;

signed short OrgX;

signed short OrgY;

signed short Rx1, Ry1;

signed short Rx2, Ry2;

unsigned short Shifts;

} SpriteHeadStruct;

Which would make a 9-word structure.

20:16, 3 September 2010 (GMT)

: Well from studying Keen 1-3 type games, my guess would be two words of height / 8 and width, four words giving hitbox co-ords and two words giving h\v offsets. However Keen 4-6 is more complex so I'll have to investigate this later on when I work with those files --[[User:Levellass|Endian? What are you on about?]] 04:40, 16 September 2010 (GMT)

:: I can confirm that each entry is 9 words (or 18 bytes) long and matches the above structure. And the last value (shifts) is actually the same value that you can see at the end of each line in the <tt>xSPRITES.TXT</tt> file. --[[User:K1n9 Duk3|K1n9 Duk3]] 00:30, 7 February 2011 (GMT)

== CGAGraph format ==

Almost everything in this article seems to apply to the CGA graphics resources as well. There are two main differences: the fonts are stored in version 1 of the [[EGA Font format]], and dividing by 8 for EGA dimensions is equivalent to dividing by 4 in CGA. (Generalization: divide by [bpp * 2].) Keep in mind that both CGA and EGA always store the mask before the actual image. --[[User:Fleexy|Fleexy]] ([[User talk:Fleexy|talk]]) 19:34, 20 June 2015 (UTC)

Commander Keen 4-6

2014-12-21T21:33:42Z

Fleexy: Updated Abiathar capabilities for the v2.2 release

{{Game Infobox
| Title = Commander Keen 4.png
| Levels = Edit
| Tiles = Edit
| Sprites = Edit
| Fullscreen = Edit
| Sound = Edit
| Music = Edit
| Text = Edit
| Story = Edit
| Interface = Edit
}}
'''Commander Keen: Goodbye Galaxy''' is a very popular platform scroller, which set the benchmark for this style of game on a PC.

There is a large amount of modding info in the [[keenwiki:Galaxy Tools|modding section of the KeenWiki]] and there is an extensive [[keenwiki:Category:Patches|patch library]] allowing the game to be extended and changed considerably.

== Tools ==

{{BeginFileFormatTools|Type=game}}
{{FileFormatTool
| Name = [[keenwiki:Abiathar|Abiathar]]
| Platform = Windows
| grp = N/A
| map = Edit
| gfx = No
| mus = Replace
| sfx = Edit
| txt = No
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = [[CKPatch]]
| Platform = DOS
| grp = N/A
| map = No
| gfx = No
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = Yes
}}
{{FileFormatTool
| Name = [[keenwiki:IMF Creator|IMF Creator]]
| Platform = Windows
| grp = N/A
| map = No
| gfx = No
| mus = Create
| sfx = No
| txt = No
| sav = No
| exe = Yes
}}
{{FileFormatTool
| Name = [[keenwiki:Keen: Next (level editor)|Keen: Next]]
| Platform = Windows
| grp = N/A
| map = Edit
| gfx = No
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = [[keenwiki:Keengraph|Keengraph]]
| Platform = DOS/Win/Linux
| grp = N/A
| map = No
| gfx = Edit
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = [[keenwiki:KeenWave|KeenWave]]
| Platform = DOS/Win/Linux
| grp = N/A
| map = No
| gfx = No
| mus = Replace
| sfx = Edit
| txt = No
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = EModKeen/LModKeen
| Platform = Win/Linux
| grp = N/A
| map = No
| gfx = Edit
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = [[keenwiki:ModKeen|ModKeen]]
| Platform = DOS
| grp = N/A
| map = No
| gfx = Edit
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = [[keenwiki:The Omegamatic|The Omegamatic]]
| Platform = Windows
| grp = N/A
| map = Edit
| gfx = No
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = [[keenwiki:The Photachyon Transceiver|The Photachyon Transceiver]]
| Platform = Windows
| grp = N/A
| map = Edit
| gfx = No
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = [[keenwiki:Startext|Startext]]
| Platform = DOS
| grp = N/A
| map = No
| gfx = No
| mus = No
| sfx = No
| txt = Edit
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = [[TED5]]
| Platform = DOS
| grp = N/A
| map = Edit
| gfx = No
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
}}
{{EndFileFormatTools}}

 
{{BeginGameFileList}}
{{GameFile
| Name = audio.ck[456]
| Format = [[AudioT Format]], [[IMF Format]]
| KnownFormat = Yes
| Desc = Sound and music
}}
{{GameFile
| Name = config.ck[456]
| Format = [[Keen 4-6 Configuration File]]
| KnownFormat = No
| Desc = Settings and high scores
}}
{{GameFile
| Name = egagraph.ck[456]
| Format = [[EGAGraph Format]]
| KnownFormat = Yes
| Desc = Graphics, text and miscellaneous data
}}
{{GameFile
| Name = gamemaps.ck[456]
| Format = [[GameMaps Format]]
| KnownFormat = Yes
| Desc = Level maps
}}
{{GameFile
| Name = keen[456]e.exe
| Format = [[B800 Text]], [[Keen 4-6 Action Format]]
| KnownFormat = Yes
| Desc = There are a couple of text screens in the main .EXE file (including the text displayed on the screen when you quit)
}}
{{GameFile
| Name = keen[456]e.exe
| Format = [[Keen 4-6 Action Format]]
| KnownFormat = Yes
| Desc = Sprite behaviours are governed by this
}}
{{GameFile
| Name = keen[456]e.exe
| Format = [[Keen 4-6 Tileinfo Format]]
| KnownFormat = Yes
| Desc = Tile information are governed by this
}}
{{EndGameFileList}}

=== AudioT Format Details ===

Keen 4, 5 and 6 use the compressed version of the [[AudioT Format]] to store music and sound. The main audio file for each game, the AUDIOT.xxx file, is called AUDIO.CKx, where x is the number of the Keen game (4, 5 or 6). For example, the file for Keen 4 is AUDIO.CK4.

The two secondary files needed for this format, the AUDIOHED.xxx and AUDIODCT.xxx, are embedded inside the main .exe file of each game (KEEN4E.EXE, KEEN5E.EXE and KEEN6.EXE for the EGA versions of the game).

== Related Links ==

* [[Commander Keen 1-3]]
* [[Commander Keen Dreams]]
* The [[KeenWiki:Main Page|Commander Keen Wiki]]

[[Category:Apogee]]
[[Category:id Software]]
[[Category:Sidescroller]]

Bio Menace

2014-11-11T20:18:05Z

Fleexy: Added notes to entries, fixed Abiathar link

{{Stub}}
{{Game Infobox
| Levels = Edit
| Tiles = No
| Sprites = No
| Fullscreen = No
| Sound = Some
| Music = Some
| Text = No
| Story = No
| Interface = No
}}
'''Bio Menace''' is a series of three games known simply as Bio Menace 1, 2 and 3. They use a slightly altered version of the [[Commander Keen 4-6]] engine, so the file formats used are almost identical.

The game follows the adventures of Snake Logan, a top secret operative at the CIA, who takes on missions others would see as suicidal, as he battles hordes of strange mutant monsters that have suddenly appeared and are taking over cities. The goals of the game vary, from rescuing trapped prisoners to destroying an evil computer. Notably, in the second game, if you do not get the portable nuclear device in one of the levels, the game becomes unwinnable and you are treated to the story of Snake's eventual death.

{{BeginFileFormatTools|Type=game}}
{{FileFormatTool
| Name = [http://www.shikadi.net/keenwiki/Abiathar Abiathar]
| Platform = Windows
| grp = N/A
| map = Edit
| gfx = No
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
| notes = Offers templates for Bio Menace projects
}}
{{FileFormatTool
| Name = [[TED5]]
| Platform = DOS GUI
| grp = N/A
| map = Edit
| gfx = No
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
| notes = Must be [http://people.inf.elte.hu/szgrahi/bmenace_mod.zip configured to work with Bio Menace]
}}
{{EndFileFormatTools}}

 
== File formats ==

{|class="wikitable"
! File name !! Description
|-
| <tt>audiohed.bm* audiot.bm*</tt>
| style="background: #CCFFCC;" | Sound effects and music in [[AudioT Format]] (and within that, music in [[IMF Format]])
|-
| <tt>bmenace[123].exe</tt>
| style="background: #FFFFCC;" | Exit text screens in [[B800 Text]] format
|-
| <tt>egadict.bm* egahead.bm* egagraph.bm*</tt>
| style="background: #CCFFCC;" | Graphics, text and miscellaneous data in [[EGAGraph Format]]
|-
| <tt>gamemaps.bm* maphead.bm* maptemp.bm* mapthead.bm*</tt>
| style="background: #CCFFCC;" | Levels in [[GameMaps Format]]
|}

* Sprite behaviours are governed by data structures in the .exe, arranged in [[Keen 4-6 Action Format]]

== Music ==

=== AudioT Format Details ===
Biomenace uses the uncompressed version of the [[AudioT Format]] to store music and sound. The main audio file for each game, the AUDIOT.xxx file, is called AUDIOT.BMx, where x is the number of the game (1, 2 or 3). For example, the file for Keen 2 is AUDIO.BM2. The secondary files needed for this format, the AUDIOHED.BMx are external.

=== Levels Songs Assignment ===
There is a place inside the executables too, where each level in the game is assigned a song from the AUDIOT.BMx file. From that addresses, the following bytes indicate the song for each level. Every 2 bytes is a value corresponding to a level. The first 2 bytes correspond to level 0; the next 2 bytes correspond to level 1; and so on. Each 2-byte value is an integer corresponding to the song number used in that level, starting from $00 $00, and going up to $00 $01, and so on. This numbers refer to the order the songs are in the AUDIO.BMx file; $00 $00 is the first one, $00 $01 is the second one, and so on.

It is notable that Biomenace has many more songs in it than the Keen 4-6 games upon which it was based.

== Obtaining the game ==

The full version of this game [http://www.3drealms.com/news/2005/12/bio_menace_released_as_freewar.html has been released as freeware].

* Three different versions are available from [http://www.classicdosgames.com/apogee.html Classic DOS Games]

[[Category:Apogee]]
[[Category:Sidescroller]]
[[Category:Freeware]]

Bio Menace

2014-11-11T20:11:09Z

Fleexy: Added Abiathar to editing program list

{{Stub}}
{{Game Infobox
| Levels = Edit
| Tiles = No
| Sprites = No
| Fullscreen = No
| Sound = Some
| Music = Some
| Text = No
| Story = No
| Interface = No
}}
'''Bio Menace''' is a series of three games known simply as Bio Menace 1, 2 and 3. They use a slightly altered version of the [[Commander Keen 4-6]] engine, so the file formats used are almost identical.

The game follows the adventures of Snake Logan, a top secret operative at the CIA, who takes on missions others would see as suicidal, as he battles hordes of strange mutant monsters that have suddenly appeared and are taking over cities. The goals of the game vary, from rescuing trapped prisoners to destroying an evil computer. Notably, in the second game, if you do not get the portable nuclear device in one of the levels, the game becomes unwinnable and you are treated to the story of Snake's eventual death.

{{BeginFileFormatTools|Type=game}}
{{FileFormatTool
| Name = [[TED5]]
| Platform = DOS GUI
| grp = N/A
| map = Edit
| gfx = No
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
}}
{{FileFormatTool
| Name = [[Abiathar]]
| Platform = Windows
| grp = N/A
| map = Edit
| gfx = No
| mus = No
| sfx = No
| txt = No
| sav = No
| exe = No
}}
{{EndFileFormatTools}}

* TED5 must be [http://people.inf.elte.hu/szgrahi/bmenace_mod.zip configured to work with Bio Menace]

 
== File formats ==

{|class="wikitable"
! File name !! Description
|-
| <tt>audiohed.bm* audiot.bm*</tt>
| style="background: #CCFFCC;" | Sound effects and music in [[AudioT Format]] (and within that, music in [[IMF Format]])
|-
| <tt>bmenace[123].exe</tt>
| style="background: #FFFFCC;" | Exit text screens in [[B800 Text]] format
|-
| <tt>egadict.bm* egahead.bm* egagraph.bm*</tt>
| style="background: #CCFFCC;" | Graphics, text and miscellaneous data in [[EGAGraph Format]]
|-
| <tt>gamemaps.bm* maphead.bm* maptemp.bm* mapthead.bm*</tt>
| style="background: #CCFFCC;" | Levels in [[GameMaps Format]]
|}

* Sprite behaviours are governed by data structures in the .exe, arranged in [[Keen 4-6 Action Format]]

== Music ==

=== AudioT Format Details ===
Biomenace uses the uncompressed version of the [[AudioT Format]] to store music and sound. The main audio file for each game, the AUDIOT.xxx file, is called AUDIOT.BMx, where x is the number of the game (1, 2 or 3). For example, the file for Keen 2 is AUDIO.BM2. The secondary files needed for this format, the AUDIOHED.BMx are external.

=== Levels Songs Assignment ===
There is a place inside the executables too, where each level in the game is assigned a song from the AUDIOT.BMx file. From that addresses, the following bytes indicate the song for each level. Every 2 bytes is a value corresponding to a level. The first 2 bytes correspond to level 0; the next 2 bytes correspond to level 1; and so on. Each 2-byte value is an integer corresponding to the song number used in that level, starting from $00 $00, and going up to $00 $01, and so on. This numbers refer to the order the songs are in the AUDIO.BMx file; $00 $00 is the first one, $00 $01 is the second one, and so on.

It is notable that Biomenace has many more songs in it than the Keen 4-6 games upon which it was based.

== Obtaining the game ==

The full version of this game [http://www.3drealms.com/news/2005/12/bio_menace_released_as_freewar.html has been released as freeware].

* Three different versions are available from [http://www.classicdosgames.com/apogee.html Classic DOS Games]

[[Category:Apogee]]
[[Category:Sidescroller]]
[[Category:Freeware]]

Commander Keen Dreams

2014-08-12T19:59:18Z

Fleexy: Noted the possibility of modding levels and graphics

{{Stub}}

{{Game Infobox
| Levels = Edit
| Tiles = Edit
| Sprites = Edit
| Fullscreen = No
| Sound = No
| Music = No
| Text = No
| Story = No
| Interface = No
}}
== File formats ==

* [[AudioT Format]] - used for storing sound and music data
* [[B800 Text]] - there are a couple of text screens in the main .EXE file (including the text displayed on the screen when you quit)
* [[EGAGraph Format]] - used for storing graphics, text and miscellaneous data
* [[Keen Dreams level Format]] - used for storing level maps
* [[Keen 4-6 Action Format]] Sprite's behaviors are governed by this
* [[Keen 4-6 Tileinfo Format]] Tile information are governed by this
* [[SHL data]] - The start menu is in this format
* [[SLIB compression]] The title screen is compressed with this

Compression used in KeenDreams is complex; most of the 'registered' (Non shareware) executables are compressed using [[PKLite compression]] and demo versions have additional files or filenames. Uniquely, the game maps are compressed using [[Huffman Compression]] instead of [[Carmack compression]], as they were made with an earlier version of [[TED5]] (TED3).

== Versions ==

There are no less than six versions of Keen Dreams available; these differ very little in actual gameplay, but markedly in file structure. Most of these are listed at [[http://www.shikadi.net/keenwiki/Keen_Dreams_Versions the Commander Keen Wiki]] (They omit version 1.0 and the Id Anthology)

The main moddable version is version 1.13, which is not the latest version of the game. Notably it has less in-game help and text, requires the player to run a program from Gamer's Edge before playing and names its files differently from the later Keen 4-6 manner. It also uses a seperate file for its start screen, which is unusual.

== Related Links ==
* [[Commander Keen 1-3]]
* [[Commander Keen 4-6]]
* The [[KeenWiki:Main Page|Commander Keen Wiki]]

[[Category:Softdisk]]
[[Category:id Software]]
[[Category:Sidescroller]]

TED5

2014-06-15T13:36:14Z

Fleexy: /* GFXINFOE.xxx */ Filled in fields. Source: https://github.com/Fleex255/TED5/blob/master/TED5.H

{{Tool Infobox
| Image = TED5-Keen4.png
| Platform = DOS
| Edit1 = Map
}}

Tile Editor v5.0 (TED5) is a level editor written by [[:Category:id Software|id Software]] that has been used to create the levels for many of their games. It edits the [[GameMaps Format]] seen in many older id Software games and is also apparently able to edit the [[Commander Keen 1-3 Level format]], though how is not known.

== Files used by TED5 ==

TED5 is compatible with any ID game that has a <tt>GAMEMAPS.xxx</tt> file. However the actual <tt>GAMEMAPS.xxx</tt> file is the final product, many other supporting files are required for TED5 to function.

If the game is 'compressed', it will not have most of these files. [[Commander Keen 4-6]] is one such game. In contrast 'uncompressed' games such as [[Bio Menace]] will have the full complement of files. Sadly, most games are compressed, as this is most efficient.

=== EGAGRAPH.xxx, EGAHEAD.xxx and EGADCT.xxx ===

These are explained fully under [[EGAGraph Format]]; they are respectively the game graphics, the graphics header and the [[Huffman Compression]] dictionary for decompressing graphics. Compressed games will just have the <tt>EGAGRAPH</tt>; with the HEAD and DCT files stored internally in the executable. Uncompressed games will have an external <tt>EGAHEAD</tt> and no need for an <tt>EGADCT</tt> file.

Like all Huffman dictionaries, the <tt>EGADCT</tt> file can be located in the executable by looking for the string <tt>$FD $01 $00 $00 $00 $00</tt> which is found at the file's end. (The file will be 1024 bytes long.) The <tt>EGAHEAD</tt> file can be located by looking for a 4 (Or 3) byte string which is the <tt>EGAGRAPH</tt> file size, this is always the last entry. (If you then proceed to move backwards until you reach a zero value entry, you will have the start of the file.)

You will receive an error if these files are absent. If the DCT is missing then TED5 will assume graphics are uncompressed.


=== Gamemaps ===

The <tt>GAMEMAPS</tt> file is not used, but is produced by TED5. It contains the compressed game maps and is produced from the <tt>MAPTEMP</tt> file when the user selects 'Carmackize maps' in TED5. Carmackization is complex and takes a long time to do. Because of this it is recommended that modders use Instant Carma! (See below)

=== GFXINFOE.xxx ===

This is a 46 byte file that contains data allowing TED5 to extract the graphics used in levels.It consists of a number of numbers and locations of various tile types.

Levels can be made of 8x8, 16x16 or 32x32 tiles (Always 16 color EGA.) Levels have a foreground plane, and may also have a background and info (Sprite) plane.The structure of the <tt>GFXINFOE</tt> file is as follows:

{|class="wikitable"
!Offset!!Type!!Description
|-
|0||[[UINT16LE]]||Number of 8x8 background tiles
|-
|2||[[UINT16LE]]||Number of 8x8 foreground tiles
|-
|4||[[UINT16LE]]||Number of 16x16 background tiles
|-
|6||[[UINT16LE]]||Number of 16x16 foreground tiles
|-
|8||[[UINT16LE]]||Number of 32x32 background tiles
|-
|10||[[UINT16LE]]||Number of 32x32 foreground tiles
|-
|12||[[UINT16LE]]||8x8 back tile start
|-
|14||[[UINT16LE]]||8x8 fore tile start
|-
|16||[[UINT16LE]]||16x16 back start
|-
|18||[[UINT16LE]]||16x16 fore start
|-
|20||[[UINT16LE]]||32x32 back start
|-
|22||[[UINT16LE]]||32x32 fore start
|-
|24||[[UINT16LE]]||Number of pictures
|-
|26||[[UINT16LE]]||Number of masked pictures
|-
|28||[[UINT16LE]]||Number of sprites
|-
|30||[[UINT16LE]]||Picture start
|-
|32||[[UINT16LE]]||Mask picture start
|-
|34||[[UINT16LE]]||Sprite start
|-
|36||[[UINT16LE]]||"offpicstr"
|-
|38||[[UINT16LE]]||"offpicmstr"
|-
|40||[[UINT16LE]]||"offstrstr"
|-
|42||[[UINT16LE]]||Number of extra EGA resources
|-
|44||[[UINT16LE]]||Extra EGA resource start
|}

The location (Start) given for tiles is the entry in the <tt>EGAHEAD.xxx</tt> file, most <tt>EGAHEAD.xxx</tt> files consist of 3 byte entries so the actual location, in bytes, in the file is three times this, though older games (Such as [[Commander Keen Dreams]]) have 4 byte entries. (It is not known if or how TED5 tells the difference between these.) Each tile has its own entry in the header.

The icons used by TED5 (For sprites, etc) are calculated as follows; icons follow directly after background tiles, then foreground tiles. There will thus be some 'extra' slots between the back and fore tiles. These are used as the number of icons.

This file is found with uncompressed games, but has to be created for compressed games. It is not currently known if it can be extracted somehow from the executable or if it is hard coded for each game.

=== Maptemp and Mapthead ===

The <tt>MAPTEMP</tt> file contains the uncarmackized level maps. Carmackization is a form of compression related to [[Keen 1-3 LZW compression]] in that it employs a sliding window. <tt>MAPTEMP</tt> is however [[RLE Compression]] compressed to avoid excess size. This is easy for TED5 to edit though. This file is changed each time levels are saved in TED5 and is not used by the game. To make usable levels you must select the item 'Carmackize maps' in the file menu, which will produce the <tt>GAMEMAPS</tt> file.

<tt>MAPTHEAD</tt> is the header for the <tt>MAPTEMP</tt> file. When maps are carmackized it is used to create the <tt>MAPHEAD</tt> file. <tt>MAPTHEAD</tt> contains several variables used by TED5 and some older games. Notably it has information for IDs tileinfo program to work with. Its structure is as follows:

{|class="wikitable"
!Offset!!Type!!Description
|-
|0||[[UINT16LE]]||Bit field for level planes (+1 unmasked, +2 masked, +4 infoplane)
|-
|2||[[UINT16LE]]||Type of tile (8x8 = 1, 16x16 = 2, 32x32 = 3)
|-
|4||[[UINT16LE]]|||Number of unmasked tileinfo (TILEINFO) planes in file. Most games 2, speed and offset. Max 10
|-
|6||[[UINT16LE]]||Number of unmasked tiles
|-
|8||[[UINT32LE]][10]||Pointers to tileinfo planes in <tt>MAPTHEAD</tt> file
|-
|48||[[UINT16LE]][10]||Size of TILFINFO planes data.
|-
|68||[[char]][10][8]||TILEINFO plane names
|-
|148||[[UINT16LE]]||Number of masked tileinfo (TILEINFOM) planes in file. Usually 7. Max 10
|-
|150||[[UINT16LE]]||Number of masked tiles
|-
|152||[[UINT16LE]][10]||Pointers to TILEINFOM planes in file
|-
|192||[[UINT16LE]][10]||Size of TILFINFOM planes data.
|-
|212||[[char]][10][8]||TILFINFOM plane names
|-
|292||[[UINT16LE]]||RLEW flag, default $ABCD
|-
|294||[[UINT32LE]][100]||Pointers to level headers in <tt>MAPTEMP</tt>. 100 pointers, null values are -1 ($FFFFFFFF)
|-
|694||[[UINT32LE]][100]||Level header sizes <tt>MAPTEMP</tt>. Older games Huffman Compressed headers and used this. Older games have $26 here, the (uncompressed) header size and don't use it
|-
|1094||[[UINT16LE]]||Number of ICON rows TED5 sets aside from masked tiles to display icons. For most games this is 5
|-
|1096||[[char]][x]||Optional TILEINFOM and TILEINFO data
|}

=== Tedinfo ===

This file is used by TED5 to store various details of the level format. Important here is the number of planes in each level, and the number of icons. Icons (used for sprites) are taken from the foreground tiles. The known values are:

{|class="wikitable"
!Offset!!Type!!Description
|-
|0||[[UINT16LE]]||Level TED opens in (last edited)
|-
|2||[[UINT16LE]]||Screen view, close (1) or distant (2)
|-
|colspan=3|...
|-
|8||[[UINT16LE]]||Number of tile planes in levels (usually 2, fore and back)
|-
|colspan=3|...
|-
|35||[[BYTE]]||Planes editable/visible in TED; +1 view icon, +2 view fore, +4 view back, +16 edit icon, +32 edit fore, +64 edit back
|-
|36||[[UINT16LE]]||h loc in level
|-
|38||[[UINT16LE]]||v loc in level
|-
|colspan=3|...
|-
|119||[[char]][64]||Import map path
|}

== Backup files ==

TED5 backs up all its files when levels are saved or carmackized or TED5 is exited. The old files are saved as <tt>*.BAK</tt> files and a simple renaming will suffice to undo the latest change. Only one backup is made so it is wise to save these files occasionally.

== Setting up TED5 ==

For uncompressed games, such as [[Bio Menace]], all that is required is to copy the file <tt>TED5.EXE</tt> into the games directory. For compressed games the situation is slightly more complex.

Some games, such as [[Commander Keen 4-6]] have a Ted setup utility to either extract or create the necessary files. If this is the case then a two step process is required. The first involves copying TED5 and the setup utility and running the setup to obtain the required files. The second step involves using a patching utility (Such as [[CKPatch]]) to patch the modified files back into the executable.

Finally, some games, such as [[Catacomb 3D]], have no utilities at present. While it is not impossible to modify the levels in these games, it takes a bit more work, since the required files must be extracted manually and the executables illegally modified. It has already been mentioned how the graphics files can be located in an executable, but other files will need to be hard coded until somebody automates the procedure.

== Modifying levels ==

If TED5 is setup correctly then it should run (Dosbox may be required for this.) and go to the first level in the game, usually a world map. There are a number of commands and actions that can be used, which will not be covered here. After the desired changes have been made the user must select 'Carmackize maps' from the file menu to produce a modified <tt>GAMEMAPS</tt> and <tt>MAPHEAD</tt> file. For compressed games such as Keen this file will then need to be patched into the executable.

A level can be completely replaced with one from another GAMEMAPS file by using the 'Import levels' command under the file menu. (You will need to specify a patch to a valid GAMEMAPS file.) This allows a person to copy levels between backups, etc.

The program Instant Carma! skips Carmack compression by doing the very minimal amount of work. It takes a fraction of the time of TED5 and can be used while TED5 is running and is thus very useful. Sadly it is only available for Commander Keen 4-6.



== Other utilities ==

* CKPatch: A set of utilities that allow Keen 4-6 executables to be modified temporarily and legally by patching a copy of the executable into memory and modifying that. This is vital for using new levels in Keen 4-6 There are several versions available, with the latest having the most features. http://www.bipship.com/CKPatch
* Fixmhead: This converts the <tt>MAPTHEAD</tt> file to <tt>MAPHEAD</tt>. It is not really required for anything, but is included with a lot of TED5 packs.
* Galaxymk: This pack contains everything needed to set Commander Keen 4-6 games up for editing with TED5, including the setups and patchers. http://keenmodding.org/search.php?search_author=The_Fosti&sid=92cde18022face91063ec52109b6c2ec
* Instant Carma!: by CK Guy externalizes the carmackization of Keen 4-6 maps, saving time and improving efficiency. http://www.keenmodding.org/viewtopic.php?t=997
* Tedsetup: A program that sets up certain games to be editable by TED5, automatically extracting or creating required files, Often comes with TED5

== Supported Games ==

This is a list of games that have been successfully edited with TED5: (though many of them have dedicated editors that are more user friendly)

* [[Bio Menace]]
* [[Catacomb 3D]]
* [[Commander Keen 4-6]]
* [[Commander Keen Dreams]]
* [[Dangerous Dave 3]]
* [[Dangerous Dave 4]] (Dave Goes Nutz)
* [[Rise of the Triad]]
* [[Wolfenstein 3-D]]

== Download ==

TED5 has been released as open source freeware. TED5 and its source code can be downloaded from [http://www.3drealms.com/downloads.html#rott 3D Realms]. It can be used to edit the levels of Bio Menace and to create new levels for Rise of the Triad. It is also possible to edit the levels of Commander Keen 4-6, but this is a bit harder to [http://www.keenmodding.org/viewtopic.php?t=899 set up].

[[Category:id Software]]



[[Category:Bio Menace]]
[[Category:Catacomb 3D]]
[[Category:Commander Keen 4-6]]
[[Category:Commander Keen Dreams]]
[[Category:Dangerous Dave 3]]
[[Category:Dangerous Dave 4]]
[[Category:Rise of the Triad]]
[[Category:Wolfenstein 3-D]]

Talk:TED5

2014-06-14T19:26:30Z

Fleexy: Gave links to TED5 source

We need more details on how Dave 3&4 have been edited with TED, since these use a different level format to that of other games.~!Levellass;20-9-2009
:Different even to Keen Dreams? Maybe we should have a bunch of instructions explaining how to configure TED5 for each game. -- [[User:Malvineous|Malvineous]] 22:32, 20 September 2009 (GMT)

I'm working on filling in the format specs, you can check out the TED5 source code at https://github.com/Fleex255/TED5/ or look at [https://github.com/Fleex255/TED5/blob/a602282af71d9c085c1270211889313131e4d24c/TED5.H the TED5.H file that contains most of the typedefs]. [[User:Fleexy|Fleexy]] ([[User talk:Fleexy|talk]]) 19:26, 14 June 2014 (GMT)

TED5

2014-06-14T19:23:20Z

Fleexy: /* Maptemp and Mapthead */ Corrected and filled in the MAPTHEAD format. Source: https://github.com/Fleex255/TED5/blob/a602282af71d9c085c1270211889313131e4d24c/TED5.H

{{Tool Infobox
| Image = TED5-Keen4.png
| Platform = DOS
| Edit1 = Map
}}

Tile Editor v5.0 (TED5) is a level editor written by [[:Category:id Software|id Software]] that has been used to create the levels for many of their games. It edits the [[GameMaps Format]] seen in many older id Software games and is also apparently able to edit the [[Commander Keen 1-3 Level format]], though how is not known.

== Files used by TED5 ==

TED5 is compatible with any ID game that has a <tt>GAMEMAPS.xxx</tt> file. However the actual <tt>GAMEMAPS.xxx</tt> file is the final product, many other supporting files are required for TED5 to function.

If the game is 'compressed', it will not have most of these files. [[Commander Keen 4-6]] is one such game. In contrast 'uncompressed' games such as [[Bio Menace]] will have the full complement of files. Sadly, most games are compressed, as this is most efficient.

=== EGAGRAPH.xxx, EGAHEAD.xxx and EGADCT.xxx ===

These are explained fully under [[EGAGraph Format]]; they are respectively the game graphics, the graphics header and the [[Huffman Compression]] dictionary for decompressing graphics. Compressed games will just have the <tt>EGAGRAPH</tt>; with the HEAD and DCT files stored internally in the executable. Uncompressed games will have an external <tt>EGAHEAD</tt> and no need for an <tt>EGADCT</tt> file.

Like all Huffman dictionaries, the <tt>EGADCT</tt> file can be located in the executable by looking for the string <tt>$FD $01 $00 $00 $00 $00</tt> which is found at the file's end. (The file will be 1024 bytes long.) The <tt>EGAHEAD</tt> file can be located by looking for a 4 (Or 3) byte string which is the <tt>EGAGRAPH</tt> file size, this is always the last entry. (If you then proceed to move backwards until you reach a zero value entry, you will have the start of the file.)

You will receive an error if these files are absent. If the DCT is missing then TED5 will assume graphics are uncompressed.


=== Gamemaps ===

The <tt>GAMEMAPS</tt> file is not used, but is produced by TED5. It contains the compressed game maps and is produced from the <tt>MAPTEMP</tt> file when the user selects 'Carmackize maps' in TED5. Carmackization is complex and takes a long time to do. Because of this it is recommended that modders use Instant Carma! (See below)

=== GFXINFOE.xxx ===

This is a 46 byte file that contains data allowing TED5 to extract the graphics used in levels.It consists of a number of numbers and locations of various tile types.

Levels can be made of 8x8, 16x16 or 32x32 tiles (Always 16 color EGA.) Levels have a foreground plane, and may also have a background and info (Sprite) plane.The structure of the <tt>GFXINFOE</tt> file is as follows:

{|class="wikitable"
!Offset!!Type!!Description
|-
|0||[[UINT16LE]]||Number of 8x8 background tiles
|-
|2||[[UINT16LE]]||Number of 8x8 foreground tile
|-
|4||[[UINT16LE]]||Number of 16x16 background tiles
|-
|6||[[UINT16LE]]||Number of 16x16 foreground tiles
|-
|8||[[UINT16LE]]||Number of 32x32 background tiles
|-
|10||[[UINT16LE]]||Number of 32x32 foreground tiles
|-
|12||[[UINT16LE]]||8x8 back tile start
|-
|14||[[UINT16LE]]||8x8 fore tile start
|-
|16||[[UINT16LE]]||16x16 back start
|-
|18||[[UINT16LE]]||16x16 fore start
|-
|20||[[UINT16LE]]||32x32 back start
|-
|22||[[UINT16LE]]||32x32 back start
|-
|26||[[char]][20]||??? Usually blank, probably has some function
|}

The location (Start) given for tiles is the entry in the <tt>EGAHEAD.xxx</tt> file, most <tt>EGAHEAD.xxx</tt> files consist of 3 byte entries so the actual location, in bytes, in the file is three times this, though older games (Such as [[Commander Keen Dreams]]) have 4 byte entries. (It is not known if or how TED5 tells the difference between these.) Each tile has its own entry in the header.

The icons used by TED5 (For sprites, etc) are calculated as follows; icons follow directly after background tiles, then foreground tiles. There will thus be some 'extra' slots between the back and fore tiles. These are used as the number of icons.

This file is found with uncompressed games, but has to be created for compressed games. It is not currently known if it can be extracted somehow from the executable or if it is hard coded for each game.

=== Maptemp and Mapthead ===

The <tt>MAPTEMP</tt> file contains the uncarmackized level maps. Carmackization is a form of compression related to [[Keen 1-3 LZW compression]] in that it employs a sliding window. <tt>MAPTEMP</tt> is however [[RLE Compression]] compressed to avoid excess size. This is easy for TED5 to edit though. This file is changed each time levels are saved in TED5 and is not used by the game. To make usable levels you must select the item 'Carmackize maps' in the file menu, which will produce the <tt>GAMEMAPS</tt> file.

<tt>MAPTHEAD</tt> is the header for the <tt>MAPTEMP</tt> file. When maps are carmackized it is used to create the <tt>MAPHEAD</tt> file. <tt>MAPTHEAD</tt> contains several variables used by TED5 and some older games. Notably it has information for IDs tileinfo program to work with. Its structure is as follows:

{|class="wikitable"
!Offset!!Type!!Description
|-
|0||[[UINT16LE]]||Bit field for level planes (+1 unmasked, +2 masked, +4 infoplane)
|-
|2||[[UINT16LE]]||Type of tile (8x8 = 1, 16x16 = 2, 32x32 = 3)
|-
|4||[[UINT16LE]]|||Number of unmasked tileinfo (TILEINFO) planes in file. Most games 2, speed and offset. Max 10
|-
|6||[[UINT16LE]]||Number of unmasked tiles
|-
|8||[[UINT32LE]][10]||Pointers to tileinfo planes in <tt>MAPTHEAD</tt> file
|-
|48||[[UINT16LE]][10]||Size of TILFINFO planes data.
|-
|68||[[char]][10][8]||TILEINFO plane names
|-
|148||[[UINT16LE]]||Number of masked tileinfo (TILEINFOM) planes in file. Usually 7. Max 10
|-
|150||[[UINT16LE]]||Number of masked tiles
|-
|152||[[UINT16LE]][10]||Pointers to TILEINFOM planes in file
|-
|192||[[UINT16LE]][10]||Size of TILFINFOM planes data.
|-
|212||[[char]][10][8]||TILFINFOM plane names
|-
|292||[[UINT16LE]]||RLEW flag, default $ABCD
|-
|294||[[UINT32LE]][100]||Pointers to level headers in <tt>MAPTEMP</tt>. 100 pointers, null values are -1 ($FFFFFFFF)
|-
|694||[[UINT32LE]][100]||Level header sizes <tt>MAPTEMP</tt>. Older games Huffman Compressed headers and used this. Older games have $26 here, the (uncompressed) header size and don't use it
|-
|1094||[[UINT16LE]]||Number of ICON rows TED5 sets aside from masked tiles to display icons. For most games this is 5
|-
|1096||[[char]][x]||Optional TILEINFOM and TILEINFO data
|}

=== Tedinfo ===

This file is used by TED5 to store various details of the level format. Important here is the number of planes in each level, and the number of icons. Icons (used for sprites) are taken from the foreground tiles. The known values are:

{|class="wikitable"
!Offset!!Type!!Description
|-
|0||[[UINT16LE]]||Level TED opens in (last edited)
|-
|2||[[UINT16LE]]||Screen view, close (1) or distant (2)
|-
|colspan=3|...
|-
|8||[[UINT16LE]]||Number of tile planes in levels (usually 2, fore and back)
|-
|colspan=3|...
|-
|35||[[BYTE]]||Planes editable/visible in TED; +1 view icon, +2 view fore, +4 view back, +16 edit icon, +32 edit fore, +64 edit back
|-
|36||[[UINT16LE]]||h loc in level
|-
|38||[[UINT16LE]]||v loc in level
|-
|colspan=3|...
|-
|119||[[char]][64]||Import map path
|}

== Backup files ==

TED5 backs up all its files when levels are saved or carmackized or TED5 is exited. The old files are saved as <tt>*.BAK</tt> files and a simple renaming will suffice to undo the latest change. Only one backup is made so it is wise to save these files occasionally.

== Setting up TED5 ==

For uncompressed games, such as [[Bio Menace]], all that is required is to copy the file <tt>TED5.EXE</tt> into the games directory. For compressed games the situation is slightly more complex.

Some games, such as [[Commander Keen 4-6]] have a Ted setup utility to either extract or create the necessary files. If this is the case then a two step process is required. The first involves copying TED5 and the setup utility and running the setup to obtain the required files. The second step involves using a patching utility (Such as [[CKPatch]]) to patch the modified files back into the executable.

Finally, some games, such as [[Catacomb 3D]], have no utilities at present. While it is not impossible to modify the levels in these games, it takes a bit more work, since the required files must be extracted manually and the executables illegally modified. It has already been mentioned how the graphics files can be located in an executable, but other files will need to be hard coded until somebody automates the procedure.

== Modifying levels ==

If TED5 is setup correctly then it should run (Dosbox may be required for this.) and go to the first level in the game, usually a world map. There are a number of commands and actions that can be used, which will not be covered here. After the desired changes have been made the user must select 'Carmackize maps' from the file menu to produce a modified <tt>GAMEMAPS</tt> and <tt>MAPHEAD</tt> file. For compressed games such as Keen this file will then need to be patched into the executable.

A level can be completely replaced with one from another GAMEMAPS file by using the 'Import levels' command under the file menu. (You will need to specify a patch to a valid GAMEMAPS file.) This allows a person to copy levels between backups, etc.

The program Instant Carma! skips Carmack compression by doing the very minimal amount of work. It takes a fraction of the time of TED5 and can be used while TED5 is running and is thus very useful. Sadly it is only available for Commander Keen 4-6.



== Other utilities ==

* CKPatch: A set of utilities that allow Keen 4-6 executables to be modified temporarily and legally by patching a copy of the executable into memory and modifying that. This is vital for using new levels in Keen 4-6 There are several versions available, with the latest having the most features. http://www.bipship.com/CKPatch
* Fixmhead: This converts the <tt>MAPTHEAD</tt> file to <tt>MAPHEAD</tt>. It is not really required for anything, but is included with a lot of TED5 packs.
* Galaxymk: This pack contains everything needed to set Commander Keen 4-6 games up for editing with TED5, including the setups and patchers. http://keenmodding.org/search.php?search_author=The_Fosti&sid=92cde18022face91063ec52109b6c2ec
* Instant Carma!: by CK Guy externalizes the carmackization of Keen 4-6 maps, saving time and improving efficiency. http://www.keenmodding.org/viewtopic.php?t=997
* Tedsetup: A program that sets up certain games to be editable by TED5, automatically extracting or creating required files, Often comes with TED5

== Supported Games ==

This is a list of games that have been successfully edited with TED5: (though many of them have dedicated editors that are more user friendly)

* [[Bio Menace]]
* [[Catacomb 3D]]
* [[Commander Keen 4-6]]
* [[Commander Keen Dreams]]
* [[Dangerous Dave 3]]
* [[Dangerous Dave 4]] (Dave Goes Nutz)
* [[Rise of the Triad]]
* [[Wolfenstein 3-D]]

== Download ==

TED5 has been released as open source freeware. TED5 and its source code can be downloaded from [http://www.3drealms.com/downloads.html#rott 3D Realms]. It can be used to edit the levels of Bio Menace and to create new levels for Rise of the Triad. It is also possible to edit the levels of Commander Keen 4-6, but this is a bit harder to [http://www.keenmodding.org/viewtopic.php?t=899 set up].

[[Category:id Software]]



[[Category:Bio Menace]]
[[Category:Catacomb 3D]]
[[Category:Commander Keen 4-6]]
[[Category:Commander Keen Dreams]]
[[Category:Dangerous Dave 3]]
[[Category:Dangerous Dave 4]]
[[Category:Rise of the Triad]]
[[Category:Wolfenstein 3-D]]

File format data types

2014-01-03T20:26:32Z

Fleexy: Changed source language parameter on VB .NET stuff to be dot-netty

This is a list of all the data types used in the file format descriptions on the wiki. They are loosely based on common C/C++ data types, and should be used throughout the wiki for consistency.

== Type list ==

==== Numeric values ====

{|class="wikitable"
!Data type!!Description
|-
|UINT8||Unsigned 8-bit integer
|-
|UINT16LE||Unsigned 16-bit integer in little-endian format
|-
|UINT16BE||Unsigned 16-bit integer in big-endian format
|-
|UINT32LE||Unsigned 32-bit integer in little-endian format
|-
|UINT32BE||Unsigned 32-bit integer in big-endian format
|}

Signed equivalents are the same without the leading U, i.e. <tt>INT8</tt>, <tt>INT16LE</tt>, etc. Unless otherwise stated, the format is in [[wikipedia:Two's complement|two's complement]] (where a UINT8 value of 255 is -1 as an INT8, for example.)

==== Character strings ====
{|class="wikitable"
!Data type!!Description
|-
|char[x]||String ''x'' characters long
|-
|char||Single 8-bit character
|-
|ASCIIZ||A C-style string (variable-length, terminated with a single NULL/0x00 value)
|}

==== Misc data types ====
{|class="wikitable"
!Data type!!Description
|-
|BYTE||Same as UINT8 but conceptually for generic data rather than numeric values (e.g. UINT8 would be used for a number, while a BYTE would be used for a bitfield)
|-
|BYTE[x]||Block of data ''x'' bytes long
|}

== Big endian vs little endian ==

For numeric values larger than a single byte, the endianness specifies how the values are split over multiple bytes. For example a hex value of 0x1234AABB when written to a file will take up two bytes, as follows:

{|class="wikitable"
!Endian!!Bytes in file
|-
|Big||<code>12 34 AA BB</code>
|-
|Little||<code>BB AA 34 12</code>
|}

For those languages that allow direct memory access such as C/C++, converting an integer value to a byte array will reveal the value stored in-memory in the same order as the table above.

Normally when reading or writing a variable to a file a programmer will simply pass the memory address of the variable, resulting in the file mirroring the byte order in memory. This is no problem when reading the variable back in on the same system, as the byte order will match. However when reading data from a different system (for example using an Intel PC to read files from a PowerPC Mac) the byte order will be opposite to what the system expects and the programmer must convert the values manually.

=== Conversion examples ===

If a value is being read on the same system (little to little or big to big) then no action is required. If the systems are different, then the values must be swapped. The following sections list examples for different programming languages.

==== C/C++ ====

<source lang="c">
// 16-bit
int in = 0x1234;
int out = ((in & 0xFF) << 8) | (in >> 8);
// out should now be 0x3412

// 32-bit
int in = 0x1234AABB;
int out =
((in & 0xFF) << 24) |
((in & 0xFF00) << 8) |
((in & 0xFF0000) >> 8) |
(in >> 24);
// out should now be 0xBBAA3412
</source>

==== Visual Basic .NET ====

<source lang="vbnet">
Function ByteSwap16(ByVal InValue as Int16) as Int16
Return ((InValue And 255) << 8) Or (InValue >> 8)
End Function

Function ByteSwap32(ByVal InValue as Int32) as Int32
Return ((InValue And &HFF) << 24) Or _
((InValue And &HFF00) << 8) Or _
((InValue And &HFF0000) >> 8) Or _
(InValue >> 24)
End Function
</source>

[[Category:File Formats]]

User talk:Levellass

2014-01-03T17:32:51Z

Fleexy: Asked question about EGADICT

Hi Levellass - not sure if you've seen the question on [[Talk:Softdisk Library Format]] but since you wrote it you're probably best placed to answer! -- [[User:Malvineous|Malvineous]] 02:26, 6 January 2012 (GMT)

Hi Levellass - I reverted your change to [[IMF Format]] (removing [[:Category:File Formats]]) because I am trying to remove all the pages from that category, instead putting them into more specific ones. So in this case [[IMF Format]] will appear in [[:Category:All file formats]] as well as [[:Category:All music formats]]. And these categories should apply automatically when an infobox is added to the article page, so hopefully there will be no need to add categories manually! So if you spot any pages missing categories, please add infoboxes to them instead of just the bare categories. Thanks! -- [[User:Malvineous|Malvineous]] 09:03, 8 March 2013 (GMT)
: Nifty, that should certainly simplify things. -[[User:Levellass|Endian? What are you on about?]]

Hey, LL! Your new example/explanation of Huffman helped a lot in writing a decompressor, but I'm having an issue using it on EGAGRAPH. I verified that it is identical to the EGADICT created by KG and that I am reading it correctly. I get a tree full of reversed bit sequences, just like you said. However, the EGAGRAPH does not appear to use reversed bit sequences; it is completely uncompressed. (It does still have the UInt32LE at the front of the appropriate chunks specifying decompressed length.) Keen can read it and KeenGraph can read it, but using the provided dictionary, I can't. Can you help me? [[User:Fleexy|Fleexy]] ([[User talk:Fleexy|talk]]) 17:32, 3 January 2014 (GMT)

Huffman Compression

2014-01-02T16:24:31Z

Fleexy: Added HasValue property to BinaryTreeNode<T>

[[Huffman Compression]] is a compression algorithm used in many classic games. It is about as efficient as LZW or RLE compression, with many games using the three formats simultaneously.

Huffman compression involves making/reading a ''dictionary'' of 256 entries, one for each possible 1-byte character. This is organised in the form of a ''binary tree'' which is basically formed by taking the two lowest frequency characters and combining them into a new entry with their added frequencies and repeating until all entries are reduced to one ''root node''. After the dictionary is created, every character of data is replaced with the corresponding bit representation from the tree.

Any Huffman compressed data will thus be associated with a dictionary file (internal or external) used to decompress it.

==Nodes==

The binary tree must be stored in some form, usually called a ''dictionary''. The format varies from game to game, but is usually broadly along these lines. The tree will have 254 nodes (for all the one byte characters in game data) stored as two 'branches' of that node, usually taking up 4 bytes each, but sometimes 3. The second part is either 0 or 1, and says whether that branch goes to a character (0) or another node (1). The first part is the value of either that character or that node. (For the usual 4-byte implementation each part is a byte.)

In the Huffman table then we can expect to see each possible character TWICE, once as a character, once as a node reference. Note that the nodes are numbered from 0-253, NOT 1-254. The order of the nodes doesn't matter, only how they are connected.

The root node is ALWAYS node 254 (number 253!) and should always be something like $xx $yy $00 $00 (Since 0 is the most common character it is closest to the root.) It will also tend to be at the end of the dictionary, since most programs find it easier to write their dictionaries from bottom to top. This can often be used to find Huffman dictionaries in files, if you know what you are looking for.

==Example 1==

Compress the word "HUFFMAN". For simplicity, we will not use the usual 256 character tree, but rather something a bit smaller. (Huffman trees can be any size, the concept is the same.)

<ol>
<li>Frequencies are: 1: H,U,M,A,N 2: F</li>
<li>Make a binary tree: (here left = 0 and right = 1)
<pre>
'Root node'
*
/ \
/ *
/ / \
* * *
/ \ / \ / \
F N M A H U
</pre>
</li>
<li>The first letter is 'H'; this on the tree is <tt>110</tt> The next letter is 'U' which is <tt>111</tt>. The third letter is 'F' which is <tt>00</tt> and so on (notice common letters have shorter strings?) The final output in bits is <tt>110111000010010101</tt>.</li>
</ol>

This is not of course the optimum Huffman tree, but that doesn't matter, ANY tree will do. As bytes the output is thus: <tt>11011100 00100101 01000000</tt> (the end bit is padded with nulls) or <tt>$DC $25 $40</tt>. We have a reduction in size from 7 bytes to 3, over 50%! Sadly, we then need to include the Huffman table (the ''dictionary'') in some form.

A possibility for storing the dictionary might be: <tt>$00 $46 $00 $4E $00 $4D $00 $41 $00 $48 $00 $55 $01 $02 $01 $03 $01 $01 $01 $00</tt>. Here we have numbered the nodes starting at the root as 254 and labelled every other node from 0-4 going left to right and top to bottom. (If this is hard to see, try drawing things out on a piece of paper, the first 12 bytes are the bottom 3 nodes, all pointing to characters, and the last 8 bytes are the top two nodes, all pointing to other nodes.) This is the format used in Keen 4-6 executables.

To decompress the data we use the tree again starting at the root node and reading the data bit by bit. So the first three bits are <tt>100</tt> which leads us down the tree to character 'H', then <tt>000</tt> which leads us to 'U' and so on.

==Example 2==

The following 1020 bytes constitute the complete 'trivial' Huffman dictionary, that is, one that does not compress the data at all:

$00 $00 $80 $00 $40 $00 $C0 $00 $20 $00 $A0 $00 $60 $00 $E0 $00
$10 $00 $90 $00 $50 $00 $D0 $00 $30 $00 $B0 $00 $70 $00 $F0 $00
$08 $00 $88 $00 $48 $00 $C8 $00 $28 $00 $A8 $00 $68 $00 $E8 $00
$18 $00 $98 $00 $58 $00 $D8 $00 $38 $00 $B8 $00 $78 $00 $F8 $00
$04 $00 $84 $00 $44 $00 $C4 $00 $24 $00 $A4 $00 $64 $00 $E4 $00
$14 $00 $94 $00 $54 $00 $D4 $00 $34 $00 $B4 $00 $74 $00 $F4 $00
$0C $00 $8C $00 $4C $00 $CC $00 $2C $00 $AC $00 $6C $00 $EC $00
$1C $00 $9C $00 $5C $00 $DC $00 $3C $00 $BC $00 $7C $00 $FC $00
$02 $00 $82 $00 $42 $00 $C2 $00 $22 $00 $A2 $00 $62 $00 $E2 $00
$12 $00 $92 $00 $52 $00 $D2 $00 $32 $00 $B2 $00 $72 $00 $F2 $00
$0A $00 $8A $00 $4A $00 $CA $00 $2A $00 $AA $00 $6A $00 $EA $00
$1A $00 $9A $00 $5A $00 $DA $00 $3A $00 $BA $00 $7A $00 $FA $00
$06 $00 $86 $00 $46 $00 $C6 $00 $26 $00 $A6 $00 $66 $00 $E6 $00
$16 $00 $96 $00 $56 $00 $D6 $00 $36 $00 $B6 $00 $76 $00 $F6 $00
$0E $00 $8E $00 $4E $00 $CE $00 $2E $00 $AE $00 $6E $00 $EE $00
$1E $00 $9E $00 $5E $00 $DE $00 $3E $00 $BE $00 $7E $00 $FE $00
$01 $00 $81 $00 $41 $00 $C1 $00 $21 $00 $A1 $00 $61 $00 $E1 $00
$11 $00 $91 $00 $51 $00 $D1 $00 $31 $00 $B1 $00 $71 $00 $F1 $00
$09 $00 $89 $00 $49 $00 $C9 $00 $29 $00 $A9 $00 $69 $00 $E9 $00
$19 $00 $99 $00 $59 $00 $D9 $00 $39 $00 $B9 $00 $79 $00 $F9 $00
$05 $00 $85 $00 $45 $00 $C5 $00 $25 $00 $A5 $00 $65 $00 $E5 $00
$15 $00 $95 $00 $55 $00 $D5 $00 $35 $00 $B5 $00 $75 $00 $F5 $00
$0D $00 $8D $00 $4D $00 $CD $00 $2D $00 $AD $00 $6D $00 $ED $00
$1D $00 $9D $00 $5D $00 $DD $00 $3D $00 $BD $00 $7D $00 $FD $00
$03 $00 $83 $00 $43 $00 $C3 $00 $23 $00 $A3 $00 $63 $00 $E3 $00
$13 $00 $93 $00 $53 $00 $D3 $00 $33 $00 $B3 $00 $73 $00 $F3 $00
$0B $00 $8B $00 $4B $00 $CB $00 $2B $00 $AB $00 $6B $00 $EB $00
$1B $00 $9B $00 $5B $00 $DB $00 $3B $00 $BB $00 $7B $00 $FB $00
$07 $00 $87 $00 $47 $00 $C7 $00 $27 $00 $A7 $00 $67 $00 $E7 $00
$17 $00 $97 $00 $57 $00 $D7 $00 $37 $00 $B7 $00 $77 $00 $F7 $00
$0F $00 $8F $00 $4F $00 $CF $00 $2F $00 $AF $00 $6F $00 $EF $00
$1F $00 $9F $00 $5F $00 $DF $00 $3F $00 $BF $00 $7F $00 $FF $00
$00 $01 $01 $01 $02 $01 $03 $01 $04 $01 $05 $01 $06 $01 $07 $01
$08 $01 $09 $01 $0A $01 $0B $01 $0C $01 $0D $01 $0E $01 $0F $01
$10 $01 $11 $01 $12 $01 $13 $01 $14 $01 $15 $01 $16 $01 $17 $01
$18 $01 $19 $01 $1A $01 $1B $01 $1C $01 $1D $01 $1E $01 $1F $01
$20 $01 $21 $01 $22 $01 $23 $01 $24 $01 $25 $01 $26 $01 $27 $01
$28 $01 $29 $01 $2A $01 $2B $01 $2C $01 $2D $01 $2E $01 $2F $01
$30 $01 $31 $01 $32 $01 $33 $01 $34 $01 $35 $01 $36 $01 $37 $01
$38 $01 $39 $01 $3A $01 $3B $01 $3C $01 $3D $01 $3E $01 $3F $01
$40 $01 $41 $01 $42 $01 $43 $01 $44 $01 $45 $01 $46 $01 $47 $01
$48 $01 $49 $01 $4A $01 $4B $01 $4C $01 $4D $01 $4E $01 $4F $01
$50 $01 $51 $01 $52 $01 $53 $01 $54 $01 $55 $01 $56 $01 $57 $01
$58 $01 $59 $01 $5A $01 $5B $01 $5C $01 $5D $01 $5E $01 $5F $01
$60 $01 $61 $01 $62 $01 $63 $01 $64 $01 $65 $01 $66 $01 $67 $01
$68 $01 $69 $01 $6A $01 $6B $01 $6C $01 $6D $01 $6E $01 $6F $01
$70 $01 $71 $01 $72 $01 $73 $01 $74 $01 $75 $01 $76 $01 $77 $01
$78 $01 $79 $01 $7A $01 $7B $01 $7C $01 $7D $01 $7E $01 $7F $01
$80 $01 $81 $01 $82 $01 $83 $01 $84 $01 $85 $01 $86 $01 $87 $01
$88 $01 $89 $01 $8A $01 $8B $01 $8C $01 $8D $01 $8E $01 $8F $01
$90 $01 $91 $01 $92 $01 $93 $01 $94 $01 $95 $01 $96 $01 $97 $01
$98 $01 $99 $01 $9A $01 $9B $01 $9C $01 $9D $01 $9E $01 $9F $01
$A0 $01 $A1 $01 $A2 $01 $A3 $01 $A4 $01 $A5 $01 $A6 $01 $A7 $01
$A8 $01 $A9 $01 $AA $01 $AB $01 $AC $01 $AD $01 $AE $01 $AF $01
$B0 $01 $B1 $01 $B2 $01 $B3 $01 $B4 $01 $B5 $01 $B6 $01 $B7 $01
$B8 $01 $B9 $01 $BA $01 $BB $01 $BC $01 $BD $01 $BE $01 $BF $01
$C0 $01 $C1 $01 $C2 $01 $C3 $01 $C4 $01 $C5 $01 $C6 $01 $C7 $01
$C8 $01 $C9 $01 $CA $01 $CB $01 $CC $01 $CD $01 $CE $01 $CF $01
$D0 $01 $D1 $01 $D2 $01 $D3 $01 $D4 $01 $D5 $01 $D6 $01 $D7 $01
$D8 $01 $D9 $01 $DA $01 $DB $01 $DC $01 $DD $01 $DE $01 $DF $01
$E0 $01 $E1 $01 $E2 $01 $E3 $01 $E4 $01 $E5 $01 $E6 $01 $E7 $01
$E8 $01 $E9 $01 $EA $01 $EB $01 $EC $01 $ED $01 $EE $01 $EF $01
$F0 $01 $F1 $01 $F2 $01 $F3 $01 $F4 $01 $F5 $01 $F6 $01 $F7 $01
$F8 $01 $F9 $01 $FA $01 $FB $01 $FC $01 $FD $01

This is a useful example to use since it has a number of unique features. Firstly the paths to any given terminal node are all the same length, 8 bits. Secondly the path to each terminal node is the ''reverse'' of the character it represents. And finally the nodes are arranged in a logical order with an easily seen pattern; the first half of the tree consists of terminal nodes, the second half of branch nodes. (Of the second half the first half of ''that'' consists of branch nodes that lead to terminal nodes while the second half consists of branch nodes to two branch nodes and so on.)

As an example the character $80 (128 or 10000000) can be expected to be represented by the path '00000001' and as such be the second node in the tree. Starting at the root node and following the leftmost path until the last step takes us to the following nodes: 254(root) -> 252 -> 248 -> 240 -> 224 -> 192 -> 128 -> 0(terminal node for characters $00 and $80)

== Source code ==

Some example code is available in various languages showing how to decompress (and in some cases compress) files using the Huffman algorithm.

=== QuickBasic ===

<source lang="qbasic">
'
' DANGEROUS DAVE 2 - IN THE HAUNTED MANSION - Huffman Decompressor
' - by Napalm with thanks to Adurdin's work on ModKeen
'
' This source is Public Domain, please credit me if you use it.
'
'
DECLARE SUB HUFFDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)

TYPE NODE
BIT0 AS INTEGER
BIT1 AS INTEGER
END TYPE

' Test Function
HUFFDECOMPRESS "TITLE1.DD2", "TITLE1.PIC"

SUB HUFFDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING) ' by Napalm
DIM INFILE AS INTEGER, OUTFILE AS INTEGER, I AS INTEGER
DIM SIG AS LONG, OUTLEN AS LONG, BITMASK AS INTEGER
DIM CURNODE AS INTEGER, NEXTNODE AS INTEGER
DIM CHRIN AS STRING * 1, CHROUT AS STRING * 1
DIM NODES(0 TO 254) AS NODE

' Open input file
INFILE = FREEFILE
OPEN INNAME FOR BINARY ACCESS READ AS INFILE

' Check file signature
GET INFILE, , SIG
IF SIG <> &H46465548 THEN ' Hex for: HUFF in little endian
PRINT "INVALID FILE!"
EXIT SUB
END IF

' Get output length
OUTLEN = 0
GET INFILE, , OUTLEN

' Read in the huffman binary tree
FOR I = 0 TO 254
GET INFILE, , NODES(I).BIT0
GET INFILE, , NODES(I).BIT1
NEXT I

' Open output file
OUTFILE = FREEFILE
OPEN OUTNAME FOR BINARY ACCESS WRITE AS OUTFILE

' Decompress input data using binary tree
CURNODE = 254
DO
BITMASK = 0
GET INFILE, , CHRIN
DO
' Decide which node to travel down depending on
' input bits from CHRIN.
IF ASC(CHRIN) AND 2 ^ BITMASK THEN
NEXTNODE = NODES(CURNODE).BIT1
ELSE
NEXTNODE = NODES(CURNODE).BIT0
END IF

' Is this next node another part of the tree or
' is it a end node? Less than 256 mean end node.
IF NEXTNODE < 256 THEN

' Get output char from end node and save.
CHROUT = CHR$(NEXTNODE AND &HFF)
PUT OUTFILE, , CHROUT

' Amend output length and start from top of
' binary tree.
OUTLEN = OUTLEN - 1
CURNODE = 254

ELSE
' Travel to next node
CURNODE = (NEXTNODE AND &HFF)

END IF

' Move to next input bit
BITMASK = BITMASK + 1
LOOP WHILE BITMASK < 8 AND OUTLEN > 0
' Loop while we still need to output data
LOOP WHILE OUTLEN > 0

' Clean up
CLOSE OUTFILE
CLOSE INFILE

END SUB
</source>

----

<source lang="qbasic">
SUB MAKHUF 'Mak a degenerate huffman tree, store as string huffq
OPEN "HUFF.DD2" FOR BINARY AS #8
aq = "HUFF"
PUT #8, 1, aq
x = 9
FOR t = 0 TO 255
b = t
va = 0
vb = 0
vc = 0
vd = 0
ve = 0
vf = 0
vg = 0
vh = 0
IF b > 127 THEN LET va = va + 1
b = b MOD 128
IF b > 63 THEN LET vb = vb + 1
b = b MOD 64
IF b > 31 THEN LET vc = vc + 1
b = b MOD 32
IF b > 15 THEN LET vd = vd + 1
b = b MOD 16
IF b > 7 THEN LET ve = ve + 1
b = b MOD 8
IF b > 3 THEN LET vf = vf + 1
b = b MOD 4
IF b > 1 THEN LET vg = vg + 1
b = b MOD 2
IF b = 1 THEN LET vh = vh + 1
b = (vh * 128) + (vg * 64) + (vf * 32) + (16 * ve) + (8 * vd) + (4 * vc) + (2 * vb) + va
aq = MKI$(b)
PUT #8, x, aq
x = x + 2
NEXT t
FOR t = 0 TO 253
aq = MKI$(t + 256)
PUT #8, x, aq
x = x + 2
NEXT t
GET #8, 1, huffq
CLOSE #8
KILL "HUFF.DD2"
END SUB
</source>

=== Visual Basic .NET ===

==== Huffman tree representation ====

This class, BinaryTreeNode, represents a binary tree whose branch nodes carry no value, like a Huffman dictionary tree.
<source lang="vbnet">Public Class BinaryTreeNode(Of T)
' By Fleexy, public domain, credit where credit is due
Private Branch As Boolean
Private Children As BinaryTreeNode(Of T)()
Private HoldValue As T
Public Sub New(LeafValue As T)
Branch = False
HoldValue = LeafValue
End Sub
Public Sub New(LeftChild As BinaryTreeNode(Of T), RightChild As BinaryTreeNode(Of T))
Branch = True
Children = {LeftChild, RightChild}
End Sub
Public ReadOnly Property HasValue As Boolean
Get
Return Not Branch
End Get
End Property
Public Property Value As T
Get
If Branch Then Throw New InvalidOperationException
Return HoldValue
End Get
Set(value As T)
If Branch Then Throw New InvalidOperationException
HoldValue = value
End Set
End Property
Public Property Child(Side As ChildSide) As BinaryTreeNode(Of T)
Get
If Not Branch Then Throw New InvalidOperationException
Return Children(Side)
End Get
Set(value As BinaryTreeNode(Of T))
If Not Branch Then Throw New InvalidOperationException
Children(Side) = value
End Set
End Property
Public Enum ChildSide As Byte
Left = 0
Right = 1
End Enum
Public ReadOnly Property Count As Integer
Get
If Not Branch Then Return 1
Return Children(0).Count + Children(1).Count
End Get
End Property
Public ReadOnly Property Depth As Integer
Get
If Not Branch Then Return 1
Return Math.Max(Children(0).Depth, Children(1).Depth) + 1
End Get
End Property
Public Overrides Function ToString() As String
Return "{Count = " & Count & ", Depth = " & Depth & "}"
End Function
End Class</source>

==== Huffman tree reading ====

This piece of code will read in a stored Huffman dictionary in the format described at the top of this article and store it in a BinaryTreeNode(Of Byte) as shown above.
<source lang="vbnet"> ' By Fleexy, public domain, credit where credit is due
Dim fDict As New IO.FileStream(DictionaryFile, IO.FileMode.Open)
Dim raw(254) As Tuple(Of UShort, UShort)
For x = 0 To 254
raw(x) = Tuple.Create(ReadUShort(fDict), ReadUShort(fDict))
Next
fDict.Close()
Dim GenerateTree As Func(Of UShort, BinaryTreeNode(Of Byte))
GenerateTree = Function(NextNode As UShort) As BinaryTreeNode(Of Byte)
Dim n As Tuple(Of UShort, UShort) = raw(NextNode)
Dim a, b As BinaryTreeNode(Of Byte)
If n.Item1 < 256 Then
a = New BinaryTreeNode(Of Byte)(n.Item1)
Else
a = GenerateTree(n.Item1 - 256)
End If
If n.Item2 < 256 Then
b = New BinaryTreeNode(Of Byte)(n.Item2)
Else
b = GenerateTree(n.Item2 - 256)
End If
Return New BinaryTreeNode(Of Byte)(a, b)
End Function
Dim dict As BinaryTreeNode(Of Byte) = GenerateTree(254)
fDict.Close()</source>

[[Category:File Formats]]
[[Category:Compressed Files]]
[[Category:Huffman Compression]]
[[Category:Compression Algorithms]]
[[Category:Code examples]]
[[Category:Dangerous Dave 2]]

Huffman Compression

2014-01-02T15:26:56Z

Fleexy: Corrected BinaryTreeNode<T>.ChildSide enumeration values

[[Huffman Compression]] is a compression algorithm used in many classic games. It is about as efficient as LZW or RLE compression, with many games using the three formats simultaneously.

Huffman compression involves making/reading a ''dictionary'' of 256 entries, one for each possible 1-byte character. This is organised in the form of a ''binary tree'' which is basically formed by taking the two lowest frequency characters and combining them into a new entry with their added frequencies and repeating until all entries are reduced to one ''root node''. After the dictionary is created, every character of data is replaced with the corresponding bit representation from the tree.

Any Huffman compressed data will thus be associated with a dictionary file (internal or external) used to decompress it.

==Nodes==

The binary tree must be stored in some form, usually called a ''dictionary''. The format varies from game to game, but is usually broadly along these lines. The tree will have 254 nodes (for all the one byte characters in game data) stored as two 'branches' of that node, usually taking up 4 bytes each, but sometimes 3. The second part is either 0 or 1, and says whether that branch goes to a character (0) or another node (1). The first part is the value of either that character or that node. (For the usual 4-byte implementation each part is a byte.)

In the Huffman table then we can expect to see each possible character TWICE, once as a character, once as a node reference. Note that the nodes are numbered from 0-253, NOT 1-254. The order of the nodes doesn't matter, only how they are connected.

The root node is ALWAYS node 254 (number 253!) and should always be something like $xx $yy $00 $00 (Since 0 is the most common character it is closest to the root.) It will also tend to be at the end of the dictionary, since most programs find it easier to write their dictionaries from bottom to top. This can often be used to find Huffman dictionaries in files, if you know what you are looking for.

==Example 1==

Compress the word "HUFFMAN". For simplicity, we will not use the usual 256 character tree, but rather something a bit smaller. (Huffman trees can be any size, the concept is the same.)

<ol>
<li>Frequencies are: 1: H,U,M,A,N 2: F</li>
<li>Make a binary tree: (here left = 0 and right = 1)
<pre>
'Root node'
*
/ \
/ *
/ / \
* * *
/ \ / \ / \
F N M A H U
</pre>
</li>
<li>The first letter is 'H'; this on the tree is <tt>110</tt> The next letter is 'U' which is <tt>111</tt>. The third letter is 'F' which is <tt>00</tt> and so on (notice common letters have shorter strings?) The final output in bits is <tt>110111000010010101</tt>.</li>
</ol>

This is not of course the optimum Huffman tree, but that doesn't matter, ANY tree will do. As bytes the output is thus: <tt>11011100 00100101 01000000</tt> (the end bit is padded with nulls) or <tt>$DC $25 $40</tt>. We have a reduction in size from 7 bytes to 3, over 50%! Sadly, we then need to include the Huffman table (the ''dictionary'') in some form.

A possibility for storing the dictionary might be: <tt>$00 $46 $00 $4E $00 $4D $00 $41 $00 $48 $00 $55 $01 $02 $01 $03 $01 $01 $01 $00</tt>. Here we have numbered the nodes starting at the root as 254 and labelled every other node from 0-4 going left to right and top to bottom. (If this is hard to see, try drawing things out on a piece of paper, the first 12 bytes are the bottom 3 nodes, all pointing to characters, and the last 8 bytes are the top two nodes, all pointing to other nodes.) This is the format used in Keen 4-6 executables.

To decompress the data we use the tree again starting at the root node and reading the data bit by bit. So the first three bits are <tt>100</tt> which leads us down the tree to character 'H', then <tt>000</tt> which leads us to 'U' and so on.

==Example 2==

The following 1020 bytes constitute the complete 'trivial' Huffman dictionary, that is, one that does not compress the data at all:

$00 $00 $80 $00 $40 $00 $C0 $00 $20 $00 $A0 $00 $60 $00 $E0 $00
$10 $00 $90 $00 $50 $00 $D0 $00 $30 $00 $B0 $00 $70 $00 $F0 $00
$08 $00 $88 $00 $48 $00 $C8 $00 $28 $00 $A8 $00 $68 $00 $E8 $00
$18 $00 $98 $00 $58 $00 $D8 $00 $38 $00 $B8 $00 $78 $00 $F8 $00
$04 $00 $84 $00 $44 $00 $C4 $00 $24 $00 $A4 $00 $64 $00 $E4 $00
$14 $00 $94 $00 $54 $00 $D4 $00 $34 $00 $B4 $00 $74 $00 $F4 $00
$0C $00 $8C $00 $4C $00 $CC $00 $2C $00 $AC $00 $6C $00 $EC $00
$1C $00 $9C $00 $5C $00 $DC $00 $3C $00 $BC $00 $7C $00 $FC $00
$02 $00 $82 $00 $42 $00 $C2 $00 $22 $00 $A2 $00 $62 $00 $E2 $00
$12 $00 $92 $00 $52 $00 $D2 $00 $32 $00 $B2 $00 $72 $00 $F2 $00
$0A $00 $8A $00 $4A $00 $CA $00 $2A $00 $AA $00 $6A $00 $EA $00
$1A $00 $9A $00 $5A $00 $DA $00 $3A $00 $BA $00 $7A $00 $FA $00
$06 $00 $86 $00 $46 $00 $C6 $00 $26 $00 $A6 $00 $66 $00 $E6 $00
$16 $00 $96 $00 $56 $00 $D6 $00 $36 $00 $B6 $00 $76 $00 $F6 $00
$0E $00 $8E $00 $4E $00 $CE $00 $2E $00 $AE $00 $6E $00 $EE $00
$1E $00 $9E $00 $5E $00 $DE $00 $3E $00 $BE $00 $7E $00 $FE $00
$01 $00 $81 $00 $41 $00 $C1 $00 $21 $00 $A1 $00 $61 $00 $E1 $00
$11 $00 $91 $00 $51 $00 $D1 $00 $31 $00 $B1 $00 $71 $00 $F1 $00
$09 $00 $89 $00 $49 $00 $C9 $00 $29 $00 $A9 $00 $69 $00 $E9 $00
$19 $00 $99 $00 $59 $00 $D9 $00 $39 $00 $B9 $00 $79 $00 $F9 $00
$05 $00 $85 $00 $45 $00 $C5 $00 $25 $00 $A5 $00 $65 $00 $E5 $00
$15 $00 $95 $00 $55 $00 $D5 $00 $35 $00 $B5 $00 $75 $00 $F5 $00
$0D $00 $8D $00 $4D $00 $CD $00 $2D $00 $AD $00 $6D $00 $ED $00
$1D $00 $9D $00 $5D $00 $DD $00 $3D $00 $BD $00 $7D $00 $FD $00
$03 $00 $83 $00 $43 $00 $C3 $00 $23 $00 $A3 $00 $63 $00 $E3 $00
$13 $00 $93 $00 $53 $00 $D3 $00 $33 $00 $B3 $00 $73 $00 $F3 $00
$0B $00 $8B $00 $4B $00 $CB $00 $2B $00 $AB $00 $6B $00 $EB $00
$1B $00 $9B $00 $5B $00 $DB $00 $3B $00 $BB $00 $7B $00 $FB $00
$07 $00 $87 $00 $47 $00 $C7 $00 $27 $00 $A7 $00 $67 $00 $E7 $00
$17 $00 $97 $00 $57 $00 $D7 $00 $37 $00 $B7 $00 $77 $00 $F7 $00
$0F $00 $8F $00 $4F $00 $CF $00 $2F $00 $AF $00 $6F $00 $EF $00
$1F $00 $9F $00 $5F $00 $DF $00 $3F $00 $BF $00 $7F $00 $FF $00
$00 $01 $01 $01 $02 $01 $03 $01 $04 $01 $05 $01 $06 $01 $07 $01
$08 $01 $09 $01 $0A $01 $0B $01 $0C $01 $0D $01 $0E $01 $0F $01
$10 $01 $11 $01 $12 $01 $13 $01 $14 $01 $15 $01 $16 $01 $17 $01
$18 $01 $19 $01 $1A $01 $1B $01 $1C $01 $1D $01 $1E $01 $1F $01
$20 $01 $21 $01 $22 $01 $23 $01 $24 $01 $25 $01 $26 $01 $27 $01
$28 $01 $29 $01 $2A $01 $2B $01 $2C $01 $2D $01 $2E $01 $2F $01
$30 $01 $31 $01 $32 $01 $33 $01 $34 $01 $35 $01 $36 $01 $37 $01
$38 $01 $39 $01 $3A $01 $3B $01 $3C $01 $3D $01 $3E $01 $3F $01
$40 $01 $41 $01 $42 $01 $43 $01 $44 $01 $45 $01 $46 $01 $47 $01
$48 $01 $49 $01 $4A $01 $4B $01 $4C $01 $4D $01 $4E $01 $4F $01
$50 $01 $51 $01 $52 $01 $53 $01 $54 $01 $55 $01 $56 $01 $57 $01
$58 $01 $59 $01 $5A $01 $5B $01 $5C $01 $5D $01 $5E $01 $5F $01
$60 $01 $61 $01 $62 $01 $63 $01 $64 $01 $65 $01 $66 $01 $67 $01
$68 $01 $69 $01 $6A $01 $6B $01 $6C $01 $6D $01 $6E $01 $6F $01
$70 $01 $71 $01 $72 $01 $73 $01 $74 $01 $75 $01 $76 $01 $77 $01
$78 $01 $79 $01 $7A $01 $7B $01 $7C $01 $7D $01 $7E $01 $7F $01
$80 $01 $81 $01 $82 $01 $83 $01 $84 $01 $85 $01 $86 $01 $87 $01
$88 $01 $89 $01 $8A $01 $8B $01 $8C $01 $8D $01 $8E $01 $8F $01
$90 $01 $91 $01 $92 $01 $93 $01 $94 $01 $95 $01 $96 $01 $97 $01
$98 $01 $99 $01 $9A $01 $9B $01 $9C $01 $9D $01 $9E $01 $9F $01
$A0 $01 $A1 $01 $A2 $01 $A3 $01 $A4 $01 $A5 $01 $A6 $01 $A7 $01
$A8 $01 $A9 $01 $AA $01 $AB $01 $AC $01 $AD $01 $AE $01 $AF $01
$B0 $01 $B1 $01 $B2 $01 $B3 $01 $B4 $01 $B5 $01 $B6 $01 $B7 $01
$B8 $01 $B9 $01 $BA $01 $BB $01 $BC $01 $BD $01 $BE $01 $BF $01
$C0 $01 $C1 $01 $C2 $01 $C3 $01 $C4 $01 $C5 $01 $C6 $01 $C7 $01
$C8 $01 $C9 $01 $CA $01 $CB $01 $CC $01 $CD $01 $CE $01 $CF $01
$D0 $01 $D1 $01 $D2 $01 $D3 $01 $D4 $01 $D5 $01 $D6 $01 $D7 $01
$D8 $01 $D9 $01 $DA $01 $DB $01 $DC $01 $DD $01 $DE $01 $DF $01
$E0 $01 $E1 $01 $E2 $01 $E3 $01 $E4 $01 $E5 $01 $E6 $01 $E7 $01
$E8 $01 $E9 $01 $EA $01 $EB $01 $EC $01 $ED $01 $EE $01 $EF $01
$F0 $01 $F1 $01 $F2 $01 $F3 $01 $F4 $01 $F5 $01 $F6 $01 $F7 $01
$F8 $01 $F9 $01 $FA $01 $FB $01 $FC $01 $FD $01

This is a useful example to use since it has a number of unique features. Firstly the paths to any given terminal node are all the same length, 8 bits. Secondly the path to each terminal node is the ''reverse'' of the character it represents. And finally the nodes are arranged in a logical order with an easily seen pattern; the first half of the tree consists of terminal nodes, the second half of branch nodes. (Of the second half the first half of ''that'' consists of branch nodes that lead to terminal nodes while the second half consists of branch nodes to two branch nodes and so on.)

As an example the character $80 (128 or 10000000) can be expected to be represented by the path '00000001' and as such be the second node in the tree. Starting at the root node and following the leftmost path until the last step takes us to the following nodes: 254(root) -> 252 -> 248 -> 240 -> 224 -> 192 -> 128 -> 0(terminal node for characters $00 and $80)

== Source code ==

Some example code is available in various languages showing how to decompress (and in some cases compress) files using the Huffman algorithm.

=== QuickBasic ===

<source lang="qbasic">
'
' DANGEROUS DAVE 2 - IN THE HAUNTED MANSION - Huffman Decompressor
' - by Napalm with thanks to Adurdin's work on ModKeen
'
' This source is Public Domain, please credit me if you use it.
'
'
DECLARE SUB HUFFDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)

TYPE NODE
BIT0 AS INTEGER
BIT1 AS INTEGER
END TYPE

' Test Function
HUFFDECOMPRESS "TITLE1.DD2", "TITLE1.PIC"

SUB HUFFDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING) ' by Napalm
DIM INFILE AS INTEGER, OUTFILE AS INTEGER, I AS INTEGER
DIM SIG AS LONG, OUTLEN AS LONG, BITMASK AS INTEGER
DIM CURNODE AS INTEGER, NEXTNODE AS INTEGER
DIM CHRIN AS STRING * 1, CHROUT AS STRING * 1
DIM NODES(0 TO 254) AS NODE

' Open input file
INFILE = FREEFILE
OPEN INNAME FOR BINARY ACCESS READ AS INFILE

' Check file signature
GET INFILE, , SIG
IF SIG <> &H46465548 THEN ' Hex for: HUFF in little endian
PRINT "INVALID FILE!"
EXIT SUB
END IF

' Get output length
OUTLEN = 0
GET INFILE, , OUTLEN

' Read in the huffman binary tree
FOR I = 0 TO 254
GET INFILE, , NODES(I).BIT0
GET INFILE, , NODES(I).BIT1
NEXT I

' Open output file
OUTFILE = FREEFILE
OPEN OUTNAME FOR BINARY ACCESS WRITE AS OUTFILE

' Decompress input data using binary tree
CURNODE = 254
DO
BITMASK = 0
GET INFILE, , CHRIN
DO
' Decide which node to travel down depending on
' input bits from CHRIN.
IF ASC(CHRIN) AND 2 ^ BITMASK THEN
NEXTNODE = NODES(CURNODE).BIT1
ELSE
NEXTNODE = NODES(CURNODE).BIT0
END IF

' Is this next node another part of the tree or
' is it a end node? Less than 256 mean end node.
IF NEXTNODE < 256 THEN

' Get output char from end node and save.
CHROUT = CHR$(NEXTNODE AND &HFF)
PUT OUTFILE, , CHROUT

' Amend output length and start from top of
' binary tree.
OUTLEN = OUTLEN - 1
CURNODE = 254

ELSE
' Travel to next node
CURNODE = (NEXTNODE AND &HFF)

END IF

' Move to next input bit
BITMASK = BITMASK + 1
LOOP WHILE BITMASK < 8 AND OUTLEN > 0
' Loop while we still need to output data
LOOP WHILE OUTLEN > 0

' Clean up
CLOSE OUTFILE
CLOSE INFILE

END SUB
</source>

----

<source lang="qbasic">
SUB MAKHUF 'Mak a degenerate huffman tree, store as string huffq
OPEN "HUFF.DD2" FOR BINARY AS #8
aq = "HUFF"
PUT #8, 1, aq
x = 9
FOR t = 0 TO 255
b = t
va = 0
vb = 0
vc = 0
vd = 0
ve = 0
vf = 0
vg = 0
vh = 0
IF b > 127 THEN LET va = va + 1
b = b MOD 128
IF b > 63 THEN LET vb = vb + 1
b = b MOD 64
IF b > 31 THEN LET vc = vc + 1
b = b MOD 32
IF b > 15 THEN LET vd = vd + 1
b = b MOD 16
IF b > 7 THEN LET ve = ve + 1
b = b MOD 8
IF b > 3 THEN LET vf = vf + 1
b = b MOD 4
IF b > 1 THEN LET vg = vg + 1
b = b MOD 2
IF b = 1 THEN LET vh = vh + 1
b = (vh * 128) + (vg * 64) + (vf * 32) + (16 * ve) + (8 * vd) + (4 * vc) + (2 * vb) + va
aq = MKI$(b)
PUT #8, x, aq
x = x + 2
NEXT t
FOR t = 0 TO 253
aq = MKI$(t + 256)
PUT #8, x, aq
x = x + 2
NEXT t
GET #8, 1, huffq
CLOSE #8
KILL "HUFF.DD2"
END SUB
</source>

=== Visual Basic .NET ===

==== Huffman tree representation ====

This class, BinaryTreeNode, represents a binary tree whose branch nodes carry no value, like a Huffman dictionary tree.
<source lang="vbnet">Public Class BinaryTreeNode(Of T)
' By Fleexy, public domain, credit where credit is due
Private Branch As Boolean
Private Children As BinaryTreeNode(Of T)()
Private HoldValue As T
Public Sub New(LeafValue As T)
Branch = False
HoldValue = LeafValue
End Sub
Public Sub New(LeftChild As BinaryTreeNode(Of T), RightChild As BinaryTreeNode(Of T))
Branch = True
Children = {LeftChild, RightChild}
End Sub
Public Property Value As T
Get
If Branch Then Throw New InvalidOperationException
Return HoldValue
End Get
Set(value As T)
If Branch Then Throw New InvalidOperationException
HoldValue = value
End Set
End Property
Public Property Child(Side As ChildSide) As BinaryTreeNode(Of T)
Get
If Not Branch Then Throw New InvalidOperationException
Return Children(Side)
End Get
Set(value As BinaryTreeNode(Of T))
If Not Branch Then Throw New InvalidOperationException
Children(Side) = value
End Set
End Property
Public Enum ChildSide As Byte
Left = 0
Right = 1
End Enum
Public ReadOnly Property Count As Integer
Get
If Not Branch Then Return 1
Return Children(0).Count + Children(1).Count
End Get
End Property
Public ReadOnly Property Depth As Integer
Get
If Not Branch Then Return 1
Return Math.Max(Children(0).Depth, Children(1).Depth) + 1
End Get
End Property
Public Overrides Function ToString() As String
Return "{Count = " & Count & ", Depth = " & Depth & "}"
End Function
End Class</source>

==== Huffman tree reading ====

This piece of code will read in a stored Huffman dictionary in the format described at the top of this article and store it in a BinaryTreeNode(Of Byte) as shown above.
<source lang="vbnet"> ' By Fleexy, public domain, credit where credit is due
Dim fDict As New IO.FileStream(DictionaryFile, IO.FileMode.Open)
Dim raw(254) As Tuple(Of UShort, UShort)
For x = 0 To 254
raw(x) = Tuple.Create(ReadUShort(fDict), ReadUShort(fDict))
Next
fDict.Close()
Dim GenerateTree As Func(Of UShort, BinaryTreeNode(Of Byte))
GenerateTree = Function(NextNode As UShort) As BinaryTreeNode(Of Byte)
Dim n As Tuple(Of UShort, UShort) = raw(NextNode)
Dim a, b As BinaryTreeNode(Of Byte)
If n.Item1 < 256 Then
a = New BinaryTreeNode(Of Byte)(n.Item1)
Else
a = GenerateTree(n.Item1 - 256)
End If
If n.Item2 < 256 Then
b = New BinaryTreeNode(Of Byte)(n.Item2)
Else
b = GenerateTree(n.Item2 - 256)
End If
Return New BinaryTreeNode(Of Byte)(a, b)
End Function
Dim dict As BinaryTreeNode(Of Byte) = GenerateTree(254)
fDict.Close()</source>

[[Category:File Formats]]
[[Category:Compressed Files]]
[[Category:Huffman Compression]]
[[Category:Compression Algorithms]]
[[Category:Code examples]]
[[Category:Dangerous Dave 2]]

Huffman Compression

2014-01-02T01:24:03Z

Fleexy: Fixed variable type mismatch in the newly added VB.NET code

Huffman Compression

2014-01-02T01:22:29Z

Fleexy: Made the encoding example's bits correct, added VB.NET code

Commander Keen EGA Header

2013-12-31T23:18:40Z

Fleexy: Fixed compression entry in the EGAHEAD, cleared stuff up

Many early (1989-1993( [[ID Software]]\[[Softdish]] games use an <tt>EGAHEAD.*</tt> file to read their graphics files. While the number and function of various graphics files vary substantially across various games, the header format itself is rather constant, with only a few minor variations. (Indeed it is possible to see a progressive development of the format over the years by tracing changes across several games.) In [[Commander Keen 1-3]] itself two [[Raw EGA data]] files contain all the tiles, sprites and in-game images. In other games more files are usually required.

There are 'types' of graphics, 8x8 fonts, 16x16 tiles, unmasked images and masked sprites. Sprites are almost always stored separately because they usually require 5 planes of EGA data, not 4.

The EGAHEAD file may be external or internal and is divided into three parts, the first section deals with the number and location of fonts, tiles, images and sprites. The second section is a list of the names of unmasked images (Each 16 bytes long.) and the third (And usually largest.) section deals with a list of the names of sprites. (Each 128 bytes long.)

Note that the addresses of graphics are offsets from where an EGA plane starts. So with a plane size of 100, the address 40 would have us looking for data at 40, 140, 240... in the graphics file. If graphics are stored separately, then this address is usually blank (Since it will start at the file start.)

== Commander Keen format ==

Section 1:
0 4 Latplansiz Size of 4-plane EGA plane; should be one quarter of the size of the UNCOMPRESSED EGA
file size. In Keen this is for the EGALATCH file.
4 4 Sprplansiz Size of 5-plane EGA plane; should be one fifth of the size of the UNCOMPRESSED EGA
file size. In Keen this is EGASPRIT. This may be blank in games that store their
sprites in multiple files.
8 4 Imgdatstart Where in the EGAHEAD the entries for unmasked graphics (Excluding font and tiles.)
start. If there are none, this is blank.
12 4 Sprdatstart Where in the EGAHEAD the entries for masked graphics (Sprites) start; by default
this is right after the unmasked graphics.
16 2 Fontnum Number of 8x8 font entries are in the font; note that many games have this for
drawing windows, with the actual black and white font stored in the executable.
18 4 Fontloc Offset in memory where font data is placed. Should be zero since font is
first. In Keen 1-3 also the location in LATCH file where font data starts.
22 2 Unknum Used for the ending screen until this was removed. Number of screen graphics.
24 4 Unkloc Used for screen graphics until removal. Offset in memory where screen data placed.
28 2 Tilenum Number of 16x16 tiles
30 4 Tileloc Offset in memory\LATCH where tile data placed\starts.

34 2 Bmpnum Number of unmasked bitmaps
36 4 Bmploc Offset in memory\plane where unmasked bitmap data starts\placed.
40 2 Spritenum Number of sprite images
42 4 Spriteloc Offset in EGASPRIT plane of start of sprite data. Is, of course, zero. Also relates
to memory
46 2 Compression Add 2 to this byte if EGALATCH is compressed, add 1 to it if EGASPRIT is compressed.
Thus uncompressed graphics have this set at 0 and fully compressed at 3. The only
game that compresses its graphics like this is Commander Keen 1. Attempting to set
any compression for the other games will result in garbage graphics.

Section 2:
2 Size h The width of the graphic divided by 8
2 2 Size v The height of the graphic in pixels; if this cannot be
divided into neat 16 byte pieces, the extra data,
usually 8 bytes, is added to the size.
4 4 Loc When added to the graphic offset in the header, gives
the location of the start of the graphic data in the
plane. For the first graphic this is thus zero.
8 8 Name Name of the graphic, padded with nulls.

Section 3:
0 2 Width The width of the graphic divided by 8
2 2 Height The height of the graphic in pixels; the same rule
applies as for unmasked graphics except now we have:
4 2 Loc offset Usually 8, this is the number of bytes 'extra' that
must be added to the location to reach the start of
the sprite data. This appears when a sprite is so
small, usually 8x8 pixels, that it doesn't fill a
multiple of 16 bytes. This will affect ALL sprites
after the aberration until another one occurs to fix
the shortfall.
6 2 Location Multiplying this by 16 bytes gives the location of
the start of the sprite data in the EGA plane.
8 4 Hitbox ul Location of the upper-left corner of the sprite's
hitbox or collision rectangle, in pixels, starting
from 0,0. First two bytes are the h location and the
next two are the v location. In the EGAHEAD, both
values are multiplied by 256.
12 4 Hitbox br Same as above, for the bottom left corner.
16 12 Name The sprite name, usually includes a number and is
usually only 10 bytes long. May spill into the next
field in games that do not use that.
28 4 h/v off The horizontal and vertical offset of the sprite frame from the sprite
event. In early games this is blank, but in later ones such as
[[Shadow Knights]] it is utilized. This can be used to 'line up'
frames of different heights and so on.
32 3*32 Copies Games use these entries for smooth movement; each 32 byte entry is
a copy of the initial sprite shifted 2 pixels left. The game will
automatically generate the graphics, but the preceding information
must be duplicated. In the copies, 1 is added to the width field.

== Other games ==

There are several differences between other games and Commander Keen. The most notable is usually the presence of more than two graphics files. Sprites are often split into various <tt>S_*</tt> files, each of which is compressed, containing, as its first word, the decompressed file size (Including the word itself and any header the raw data may have.) It is usually not hard to calculate a plane size from this.

Addresses in the header will point to the right location in whatever file is being used at the time. Thus is sprites have been split up into several files, it is possible to see when files are 'switched' whenever the pointer for a sprite is less (usually 0) than the pointer to the previous sprite. (A new file is being opened and data read from the start of it, instead of the end of the old file.) Note however that the order sprite files are opened and how many sprites they contain is usually hard-coded.

Another effect of this is that pointers in the EGA are used for memory, NOT files. In Keen they are the same, since the EGALATCH is copied directly into memory in segments, essentially the whole file is copied as-is. In other games with more files the game is hard-coded to calculate where to read from the number of graphics entries, their size or whether or not a file has been opened so they header pointers are not a reliable guide.

Aside fro Commander Keen, no other games are [[Keen 1-3 LZW Compression|LZW compressed]] and so don't use that feature in the header. Other forms of compression are used which may or may not be indicated in the header somehow.

As games progressed an additional modification was added to the header format; the last four bytes of a sprite's name were set aside to be h and v offsets of the sprite image (as used in [[Commander Keen 4-6]] games.) This left a sprite with a maximum name length of 11 bytes. Since the last four bytes are seldom used (are zero), it can often be assumed that all games use this with few problems.

Finally is the issue of 'sprite blitting' The header's sprite entries are 128 bytes long, being essentially four copies of a 32 byte entry. In earlier games each was in fact a separate entry, there were in effect four times the number of sprites, each one used for when a sprite was moving 0, 2, 4, and 6 pixels left or right. Later games automated the creation of these sprites, shrinking file size. For simplicity the header format was left as-is.

=== [[Dangerous Dave in Copyright Infringement]] ===

TODO

=== [[Dangerous Dave 2]] ===

Dangerous Dave 2 stores the header and <tt>LATCH</tt> internally in the executable at 74896 and 101096 respectively (in the [[UNLZEXE]]'d executable.) Because the <tt>EGALATCH.DD2</tt> file is stored internally there is an interesting situation with 8x8 tiles and unmasked bitmaps; the <tt>EGAHEAD</tt> points to 32 8x8 tiles, but the last tile is overwritten by the start of the unmasked bitmap data. This means that 8 bytes from each EGA plane are copied into memory twice, once as the end of 8x8 tiles and again as the first unmasked bitmap. It also means that the 'location' of the bitmaps (e.g. $100) is 8 bytes larger than the location of the EGA data in the internal file.

Otherwise the format is nearly identical. Number of unmasked bitmaps and their location are at 40\42 in the header instead of 34\36. There is no sprite plane size at 4 since the sprites are stored in several compressed files (each containing its own planesize.) Tiles are stored in a separate file <tt>EGATILES.DD2</tt>. Since each tile may or may not be loaded in a level depending on whether it's needed, the tile file is composed of 858 128-byte entries, each containing 4-EGA planes for the tile. There is a 'copy' of the EGA planesize at byte 40 and bytes 22-34 and 52-64 are blank or contain nonsense data.

=== [[Rescue Rover]] ===

Rescue Rover has two sprite files, <tt>S_PLAY.ROV</tt> and <tt>S_DEMO.ROV</tt> used in that order. Bitmaps and 8x8 font are stored in <tt>EGAPLANE.ROV</tt>, tiles are 32x32 and stored in <tt>EGABTILE.ROV</tt> (TED 16x16 tiles are stored in a file not used by the game, <tt>EGATILES.ROV</tt>. Finally a monochrome font is stored in <tt>EGAFONT0.ROV</tt> There are also several compressed screen images, <tt>*PIC.ROV</tt>

The number\location-in-file of the sprites are stored at slightly different places, at bytes 47 and 49 instead of 41 and 43. In practice the location of tile and sprite data is 0, since they do not share files with any other EGA graphics.

=== [[Shadow Knights]] ===

Shadow knights uses the same (subtly different) EGA header as Rescue Rover, this document http://levellord.rewound.net/Index/File%20Formats/Shadow%20Knights/Shadow%20EGA.txt describes how the compression used for the sprites and the slightly odd layout of the tile data works.

=== [[Slordax]] ===

Slordax uses a format exactly identical to Commander Keen.

==Notes==

This format has been reverse engineered several times, mostly by the Commander Keen community.

[[Category: File Formats]]
[[Category: Graphics Files]]

Carmack compression

2013-12-31T23:12:36Z

Fleexy: Made it even clearer that the xA8 address is in words

[[Carmack compression]] is used in the [[GameMaps Format|GAMEMAPS file]] in [[Commander Keen 4-6]], [[Catacomb 3D]], [[Wolfenstein 3D]], and [[Noah's Ark 3D]] to further shrink the levels down beyond what [[RLEW compression]] can achieve. Its basic idea is somewhat like LZ (Lempel-Ziv) compression in that it contains pointers back to previous data.

As in RLEW compression, the first word in the Carmack compressed data is the number of bytes (not words) in the decompressed data. This is typically the number of bytes in the compressed RLEW data, as Carmack compression is performed after RLEW compression.

Carmack compression contains two types of references to previous data: near pointers and far pointers.

= Near Pointers =

Near pointers occupy three bytes in the compressed data. The first is the number of words in the referenced sequence, the second is the signal byte of xA7, and the third is the number of words to the start of the reference (counting backwards from the current location). As a concrete example, the three bytes x05 xA7 x0A mean 'repeat the 5 words starting 10 words ago'.

Notice that near pointers only let one refer to the last 255 words. To refer to sequences further back, one must use far pointers.

= Far Pointers =

Far pointers occupy four bytes in the compressed data. The first is, again, the number of words in the referenced sequence, the second is xA8, and the third and fourth are interpreted as a word - a 0-based pointer to the start of the reference '''in words''', so the address must be multiplied by two to reach the correct byte location. As a concrete example, the four bytes x10 xA8 x01 x20 mean 'repeat the 16 words starting at word number 513'.

= Words with a high byte of $A7 or $A8 =

Words whose high (second) byte is xA7 or xA8 would appear to be issue, as they would be confused with near or far pointers. These are handled by representing them as the three bytes: $00, $Ax $xx, this is recognized as an exception (Repeating zero words would make no sense.)

[[Category:File Formats]]
[[Category:Compressed Files]]
[[Category:Compression Algorithms]]
[[Category:Commander Keen 4-6]]
[[Category:Catacomb 3D]]
[[Category:Wolfenstein 3D]]
[[Category:Noah's Ark 3D]]

Talk:LZW Compression

2013-12-31T23:10:19Z

Fleexy: Responded to Malv

The article states that some games are limited to 12-bit codes to save space, but it doesn't say what happens when this limit is reached (which I am having trouble trying to reverse engineer.) I can't figure out whether it resets the dictionary or just keeps adjusting the last entry. Any ideas? -- [[User:Malvineous|Malvineous]] 07:24, 19 September 2010 (GMT)

When a 12-bit dictionary is filled, it simply stops adding codes. Therefore, the compressor just continues using the longest string in the dictionary for output, but doesn't add any new codes. See my VB.NET implementation for details. [[User:Fleexy|Fleexy]] ([[User talk:Fleexy|talk]]) 23:10, 31 December 2013 (GMT)

LZW Compression

2013-12-31T23:05:09Z

Fleexy: Made the separate dictionary general algorithm billions of times clearer

LZW compression is a common form of compression used in some early games to compress data and by most early games to compress their executables. It is notable in being one of the first compression methods to not compress on the byte level (Along with [[Huffman Compression]]) and for its efficiency.

The basic concept for LZW is universal, though the implementations differ. In essence it involves replacing data strings that have been encountered before with references to already decompressed data. (Known as a 'dictionary') This can be done in a number of ways, the two main approaches differing on whether the dictionary is separate or integrated

= Separate Dictionary Approach =

In this approach the dictionary is separate from the data being decompressed, that is it is stored in a separate location in memory. In this it behaves more like one would expect a dictionary to work; when a codeword is found in the data, it is looked up in the dictionary and the corresponding string copied to output. (As an example dictionary entry 42 could represent the string 'life', thus whenever the code '42' is encountered the string 'life' is added to the decompressed data.)

The advantage of this method is that the efficiency of compression increases as the amount of data to compress increases. The following points differ between implementations:

* The initial dictionary. Just how large the initial dictionary is varies. Some implementations start with no dictioanry at all, others set a number of entries, usually 255, covering all possible 1-byte values.

* The maximum size of the dictionary. Many older implementations with less resources were forced to cap the dictionary at a certain size, usually a power of two entries long. (512, 1024...) Unlimited implementations are rare as modern methods (e.g. the DEFLATE algorithm.) usually rely on several compression methods at once. Sometimes the dictionary is 'reset' when it reaches too large a size.

* Whether the codestream is made partly or entirely of codewords. Often the compressed data is made entirely of codewords, even non-repeating strings, which means that initially compression can sometimes be rather poor. Other implementations use codeowords only for repeating strings. Differences in how codewords vs literal are indicated and how dictionaries are built up may occur.

== Decoding ==

This is a general decoding algorithm for separate dictionary LZW. It will need to be altered slightly when dealing with different implementations. Notably it assumes that the codestream is composed entirely of codewords and that the dictionary can keep growing indefinitely.

Add all roots to the dictionary. Code 0 corresponds to $00, code 1 is $01, etc to $FF;
Add error, clear, and end-mark flags to the dictionary as appropriate;
FirstCode [as unknown length binary number] = the first code in the codestream;
CurMatch [as byte array] = the dictionary entry for FirstCode;
Output CurMatch;
Loop until end of codestream {
CurCode [as unknown length binary number] = next code in the codestream;
TempMatch [as byte array] allocate;
If there is an entry for CurCode in the dictionary {
TempMatch = the dictionary entry for CurCode;
} If not {
TempMatch = CurMatch;
Concatenate the first byte of CurMatch to the end of TempMatch;
}
Output TempMatch;
NewDictEntry [as byte array] = CurMatch;
Concatenate the first byte of TempMatch to the end of NewDictEntry;
Try to add NewDictEntry to the dictionary for the first empty key;
CurMatch = TempMatch;
}

== [[Commander Keen 1-3|Commander Keen 1-3 LZW]] ==

In [[Commander Keen 1-3]] LZW is used to compress the <tt>EGALATCH</tt> and <tt>EGASPRIT</tt> files in episode 1 (It can also be used in episodes 2 and 3, but isn't.) The game uses two error-checking methods in this implementation, firstly it reserves two dictionary values, $100 to indicate an error (This is written by the compression program and will make the executable abort.) and $101 to indicate the end of data. (If the program reaches the end of the data without encountering this it will also abort.) The compressed data is also prefixed with a dword giving the decompressed data size, so this can be compared with the output.

This method is a typical separate dictionary approach. It starts with a dictionary of 256 9-bit codewords representing the 8-bit strings $00-$FF (Plus some special cases.) The dictionary is allowed to grow to 4096 entries. The following is the initial dictionary:

0000 - 00 (character)
0001 - 01 (character)
...
00FE - FE (character)
00FF - FF (character)
0100 - Reserved for errors...
0101 - Reserved for end of compressed data...
0102 - (not set)
0103 - (not set)

It will be immediately noticed that 4096 entries cannot be represented by 9-bit codes but at the least by 12-bit codes. To further conserve space the length of the codewords is increased every time the dictionary grows too large. Thus when it reaches $01FF entries codewords become 10 bits long, at $03FF they are 11 bits and finally at $07FF 12 bits. At $0FFF entries the dictionary stops growing.

The following data is taken from the EGALATCH file from Keen 1. Notice that the first six bytes are ignored. (The first four give the decompressed data size, the next two are the maximum number of bits the LZW decoder will use.) The first few steps of decompression follows.

0000 80 D3 01 00 0C 00 00 40 - A0 70 31 E9 F8 F8 78 38
0010 08 08 00 07 FC 39 FF 04 - 5E 41 E1 30 B3 C4 5A 2F

The first code word encountered is 000000000 (First 9 bits) and thus outputs the string $00 - the first dictionary entry. This has set us up to step 4 and now things work slightly differently.

The second code word is the next 9 bits, 100000010, which would point to entry $102. Since this entry is NOT found in the dictionary yet, we will create this entry then output it. Entry $102 is created by taking the previous codeword's string adding to it the first byte of that string. In this case the previous code word's (0) string is $00. $00 + $00 is $00 $00. Entry $102 thus represents the string $00 $00

The next code word is 100000011, which is entry $103, which again doesn't exist. Entry $103 is created just like with $102, except now since the previous codeword is $00 $00, entry $103 is $00 $00 $00.

The next code word is again $103, this IS found in the dictionary and is outputted ($00 $00 $00) however we now create dictionary entry $104 just like $103. (It is $00 $00 $00 $00.) Note that the previous codeword is still $103.

The next code word is $3B. It is outputted and entry $105 created ($00 $00 $00 $3B) Now, the previous codeword is $3B. This pattern continues.

= Integrated Dictionary Approach =

In this approach the dictionary is the decompressed data itself. The codewords do not represent an entry in the dictionary structure, but rather directions in the decompressed code as to where a repeated string is located. That is a repeat string may well be represented by a codeword that states 'copy seven bytes from byte 132 in the decompressed data'

The advantage of this approach is that it doesn't require a separate construct for the dictionary and can just use the already decompressed data. The downside is that it will always need some method to distinguish literals and codewords, and, since codewords are nearly always of a fixed length there is an inherent limit both to how long the copied string can be and where it can be read from. this means that eventually the compression efficiency will level off and stop improving.

There This is often called the 'sliding window' and it represents the data that can be 'reached' by the codewords. It is named as such because it is of a fixed length and 'slides' along the output stream as it gets longer. The following features differ between implementations:

* Differentiating codewords from literals. There must be a way to tell codewords apart from data that is just to be read and outputted. Sometimes this is integrated into the codewords themselves, but more often the 'flag' precedes a codeword. The flag may indicate only codewords or both codewords and literals. ('Following data is made of two codewords and six literals', etc.)

* Codeword format. Most implementations use codewords of a fixed format that must encode both the length of data to copy and the location to copy it from. Codewords are usually two or four bytes long.

* Zero point location. Different implementations use different places in the output as zero. If the start of the code is used as zero only the first x bytes of the output can be used as a reference. Most implementations use a more complex, but seldom less effective 'sliding window'; the zero point is the start of the data until the data becomes too long at which point it moves forward so that the most recent x bytes of output can be used. It is also possible for zero to be the most recent byte with all locations being 'x bytes from the end', which also produces a sliding window. Finally it may be possible to have both negative and positive locations.

* Sliding window. The nature of this is dictated by the format of the codewords. Common sizes are 1KB, 2KB or 4KB. The window will always be present, but it may not 'slide' if the implementation uses a fixed location as the zero point.

== LZEXE ==

Many vintage executables are compressed with the program LZEXE, interesting in that the compressed file contains its own decompressor, meaning that it is in essence a self-extracting archive. However the unique feature of LZEXE executables is that they extract the compressed data to memory and run it. To the user this is indistinguishable from the decompressed executable, though it takes slightly longer to start up and takes up much less space.

[[UNLZEXE]] can be used to extract the decompressed executables from this which will run perfectly with other game files. It can be obtained here: http://www.dosclassics.com/download/198 It is currently unknown specifically how the LZW compression is implemented in this case, but with the source code for decompression is available.

The LZEXE compression is similar to the SoftDisk Library Approach described below, but it uses UINT16LE values instead of byte values to store the flag bits and the sliding window has a size of 8192 (0x2000) bytes. Also, the flag bits have different meanings (you could argue that they are in fact Huffman codes):

1 -> copy 1 literal byte
10 -> next two bytes contain length and distance
0000 -> length is 2, next byte contains distance
1000 -> length is 3, next byte contains distance
0100 -> length is 4, next byte contains distance
1100 -> length is 5, next byte contains distance

The real distance value is always a signed 16 bit integer (you can use unsigned but then you have to bitwise-and the resulting index value with 0xFFFF). If the length is given by the flag bits/Huffman code, the real distance is <tt>b | 0xFF00</tt> or <tt>b - 256</tt> where <tt>b</tt> is the byte value read from the file. If the length value is not given, read the byte values <tt>b0</tt> and <tt>b1</tt> and calculate length and distance like this:

length = (b1 mod 8)+2
distance = b0 + (b1/8)*256 - 8192
or
length = (b1 & 0x07)+2
distance = b0 | ((b1 & 0xF8)<<5) | 0xE000

If the <tt>length</tt> value calculated from <tt>b1</tt> is 2, this indicates that another byte value <tt>b2</tt> must be read. Depending in the value of b2, one of three things can happen:
b2 = 0:
end of compressed data - stop decompressing
b2 = 1:
end of segment
(decompressor may write contents of buffer to output)
set length to 0 or jump to the part where the decompressor reads the next flag bits/Huffman code
otherwise:
set length to b2+1

Now the decompressor must only add <tt>distance</tt> to the current buffer index (since <tt>distance</tt> is negative, the index goes backwards) and copy <tt>length</tt> bytes from there to the current index:

WHILE length > 0
buffer[index] = buffer[index+distance]
index = index + 1
length = length - 1
END WHILE

Please refer to the UNLZEXE source code for further information.

== SoftDisk Library Approach ==

This is used as the first form of compression in the [[Softdisk Library Format]]. Flags are 1-byte long and divide the datastream into segments of eight 'values' which can be either literals or codewords. Codewords are 2 bytes long, literals 1 byte. (Therefore there will be a flag byte every 8 to 16 bytes of data.) The value of each bit (In little endian) indicates whether a value will be a literal (1) or codeword (0) Thus a value of 199 (11000111 in binary) indicates three codewords, three literals and two codewords in that order. (Total of 13 bytes.)

Literals are sequences that have never been seen in the datastream before, they cannot be compressed and are thus the same in the compressed and decompressed datastreams. (If the data is text they become quite obvious.) Any string less than 3 bytes long that has not been read before or cannot be pointed to (See below) will be stored as literals.

Codewords are reference to data that has already been read. They are two bytes long, with the first 12 bits giving the location to read data from and the last 4 bits giving the length of data to read.

The lower nybble (4 bits) of the second codeword byte holds the length of repeat data to read minus three. (This makes sense, the shortest sequence it makes sense to code is three bytes which can be given the value 0.) It will be immediately apparent that the maximum length of repeated data that can be stored as a codeword is 18 bytes.

The high nybble of the second byte is multiplied by 16 then added to the first byte to give the location of the data to read in the 'sliding window' minus 19. (This is due to the way the decompression is set up in memory.)

It will be immediately obvious that the codewords can encode values between +-2048, or about 2KB. If the decompressed data is less than 2KB in size then zero is the start of the data, if it is larger than it is 2048 bytes from the data end.

It will be noted that it is probable that the compressed datastream will not be perfectly divisible by flag bytes. In this case the unused bits are set to 0. The decompressor stops when the decompressed data size is equal to the value given in the chunk header. (If it runs out of data it will abort.)

As a simple example the sentence 'I am Sam. Sam I am!' will be compressed to:

FF Flag byte, 8 literals follow
49 20 61 6D 20 53 61 6D 'I am Sam' as literals
2B Flag byte, 2L, P, L, P, L 2Blanks ($2B = 43 = 00101011)
2E 20 ' .' as literals
F2 F0 codeword, read 0 + 3 = 3 bytes from $FF2, or -14 + 19 = 5 in the data. This is 'Sam'
20 ' ' as literal
ED F1 codeword, read 1 + 3 = 4 bytes from $FED or -19 + 19 = 1 in the data. This is 'I am'
21 '!' as literal

= Source code =

Some example code is available in various languages showing how to decompress (and in some cases compress) files using the Keen's LZW algorithm in its various implementations.

== Keen 1-3 Implementation ==

These segments f code work with the Keen 1-3 implementation only and will not for example decompress LZEXE compressed executables.

=== QuickBasic ===

<source lang="qbasic">
DECLARE FUNCTION READBITS% (FILE AS INTEGER, NUMBITS AS INTEGER)
DECLARE SUB LZWDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)
DECLARE SUB LZWOUTPUT (FILE AS INTEGER, DIC AS INTEGER, CHAR AS INTEGER)
'
' KEEN1 Compatible LZW Decompressor (Lempel-Ziv-Welch)
' - by Napalm with thanks to Adurdin's work on ModKeen
'
' This source is Public Domain
'
'

' Allocate dictionary
DIM LZDIC(0 TO 4095) AS INTEGER
DIM LZCHR(0 TO 4095) AS INTEGER

' Test Function
LZWDECOMPRESS "EGALATCH.CK1", "EGALATCH.DAT"

SUB LZWDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)
SHARED LZDIC() AS INTEGER, LZCHR() AS INTEGER
DIM INFILE AS INTEGER, OUTFILE AS INTEGER, I AS INTEGER
DIM BITLEN AS INTEGER, CURPOS AS INTEGER
DIM CW AS INTEGER, PW AS INTEGER, C AS INTEGER, P AS INTEGER
DIM CHECK AS INTEGER

' Open files for input and output
INFILE = FREEFILE
OPEN INNAME FOR BINARY ACCESS READ AS INFILE
OUTFILE = FREEFILE
OPEN OUTNAME FOR BINARY ACCESS WRITE AS OUTFILE
SEEK INFILE, 7

' Fill dictionary with starting values
FOR I = 0 TO 4095
LZDIC(I) = -1
IF I < 256 THEN
LZCHR(I) = I
ELSE
LZCHR(I) = -1
END IF
NEXT I

' Decompress input stream to output stream
BITLEN = 9
CURPOS = 258
CW = READBITS(INFILE, BITLEN)
LZWOUTPUT OUTFILE, LZDIC(CW), LZCHR(CW)

WHILE CW <> &H100 AND CW <> &H101
PW = CW
CW = READBITS(INFILE, BITLEN)
IF CW <> &H100 AND CW <> &H101 THEN
P = PW
CHECK = (LZCHR(CW) <> -1)

IF CHECK THEN
TMP = CW
ELSE
TMP = PW
END IF
WHILE LZDIC(TMP) <> -1
TMP = LZDIC(TMP)
WEND
C = LZCHR(TMP)

IF CHECK THEN
LZWOUTPUT OUTFILE, LZDIC(CW), LZCHR(CW)
ELSE
LZWOUTPUT OUTFILE, P, C
END IF

IF CURPOS < 4096 THEN
LZDIC(CURPOS) = P
LZCHR(CURPOS) = C
CURPOS = CURPOS + 1
IF CURPOS = (2 ^ BITLEN - 1) AND BITLEN < 12 THEN
BITLEN = BITLEN + 1
END IF
END IF

END IF
WEND

' Close files
CLOSE OUTFILE
CLOSE INFILE
END SUB

SUB LZWOUTPUT (FILE AS INTEGER, DIC AS INTEGER, CHAR AS INTEGER)
SHARED LZDIC() AS INTEGER, LZCHR() AS INTEGER
DIM LZSTK(0 TO 127) AS STRING * 1
DIM X AS INTEGER, SP AS INTEGER
DIM LDIC AS INTEGER, LCHAR AS INTEGER

LCHAR = CHAR
LDIC = DIC
SP = 0
X = 1

DO
IF SP >= 128 THEN
PRINT "LZW: Stack Overflow!"
END
END IF
LZSTK(SP) = CHR$(LCHAR)
SP = SP + 1
IF LDIC <> -1 THEN
LCHAR = LZCHR(LDIC)
LDIC = LZDIC(LDIC)
ELSE
X = 0
END IF
LOOP WHILE X

WHILE SP <> 0
SP = SP - 1
PUT FILE, , LZSTK(SP)
WEND
END SUB

FUNCTION READBITS% (FILE AS INTEGER, NUMBITS AS INTEGER)
STATIC BITDAT AS STRING * 1, BITPOS AS INTEGER
DIM BITVAL AS INTEGER, BIT AS INTEGER

BITVAL = 0
FOR BIT = (NUMBITS - 1) TO 0 STEP -1
IF BITPOS = 0 THEN
GET FILE, , BITDAT
BITPOS = 7
ELSE
BITPOS = BITPOS - 1
END IF
IF ASC(BITDAT) AND 2 ^ BITPOS THEN
BITVAL = BITVAL OR 2 ^ BIT
END IF
NEXT BIT

READBITS% = BITVAL
END FUNCTION</source>

=== FreeBasic ===

This code does not suffer from the 64K memory limit imposed by QuickBasic and so is less efficient, but runs faster. It can be compiled with FreeBasic compiler using the -lang=qb switch. Aside from memory concerns, all code here is compatible with QuickBasic.

The code before the subroutine is used to make a string containing the bit expansion of all values from 0 to 255. The subroutine takes a filename, reads the entire file into memory then expands each bit of data to a byte using the aforesaid string as Basic cannot deal with bits directly. cw$ is codeword, pw$ is the previous codeword, lun is the lowest dictionary entry that is empty, p is the location in the compressed data stream and bl is the length of codes in bits (Starting at nine bits increasing to 12)

The dictionary is set before decompression. The first 258 are the starting dictionary, the remainder are cleared. (It is vital to reset the dictionary for each file) An error occurs if entry 256 is found in the data, 'distrupt' is printed when the newest dictionary entry is not the lowest possible entry (This shouldn't happen but is possible.) Decompression ends at encountering entry 257, or when there is no more data to read.

<source lang="qbasic">
DECLARE SUB LZWDEC (lfn AS STRING)

x$ = ""
FOR l = 0 TO 255
IF (l AND 128) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 64) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 32) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 16) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 8) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 4) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 2) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 1) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
NEXT l
LZEDEC "EGALATCH.CK1"
END

'_________________________________________________
SUB LZWDEC (lfn AS STRING) ' Decompress LZW data
'_________________________________________________
DIM lzw(0 TO 4095) AS STRING
PRINT lfn; " is LZW compressed, decompressing...";
OPEN folder + lfn FOR BINARY AS #9
y$ = SPACE$(LOF(9))
GET #9, 1, y$
CLOSE #9
z$ = ""
FOR l = 7 TO LEN(y$)
z$ = z$ + MID$(x$, (ASC(MID$(y$, l, 1)) * 8) + 1, 8)
NEXT l

bl = 9
lun = 258
p = 1
cw$ = ""
y$ = ""
FOR l = 0 TO 4095
IF l < 256 THEN lzw(l) = CHR$(l) ELSE lzw(l) = ""
NEXT l
DO
IF lun = 511 THEN bl = 10
IF lun = 1023 THEN bl = 11
IF lun = 2047 THEN bl = 12
pw$ = cw$
u$ = MID$(z$, p, bl)
p = p + bl
y = 0
FOR l = 1 TO bl
IF MID$(u$, bl - l + 1, 1) = "1" THEN y = y + (2 ^ (l - 1))
NEXT l
IF y = 256 THEN
PRINT "LZW error in Keen data!"
OPEN "ERROR.DAT" FOR OUTPUT AS #9
PRINT #9, y$;
CLOSE
END
END IF
IF y = 257 THEN EXIT DO
IF cw$ = "" THEN
cw$ = lzw(y)
y$ = y$ + cw$
ELSE
IF lun < 4096 THEN
IF lzw(y) = "" THEN
cw$ = pw$ + LEFT$(pw$, 1)
lzw(y) = cw$
y$ = y$ + cw$
IF y <> lun THEN PRINT "Disrupt!"
lun = y + 1
ELSE
cw$ = lzw(y)
y$ = y$ + cw$
lzw(lun) = pw$ + LEFT$(cw$, 1)
lun = lun + 1
END IF
ELSE
IF lzw(y) = "" THEN
y$ = y$ + cw$
ELSE
cw$ = lzw(y)
y$ = y$ + cw$
END IF
END IF
END IF
LOOP WHILE p < LEN(z$)
IF y = 257 THEN PRINT "done" ELSE PRINT "out of data."
OPEN folder + LEFT$(lfn, 4) + extq FOR OUTPUT AS #9
PRINT #9, y$;
CLOSE #9
END SUB</source>

=== Visual Basic .NET ===

This implementation uses high-level elements such as lambdas, anonymous arrays, and strict types. It must be compiled for the Microsoft .NET Framework v4.5 in Visual Studio 2012. It has the advantages of running in non-DosBox Windows and using .NET streams for simple reusability. It can also be used for higher-bit LZW.

==== Decompression ====

Decompressing is very fast.
<source lang="vbnet"> Sub DecompressLZW(Data As IO.Stream, MaxBits As Byte, Output As IO.Stream)
' This source is by Fleexy and is in the public domain. If used, please note it as such.
Dim dict As New List(Of Byte())
For x = 0 To 255
dict.Add({x})
Next
dict.Add({})
dict.Add(Nothing)
Dim usebits As Byte = 9
Dim bpos As Long
Dim bits As New List(Of Byte)
Do Until Data.Position = Data.Length
Dim b, ub As Byte
b = Data.ReadByte
ub = b
For x = 7 To 0 Step -1
If ub - (2 ^ x) >= 0 Then
ub -= (2 ^ x)
bits.Add(1)
Else
bits.Add(0)
End If
Next
Loop
Dim GetCode = Function() As UInteger
Dim u As UInteger
For x = usebits To 1 Step -1
If bits(bpos) = 1 Then u += (2 ^ (x - 1))
bpos += 1
Next
Return u
End Function
Dim OutputCode = Sub(DecompData As Byte())
Dim n As UInteger = DecompData.Length
Output.Write(DecompData, 0, n)
End Sub
Dim AddToDict = Sub(Entry As Byte())
If dict.Count < (2 ^ MaxBits) Then
dict.Add(Entry)
If dict.Count = (2 ^ usebits) - 1 Then usebits = Math.Min(usebits + 1, MaxBits)
End If
End Sub
Dim fcode As UInteger = GetCode()
Dim match As Byte() = dict(fcode)
OutputCode(match)
Do
Dim ncode As UInteger = GetCode()
If ncode = 257 Then Exit Do
If ncode = 256 Then Throw New Exception
Dim nmatch As Byte()
If ncode < dict.Count Then
nmatch = dict(ncode)
Else
nmatch = match.Concat({match(0)}).ToArray
End If
OutputCode(nmatch)
AddToDict(match.Concat({nmatch(0)}).ToArray)
match = nmatch
Loop
End Sub</source>

==== Compression ====

Compression is more difficult; consulting the dictionary for a byte array takes more time. The speed of this algorithm may be unacceptable.
<source lang="vbnet"> Sub CompressLZW(Data As IO.Stream, MaxBits As Byte, Output As IO.Stream)
' This source is by Fleexy and is in the public domain.
Dim bits As New List(Of Byte)
Dim dict As New List(Of Byte())
For x = 0 To 255
dict.Add({x})
Next
dict.Add({})
dict.Add({})
Dim usebits As Byte = 9
Dim PutCode = Sub(Code As UInteger)
For x = usebits To 1 Step -1
If Code - (2 ^ (x - 1)) >= 0 Then
Code -= (2 ^ (x - 1))
bits.Add(1)
Else
bits.Add(0)
End If
Next
End Sub
Dim AddToDict = Function(Entry As Byte()) As Boolean
If dict.Count < 2 ^ MaxBits Then
dict.Add(Entry)
If dict.Count = 2 ^ usebits Then usebits = Math.Min(usebits + 1, MaxBits)
Return True
Else
Return False
End If
End Function
Dim FindCode = Function(Bytes As Byte()) As UInteger
For x = 1 To dict.Count
If dict(x - 1).Count = Bytes.Count Then
If dict(x - 1).SequenceEqual(Bytes) Then Return x - 1
End If
Next
Throw New NotFiniteNumberException
End Function
Dim DictContains = Function(Bytes As Byte()) As Boolean
For x = 1 To dict.Count
If dict(x - 1).Length = Bytes.Length Then
If dict(x - 1).SequenceEqual(Bytes) Then Return True
End If
Next
Return False
End Function
Dim match As Byte() = {}
Do Until Data.Position = Data.Length
Dim nbyte As Byte = Data.ReadByte
Dim nmatch As Byte() = match.Concat({nbyte}).ToArray
If DictContains(nmatch) Then
match = nmatch
Else
PutCode(FindCode(match))
AddToDict(nmatch)
match = {nbyte}
End If
Loop
PutCode(FindCode(match))
PutCode(257)
Do Until bits.LongCount Mod 8L = 0L
bits.Add(0)
Loop
For x = 1 To CInt(bits.LongCount / 8L)
Dim b As Byte = 0
For y = 0 To 7
b += bits((x - 1) * 8 + y) * (2 ^ (7 - y))
Next
Output.WriteByte(b)
Next
End Sub</source>
----

[[Category:Commander Keen 1-3]]
[[Category:File Formats]]
[[Category:Compressed Files]]

LZW Compression

2013-12-31T22:42:52Z

Fleexy: Forgot </source>

LZW compression is a common form of compression used in some early games to compress data and by most early games to compress their executables. It is notable in being one of the first compression methods to not compress on the byte level (Along with [[Huffman Compression]]) and for its efficiency.

The basic concept for LZW is universal, though the implementations differ. In essence it involves replacing data strings that have been encountered before with references to already decompressed data. (Known as a 'dictionary') This can be done in a number of ways, the two main approaches differing on whether the dictionary is separate or integrated

= Separate Dictionary Approach =

In this approach the dictionary is separate from the data being decompressed, that is it is stored in a separate location in memory. In this it behaves more like one would expect a dictionary to work; when a codeword is found in the data, it is looked up in the dictionary and the corresponding string copied to output. (As an example dictionary entry 42 could represent the string 'life', thus whenever the code '42' is encountered the string 'life' is added to the decompressed data.)

The advantage of this method is that the efficiency of compression increases as the amount of data to compress increases. The following points differ between implementations:

* The initial dictionary. Just how large the initial dictionary is varies. Some implementations start with no dictioanry at all, others set a number of entries, usually 255, covering all possible 1-byte values.

* The maximum size of the dictionary. Many older implementations with less resources were forced to cap the dictionary at a certain size, usually a power of two entries long. (512, 1024...) Unlimited implementations are rare as modern methods (e.g. the DEFLATE algorithm.) usually rely on several compression methods at once. Sometimes the dictionary is 'reset' when it reaches too large a size.

* Whether the codestream is made partly or entirely of codewords. Often the compressed data is made entirely of codewords, even non-repeating strings, which means that initially compression can sometimes be rather poor. Other implementations use codeowords only for repeating strings. Differences in how codewords vs literal are indicated and how dictionaries are built up may occur.

== Decoding ==

This is a general decoding algorithm for separate dictionary LZW. It will need to be altered slightly when dealing with different implementations. Notably it assumes that the codestream is composed entirely of codewords and that the dictionary can keep growing indefinitely.

1 At the start the dictionary contains all possible roots;
2 cW := the first code word in the codestream (it denotes a root);
3 output the string.cW to the charstream;
4 pW := cW;
5 cW := next code word in the codestream;
6 Is the string.cW present in the dictionary?
a if it is,
i output the string.cW to the charstream;
ii P := string.pW;
iii C := the first character of the string.cW;
iv add the string P+C to the dictionary;
b if not,
i P := string.pW;
ii C := the first character of the string.pW;
iii output the string P+C to the charstream
iv add the string P+C to the dictionary (now it corresponds to the cW);
7 Are there more code words in the codestream?
a if yes, go back to step 4;
b if not, END.

== [[Commander Keen 1-3|Commander Keen 1-3 LZW]] ==

In [[Commander Keen 1-3]] LZW is used to compress the <tt>EGALATCH</tt> and <tt>EGASPRIT</tt> files in episode 1 (It can also be used in episodes 2 and 3, but isn't.) The game uses two error-checking methods in this implementation, firstly it reserves two dictionary values, $100 to indicate an error (This is written by the compression program and will make the executable abort.) and $101 to indicate the end of data. (If the program reaches the end of the data without encountering this it will also abort.) The compressed data is also prefixed with a dword giving the decompressed data size, so this can be compared with the output.

This method is a typical separate dictionary approach. It starts with a dictionary of 256 9-bit codewords representing the 8-bit strings $00-$FF (Plus some special cases.) The dictionary is allowed to grow to 4096 entries. The following is the initial dictionary:

0000 - 00 (character)
0001 - 01 (character)
...
00FE - FE (character)
00FF - FF (character)
0100 - Reserved for errors...
0101 - Reserved for end of compressed data...
0102 - (not set)
0103 - (not set)

It will be immediately noticed that 4096 entries cannot be represented by 9-bit codes but at the least by 12-bit codes. To further conserve space the length of the codewords is increased every time the dictionary grows too large. Thus when it reaches $01FF entries codewords become 10 bits long, at $03FF they are 11 bits and finally at $07FF 12 bits. At $0FFF entries the dictionary stops growing.

The following data is taken from the EGALATCH file from Keen 1. Notice that the first six bytes are ignored. (The first four give the decompressed data size, the next two are an executable variable of unknown use.) The first few steps of decompression follows.

0000 80 D3 01 00 0C 00 00 40 - A0 70 31 E9 F8 F8 78 38
0010 08 08 00 07 FC 39 FF 04 - 5E 41 E1 30 B3 C4 5A 2F

The first code word encountered is 000000000 (First 9 bits) and thus outputs the string $00 (First dictionary entry.) This has set us up to step 4 and now things work slightly differently.

The second code word is 100000010 (Next 9 bits) which is entry $102. Since this entry is NOT found in the dictionary yet, we will create this entry then output it. Entry $102 is created by taking the previous codeword's string adding to it the first byte of that string. In this case the previous codewrode's (0) string is $00. $00 + $00 is $00 $00 Entry $102 thus represents the string $00 $00

The next code word is 100000011 which is entry $103, which again doesn't exist. Entry $103 is created just like with $102, except now since the previous codeword is $00 $00, entry $103 is $00 $00 $00

The next code word is again $103, this IS found in the dictionary and is outputted ($00 $00 $00) however we now create dictionary entry $104 just like $103. (It is $00 $00 $00 $00)

The next code word is $3B. It is outputted and entry $105 created ($00 $00 $00 $00 $3B) Note that now the previous codeword is $3B. This pattern continues.

= Integrated Dictionary Approach =

In this approach the dictionary is the decompressed data itself. The codewords do not represent an entry in the dictionary structure, but rather directions in the decompressed code as to where a repeated string is located. That is a repeat string may well be represented by a codeword that states 'copy seven bytes from byte 132 in the decompressed data'

The advantage of this approach is that it doesn't require a separate construct for the dictionary and can just use the already decompressed data. The downside is that it will always need some method to distinguish literals and codewords, and, since codewords are nearly always of a fixed length there is an inherent limit both to how long the copied string can be and where it can be read from. this means that eventually the compression efficiency will level off and stop improving.

There This is often called the 'sliding window' and it represents the data that can be 'reached' by the codewords. It is named as such because it is of a fixed length and 'slides' along the output stream as it gets longer. The following features differ between implementations:

* Differentiating codewords from literals. There must be a way to tell codewords apart from data that is just to be read and outputted. Sometimes this is integrated into the codewords themselves, but more often the 'flag' precedes a codeword. The flag may indicate only codewords or both codewords and literals. ('Following data is made of two codewords and six literals', etc.)

* Codeword format. Most implementations use codewords of a fixed format that must encode both the length of data to copy and the location to copy it from. Codewords are usually two or four bytes long.

* Zero point location. Different implementations use different places in the output as zero. If the start of the code is used as zero only the first x bytes of the output can be used as a reference. Most implementations use a more complex, but seldom less effective 'sliding window'; the zero point is the start of the data until the data becomes too long at which point it moves forward so that the most recent x bytes of output can be used. It is also possible for zero to be the most recent byte with all locations being 'x bytes from the end', which also produces a sliding window. Finally it may be possible to have both negative and positive locations.

* Sliding window. The nature of this is dictated by the format of the codewords. Common sizes are 1KB, 2KB or 4KB. The window will always be present, but it may not 'slide' if the implementation uses a fixed location as the zero point.

== LZEXE ==

Many vintage executables are compressed with the program LZEXE, interesting in that the compressed file contains its own decompressor, meaning that it is in essence a self-extracting archive. However the unique feature of LZEXE executables is that they extract the compressed data to memory and run it. To the user this is indistinguishable from the decompressed executable, though it takes slightly longer to start up and takes up much less space.

[[UNLZEXE]] can be used to extract the decompressed executables from this which will run perfectly with other game files. It can be obtained here: http://www.dosclassics.com/download/198 It is currently unknown specifically how the LZW compression is implemented in this case, but with the source code for decompression is available.

The LZEXE compression is similar to the SoftDisk Library Approach described below, but it uses UINT16LE values instead of byte values to store the flag bits and the sliding window has a size of 8192 (0x2000) bytes. Also, the flag bits have different meanings (you could argue that they are in fact Huffman codes):

1 -> copy 1 literal byte
10 -> next two bytes contain length and distance
0000 -> length is 2, next byte contains distance
1000 -> length is 3, next byte contains distance
0100 -> length is 4, next byte contains distance
1100 -> length is 5, next byte contains distance

The real distance value is always a signed 16 bit integer (you can use unsigned but then you have to bitwise-and the resulting index value with 0xFFFF). If the length is given by the flag bits/Huffman code, the real distance is <tt>b | 0xFF00</tt> or <tt>b - 256</tt> where <tt>b</tt> is the byte value read from the file. If the length value is not given, read the byte values <tt>b0</tt> and <tt>b1</tt> and calculate length and distance like this:

length = (b1 mod 8)+2
distance = b0 + (b1/8)*256 - 8192
or
length = (b1 & 0x07)+2
distance = b0 | ((b1 & 0xF8)<<5) | 0xE000

If the <tt>length</tt> value calculated from <tt>b1</tt> is 2, this indicates that another byte value <tt>b2</tt> must be read. Depending in the value of b2, one of three things can happen:
b2 = 0:
end of compressed data - stop decompressing
b2 = 1:
end of segment
(decompressor may write contents of buffer to output)
set length to 0 or jump to the part where the decompressor reads the next flag bits/Huffman code
otherwise:
set length to b2+1

Now the decompressor must only add <tt>distance</tt> to the current buffer index (since <tt>distance</tt> is negative, the index goes backwards) and copy <tt>length</tt> bytes from there to the current index:

WHILE length > 0
buffer[index] = buffer[index+distance]
index = index + 1
length = length - 1
END WHILE

Please refer to the ULZEXE source code for further information.

== SoftDisk Library Approach ==

This is used as the first form of compression in the [[Softdisk Library Format]]. Flags are 1-byte long and divide the datastream into segments of eight 'values' which can be either literals or codewords. Codewords are 2 bytes long, literals 1 byte. (Therefore there will be a flag byte every 8 to 16 bytes of data.) The value of each bit (In little endian) indicates whether a value will be a literal (1) or codeword (0) Thus a value of 199 (11000111 in binary) indicates three codewords, three literals and two codewords in that order. (Total of 13 bytes.)

Literals are sequences that have never been seen in the datastream before, they cannot be compressed and are thus the same in the compressed and decompressed datastreams. (If the data is text they become quite obvious.) Any string less than 3 bytes long that has not been read before or cannot be pointed to (See below) will be stored as literals.

Codewords are reference to data that has already been read. They are two bytes long, with the first 12 bits giving the location to read data from and the last 4 bits giving the length of data to read.

The lower nybble (4 bits) of the second codeword byte holds the length of repeat data to read minus three. (This makes sense, the shortest sequence it makes sense to code is three bytes which can be given the value 0.) It will be immediately apparent that the maximum length of repeated data that can be stored as a codeword is 18 bytes.

The high nybble of the second byte is multiplied by 16 then added to the first byte to give the location of the data to read in the 'sliding window' minus 19. (This is due to the way the decompression is set up in memory.)

It will be immediately obvious that the codewords can encode values between +-2048, or about 2KB. If the decompressed data is less than 2KB in size then zero is the start of the data, if it is larger than it is 2048 bytes from the data end.

It will be noted that it is probable that the compressed datastream will not be perfectly divisible by flag bytes. In this case the unused bits are set to 0. The decompressor stops when the decompressed data size is equal to the value given in the chunk header. (If it runs out of data it will abort.)

As a simple example the sentence 'I am Sam. Sam I am!' will be compressed to:

FF Flag byte, 8 literals follow
49 20 61 6D 20 53 61 6D 'I am Sam' as literals
2B Flag byte, 2L, P, L, P, L 2Blanks ($2B = 43 = 00101011)
2E 20 ' .' as literals
F2 F0 codeword, read 0 + 3 = 3 bytes from $FF2, or -14 + 19 = 5 in the data. This is 'Sam'
20 ' ' as literal
ED F1 codeword, read 1 + 3 = 4 bytes from $FED or -19 + 19 = 1 in the data. This is 'I am'
21 '!' as literal

= Source code =

Some example code is available in various languages showing how to decompress (and in some cases compress) files using the Keen's LZW algorithm in its various implementations.

== Keen 1-3 Implementation ==

These segments f code work with the Keen 1-3 implementation only and will not for example decompress LZEXE compressed executables.

=== QuickBasic ===

<source lang="qbasic">
DECLARE FUNCTION READBITS% (FILE AS INTEGER, NUMBITS AS INTEGER)
DECLARE SUB LZWDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)
DECLARE SUB LZWOUTPUT (FILE AS INTEGER, DIC AS INTEGER, CHAR AS INTEGER)
'
' KEEN1 Compatible LZW Decompressor (Lempel-Ziv-Welch)
' - by Napalm with thanks to Adurdin's work on ModKeen
'
' This source is Public Domain
'
'

' Allocate dictionary
DIM LZDIC(0 TO 4095) AS INTEGER
DIM LZCHR(0 TO 4095) AS INTEGER

' Test Function
LZWDECOMPRESS "EGALATCH.CK1", "EGALATCH.DAT"

SUB LZWDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)
SHARED LZDIC() AS INTEGER, LZCHR() AS INTEGER
DIM INFILE AS INTEGER, OUTFILE AS INTEGER, I AS INTEGER
DIM BITLEN AS INTEGER, CURPOS AS INTEGER
DIM CW AS INTEGER, PW AS INTEGER, C AS INTEGER, P AS INTEGER
DIM CHECK AS INTEGER

' Open files for input and output
INFILE = FREEFILE
OPEN INNAME FOR BINARY ACCESS READ AS INFILE
OUTFILE = FREEFILE
OPEN OUTNAME FOR BINARY ACCESS WRITE AS OUTFILE
SEEK INFILE, 7

' Fill dictionary with starting values
FOR I = 0 TO 4095
LZDIC(I) = -1
IF I < 256 THEN
LZCHR(I) = I
ELSE
LZCHR(I) = -1
END IF
NEXT I

' Decompress input stream to output stream
BITLEN = 9
CURPOS = 258
CW = READBITS(INFILE, BITLEN)
LZWOUTPUT OUTFILE, LZDIC(CW), LZCHR(CW)

WHILE CW <> &H100 AND CW <> &H101
PW = CW
CW = READBITS(INFILE, BITLEN)
IF CW <> &H100 AND CW <> &H101 THEN
P = PW
CHECK = (LZCHR(CW) <> -1)

IF CHECK THEN
TMP = CW
ELSE
TMP = PW
END IF
WHILE LZDIC(TMP) <> -1
TMP = LZDIC(TMP)
WEND
C = LZCHR(TMP)

IF CHECK THEN
LZWOUTPUT OUTFILE, LZDIC(CW), LZCHR(CW)
ELSE
LZWOUTPUT OUTFILE, P, C
END IF

IF CURPOS < 4096 THEN
LZDIC(CURPOS) = P
LZCHR(CURPOS) = C
CURPOS = CURPOS + 1
IF CURPOS = (2 ^ BITLEN - 1) AND BITLEN < 12 THEN
BITLEN = BITLEN + 1
END IF
END IF

END IF
WEND

' Close files
CLOSE OUTFILE
CLOSE INFILE
END SUB

SUB LZWOUTPUT (FILE AS INTEGER, DIC AS INTEGER, CHAR AS INTEGER)
SHARED LZDIC() AS INTEGER, LZCHR() AS INTEGER
DIM LZSTK(0 TO 127) AS STRING * 1
DIM X AS INTEGER, SP AS INTEGER
DIM LDIC AS INTEGER, LCHAR AS INTEGER

LCHAR = CHAR
LDIC = DIC
SP = 0
X = 1

DO
IF SP >= 128 THEN
PRINT "LZW: Stack Overflow!"
END
END IF
LZSTK(SP) = CHR$(LCHAR)
SP = SP + 1
IF LDIC <> -1 THEN
LCHAR = LZCHR(LDIC)
LDIC = LZDIC(LDIC)
ELSE
X = 0
END IF
LOOP WHILE X

WHILE SP <> 0
SP = SP - 1
PUT FILE, , LZSTK(SP)
WEND
END SUB

FUNCTION READBITS% (FILE AS INTEGER, NUMBITS AS INTEGER)
STATIC BITDAT AS STRING * 1, BITPOS AS INTEGER
DIM BITVAL AS INTEGER, BIT AS INTEGER

BITVAL = 0
FOR BIT = (NUMBITS - 1) TO 0 STEP -1
IF BITPOS = 0 THEN
GET FILE, , BITDAT
BITPOS = 7
ELSE
BITPOS = BITPOS - 1
END IF
IF ASC(BITDAT) AND 2 ^ BITPOS THEN
BITVAL = BITVAL OR 2 ^ BIT
END IF
NEXT BIT

READBITS% = BITVAL
END FUNCTION</source>

=== FreeBasic ===

This code does not suffer from the 64K memory limit imposed by QuickBasic and so is less efficient, but runs faster. It can be compiled with FreeBasic compiler using the -lang=qb switch. Aside from memory concerns, all code here is compatible with QuickBasic.

The code before the subroutine is used to make a string containing the bit expansion of all values from 0 to 255. The subroutine takes a filename, reads the entire file into memory then expands each bit of data to a byte using the aforesaid string as Basic cannot deal with bits directly. cw$ is codeword, pw$ is the previous codeword, lun is the lowest dictionary entry that is empty, p is the location in the compressed data stream and bl is the length of codes in bits (Starting at nine bits increasing to 12)

The dictionary is set before decompression. The first 258 are the starting dictionary, the remainder are cleared. (It is vital to reset the dictionary for each file) An error occurs if entry 256 is found in the data, 'distrupt' is printed when the newest dictionary entry is not the lowest possible entry (This shouldn't happen but is possible.) Decompression ends at encountering entry 257, or when there is no more data to read.

<source lang="qbasic">
DECLARE SUB LZWDEC (lfn AS STRING)

x$ = ""
FOR l = 0 TO 255
IF (l AND 128) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 64) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 32) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 16) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 8) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 4) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 2) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 1) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
NEXT l
LZEDEC "EGALATCH.CK1"
END

'_________________________________________________
SUB LZWDEC (lfn AS STRING) ' Decompress LZW data
'_________________________________________________
DIM lzw(0 TO 4095) AS STRING
PRINT lfn; " is LZW compressed, decompressing...";
OPEN folder + lfn FOR BINARY AS #9
y$ = SPACE$(LOF(9))
GET #9, 1, y$
CLOSE #9
z$ = ""
FOR l = 7 TO LEN(y$)
z$ = z$ + MID$(x$, (ASC(MID$(y$, l, 1)) * 8) + 1, 8)
NEXT l

bl = 9
lun = 258
p = 1
cw$ = ""
y$ = ""
FOR l = 0 TO 4095
IF l < 256 THEN lzw(l) = CHR$(l) ELSE lzw(l) = ""
NEXT l
DO
IF lun = 511 THEN bl = 10
IF lun = 1023 THEN bl = 11
IF lun = 2047 THEN bl = 12
pw$ = cw$
u$ = MID$(z$, p, bl)
p = p + bl
y = 0
FOR l = 1 TO bl
IF MID$(u$, bl - l + 1, 1) = "1" THEN y = y + (2 ^ (l - 1))
NEXT l
IF y = 256 THEN
PRINT "LZW error in Keen data!"
OPEN "ERROR.DAT" FOR OUTPUT AS #9
PRINT #9, y$;
CLOSE
END
END IF
IF y = 257 THEN EXIT DO
IF cw$ = "" THEN
cw$ = lzw(y)
y$ = y$ + cw$
ELSE
IF lun < 4096 THEN
IF lzw(y) = "" THEN
cw$ = pw$ + LEFT$(pw$, 1)
lzw(y) = cw$
y$ = y$ + cw$
IF y <> lun THEN PRINT "Disrupt!"
lun = y + 1
ELSE
cw$ = lzw(y)
y$ = y$ + cw$
lzw(lun) = pw$ + LEFT$(cw$, 1)
lun = lun + 1
END IF
ELSE
IF lzw(y) = "" THEN
y$ = y$ + cw$
ELSE
cw$ = lzw(y)
y$ = y$ + cw$
END IF
END IF
END IF
LOOP WHILE p < LEN(z$)
IF y = 257 THEN PRINT "done" ELSE PRINT "out of data."
OPEN folder + LEFT$(lfn, 4) + extq FOR OUTPUT AS #9
PRINT #9, y$;
CLOSE #9
END SUB</source>

=== Visual Basic .NET ===

This implementation uses high-level elements such as lambdas, anonymous arrays, and strict types. It must be compiled for the Microsoft .NET Framework v4.5 in Visual Studio 2012. It has the advantages of running in non-DosBox Windows and using .NET streams for simple reusability. It can also be used for higher-bit LZW.

==== Decompression ====

Decompressing is very fast.
<source lang="vbnet"> Sub DecompressLZW(Data As IO.Stream, MaxBits As Byte, Output As IO.Stream)
' This source is by Fleexy and is in the public domain. If used, please note it as such.
Dim dict As New List(Of Byte())
For x = 0 To 255
dict.Add({x})
Next
dict.Add({})
dict.Add(Nothing)
Dim usebits As Byte = 9
Dim bpos As Long
Dim bits As New List(Of Byte)
Do Until Data.Position = Data.Length
Dim b, ub As Byte
b = Data.ReadByte
ub = b
For x = 7 To 0 Step -1
If ub - (2 ^ x) >= 0 Then
ub -= (2 ^ x)
bits.Add(1)
Else
bits.Add(0)
End If
Next
Loop
Dim GetCode = Function() As UInteger
Dim u As UInteger
For x = usebits To 1 Step -1
If bits(bpos) = 1 Then u += (2 ^ (x - 1))
bpos += 1
Next
Return u
End Function
Dim OutputCode = Sub(DecompData As Byte())
Dim n As UInteger = DecompData.Length
Output.Write(DecompData, 0, n)
End Sub
Dim AddToDict = Sub(Entry As Byte())
If dict.Count < (2 ^ MaxBits) Then
dict.Add(Entry)
If dict.Count = (2 ^ usebits) - 1 Then usebits = Math.Min(usebits + 1, MaxBits)
End If
End Sub
Dim fcode As UInteger = GetCode()
Dim match As Byte() = dict(fcode)
OutputCode(match)
Do
Dim ncode As UInteger = GetCode()
If ncode = 257 Then Exit Do
If ncode = 256 Then Throw New Exception
Dim nmatch As Byte()
If ncode < dict.Count Then
nmatch = dict(ncode)
Else
nmatch = match.Concat({match(0)}).ToArray
End If
OutputCode(nmatch)
AddToDict(match.Concat({nmatch(0)}).ToArray)
match = nmatch
Loop
End Sub</source>

==== Compression ====

Compression is more difficult; consulting the dictionary for a byte array takes more time. The speed of this algorithm may be unacceptable.
<source lang="vbnet"> Sub CompressLZW(Data As IO.Stream, MaxBits As Byte, Output As IO.Stream)
' This source is by Fleexy and is in the public domain.
Dim bits As New List(Of Byte)
Dim dict As New List(Of Byte())
For x = 0 To 255
dict.Add({x})
Next
dict.Add({})
dict.Add({})
Dim usebits As Byte = 9
Dim PutCode = Sub(Code As UInteger)
For x = usebits To 1 Step -1
If Code - (2 ^ (x - 1)) >= 0 Then
Code -= (2 ^ (x - 1))
bits.Add(1)
Else
bits.Add(0)
End If
Next
End Sub
Dim AddToDict = Function(Entry As Byte()) As Boolean
If dict.Count < 2 ^ MaxBits Then
dict.Add(Entry)
If dict.Count = 2 ^ usebits Then usebits = Math.Min(usebits + 1, MaxBits)
Return True
Else
Return False
End If
End Function
Dim FindCode = Function(Bytes As Byte()) As UInteger
For x = 1 To dict.Count
If dict(x - 1).Count = Bytes.Count Then
If dict(x - 1).SequenceEqual(Bytes) Then Return x - 1
End If
Next
Throw New NotFiniteNumberException
End Function
Dim DictContains = Function(Bytes As Byte()) As Boolean
For x = 1 To dict.Count
If dict(x - 1).Length = Bytes.Length Then
If dict(x - 1).SequenceEqual(Bytes) Then Return True
End If
Next
Return False
End Function
Dim match As Byte() = {}
Do Until Data.Position = Data.Length
Dim nbyte As Byte = Data.ReadByte
Dim nmatch As Byte() = match.Concat({nbyte}).ToArray
If DictContains(nmatch) Then
match = nmatch
Else
PutCode(FindCode(match))
AddToDict(nmatch)
match = {nbyte}
End If
Loop
PutCode(FindCode(match))
PutCode(257)
Do Until bits.LongCount Mod 8L = 0L
bits.Add(0)
Loop
For x = 1 To CInt(bits.LongCount / 8L)
Dim b As Byte = 0
For y = 0 To 7
b += bits((x - 1) * 8 + y) * (2 ^ (7 - y))
Next
Output.WriteByte(b)
Next
End Sub</source>
----

[[Category:Commander Keen 1-3]]
[[Category:File Formats]]
[[Category:Compressed Files]]

LZW Compression

2013-12-31T22:41:57Z

Fleexy: Fixed source language; I expected it to not be displayed

LZW compression is a common form of compression used in some early games to compress data and by most early games to compress their executables. It is notable in being one of the first compression methods to not compress on the byte level (Along with [[Huffman Compression]]) and for its efficiency.

The basic concept for LZW is universal, though the implementations differ. In essence it involves replacing data strings that have been encountered before with references to already decompressed data. (Known as a 'dictionary') This can be done in a number of ways, the two main approaches differing on whether the dictionary is separate or integrated

= Separate Dictionary Approach =

In this approach the dictionary is separate from the data being decompressed, that is it is stored in a separate location in memory. In this it behaves more like one would expect a dictionary to work; when a codeword is found in the data, it is looked up in the dictionary and the corresponding string copied to output. (As an example dictionary entry 42 could represent the string 'life', thus whenever the code '42' is encountered the string 'life' is added to the decompressed data.)

The advantage of this method is that the efficiency of compression increases as the amount of data to compress increases. The following points differ between implementations:

* The initial dictionary. Just how large the initial dictionary is varies. Some implementations start with no dictioanry at all, others set a number of entries, usually 255, covering all possible 1-byte values.

* The maximum size of the dictionary. Many older implementations with less resources were forced to cap the dictionary at a certain size, usually a power of two entries long. (512, 1024...) Unlimited implementations are rare as modern methods (e.g. the DEFLATE algorithm.) usually rely on several compression methods at once. Sometimes the dictionary is 'reset' when it reaches too large a size.

* Whether the codestream is made partly or entirely of codewords. Often the compressed data is made entirely of codewords, even non-repeating strings, which means that initially compression can sometimes be rather poor. Other implementations use codeowords only for repeating strings. Differences in how codewords vs literal are indicated and how dictionaries are built up may occur.

== Decoding ==

This is a general decoding algorithm for separate dictionary LZW. It will need to be altered slightly when dealing with different implementations. Notably it assumes that the codestream is composed entirely of codewords and that the dictionary can keep growing indefinitely.

1 At the start the dictionary contains all possible roots;
2 cW := the first code word in the codestream (it denotes a root);
3 output the string.cW to the charstream;
4 pW := cW;
5 cW := next code word in the codestream;
6 Is the string.cW present in the dictionary?
a if it is,
i output the string.cW to the charstream;
ii P := string.pW;
iii C := the first character of the string.cW;
iv add the string P+C to the dictionary;
b if not,
i P := string.pW;
ii C := the first character of the string.pW;
iii output the string P+C to the charstream
iv add the string P+C to the dictionary (now it corresponds to the cW);
7 Are there more code words in the codestream?
a if yes, go back to step 4;
b if not, END.

== [[Commander Keen 1-3|Commander Keen 1-3 LZW]] ==

In [[Commander Keen 1-3]] LZW is used to compress the <tt>EGALATCH</tt> and <tt>EGASPRIT</tt> files in episode 1 (It can also be used in episodes 2 and 3, but isn't.) The game uses two error-checking methods in this implementation, firstly it reserves two dictionary values, $100 to indicate an error (This is written by the compression program and will make the executable abort.) and $101 to indicate the end of data. (If the program reaches the end of the data without encountering this it will also abort.) The compressed data is also prefixed with a dword giving the decompressed data size, so this can be compared with the output.

This method is a typical separate dictionary approach. It starts with a dictionary of 256 9-bit codewords representing the 8-bit strings $00-$FF (Plus some special cases.) The dictionary is allowed to grow to 4096 entries. The following is the initial dictionary:

0000 - 00 (character)
0001 - 01 (character)
...
00FE - FE (character)
00FF - FF (character)
0100 - Reserved for errors...
0101 - Reserved for end of compressed data...
0102 - (not set)
0103 - (not set)

It will be immediately noticed that 4096 entries cannot be represented by 9-bit codes but at the least by 12-bit codes. To further conserve space the length of the codewords is increased every time the dictionary grows too large. Thus when it reaches $01FF entries codewords become 10 bits long, at $03FF they are 11 bits and finally at $07FF 12 bits. At $0FFF entries the dictionary stops growing.

The following data is taken from the EGALATCH file from Keen 1. Notice that the first six bytes are ignored. (The first four give the decompressed data size, the next two are an executable variable of unknown use.) The first few steps of decompression follows.

0000 80 D3 01 00 0C 00 00 40 - A0 70 31 E9 F8 F8 78 38
0010 08 08 00 07 FC 39 FF 04 - 5E 41 E1 30 B3 C4 5A 2F

The first code word encountered is 000000000 (First 9 bits) and thus outputs the string $00 (First dictionary entry.) This has set us up to step 4 and now things work slightly differently.

The second code word is 100000010 (Next 9 bits) which is entry $102. Since this entry is NOT found in the dictionary yet, we will create this entry then output it. Entry $102 is created by taking the previous codeword's string adding to it the first byte of that string. In this case the previous codewrode's (0) string is $00. $00 + $00 is $00 $00 Entry $102 thus represents the string $00 $00

The next code word is 100000011 which is entry $103, which again doesn't exist. Entry $103 is created just like with $102, except now since the previous codeword is $00 $00, entry $103 is $00 $00 $00

The next code word is again $103, this IS found in the dictionary and is outputted ($00 $00 $00) however we now create dictionary entry $104 just like $103. (It is $00 $00 $00 $00)

The next code word is $3B. It is outputted and entry $105 created ($00 $00 $00 $00 $3B) Note that now the previous codeword is $3B. This pattern continues.

= Integrated Dictionary Approach =

In this approach the dictionary is the decompressed data itself. The codewords do not represent an entry in the dictionary structure, but rather directions in the decompressed code as to where a repeated string is located. That is a repeat string may well be represented by a codeword that states 'copy seven bytes from byte 132 in the decompressed data'

The advantage of this approach is that it doesn't require a separate construct for the dictionary and can just use the already decompressed data. The downside is that it will always need some method to distinguish literals and codewords, and, since codewords are nearly always of a fixed length there is an inherent limit both to how long the copied string can be and where it can be read from. this means that eventually the compression efficiency will level off and stop improving.

There This is often called the 'sliding window' and it represents the data that can be 'reached' by the codewords. It is named as such because it is of a fixed length and 'slides' along the output stream as it gets longer. The following features differ between implementations:

* Differentiating codewords from literals. There must be a way to tell codewords apart from data that is just to be read and outputted. Sometimes this is integrated into the codewords themselves, but more often the 'flag' precedes a codeword. The flag may indicate only codewords or both codewords and literals. ('Following data is made of two codewords and six literals', etc.)

* Codeword format. Most implementations use codewords of a fixed format that must encode both the length of data to copy and the location to copy it from. Codewords are usually two or four bytes long.

* Zero point location. Different implementations use different places in the output as zero. If the start of the code is used as zero only the first x bytes of the output can be used as a reference. Most implementations use a more complex, but seldom less effective 'sliding window'; the zero point is the start of the data until the data becomes too long at which point it moves forward so that the most recent x bytes of output can be used. It is also possible for zero to be the most recent byte with all locations being 'x bytes from the end', which also produces a sliding window. Finally it may be possible to have both negative and positive locations.

* Sliding window. The nature of this is dictated by the format of the codewords. Common sizes are 1KB, 2KB or 4KB. The window will always be present, but it may not 'slide' if the implementation uses a fixed location as the zero point.

== LZEXE ==

Many vintage executables are compressed with the program LZEXE, interesting in that the compressed file contains its own decompressor, meaning that it is in essence a self-extracting archive. However the unique feature of LZEXE executables is that they extract the compressed data to memory and run it. To the user this is indistinguishable from the decompressed executable, though it takes slightly longer to start up and takes up much less space.

[[UNLZEXE]] can be used to extract the decompressed executables from this which will run perfectly with other game files. It can be obtained here: http://www.dosclassics.com/download/198 It is currently unknown specifically how the LZW compression is implemented in this case, but with the source code for decompression is available.

The LZEXE compression is similar to the SoftDisk Library Approach described below, but it uses UINT16LE values instead of byte values to store the flag bits and the sliding window has a size of 8192 (0x2000) bytes. Also, the flag bits have different meanings (you could argue that they are in fact Huffman codes):

1 -> copy 1 literal byte
10 -> next two bytes contain length and distance
0000 -> length is 2, next byte contains distance
1000 -> length is 3, next byte contains distance
0100 -> length is 4, next byte contains distance
1100 -> length is 5, next byte contains distance

The real distance value is always a signed 16 bit integer (you can use unsigned but then you have to bitwise-and the resulting index value with 0xFFFF). If the length is given by the flag bits/Huffman code, the real distance is <tt>b | 0xFF00</tt> or <tt>b - 256</tt> where <tt>b</tt> is the byte value read from the file. If the length value is not given, read the byte values <tt>b0</tt> and <tt>b1</tt> and calculate length and distance like this:

length = (b1 mod 8)+2
distance = b0 + (b1/8)*256 - 8192
or
length = (b1 & 0x07)+2
distance = b0 | ((b1 & 0xF8)<<5) | 0xE000

If the <tt>length</tt> value calculated from <tt>b1</tt> is 2, this indicates that another byte value <tt>b2</tt> must be read. Depending in the value of b2, one of three things can happen:
b2 = 0:
end of compressed data - stop decompressing
b2 = 1:
end of segment
(decompressor may write contents of buffer to output)
set length to 0 or jump to the part where the decompressor reads the next flag bits/Huffman code
otherwise:
set length to b2+1

Now the decompressor must only add <tt>distance</tt> to the current buffer index (since <tt>distance</tt> is negative, the index goes backwards) and copy <tt>length</tt> bytes from there to the current index:

WHILE length > 0
buffer[index] = buffer[index+distance]
index = index + 1
length = length - 1
END WHILE

Please refer to the ULZEXE source code for further information.

== SoftDisk Library Approach ==

This is used as the first form of compression in the [[Softdisk Library Format]]. Flags are 1-byte long and divide the datastream into segments of eight 'values' which can be either literals or codewords. Codewords are 2 bytes long, literals 1 byte. (Therefore there will be a flag byte every 8 to 16 bytes of data.) The value of each bit (In little endian) indicates whether a value will be a literal (1) or codeword (0) Thus a value of 199 (11000111 in binary) indicates three codewords, three literals and two codewords in that order. (Total of 13 bytes.)

Literals are sequences that have never been seen in the datastream before, they cannot be compressed and are thus the same in the compressed and decompressed datastreams. (If the data is text they become quite obvious.) Any string less than 3 bytes long that has not been read before or cannot be pointed to (See below) will be stored as literals.

Codewords are reference to data that has already been read. They are two bytes long, with the first 12 bits giving the location to read data from and the last 4 bits giving the length of data to read.

The lower nybble (4 bits) of the second codeword byte holds the length of repeat data to read minus three. (This makes sense, the shortest sequence it makes sense to code is three bytes which can be given the value 0.) It will be immediately apparent that the maximum length of repeated data that can be stored as a codeword is 18 bytes.

The high nybble of the second byte is multiplied by 16 then added to the first byte to give the location of the data to read in the 'sliding window' minus 19. (This is due to the way the decompression is set up in memory.)

It will be immediately obvious that the codewords can encode values between +-2048, or about 2KB. If the decompressed data is less than 2KB in size then zero is the start of the data, if it is larger than it is 2048 bytes from the data end.

It will be noted that it is probable that the compressed datastream will not be perfectly divisible by flag bytes. In this case the unused bits are set to 0. The decompressor stops when the decompressed data size is equal to the value given in the chunk header. (If it runs out of data it will abort.)

As a simple example the sentence 'I am Sam. Sam I am!' will be compressed to:

FF Flag byte, 8 literals follow
49 20 61 6D 20 53 61 6D 'I am Sam' as literals
2B Flag byte, 2L, P, L, P, L 2Blanks ($2B = 43 = 00101011)
2E 20 ' .' as literals
F2 F0 codeword, read 0 + 3 = 3 bytes from $FF2, or -14 + 19 = 5 in the data. This is 'Sam'
20 ' ' as literal
ED F1 codeword, read 1 + 3 = 4 bytes from $FED or -19 + 19 = 1 in the data. This is 'I am'
21 '!' as literal

= Source code =

Some example code is available in various languages showing how to decompress (and in some cases compress) files using the Keen's LZW algorithm in its various implementations.

== Keen 1-3 Implementation ==

These segments f code work with the Keen 1-3 implementation only and will not for example decompress LZEXE compressed executables.

=== QuickBasic ===

<source lang="qbasic">
DECLARE FUNCTION READBITS% (FILE AS INTEGER, NUMBITS AS INTEGER)
DECLARE SUB LZWDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)
DECLARE SUB LZWOUTPUT (FILE AS INTEGER, DIC AS INTEGER, CHAR AS INTEGER)
'
' KEEN1 Compatible LZW Decompressor (Lempel-Ziv-Welch)
' - by Napalm with thanks to Adurdin's work on ModKeen
'
' This source is Public Domain
'
'

' Allocate dictionary
DIM LZDIC(0 TO 4095) AS INTEGER
DIM LZCHR(0 TO 4095) AS INTEGER

' Test Function
LZWDECOMPRESS "EGALATCH.CK1", "EGALATCH.DAT"

SUB LZWDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)
SHARED LZDIC() AS INTEGER, LZCHR() AS INTEGER
DIM INFILE AS INTEGER, OUTFILE AS INTEGER, I AS INTEGER
DIM BITLEN AS INTEGER, CURPOS AS INTEGER
DIM CW AS INTEGER, PW AS INTEGER, C AS INTEGER, P AS INTEGER
DIM CHECK AS INTEGER

' Open files for input and output
INFILE = FREEFILE
OPEN INNAME FOR BINARY ACCESS READ AS INFILE
OUTFILE = FREEFILE
OPEN OUTNAME FOR BINARY ACCESS WRITE AS OUTFILE
SEEK INFILE, 7

' Fill dictionary with starting values
FOR I = 0 TO 4095
LZDIC(I) = -1
IF I < 256 THEN
LZCHR(I) = I
ELSE
LZCHR(I) = -1
END IF
NEXT I

' Decompress input stream to output stream
BITLEN = 9
CURPOS = 258
CW = READBITS(INFILE, BITLEN)
LZWOUTPUT OUTFILE, LZDIC(CW), LZCHR(CW)

WHILE CW <> &H100 AND CW <> &H101
PW = CW
CW = READBITS(INFILE, BITLEN)
IF CW <> &H100 AND CW <> &H101 THEN
P = PW
CHECK = (LZCHR(CW) <> -1)

IF CHECK THEN
TMP = CW
ELSE
TMP = PW
END IF
WHILE LZDIC(TMP) <> -1
TMP = LZDIC(TMP)
WEND
C = LZCHR(TMP)

IF CHECK THEN
LZWOUTPUT OUTFILE, LZDIC(CW), LZCHR(CW)
ELSE
LZWOUTPUT OUTFILE, P, C
END IF

IF CURPOS < 4096 THEN
LZDIC(CURPOS) = P
LZCHR(CURPOS) = C
CURPOS = CURPOS + 1
IF CURPOS = (2 ^ BITLEN - 1) AND BITLEN < 12 THEN
BITLEN = BITLEN + 1
END IF
END IF

END IF
WEND

' Close files
CLOSE OUTFILE
CLOSE INFILE
END SUB

SUB LZWOUTPUT (FILE AS INTEGER, DIC AS INTEGER, CHAR AS INTEGER)
SHARED LZDIC() AS INTEGER, LZCHR() AS INTEGER
DIM LZSTK(0 TO 127) AS STRING * 1
DIM X AS INTEGER, SP AS INTEGER
DIM LDIC AS INTEGER, LCHAR AS INTEGER

LCHAR = CHAR
LDIC = DIC
SP = 0
X = 1

DO
IF SP >= 128 THEN
PRINT "LZW: Stack Overflow!"
END
END IF
LZSTK(SP) = CHR$(LCHAR)
SP = SP + 1
IF LDIC <> -1 THEN
LCHAR = LZCHR(LDIC)
LDIC = LZDIC(LDIC)
ELSE
X = 0
END IF
LOOP WHILE X

WHILE SP <> 0
SP = SP - 1
PUT FILE, , LZSTK(SP)
WEND
END SUB

FUNCTION READBITS% (FILE AS INTEGER, NUMBITS AS INTEGER)
STATIC BITDAT AS STRING * 1, BITPOS AS INTEGER
DIM BITVAL AS INTEGER, BIT AS INTEGER

BITVAL = 0
FOR BIT = (NUMBITS - 1) TO 0 STEP -1
IF BITPOS = 0 THEN
GET FILE, , BITDAT
BITPOS = 7
ELSE
BITPOS = BITPOS - 1
END IF
IF ASC(BITDAT) AND 2 ^ BITPOS THEN
BITVAL = BITVAL OR 2 ^ BIT
END IF
NEXT BIT

READBITS% = BITVAL
END FUNCTION</source>

=== FreeBasic ===

This code does not suffer from the 64K memory limit imposed by QuickBasic and so is less efficient, but runs faster. It can be compiled with FreeBasic compiler using the -lang=qb switch. Aside from memory concerns, all code here is compatible with QuickBasic.

The code before the subroutine is used to make a string containing the bit expansion of all values from 0 to 255. The subroutine takes a filename, reads the entire file into memory then expands each bit of data to a byte using the aforesaid string as Basic cannot deal with bits directly. cw$ is codeword, pw$ is the previous codeword, lun is the lowest dictionary entry that is empty, p is the location in the compressed data stream and bl is the length of codes in bits (Starting at nine bits increasing to 12)

The dictionary is set before decompression. The first 258 are the starting dictionary, the remainder are cleared. (It is vital to reset the dictionary for each file) An error occurs if entry 256 is found in the data, 'distrupt' is printed when the newest dictionary entry is not the lowest possible entry (This shouldn't happen but is possible.) Decompression ends at encountering entry 257, or when there is no more data to read.

<source lang="qbasic">
DECLARE SUB LZWDEC (lfn AS STRING)

x$ = ""
FOR l = 0 TO 255
IF (l AND 128) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 64) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 32) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 16) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 8) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 4) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 2) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 1) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
NEXT l
LZEDEC "EGALATCH.CK1"
END

'_________________________________________________
SUB LZWDEC (lfn AS STRING) ' Decompress LZW data
'_________________________________________________
DIM lzw(0 TO 4095) AS STRING
PRINT lfn; " is LZW compressed, decompressing...";
OPEN folder + lfn FOR BINARY AS #9
y$ = SPACE$(LOF(9))
GET #9, 1, y$
CLOSE #9
z$ = ""
FOR l = 7 TO LEN(y$)
z$ = z$ + MID$(x$, (ASC(MID$(y$, l, 1)) * 8) + 1, 8)
NEXT l

bl = 9
lun = 258
p = 1
cw$ = ""
y$ = ""
FOR l = 0 TO 4095
IF l < 256 THEN lzw(l) = CHR$(l) ELSE lzw(l) = ""
NEXT l
DO
IF lun = 511 THEN bl = 10
IF lun = 1023 THEN bl = 11
IF lun = 2047 THEN bl = 12
pw$ = cw$
u$ = MID$(z$, p, bl)
p = p + bl
y = 0
FOR l = 1 TO bl
IF MID$(u$, bl - l + 1, 1) = "1" THEN y = y + (2 ^ (l - 1))
NEXT l
IF y = 256 THEN
PRINT "LZW error in Keen data!"
OPEN "ERROR.DAT" FOR OUTPUT AS #9
PRINT #9, y$;
CLOSE
END
END IF
IF y = 257 THEN EXIT DO
IF cw$ = "" THEN
cw$ = lzw(y)
y$ = y$ + cw$
ELSE
IF lun < 4096 THEN
IF lzw(y) = "" THEN
cw$ = pw$ + LEFT$(pw$, 1)
lzw(y) = cw$
y$ = y$ + cw$
IF y <> lun THEN PRINT "Disrupt!"
lun = y + 1
ELSE
cw$ = lzw(y)
y$ = y$ + cw$
lzw(lun) = pw$ + LEFT$(cw$, 1)
lun = lun + 1
END IF
ELSE
IF lzw(y) = "" THEN
y$ = y$ + cw$
ELSE
cw$ = lzw(y)
y$ = y$ + cw$
END IF
END IF
END IF
LOOP WHILE p < LEN(z$)
IF y = 257 THEN PRINT "done" ELSE PRINT "out of data."
OPEN folder + LEFT$(lfn, 4) + extq FOR OUTPUT AS #9
PRINT #9, y$;
CLOSE #9
END SUB</source>

=== Visual Basic .NET ===

This implementation uses high-level elements such as lambdas, anonymous arrays, and strict types. It must be compiled for the Microsoft .NET Framework v4.5 in Visual Studio 2012. It has the advantages of running in non-DosBox Windows and using .NET streams for simple reusability. It can also be used for higher-bit LZW.

==== Decompression ====

Decompressing is very fast.
<source lang="vbnet"> Sub DecompressLZW(Data As IO.Stream, MaxBits As Byte, Output As IO.Stream)
' This source is by Fleexy and is in the public domain. If used, please note it as such.
Dim dict As New List(Of Byte())
For x = 0 To 255
dict.Add({x})
Next
dict.Add({})
dict.Add(Nothing)
Dim usebits As Byte = 9
Dim bpos As Long
Dim bits As New List(Of Byte)
Do Until Data.Position = Data.Length
Dim b, ub As Byte
b = Data.ReadByte
ub = b
For x = 7 To 0 Step -1
If ub - (2 ^ x) >= 0 Then
ub -= (2 ^ x)
bits.Add(1)
Else
bits.Add(0)
End If
Next
Loop
Dim GetCode = Function() As UInteger
Dim u As UInteger
For x = usebits To 1 Step -1
If bits(bpos) = 1 Then u += (2 ^ (x - 1))
bpos += 1
Next
Return u
End Function
Dim OutputCode = Sub(DecompData As Byte())
Dim n As UInteger = DecompData.Length
Output.Write(DecompData, 0, n)
End Sub
Dim AddToDict = Sub(Entry As Byte())
If dict.Count < (2 ^ MaxBits) Then
dict.Add(Entry)
If dict.Count = (2 ^ usebits) - 1 Then usebits = Math.Min(usebits + 1, MaxBits)
End If
End Sub
Dim fcode As UInteger = GetCode()
Dim match As Byte() = dict(fcode)
OutputCode(match)
Do
Dim ncode As UInteger = GetCode()
If ncode = 257 Then Exit Do
If ncode = 256 Then Throw New Exception
Dim nmatch As Byte()
If ncode < dict.Count Then
nmatch = dict(ncode)
Else
nmatch = match.Concat({match(0)}).ToArray
End If
OutputCode(nmatch)
AddToDict(match.Concat({nmatch(0)}).ToArray)
match = nmatch
Loop
End Sub</source>

==== Compression ====

Compression is more difficult; consulting the dictionary for a byte array takes more time. The speed of this algorithm may be unacceptable.
<source lang="vbnet"> Sub CompressLZW(Data As IO.Stream, MaxBits As Byte, Output As IO.Stream)
' This source is by Fleexy and is in the public domain.
Dim bits As New List(Of Byte)
Dim dict As New List(Of Byte())
For x = 0 To 255
dict.Add({x})
Next
dict.Add({})
dict.Add({})
Dim usebits As Byte = 9
Dim PutCode = Sub(Code As UInteger)
For x = usebits To 1 Step -1
If Code - (2 ^ (x - 1)) >= 0 Then
Code -= (2 ^ (x - 1))
bits.Add(1)
Else
bits.Add(0)
End If
Next
End Sub
Dim AddToDict = Function(Entry As Byte()) As Boolean
If dict.Count < 2 ^ MaxBits Then
dict.Add(Entry)
If dict.Count = 2 ^ usebits Then usebits = Math.Min(usebits + 1, MaxBits)
Return True
Else
Return False
End If
End Function
Dim FindCode = Function(Bytes As Byte()) As UInteger
For x = 1 To dict.Count
If dict(x - 1).Count = Bytes.Count Then
If dict(x - 1).SequenceEqual(Bytes) Then Return x - 1
End If
Next
Throw New NotFiniteNumberException
End Function
Dim DictContains = Function(Bytes As Byte()) As Boolean
For x = 1 To dict.Count
If dict(x - 1).Length = Bytes.Length Then
If dict(x - 1).SequenceEqual(Bytes) Then Return True
End If
Next
Return False
End Function
Dim match As Byte() = {}
Do Until Data.Position = Data.Length
Dim nbyte As Byte = Data.ReadByte
Dim nmatch As Byte() = match.Concat({nbyte}).ToArray
If DictContains(nmatch) Then
match = nmatch
Else
PutCode(FindCode(match))
AddToDict(nmatch)
match = {nbyte}
End If
Loop
PutCode(FindCode(match))
PutCode(257)
Do Until bits.LongCount Mod 8L = 0L
bits.Add(0)
Loop
For x = 1 To CInt(bits.LongCount / 8L)
Dim b As Byte = 0
For y = 0 To 7
b += bits((x - 1) * 8 + y) * (2 ^ (7 - y))
Next
Output.WriteByte(b)
Next
End Sub
----

[[Category:Commander Keen 1-3]]
[[Category:File Formats]]
[[Category:Compressed Files]]

LZW Compression

2013-12-31T22:40:31Z

Fleexy: Added Keen LZW implementation in VB.NET

LZW compression is a common form of compression used in some early games to compress data and by most early games to compress their executables. It is notable in being one of the first compression methods to not compress on the byte level (Along with [[Huffman Compression]]) and for its efficiency.

The basic concept for LZW is universal, though the implementations differ. In essence it involves replacing data strings that have been encountered before with references to already decompressed data. (Known as a 'dictionary') This can be done in a number of ways, the two main approaches differing on whether the dictionary is separate or integrated

= Separate Dictionary Approach =

In this approach the dictionary is separate from the data being decompressed, that is it is stored in a separate location in memory. In this it behaves more like one would expect a dictionary to work; when a codeword is found in the data, it is looked up in the dictionary and the corresponding string copied to output. (As an example dictionary entry 42 could represent the string 'life', thus whenever the code '42' is encountered the string 'life' is added to the decompressed data.)

The advantage of this method is that the efficiency of compression increases as the amount of data to compress increases. The following points differ between implementations:

* The initial dictionary. Just how large the initial dictionary is varies. Some implementations start with no dictioanry at all, others set a number of entries, usually 255, covering all possible 1-byte values.

* The maximum size of the dictionary. Many older implementations with less resources were forced to cap the dictionary at a certain size, usually a power of two entries long. (512, 1024...) Unlimited implementations are rare as modern methods (e.g. the DEFLATE algorithm.) usually rely on several compression methods at once. Sometimes the dictionary is 'reset' when it reaches too large a size.

* Whether the codestream is made partly or entirely of codewords. Often the compressed data is made entirely of codewords, even non-repeating strings, which means that initially compression can sometimes be rather poor. Other implementations use codeowords only for repeating strings. Differences in how codewords vs literal are indicated and how dictionaries are built up may occur.

== Decoding ==

This is a general decoding algorithm for separate dictionary LZW. It will need to be altered slightly when dealing with different implementations. Notably it assumes that the codestream is composed entirely of codewords and that the dictionary can keep growing indefinitely.

1 At the start the dictionary contains all possible roots;
2 cW := the first code word in the codestream (it denotes a root);
3 output the string.cW to the charstream;
4 pW := cW;
5 cW := next code word in the codestream;
6 Is the string.cW present in the dictionary?
a if it is,
i output the string.cW to the charstream;
ii P := string.pW;
iii C := the first character of the string.cW;
iv add the string P+C to the dictionary;
b if not,
i P := string.pW;
ii C := the first character of the string.pW;
iii output the string P+C to the charstream
iv add the string P+C to the dictionary (now it corresponds to the cW);
7 Are there more code words in the codestream?
a if yes, go back to step 4;
b if not, END.

== [[Commander Keen 1-3|Commander Keen 1-3 LZW]] ==

In [[Commander Keen 1-3]] LZW is used to compress the <tt>EGALATCH</tt> and <tt>EGASPRIT</tt> files in episode 1 (It can also be used in episodes 2 and 3, but isn't.) The game uses two error-checking methods in this implementation, firstly it reserves two dictionary values, $100 to indicate an error (This is written by the compression program and will make the executable abort.) and $101 to indicate the end of data. (If the program reaches the end of the data without encountering this it will also abort.) The compressed data is also prefixed with a dword giving the decompressed data size, so this can be compared with the output.

This method is a typical separate dictionary approach. It starts with a dictionary of 256 9-bit codewords representing the 8-bit strings $00-$FF (Plus some special cases.) The dictionary is allowed to grow to 4096 entries. The following is the initial dictionary:

0000 - 00 (character)
0001 - 01 (character)
...
00FE - FE (character)
00FF - FF (character)
0100 - Reserved for errors...
0101 - Reserved for end of compressed data...
0102 - (not set)
0103 - (not set)

It will be immediately noticed that 4096 entries cannot be represented by 9-bit codes but at the least by 12-bit codes. To further conserve space the length of the codewords is increased every time the dictionary grows too large. Thus when it reaches $01FF entries codewords become 10 bits long, at $03FF they are 11 bits and finally at $07FF 12 bits. At $0FFF entries the dictionary stops growing.

The following data is taken from the EGALATCH file from Keen 1. Notice that the first six bytes are ignored. (The first four give the decompressed data size, the next two are an executable variable of unknown use.) The first few steps of decompression follows.

0000 80 D3 01 00 0C 00 00 40 - A0 70 31 E9 F8 F8 78 38
0010 08 08 00 07 FC 39 FF 04 - 5E 41 E1 30 B3 C4 5A 2F

The first code word encountered is 000000000 (First 9 bits) and thus outputs the string $00 (First dictionary entry.) This has set us up to step 4 and now things work slightly differently.

The second code word is 100000010 (Next 9 bits) which is entry $102. Since this entry is NOT found in the dictionary yet, we will create this entry then output it. Entry $102 is created by taking the previous codeword's string adding to it the first byte of that string. In this case the previous codewrode's (0) string is $00. $00 + $00 is $00 $00 Entry $102 thus represents the string $00 $00

The next code word is 100000011 which is entry $103, which again doesn't exist. Entry $103 is created just like with $102, except now since the previous codeword is $00 $00, entry $103 is $00 $00 $00

The next code word is again $103, this IS found in the dictionary and is outputted ($00 $00 $00) however we now create dictionary entry $104 just like $103. (It is $00 $00 $00 $00)

The next code word is $3B. It is outputted and entry $105 created ($00 $00 $00 $00 $3B) Note that now the previous codeword is $3B. This pattern continues.

= Integrated Dictionary Approach =

In this approach the dictionary is the decompressed data itself. The codewords do not represent an entry in the dictionary structure, but rather directions in the decompressed code as to where a repeated string is located. That is a repeat string may well be represented by a codeword that states 'copy seven bytes from byte 132 in the decompressed data'

The advantage of this approach is that it doesn't require a separate construct for the dictionary and can just use the already decompressed data. The downside is that it will always need some method to distinguish literals and codewords, and, since codewords are nearly always of a fixed length there is an inherent limit both to how long the copied string can be and where it can be read from. this means that eventually the compression efficiency will level off and stop improving.

There This is often called the 'sliding window' and it represents the data that can be 'reached' by the codewords. It is named as such because it is of a fixed length and 'slides' along the output stream as it gets longer. The following features differ between implementations:

* Differentiating codewords from literals. There must be a way to tell codewords apart from data that is just to be read and outputted. Sometimes this is integrated into the codewords themselves, but more often the 'flag' precedes a codeword. The flag may indicate only codewords or both codewords and literals. ('Following data is made of two codewords and six literals', etc.)

* Codeword format. Most implementations use codewords of a fixed format that must encode both the length of data to copy and the location to copy it from. Codewords are usually two or four bytes long.

* Zero point location. Different implementations use different places in the output as zero. If the start of the code is used as zero only the first x bytes of the output can be used as a reference. Most implementations use a more complex, but seldom less effective 'sliding window'; the zero point is the start of the data until the data becomes too long at which point it moves forward so that the most recent x bytes of output can be used. It is also possible for zero to be the most recent byte with all locations being 'x bytes from the end', which also produces a sliding window. Finally it may be possible to have both negative and positive locations.

* Sliding window. The nature of this is dictated by the format of the codewords. Common sizes are 1KB, 2KB or 4KB. The window will always be present, but it may not 'slide' if the implementation uses a fixed location as the zero point.

== LZEXE ==

Many vintage executables are compressed with the program LZEXE, interesting in that the compressed file contains its own decompressor, meaning that it is in essence a self-extracting archive. However the unique feature of LZEXE executables is that they extract the compressed data to memory and run it. To the user this is indistinguishable from the decompressed executable, though it takes slightly longer to start up and takes up much less space.

[[UNLZEXE]] can be used to extract the decompressed executables from this which will run perfectly with other game files. It can be obtained here: http://www.dosclassics.com/download/198 It is currently unknown specifically how the LZW compression is implemented in this case, but with the source code for decompression is available.

The LZEXE compression is similar to the SoftDisk Library Approach described below, but it uses UINT16LE values instead of byte values to store the flag bits and the sliding window has a size of 8192 (0x2000) bytes. Also, the flag bits have different meanings (you could argue that they are in fact Huffman codes):

1 -> copy 1 literal byte
10 -> next two bytes contain length and distance
0000 -> length is 2, next byte contains distance
1000 -> length is 3, next byte contains distance
0100 -> length is 4, next byte contains distance
1100 -> length is 5, next byte contains distance

The real distance value is always a signed 16 bit integer (you can use unsigned but then you have to bitwise-and the resulting index value with 0xFFFF). If the length is given by the flag bits/Huffman code, the real distance is <tt>b | 0xFF00</tt> or <tt>b - 256</tt> where <tt>b</tt> is the byte value read from the file. If the length value is not given, read the byte values <tt>b0</tt> and <tt>b1</tt> and calculate length and distance like this:

length = (b1 mod 8)+2
distance = b0 + (b1/8)*256 - 8192
or
length = (b1 & 0x07)+2
distance = b0 | ((b1 & 0xF8)<<5) | 0xE000

If the <tt>length</tt> value calculated from <tt>b1</tt> is 2, this indicates that another byte value <tt>b2</tt> must be read. Depending in the value of b2, one of three things can happen:
b2 = 0:
end of compressed data - stop decompressing
b2 = 1:
end of segment
(decompressor may write contents of buffer to output)
set length to 0 or jump to the part where the decompressor reads the next flag bits/Huffman code
otherwise:
set length to b2+1

Now the decompressor must only add <tt>distance</tt> to the current buffer index (since <tt>distance</tt> is negative, the index goes backwards) and copy <tt>length</tt> bytes from there to the current index:

WHILE length > 0
buffer[index] = buffer[index+distance]
index = index + 1
length = length - 1
END WHILE

Please refer to the ULZEXE source code for further information.

== SoftDisk Library Approach ==

This is used as the first form of compression in the [[Softdisk Library Format]]. Flags are 1-byte long and divide the datastream into segments of eight 'values' which can be either literals or codewords. Codewords are 2 bytes long, literals 1 byte. (Therefore there will be a flag byte every 8 to 16 bytes of data.) The value of each bit (In little endian) indicates whether a value will be a literal (1) or codeword (0) Thus a value of 199 (11000111 in binary) indicates three codewords, three literals and two codewords in that order. (Total of 13 bytes.)

Literals are sequences that have never been seen in the datastream before, they cannot be compressed and are thus the same in the compressed and decompressed datastreams. (If the data is text they become quite obvious.) Any string less than 3 bytes long that has not been read before or cannot be pointed to (See below) will be stored as literals.

Codewords are reference to data that has already been read. They are two bytes long, with the first 12 bits giving the location to read data from and the last 4 bits giving the length of data to read.

The lower nybble (4 bits) of the second codeword byte holds the length of repeat data to read minus three. (This makes sense, the shortest sequence it makes sense to code is three bytes which can be given the value 0.) It will be immediately apparent that the maximum length of repeated data that can be stored as a codeword is 18 bytes.

The high nybble of the second byte is multiplied by 16 then added to the first byte to give the location of the data to read in the 'sliding window' minus 19. (This is due to the way the decompression is set up in memory.)

It will be immediately obvious that the codewords can encode values between +-2048, or about 2KB. If the decompressed data is less than 2KB in size then zero is the start of the data, if it is larger than it is 2048 bytes from the data end.

It will be noted that it is probable that the compressed datastream will not be perfectly divisible by flag bytes. In this case the unused bits are set to 0. The decompressor stops when the decompressed data size is equal to the value given in the chunk header. (If it runs out of data it will abort.)

As a simple example the sentence 'I am Sam. Sam I am!' will be compressed to:

FF Flag byte, 8 literals follow
49 20 61 6D 20 53 61 6D 'I am Sam' as literals
2B Flag byte, 2L, P, L, P, L 2Blanks ($2B = 43 = 00101011)
2E 20 ' .' as literals
F2 F0 codeword, read 0 + 3 = 3 bytes from $FF2, or -14 + 19 = 5 in the data. This is 'Sam'
20 ' ' as literal
ED F1 codeword, read 1 + 3 = 4 bytes from $FED or -19 + 19 = 1 in the data. This is 'I am'
21 '!' as literal

= Source code =

Some example code is available in various languages showing how to decompress (and in some cases compress) files using the Keen's LZW algorithm in its various implementations.

== Keen 1-3 Implementation ==

These segments f code work with the Keen 1-3 implementation only and will not for example decompress LZEXE compressed executables.

=== QuickBasic ===

<source lang="qbasic">
DECLARE FUNCTION READBITS% (FILE AS INTEGER, NUMBITS AS INTEGER)
DECLARE SUB LZWDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)
DECLARE SUB LZWOUTPUT (FILE AS INTEGER, DIC AS INTEGER, CHAR AS INTEGER)
'
' KEEN1 Compatible LZW Decompressor (Lempel-Ziv-Welch)
' - by Napalm with thanks to Adurdin's work on ModKeen
'
' This source is Public Domain
'
'

' Allocate dictionary
DIM LZDIC(0 TO 4095) AS INTEGER
DIM LZCHR(0 TO 4095) AS INTEGER

' Test Function
LZWDECOMPRESS "EGALATCH.CK1", "EGALATCH.DAT"

SUB LZWDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING)
SHARED LZDIC() AS INTEGER, LZCHR() AS INTEGER
DIM INFILE AS INTEGER, OUTFILE AS INTEGER, I AS INTEGER
DIM BITLEN AS INTEGER, CURPOS AS INTEGER
DIM CW AS INTEGER, PW AS INTEGER, C AS INTEGER, P AS INTEGER
DIM CHECK AS INTEGER

' Open files for input and output
INFILE = FREEFILE
OPEN INNAME FOR BINARY ACCESS READ AS INFILE
OUTFILE = FREEFILE
OPEN OUTNAME FOR BINARY ACCESS WRITE AS OUTFILE
SEEK INFILE, 7

' Fill dictionary with starting values
FOR I = 0 TO 4095
LZDIC(I) = -1
IF I < 256 THEN
LZCHR(I) = I
ELSE
LZCHR(I) = -1
END IF
NEXT I

' Decompress input stream to output stream
BITLEN = 9
CURPOS = 258
CW = READBITS(INFILE, BITLEN)
LZWOUTPUT OUTFILE, LZDIC(CW), LZCHR(CW)

WHILE CW <> &H100 AND CW <> &H101
PW = CW
CW = READBITS(INFILE, BITLEN)
IF CW <> &H100 AND CW <> &H101 THEN
P = PW
CHECK = (LZCHR(CW) <> -1)

IF CHECK THEN
TMP = CW
ELSE
TMP = PW
END IF
WHILE LZDIC(TMP) <> -1
TMP = LZDIC(TMP)
WEND
C = LZCHR(TMP)

IF CHECK THEN
LZWOUTPUT OUTFILE, LZDIC(CW), LZCHR(CW)
ELSE
LZWOUTPUT OUTFILE, P, C
END IF

IF CURPOS < 4096 THEN
LZDIC(CURPOS) = P
LZCHR(CURPOS) = C
CURPOS = CURPOS + 1
IF CURPOS = (2 ^ BITLEN - 1) AND BITLEN < 12 THEN
BITLEN = BITLEN + 1
END IF
END IF

END IF
WEND

' Close files
CLOSE OUTFILE
CLOSE INFILE
END SUB

SUB LZWOUTPUT (FILE AS INTEGER, DIC AS INTEGER, CHAR AS INTEGER)
SHARED LZDIC() AS INTEGER, LZCHR() AS INTEGER
DIM LZSTK(0 TO 127) AS STRING * 1
DIM X AS INTEGER, SP AS INTEGER
DIM LDIC AS INTEGER, LCHAR AS INTEGER

LCHAR = CHAR
LDIC = DIC
SP = 0
X = 1

DO
IF SP >= 128 THEN
PRINT "LZW: Stack Overflow!"
END
END IF
LZSTK(SP) = CHR$(LCHAR)
SP = SP + 1
IF LDIC <> -1 THEN
LCHAR = LZCHR(LDIC)
LDIC = LZDIC(LDIC)
ELSE
X = 0
END IF
LOOP WHILE X

WHILE SP <> 0
SP = SP - 1
PUT FILE, , LZSTK(SP)
WEND
END SUB

FUNCTION READBITS% (FILE AS INTEGER, NUMBITS AS INTEGER)
STATIC BITDAT AS STRING * 1, BITPOS AS INTEGER
DIM BITVAL AS INTEGER, BIT AS INTEGER

BITVAL = 0
FOR BIT = (NUMBITS - 1) TO 0 STEP -1
IF BITPOS = 0 THEN
GET FILE, , BITDAT
BITPOS = 7
ELSE
BITPOS = BITPOS - 1
END IF
IF ASC(BITDAT) AND 2 ^ BITPOS THEN
BITVAL = BITVAL OR 2 ^ BIT
END IF
NEXT BIT

READBITS% = BITVAL
END FUNCTION</source>

=== FreeBasic ===

This code does not suffer from the 64K memory limit imposed by QuickBasic and so is less efficient, but runs faster. It can be compiled with FreeBasic compiler using the -lang=qb switch. Aside from memory concerns, all code here is compatible with QuickBasic.

The code before the subroutine is used to make a string containing the bit expansion of all values from 0 to 255. The subroutine takes a filename, reads the entire file into memory then expands each bit of data to a byte using the aforesaid string as Basic cannot deal with bits directly. cw$ is codeword, pw$ is the previous codeword, lun is the lowest dictionary entry that is empty, p is the location in the compressed data stream and bl is the length of codes in bits (Starting at nine bits increasing to 12)

The dictionary is set before decompression. The first 258 are the starting dictionary, the remainder are cleared. (It is vital to reset the dictionary for each file) An error occurs if entry 256 is found in the data, 'distrupt' is printed when the newest dictionary entry is not the lowest possible entry (This shouldn't happen but is possible.) Decompression ends at encountering entry 257, or when there is no more data to read.

<source lang="qbasic">
DECLARE SUB LZWDEC (lfn AS STRING)

x$ = ""
FOR l = 0 TO 255
IF (l AND 128) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 64) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 32) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 16) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 8) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 4) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 2) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
IF (l AND 1) > 0 THEN x$ = x$ + "1" ELSE x$ = x$ + "0"
NEXT l
LZEDEC "EGALATCH.CK1"
END

'_________________________________________________
SUB LZWDEC (lfn AS STRING) ' Decompress LZW data
'_________________________________________________
DIM lzw(0 TO 4095) AS STRING
PRINT lfn; " is LZW compressed, decompressing...";
OPEN folder + lfn FOR BINARY AS #9
y$ = SPACE$(LOF(9))
GET #9, 1, y$
CLOSE #9
z$ = ""
FOR l = 7 TO LEN(y$)
z$ = z$ + MID$(x$, (ASC(MID$(y$, l, 1)) * 8) + 1, 8)
NEXT l

bl = 9
lun = 258
p = 1
cw$ = ""
y$ = ""
FOR l = 0 TO 4095
IF l < 256 THEN lzw(l) = CHR$(l) ELSE lzw(l) = ""
NEXT l
DO
IF lun = 511 THEN bl = 10
IF lun = 1023 THEN bl = 11
IF lun = 2047 THEN bl = 12
pw$ = cw$
u$ = MID$(z$, p, bl)
p = p + bl
y = 0
FOR l = 1 TO bl
IF MID$(u$, bl - l + 1, 1) = "1" THEN y = y + (2 ^ (l - 1))
NEXT l
IF y = 256 THEN
PRINT "LZW error in Keen data!"
OPEN "ERROR.DAT" FOR OUTPUT AS #9
PRINT #9, y$;
CLOSE
END
END IF
IF y = 257 THEN EXIT DO
IF cw$ = "" THEN
cw$ = lzw(y)
y$ = y$ + cw$
ELSE
IF lun < 4096 THEN
IF lzw(y) = "" THEN
cw$ = pw$ + LEFT$(pw$, 1)
lzw(y) = cw$
y$ = y$ + cw$
IF y <> lun THEN PRINT "Disrupt!"
lun = y + 1
ELSE
cw$ = lzw(y)
y$ = y$ + cw$
lzw(lun) = pw$ + LEFT$(cw$, 1)
lun = lun + 1
END IF
ELSE
IF lzw(y) = "" THEN
y$ = y$ + cw$
ELSE
cw$ = lzw(y)
y$ = y$ + cw$
END IF
END IF
END IF
LOOP WHILE p < LEN(z$)
IF y = 257 THEN PRINT "done" ELSE PRINT "out of data."
OPEN folder + LEFT$(lfn, 4) + extq FOR OUTPUT AS #9
PRINT #9, y$;
CLOSE #9
END SUB</source>

=== Visual Basic .NET ===

This implementation uses high-level elements such as lambdas, anonymous arrays, and strict types. It must be compiled for the Microsoft .NET Framework v4.5 in Visual Studio 2012. It has the advantages of running in non-DosBox Windows and using .NET streams for simple reusability. It can also be used for higher-bit LZW.

==== Decompression ====

Decompressing is very fast.
<source> Sub DecompressLZW(Data As IO.Stream, MaxBits As Byte, Output As IO.Stream)
' This source is by Fleexy and is in the public domain. If used, please note it as such.
Dim dict As New List(Of Byte())
For x = 0 To 255
dict.Add({x})
Next
dict.Add({})
dict.Add(Nothing)
Dim usebits As Byte = 9
Dim bpos As Long
Dim bits As New List(Of Byte)
Do Until Data.Position = Data.Length
Dim b, ub As Byte
b = Data.ReadByte
ub = b
For x = 7 To 0 Step -1
If ub - (2 ^ x) >= 0 Then
ub -= (2 ^ x)
bits.Add(1)
Else
bits.Add(0)
End If
Next
Loop
Dim GetCode = Function() As UInteger
Dim u As UInteger
For x = usebits To 1 Step -1
If bits(bpos) = 1 Then u += (2 ^ (x - 1))
bpos += 1
Next
Return u
End Function
Dim OutputCode = Sub(DecompData As Byte())
Dim n As UInteger = DecompData.Length
Output.Write(DecompData, 0, n)
End Sub
Dim AddToDict = Sub(Entry As Byte())
If dict.Count < (2 ^ MaxBits) Then
dict.Add(Entry)
If dict.Count = (2 ^ usebits) - 1 Then usebits = Math.Min(usebits + 1, MaxBits)
End If
End Sub
Dim fcode As UInteger = GetCode()
Dim match As Byte() = dict(fcode)
OutputCode(match)
Do
Dim ncode As UInteger = GetCode()
If ncode = 257 Then Exit Do
If ncode = 256 Then Throw New Exception
Dim nmatch As Byte()
If ncode < dict.Count Then
nmatch = dict(ncode)
Else
nmatch = match.Concat({match(0)}).ToArray
End If
OutputCode(nmatch)
AddToDict(match.Concat({nmatch(0)}).ToArray)
match = nmatch
Loop
End Sub</source>

==== Compression ====

Compression is more difficult; consulting the dictionary for a byte array takes more time. The speed of this algorithm may be unacceptable.
<source> Sub CompressLZW(Data As IO.Stream, MaxBits As Byte, Output As IO.Stream)
' This source is by Fleexy and is in the public domain.
Dim bits As New List(Of Byte)
Dim dict As New List(Of Byte())
For x = 0 To 255
dict.Add({x})
Next
dict.Add({})
dict.Add({})
Dim usebits As Byte = 9
Dim PutCode = Sub(Code As UInteger)
For x = usebits To 1 Step -1
If Code - (2 ^ (x - 1)) >= 0 Then
Code -= (2 ^ (x - 1))
bits.Add(1)
Else
bits.Add(0)
End If
Next
End Sub
Dim AddToDict = Function(Entry As Byte()) As Boolean
If dict.Count < 2 ^ MaxBits Then
dict.Add(Entry)
If dict.Count = 2 ^ usebits Then usebits = Math.Min(usebits + 1, MaxBits)
Return True
Else
Return False
End If
End Function
Dim FindCode = Function(Bytes As Byte()) As UInteger
For x = 1 To dict.Count
If dict(x - 1).Count = Bytes.Count Then
If dict(x - 1).SequenceEqual(Bytes) Then Return x - 1
End If
Next
Throw New NotFiniteNumberException
End Function
Dim DictContains = Function(Bytes As Byte()) As Boolean
For x = 1 To dict.Count
If dict(x - 1).Length = Bytes.Length Then
If dict(x - 1).SequenceEqual(Bytes) Then Return True
End If
Next
Return False
End Function
Dim match As Byte() = {}
Do Until Data.Position = Data.Length
Dim nbyte As Byte = Data.ReadByte
Dim nmatch As Byte() = match.Concat({nbyte}).ToArray
If DictContains(nmatch) Then
match = nmatch
Else
PutCode(FindCode(match))
AddToDict(nmatch)
match = {nbyte}
End If
Loop
PutCode(FindCode(match))
PutCode(257)
Do Until bits.LongCount Mod 8L = 0L
bits.Add(0)
Loop
For x = 1 To CInt(bits.LongCount / 8L)
Dim b As Byte = 0
For y = 0 To 7
b += bits((x - 1) * 8 + y) * (2 ^ (7 - y))
Next
Output.WriteByte(b)
Next
End Sub
----

[[Category:Commander Keen 1-3]]
[[Category:File Formats]]
[[Category:Compressed Files]]