Reverse engineering guide

From ModdingWiki
Jump to navigation Jump to search

This tutorial aims to provide a basic introduction to reverse engineering games. It is focused on DOS games, but the core principles apply equally to any software. Note that this tutorial only covers reverse engineering game data (not code) as the intention is to discover and document file formats to aid in the creation of modding tools.

Reverse engineering primer

Some basic skills are required in order to reverse engineer. You must be familiar with the types of data used by games, and how this data is stored inside files.

The integer (a whole number, such as 1000) is extremely common. Since data is usually examined in a hex editor, the values are written in hex. Thus the value 1000 is 3E8 in hex (in order to distinguish between hex and decimal, the rest of this guide will prefix hex numbers with "0x", like the C programming language does - therefore 1000 is the same as 0x3E8.)

There are different variants of integers, but the most common are 16-bit and 32-bit. 16-bit integers occupy two bytes of space, and 32-bit integers occupy four bytes of space. In hex, every two digits are a byte. So 0x12 is one byte (8-bit), 0x1234 takes up two bytes (16-bit) and 0x12345678 takes up four bytes (32-bit). If you were to store 0x12 in a 32-bit integer, it would be 0x00000012 (the leading zeroes are ignored, like in decimal - 1000 is the same as 001000.)

When these integers are written to a file however, the order of each byte is different. The value 0x1234 is written as two bytes, 0x12 and 0x34. However because DOS is little-endian, the order is reversed. Thus when looking at a file with a hex editor, the two bytes will appear as 0x34 followed by 0x12. A 32-bit number might look like this in a hex editor:

78 56 34 12

Reversing the order of those bytes reveals the number 0x12345678. Likewise, this might be another number seen in a hex editor:

23 67 01 00

This is where reverse engineering skills come into play. Is this a single 32-bit integer (0x00016723) or two 16-bit integers (0x6723 and 0x0001)? Being able to determine which is which is a skill a reverse engineer must learn. (In this case it is most likely to be a 32-bit integer, because the first 16-bit integer is so large and the second is so small. Usually sequential numbers are similar in value.)

Hex editor view

In order to examine game data, a hex editor must be used. It is important to know how to read data in a hex editor, as well as how to edit data. When opening a file, the data will appear something like this:

00000000  61 62 63 64 65 66 67 68  abcdefgh

Here, the first number is the offset. The very first byte in the file is at offset 0. The second byte is at offset 1, and so on. The first number tells you where in the file the data is positioned.

The next set of numbers (61 to 68 above) are hex values of the data. In this case, the first byte in the file is value 0x61. The third byte in the file is value 0x63.

After the numbers come ASCII (text) representations of the *same* numbers. In the example above, the first character is a lowercase "a" because 0x61 is the ASCII code for a lowercase letter A.

It is important to note that the numbers (61 to 68 above) and the letters (a-h above) are different representations of the same data. This is because sometimes it is easier to look at the data in numeric form, and sometimes it is easier to look at it in text. Having both views of the same data on the screen at the same time makes it easy to switch between the two.

By way of example, if you were trying to decode a 32-bit integer (as described in the previous section) you'd be looking at the hex numbers, but if you were trying to read some filenames it would be much easier to look at the text. After all, reading "6D 61 70 73 2E 64 61 74" is a lot harder than reading "maps.dat" but they are both the same data!

This is why, when editing data using a hex editor, if you change the numbers the text will also change, and likewise if you type over the text the numbers will change accordingly.

Where is the game data stored?

Before any reverse engineering can begin, the layout of the data must be known. Does the game contain many separate files, or a few large ones? Do the files have descriptive names, are many of them the same size, or is there some other attribute that could be used to identify them?

Hints

  • If the game stores its data in separate files (even if they are packaged up into a single large file) you can often deduce the purpose of many files by simply renaming them. For example, in Jill of the Jungle each level is stored in a separate file. This was discovered by swapping the filenames of two files, and finding that two levels had swapped order inside the game. If the game files are stored in an archive file of unknown format, the filenames and/or offsets can be modified using a hex editor.

Archive files

Many games store their data in separate files, but all the files are packed together in a single large file. This is exactly how a .zip file works (one .zip file contains many other files) except many games use unique file formats for their archives. Sometimes they may even store the data inside the .exe file itself, but this makes little difference in how it is examined.

To reverse engineer an archive file, two things must be discovered - the location of the File Allocation Table (FAT) and the location of the data (the FAT is simply a list of filenames and the locations of each file's data within the archive.) The data location is easy, as it is usually the largest file that ships with the game. The FAT can be harder to locate, particularly if the file format does not include filenames. Quite often, the FAT is in the same file as the data.

If the FAT does contain filenames, it is very easy to locate. If it does not, a set of file offsets need to be located. File offsets are numbers, usually 32-bit integers, which indicate where each file starts inside the main archive. Typically the first file offset is either zero, or it points to the end of the FAT (if the FAT is stored at the beginning of the archive file.) Picking offsets at random and jumping to that offset in the hex editor will, if the offset is correct, will reveal the file contents.

Here is a view of DUKE3D.GRP in a hex editor. This is the main archive file for the game Duke Nukem 3D. It is also the largest file that ships with the game.

Reveng duke3d grp fat.png

From the screenshot it is quite clear that this file starts with a FAT, because of all the filenames. The first line reading "KenSilverman" (the programmer of the Build engine) is a signature used to identify files and is not part of the filename list (this is obvious as it is the only line that is not entirely in uppercase - so the fact that it doesn't fit a pattern of nearby data is a good hint that it is different.)

The gibberish characters after the filename are a visual representation of the file size. The characters are meaningless - here, the corresponding hex numbers are used instead. After the filename FLYBY.VOC for example, there are some blanks (hex numbers "00 00 00") which pad out the filename to 12 characters, so all filenames take up the same space. Then there are the numbers 53 93 00 00. As mentioned in the primer above, this is a 32-bit integer of value 0x00009353. This is the size of FLYBY.VOC (0x9353 in decimal is 37715, so the file is 37,715 bytes long.)

Since this archive only stores file sizes (and not file offsets), some calculation is required to work out where each file's data starts. The first file starts immediately after the FAT stops. Each file then appears one after the other. To work out where FLYBY.VOC starts, add together the sizes of every single file before FLYBY.VOC, plus the size of the FAT itself. This will give you the starting position of the file's data.

What does the data mean?

Once it has been established where the game stores its data, the reverse engineering can begin. Typically this involves making small changes to the game data in a hex editor, then loading the game and seeing what effect the change had. If game levels are being edited for instance, playing the game may reveal some unexpected change in part of the level.

Using the game Dangerous Dave as an example, looking at the map data in a hex editor looks like this:

Reveng ddave map.png

It may seem overwhelming, but that's because at this point nothing is known about the file format. The first steps are to make some small changes and see if they have any effect. Here, one byte has been changed to be the same as the byte next to it: (the white block behind the letter A is the cursor - that is the byte that has been changed.)

Reveng ddave map changed.png

Saving the file and loading the game immediately reveals that this change had a noticeable effect. There are now two trophies in the level, where previously there was only one!

Reveng ddave map changed-ingame.png

This simple change has revealed a substantial amount of information about the map file format. We now know:

  • Each byte seems to represent a single block in the game level
  • The value 0x0A represents an exit trophy
  • The old value 0x00 represents an empty space

Knowing this information (particularly the bit about each byte being a single map block) we could adjust the width of our hex editor to see if we can display the data in a recognisable form. In this case, after resizing the hex editor to display 100 bytes on each row, we can see the map above displayed directly in the hex editor:

Reveng ddave map wide.png

This shows that the map is 100 bytes/blocks wide, and 10 bytes/blocks high. It also reveals more mappings, such as the character "/" being the code for a blue gem.

At this point, only one thing is left before a map editor can be made - figuring out which codes map to which images. It might be tempting to just write down a list of mappings by hand (like "0x0A means exit trophy") but programming all these mappings into a level editor is very time consuming. It also makes it difficult to change, if the game supports it. For this reason, it is important to work out how to automatically convert these map codes into images. This is covered in more detail below, but often there is a simple relationship, such as map code 0x01 referring to the first image in a tileset, 0x02 referring to the second image, etc.