SimCity 2000 DOS Data Formats::02.28.2015+18:15
(I’ve been meaning to write this up for a while.) Around a year and a half ago I was bored and felt like digging around in some game engines because it’s interesting to see how people have solved various problems, what formats they use, and also what libraries they use. I ended up focusing on SimCity 2000 for DOS because it’s pretty old and I’m not familiar with the limitations of DOS programming. I’m going to include bits of my thought process, so feel free to skim if you want spoilers.
The DAT File
Understanding the SC2000.DAT file is the meat of this post. The GOG version of the game also includes a SC2000SE.DAT file. This is actually a modified ISO of what’s on the Special Edition CD-ROM (it doesn’t include the Windows version, sadly). ISOs are boring and very documented, so we’ll ignore it.
After opening up the file in a hex editor, I noticed that there was no header (lack of any identifying
words/bytes) and a large portion of the beginning of the file seemed to have a uniform format. Basically,
some letters (which looked like filenames) and two shorts; clearly it was an index of some sort. This
was a DOS game, so the filenames were all in 8.3 format, which put them at 12 bytes each. They were
not C strings, making extracting the index a lot easier. The format is exactly as follows:
struct Entry {
char filename[12];
uint16_t someNumber;
uint16_t otherNumber;
};
I scrubbed the file, looking for some indication of how many entries there were in the index, and as far as I can tell there’s nothing to explicitly tell the game that. While writing this post, however, I came to the realization that you can calculate the number of entries from the first entry in the index (more on that later). At the time, I just hardcoded how many files there were in the short program I wrote to dump the contents (a nearly 20-year old game isn’t likely to change).
The next important bit was understanding what the the two numbers after the filename meant. My initial guess was that maybe they were the size and offset of the file in the archive. The first number looked plausibly enough like it could be size, but the second number was confusing. It was really small (0 for the first couple entries), only ever increased, and was the same for a bunch of consecutive entries. I added up the first number for all of the entries and ended up with something much smaller than the 2.5mb that the file is. I was wrong on both counts.
My next guess about the second number was that it was some sort of block number. One might think that it was just the 20-bit addressing scheme of segment:offset. That’s not right for a number of reasons:
- 20-bit addressing only handles one megabyte of memory
- The data file is 2.5mb
- 20-bit addressing segments are only 16-bits each. The potential offset values were much larger than that.
offset + (block * 64 * 1024)
.
The final file entry structure looks like this:
struct Entry {
char filename[12];
uint16_t offset;
uint16_t block;
};
Dumping the Contents of the DAT
Now that I’d figured out the format, I needed to dump the files. The DAT is tightly packed, so you don’t have to worry about alignment or anything like that. Dumping each file is basically just slicing out the bytes from the beginning offset until the offset of the next file (or the end of the DAT if you’re on the last entry). The code I wrote to do this is trivial, so this is left as an exercise for the reader.
What’s Inside
Part of my initial motivation was getting at the tasty music files inside the archive, so I was hoping they were in a sane, somewhat standard format and not something like an XM or MOD file that had been stripped and rewritten into some other binary format or something similarly custom. As luck would have it, they’re run-of-the-mill XMI files which can be easily converted to MID.
The file formats inside of the DAT are (in no particular order):
- PAL - palette
- RAW - bitmap
- FNT - font
- HED - tileset header
- DAT - tileset data
- XMI - music in Extended MIDI format
- VOC - sound effect in Creative Voice format
- TXT - strings
- GM.(AD|OPL) - general midi sound fonts for Adlib and Yamaha OPL
Conclusion
I hope this was as interesting to read as it was for me to discover. My biggest unanswered question at this point is why the index doesn’t use a 32-bit unsigned int for the offset from the start of the file. I’ve fumbled around the Watcom C/C++ docs, and I can’t find anything to shed light on this (the game uses DOS4/GW, which was distributed with Watcom). The DOS4/G docs are behind a $49 paywall and I’m not that interested in finding out the answer.