这里会显示出您选择的修订版和当前版本之间的差别。
— |
python-files:chm-format [2011/02/24 14:09] (当前版本) |
||
---|---|---|---|
行 1: | 行 1: | ||
+ | ====== CHM文件的文件格式 ====== | ||
+ | <html> | ||
+ | <h1> Microsoft's HTML Help (.chm) format</h1> | ||
+ | <h2 id="preface"> Preface</h2> | ||
+ | <p> | ||
+ | This is documentation on the .chm format used by Microsoft HTML Help. | ||
+ | This format has been reverse engineered in the past, but as far as I | ||
+ | know this is the first freely available documentation on it. One | ||
+ | Usenet message indicates that these .chm files are actually IStorage | ||
+ | files documented in the Microsoft Platform SDK. However, I have not | ||
+ | been able to locate such documentation. | ||
+ | </p> | ||
+ | <h3>Note</h3> | ||
+ | |||
+ | <p> | ||
+ | The word "section" is badly overloaded in this document. Sorry about | ||
+ | that.</p> | ||
+ | |||
+ | <p> | ||
+ | All numbers are in hexadecimal unless otherwise indicated in the | ||
+ | text. Except in tabular listings, this will be indicated by | ||
+ | $ or 0x as appropriate. All values within the file are Intel byte | ||
+ | order (little endian) unless indicated otherwise. | ||
+ | </p> | ||
+ | |||
+ | <h2 id="overview"> The overall format of a .chm file</h2> | ||
+ | |||
+ | <p> | ||
+ | The .chm file begins with a short ($38 byte) initial header. This is | ||
+ | followed by the header section table and the offset to the | ||
+ | content. Collectively, this is the "header". | ||
+ | |||
+ | </p> | ||
+ | |||
+ | <p> | ||
+ | The header is followed by the header sections. There are two header | ||
+ | sections. One header section is the file directory, the other | ||
+ | contains the file length and some unknown data. Immediately following | ||
+ | the header sections is the content. | ||
+ | </p> | ||
+ | |||
+ | <h2 id="header"> The Header</h2> | ||
+ | <p> | ||
+ | The header starts with the initial header, which has the following format | ||
+ | </p> | ||
+ | <pre> | ||
+ | 0000: char[4] 'ITSF' | ||
+ | 0004: DWORD 3 (Version number) | ||
+ | 0008: DWORD Total header length, including header section table and | ||
+ | following data. | ||
+ | 000C: DWORD 1 (unknown) | ||
+ | 0010: DWORD a timestamp. | ||
+ | Considered as a big-endian DWORD, it appears to contain | ||
+ | seconds (MSB) and fractional seconds (second byte). | ||
+ | The third and fourth bytes may contain even more fractional | ||
+ | bits. The 4 least significant bits in the last byte are | ||
+ | constant. | ||
+ | 0014: DWORD Windows Language ID. The two I've seen | ||
+ | $0409 = LANG_ENGLISH/SUBLANG_ENGLISH_US | ||
+ | $0407 = LANG_GERMAN/SUBLANG_GERMAN | ||
+ | 0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC} | ||
+ | 0028: GUID {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC} | ||
+ | </pre> | ||
+ | <p>Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs.</p> | ||
+ | |||
+ | <p> | ||
+ | It is followed by the header section table, which is 2 entries, where | ||
+ | each entry is $10 bytes long and has this format: | ||
+ | </p> | ||
+ | |||
+ | <pre> | ||
+ | 0000: QWORD Offset of section from beginning of file | ||
+ | 0008: QWORD Length of section | ||
+ | </pre> | ||
+ | |||
+ | <p> | ||
+ | |||
+ | Following the header section table is 8 bytes of additional header | ||
+ | data. In Version 2 files, this data is not there and the content | ||
+ | section starts immediately after the directory. | ||
+ | </p> | ||
+ | <pre> | ||
+ | 0000: QWORD Offset within file of content section 0 | ||
+ | </pre> | ||
+ | |||
+ | <h2 id="headersections">The Header Sections</h2> | ||
+ | <h3>Header Section 0</h3> | ||
+ | <p> | ||
+ | This section contains the total size of the file, and not much else | ||
+ | </p> | ||
+ | <pre> | ||
+ | 0000: DWORD $01FE (unknown) | ||
+ | 0004: DWORD 0 (unknown) | ||
+ | 0008: QWORD File Size | ||
+ | 0010: DWORD 0 (unknown) | ||
+ | 0014: DWORD 0 (unknown) | ||
+ | </pre> | ||
+ | |||
+ | <h3>Header Section 1: The Directory Listing</h3> | ||
+ | <p> | ||
+ | The central part of the .chm file: A directory of the files and information it | ||
+ | contains. | ||
+ | </p> | ||
+ | <h4>Directory header</h4> | ||
+ | |||
+ | <p> | ||
+ | The directory starts with a header; its format is as follows: | ||
+ | </p> | ||
+ | <pre> | ||
+ | 0000: char[4] 'ITSP' | ||
+ | 0004: DWORD Version number 1 | ||
+ | 0008: DWORD Length of the directory header | ||
+ | 000C: DWORD $0a (unknown) | ||
+ | 0010: DWORD $1000 Directory chunk size | ||
+ | 0014: DWORD "Density" of quickref section, usually 2. | ||
+ | 0018: DWORD Depth of the index tree | ||
+ | 1 there is no index, 2 if there is one level of PMGI | ||
+ | chunks. | ||
+ | 001C: DWORD Chunk number of root index chunk, -1 if there is none | ||
+ | (though at least one file has 0 despite there being no | ||
+ | index chunk, probably a bug.) | ||
+ | 0020: DWORD Chunk number of first PMGL (listing) chunk | ||
+ | 0024: DWORD Chunk number of last PMGL (listing) chunk | ||
+ | 0028: DWORD -1 (unknown) | ||
+ | 002C: DWORD Number of directory chunks (total) | ||
+ | 0030: DWORD Windows language ID | ||
+ | 0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} | ||
+ | 0044: DWORD $54 (This is the length again) | ||
+ | 0048: DWORD -1 (unknown) | ||
+ | 004C: DWORD -1 (unknown) | ||
+ | 0050: DWORD -1 (unknown) | ||
+ | </pre> | ||
+ | <h4>The Listing Chunks</h4> | ||
+ | <p> | ||
+ | The header is directly followed by the directory chunks. There are two | ||
+ | types of directory chunks -- index chunks, and listing chunks. The | ||
+ | index chunk will be omitted if there is only one listing chunk. A | ||
+ | listing chunk has the following format: | ||
+ | </p> | ||
+ | <pre> | ||
+ | 0000: char[4] 'PMGL' | ||
+ | 0004: DWORD Length of free space and/or quickref area at end of | ||
+ | directory chunk | ||
+ | 0008: DWORD Always 0. | ||
+ | 000C: DWORD Chunk number of previous listing chunk when reading | ||
+ | directory in sequence (-1 if this is the first listing chunk) | ||
+ | 0010: DWORD Chunk number of next listing chunk when reading | ||
+ | directory in sequence (-1 if this is the last listing chunk) | ||
+ | 0014: Directory listing entries (to quickref area) Sorted by | ||
+ | filename; the sort is case-insensitive. | ||
+ | </pre> | ||
+ | <p> | ||
+ | The quickref area is written backwards from the end of the chunk. One | ||
+ | quickref entry exists for every n entries in the file, where n is | ||
+ | calculated as 1 + (1 << quickref density). So for density = 2, n = 5. | ||
+ | |||
+ | </p> | ||
+ | <pre> | ||
+ | Chunklen-0002: WORD Number of entries in the chunk | ||
+ | Chunklen-0004: WORD Offset of entry n from entry 0 | ||
+ | Chunklen-0006: WORD Offset of entry 2n from entry 0 | ||
+ | Chunklen-0008: WORD Offset of entry 3n from entry 0 | ||
+ | ... | ||
+ | </pre> | ||
+ | <p> | ||
+ | The format of a directory listing entry is as follows | ||
+ | </p> | ||
+ | <pre> | ||
+ | ENCINT: length of name | ||
+ | BYTEs: name (UTF-8 encoded) | ||
+ | ENCINT: content section | ||
+ | ENCINT: offset | ||
+ | ENCINT: length | ||
+ | </pre> | ||
+ | |||
+ | <p> | ||
+ | The offset is from the beginning of the content section the file is | ||
+ | in, after the section has been decompressed (if appropriate). The | ||
+ | length also refers to length of the file in the section after decompression. | ||
+ | </p> | ||
+ | <p> | ||
+ | There are two kinds of file represented in the directory: user data and | ||
+ | format related files. The files which are format-related have names which begin | ||
+ | with '::', the user data files have names which begin with "/". | ||
+ | |||
+ | </p> | ||
+ | |||
+ | <h4>The Index Chunk</h4> | ||
+ | <p> | ||
+ | An index chunk has the following format | ||
+ | </p> | ||
+ | <pre> | ||
+ | 0000: char[4] 'PMGI' | ||
+ | 0004: DWORD Length of quickref/free area at end of directory chunk | ||
+ | 0008: Directory index entries (to quickref/free area) | ||
+ | </pre> | ||
+ | <p> | ||
+ | The quickref area in an PMGI is the same as in an PMGL | ||
+ | </p> | ||
+ | <p> | ||
+ | The format of a directory index entry is as follows | ||
+ | </p> | ||
+ | <pre> | ||
+ | |||
+ | ENCINT: length of name | ||
+ | BYTEs: name (UTF-8 encoded) | ||
+ | ENCINT: directory listing chunk which starts with name | ||
+ | </pre> | ||
+ | <p> | ||
+ | When higher-level indexes exist (when the depth of the index tree is 3 or | ||
+ | higher), presumably the upper-level indexes will contain the numbers | ||
+ | of lower-level index chunks rather than listing chunks | ||
+ | </p> | ||
+ | |||
+ | <h4>Encoded Integers</h4> | ||
+ | <p> | ||
+ | An ENCINT is a variable-length integer. The high bit of each byte | ||
+ | indicates "continued to the next byte". Bytes are stored most | ||
+ | significant to least significant. So, for example, $EA $15 is | ||
+ | (((0xEA&0x7F)<<7)|0x15) = 0x3515. | ||
+ | </p> | ||
+ | |||
+ | <h2 id="content">The Content</h2> | ||
+ | |||
+ | <p> | ||
+ | In Version 3, the content typically immediately follows the header | ||
+ | sections, and is at the location indicated by the DWORD following the | ||
+ | header section table. In Version 2, the content immediately follows | ||
+ | the header. | ||
+ | |||
+ | All content section 0 locations in the directory are relative to that | ||
+ | point. The other content sections are stored WITHIN content section | ||
+ | 0. | ||
+ | </p> | ||
+ | <h3>The Namelist file</h3> | ||
+ | <p> | ||
+ | There exists in content section 0 and in the directory a file called | ||
+ | "::DataSpace/NameList". This file contains the names of all the | ||
+ | content sections. The format is as follows: | ||
+ | </p> | ||
+ | <pre> | ||
+ | 0000: WORD Length of file, in words | ||
+ | 0002: WORD Number of entries in file | ||
+ | |||
+ | Each entry: | ||
+ | 0000: WORD Length of name in words, excluding terminating null | ||
+ | 0002: WORD Double-byte characters | ||
+ | xxxx: WORD 0 | ||
+ | </pre> | ||
+ | <p> | ||
+ | Yes, the names have a length word AND are null terminated; sort of a | ||
+ | belt-and-suspenders approach. The coding system is likely UTF-16 (little endian). | ||
+ | </p> | ||
+ | <p> | ||
+ | The section names seen so far are | ||
+ | </p> | ||
+ | |||
+ | <ul> | ||
+ | <li>Uncompressed</li> | ||
+ | <li>MSCompressed</li> | ||
+ | </ul> | ||
+ | <p> | ||
+ | "Uncompressed" is self-explanatory. The section "MSCompressed" is | ||
+ | compressed with Microsoft's LZX algorithm. | ||
+ | </p> | ||
+ | <h3>The Section Data</h3> | ||
+ | <p> | ||
+ | For each section other than 0, there exists a file called | ||
+ | '::DataSpace/Storage/<Section Name>/Content'. This file contains the | ||
+ | compressed data for the section. So, conceptually, | ||
+ | getting a file from a nonzero section is a multi-step process. First | ||
+ | you must get the content file from section 0. Then you decompress (if | ||
+ | appropriate) the section. Then you get the desired file from your decompressed section. | ||
+ | </p> | ||
+ | <h3>Other section format-related files</h3> | ||
+ | |||
+ | <p> | ||
+ | There are several other files associated with the sections | ||
+ | </p> | ||
+ | <ul> | ||
+ | <li> | ||
+ | ::DataSpace/Storage/<SectionName>/ControlData | ||
+ | <p> | ||
+ | This file contains $20 bytes of information on the compression. | ||
+ | |||
+ | The information is partially known: | ||
+ | </p> | ||
+ | <pre> | ||
+ | 0000: DWORD Number of DWORDs following 'LZXC', must be 6 if version is 2 | ||
+ | 0004: ASCII 'LZXC' Compression type identifier | ||
+ | 0008: DWORD Version (Must be <=2) | ||
+ | 000C: DWORD The LZX reset interval | ||
+ | 0010: DWORD The window size | ||
+ | 0014: DWORD The cache size | ||
+ | 0018: DWORD 0 (unknown) | ||
+ | </pre> | ||
+ | <p> | ||
+ | Reset interval, window size, and cache size are in bytes if version is | ||
+ | 1, $8000-byte blocks if version is 2. | ||
+ | |||
+ | </p> | ||
+ | </li> | ||
+ | <li>::DataSpace/Storage/<SectionName>/SpanInfo | ||
+ | <p> | ||
+ | This file contains a quadword containing the uncompressed length of | ||
+ | the section. | ||
+ | </p> | ||
+ | </li> | ||
+ | <li>::DataSpace/Storage/<SectionName>/Transform/List | ||
+ | <p> | ||
+ | It appears this file was intended to contain a list of GUIDs belonging | ||
+ | to methods of decompressing (or otherwise transforming) the section. | ||
+ | However, it actually contains only half of the string representation | ||
+ | of a GUID, apparently because it was sized for characters but contains | ||
+ | wide characters. | ||
+ | </p> | ||
+ | </li> | ||
+ | </ul> | ||
+ | |||
+ | <h2 id="compression">Appendix: The Compression</h2> | ||
+ | <p> | ||
+ | The compressed sections are compressed using LZX, a compression | ||
+ | method Microsoft also uses for its cabinet files. To ensure this, check the | ||
+ | second DWORD of compression info in the ControlData file for the | ||
+ | section — it should be 'LZXC'. To decompress, first read the file | ||
+ | "::DataSpace/Storage/<SectionName>/Transform/{7FC28940-9D31-11D0-9B27-00A0C91E9C7C}/InstanceData/ResetTable". | ||
+ | This reset table has the following format | ||
+ | </p> | ||
+ | <pre> | ||
+ | 0000: DWORD 2 unknown (possibly a version number) | ||
+ | 0004: DWORD Number of entries in reset table | ||
+ | 0008: DWORD 8 Size of table entry (bytes) | ||
+ | 000C: DWORD $28 Length of table header (area before table entries) | ||
+ | 0010: QWORD Uncompressed Length | ||
+ | 0018: QWORD Compressed Length | ||
+ | 0020: QWORD 0x8000 block size for locations below | ||
+ | 0028: QWORD 0 (zeroth entry of table) | ||
+ | 0030: QWORD location in compressed data of 1st block boundary in | ||
+ | uncompressed data | ||
+ | |||
+ | Repeat to end of file | ||
+ | </pre> | ||
+ | <p> | ||
+ | Now you can finally obtain the section (from its Content file). The | ||
+ | window size for the LZX compression is 16 (decimal) on all the files | ||
+ | seen so far. This is specified by the DWORD at $10 in the ControlData | ||
+ | file (but note that DWORD gives the window size in 0x8000-byte blocks, | ||
+ | not the LZX code for the window size) | ||
+ | </p> | ||
+ | <p> | ||
+ | |||
+ | The rule that the input bit-stream is to be re-aligned to a 16-bit boundary | ||
+ | after $8000 output characters have been processed IS in effect, | ||
+ | despite this LZX not being part of a CAB file. The reset table tells | ||
+ | you when this was done, though there is no need for that | ||
+ | during decompression; you can just keep track of the number of output | ||
+ | characters. Furthermore, while this does not appear to be documented | ||
+ | in the LZX format, the uncompressed stream is padded to an $8000 | ||
+ | byte boundary. | ||
+ | </p> | ||
+ | <p> | ||
+ | There is one change from LZX as defined by Microsoft: After each | ||
+ | LZX reset interval (defined in the ControlData file, but in | ||
+ | practice equal to the window size) of compressed data is | ||
+ | processed, the LZX state is fully reset, as if an entirely new file | ||
+ | was being encoded. This allows semi-random access to the compressed | ||
+ | data; you can start reading on any reset interval boundary using the | ||
+ | reset interval size and the reset table. | ||
+ | </p> | ||
+ | <p> | ||
+ | <em>Note:</em><br> | ||
+ | Earlier versions of this document stated that the reset interval only | ||
+ | reset the Huffman tables and required outputting the 1-bit header | ||
+ | again. This was erroneous. The Lempel Ziv state is reset as well. In | ||
+ | practice, a decoder works just as well with the incorrect assumption, | ||
+ | but encoding a file with match positions which refer to a time before | ||
+ | the most recent LZX reset causes crashes on decoding. | ||
+ | </p> | ||
+ | <hr /> | ||
+ | <h2>Acknowledgements</h2> | ||
+ | <p>The following people in (no particular order) have submitted | ||
+ | information which has helped correct and close the gaps in this | ||
+ | document.</p> | ||
+ | <ul> | ||
+ | <li>Peter Ferrie (peter_ferrie at hotmail.com) <a | ||
+ | href="http://pferrie.tripod.com">Web Site</a></li> | ||
+ | |||
+ | <li>Pabs (pabs at zip.to) who also runs the <a href="http://savannah.nongnu.org/projects/chmspec">CHM Spec page</a>.</li> | ||
+ | </ul> | ||
+ | <p>And others I have not been able to reach.</p> | ||
+ | <hr /> | ||
+ | <p> | ||
+ | Copyright 2001-2003 Matthew T. Russotto | ||
+ | <br /> | ||
+ | </p> | ||
+ | <p> | ||
+ | You may freely copy and distribute unmodified copies of this file, or | ||
+ | copies where the only modification is a change in line endings, | ||
+ | padding after the html end tag, coding system, or any combination | ||
+ | thereof. The original is in ASCII with Unix line endings. | ||
+ | </p> | ||
+ | |||
+ | </html> | ||
+ | ===== 参考 ===== | ||
+ | * http://www.russotto.net/chm/chmformat.html |