差别

这里会显示出您选择的修订版和当前版本之间的差别。

@@ 行 1: / 行 1: @@
+====== CHM文件的文件格式 ======
+<html>
+<h1> Microsoft's HTML Help (.chm) format</h1>
+<h2 id="preface"> Preface</h2>
+<p>
+This is documentation on the .chm format used by Microsoft HTML Help.
+This format has been reverse engineered in the past, but as far as I
+know this is the first freely available documentation on it.  One
+Usenet message indicates that these .chm files are actually IStorage
+files documented in the Microsoft Platform SDK.  However, I have not
+been able to locate such documentation.
+</p>
+<h3>Note</h3>
+<p>
+The word "section" is badly overloaded in this document.  Sorry about
+that.</p>
+<p>
+All numbers are in hexadecimal unless otherwise indicated in the
+text.  Except in tabular listings, this will be indicated by
+$ or 0x as appropriate.  All values within the file are Intel byte
+order (little endian) unless indicated otherwise.
+</p>
+<h2 id="overview"> The overall format of a .chm file</h2>
+<p>
+The .chm file begins with a short ($38 byte) initial header.  This is
+followed by the header section table and the offset to the
+content. Collectively, this is the "header".
+</p>
+<p>
+The header is followed by the header sections.  There are two header
+sections.  One header section is the file directory, the other
+contains the file length and some unknown data. Immediately following
+the header sections is the content.
+</p>
+<h2 id="header"> The Header</h2>
+<p>
+The header starts with the initial header, which has the following format
+</p>
+<pre>
+: char[4]  'ITSF'
+: DWORD    3 (Version number)
+: DWORD    Total header length, including header section table and
+               following data.
+C: DWORD    1 (unknown)
+: DWORD    a timestamp.
+               Considered as a big-endian DWORD, it appears to contain
+               seconds (MSB) and fractional seconds (second byte).
+	       The third and fourth bytes may contain even more fractional
+               bits.  The 4 least significant bits in the last byte are
+               constant.
+: DWORD    Windows Language ID.  The two I've seen
+               $0409 = LANG_ENGLISH/SUBLANG_ENGLISH_US
+               $0407 = LANG_GERMAN/SUBLANG_GERMAN
+: GUID     {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC}
+: GUID     {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC}
+</pre>
+<p>Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs.</p>
+<p>
+It is followed by the header section table, which is 2 entries, where
+each entry is $10 bytes long and has this format:
+</p>
+<pre>
+: QWORD    Offset of section from beginning of file
+: QWORD    Length of section
+</pre>
+<p>
+Following the header section table is 8 bytes of additional header
+data.  In Version 2 files, this data is not there and the content
+section starts immediately after the directory.
+</p>
+<pre>
+: QWORD    Offset within file of content section 0
+</pre>
+<h2 id="headersections">The Header Sections</h2>
+<h3>Header Section 0</h3>
+<p>
+This section contains the total size of the file, and not much else
+</p>
+<pre>
+: DWORD    $01FE (unknown)
+: DWORD    0 (unknown)
+: QWORD    File Size
+: DWORD    0 (unknown)
+: DWORD    0 (unknown)
+</pre>
+<h3>Header Section 1: The Directory Listing</h3>
+<p>
+The central part of the .chm file: A directory of the files and information it
+contains.
+</p>
+<h4>Directory header</h4>
+<p>
+The directory starts with a header; its format is as follows:
+</p>
+<pre>
+: char[4]  'ITSP'
+: DWORD    Version number 1
+: DWORD    Length of the directory header
+C: DWORD    $0a (unknown)
+: DWORD    $1000    Directory chunk size
+: DWORD    "Density" of quickref section, usually 2.
+: DWORD    Depth of the index tree
+there is no index, 2 if there is one level of PMGI
+	       chunks.
+C: DWORD    Chunk number of root index chunk, -1 if there is none
+               (though at least one file has 0 despite there being no
+	       index chunk, probably a bug.)
+: DWORD    Chunk number of first PMGL (listing) chunk
+: DWORD    Chunk number of last PMGL (listing) chunk
+: DWORD    -1 (unknown)
+C: DWORD    Number of directory chunks (total)
+: DWORD    Windows language ID
+: GUID     {5D02926A-212E-11D0-9DF9-00A0C922E6EC}
+: DWORD    $54 (This is the length again)
+: DWORD    -1 (unknown)
+C: DWORD    -1 (unknown)
+: DWORD    -1 (unknown)
+</pre>
+<h4>The Listing Chunks</h4>
+<p>
+The header is directly followed by the directory chunks.  There are two
+types of directory chunks -- index chunks, and listing chunks.  The
+index chunk will be omitted if there is only one listing chunk.  A
+listing chunk has the following format:
+</p>
+<pre>
+: char[4]  'PMGL'
+: DWORD    Length of free space and/or quickref area at end of
+               directory chunk
+: DWORD    Always 0.
+C: DWORD    Chunk number of previous listing chunk when reading
+               directory in sequence (-1 if this is the first listing chunk)
+: DWORD    Chunk number of next listing chunk when reading
+               directory in sequence (-1 if this is the last listing chunk)
+: Directory listing entries (to quickref area)  Sorted by
+      filename; the sort is case-insensitive.
+</pre>
+<p>
+The quickref area is written backwards from the end of the chunk.  One
+quickref entry exists for every n entries in the file, where n is
+calculated as 1 + (1 << quickref density).  So for density = 2, n = 5.
+</p>
+<pre>
+Chunklen-0002: WORD     Number of entries in the chunk
+Chunklen-0004: WORD     Offset of entry n from entry 0
+Chunklen-0006: WORD     Offset of entry 2n from entry 0
+Chunklen-0008: WORD     Offset of entry 3n from entry 0
+...
+</pre>
+<p>
+The format of a directory listing entry is as follows
+</p>
+<pre>
+      ENCINT: length of name
+      BYTEs: name  (UTF-8 encoded)
+      ENCINT: content section
+      ENCINT: offset
+      ENCINT: length
+</pre>
+<p>
+The offset is from the beginning of the content section the file is
+in, after the section has been decompressed (if appropriate).  The
+length also refers to length of the file in the section after decompression.
+</p>
+<p>
+There are two kinds of file represented in the directory: user data and
+format related files.  The files which are format-related have names which begin
+with '::', the user data files have names which begin with "/".
+</p>
+<h4>The Index Chunk</h4>
+<p>
+An index chunk has the following format
+</p>
+<pre>
+: char[4]  'PMGI'
+: DWORD    Length of quickref/free area at end of directory chunk
+: Directory index entries (to quickref/free area)
+</pre>
+<p>
+The quickref area in an PMGI is the same as in an PMGL
+</p>
+<p>
+The format of a directory index entry is as follows
+</p>
+<pre>
+      ENCINT: length of name
+      BYTEs: name  (UTF-8 encoded)
+      ENCINT: directory listing chunk which starts with name
+</pre>
+<p>
+When higher-level indexes exist (when the depth of the index tree is 3 or
+higher), presumably the upper-level indexes will contain the numbers
+of lower-level index chunks rather than listing chunks
+</p>
+<h4>Encoded Integers</h4>
+<p>
+An ENCINT is a variable-length integer.  The high bit of each byte
+indicates "continued to the next byte".  Bytes are stored most
+significant to least significant.  So, for example, $EA $15 is
+(((0xEA&amp;0x7F)&lt;&lt;7)|0x15) = 0x3515.
+</p>
+<h2 id="content">The Content</h2>
+<p>
+In Version 3, the content typically immediately follows the header
+sections, and is at the location indicated by the DWORD following the
+header section table. In Version 2, the content immediately follows
+the header.
+All content section 0 locations in the directory are relative to that
+point.  The other content sections are stored WITHIN content section
+.
+</p>
+<h3>The Namelist file</h3>
+<p>
+There exists in content section 0 and in the directory a file called
+"::DataSpace/NameList".  This file contains the names of all the
+content sections.  The format is as follows:
+</p>
+<pre>
+: WORD     Length of file, in words
+: WORD     Number of entries in file
+Each entry:
+: WORD     Length of name in words, excluding terminating null
+: WORD     Double-byte characters
+xxxx: WORD     0
+</pre>
+<p>
+Yes, the names have a length word AND are null terminated; sort of a
+belt-and-suspenders approach.  The coding system is likely UTF-16 (little endian).
+</p>
+<p>
+The section names seen so far are
+</p>
+<ul>
+<li>Uncompressed</li>
+<li>MSCompressed</li>
+</ul>
+<p>
+"Uncompressed" is self-explanatory.  The section "MSCompressed" is
+compressed with Microsoft's LZX algorithm.
+</p>
+<h3>The Section Data</h3>
+<p>
+For each section other than 0, there exists a file called
+'::DataSpace/Storage/&lt;Section Name&gt;/Content'.  This file contains the
+compressed data for the section.  So, conceptually,
+getting a file from a nonzero section is a multi-step process.  First
+you must get the content file from section 0.  Then you decompress (if
+appropriate) the section.  Then you get the desired file from your decompressed section.
+</p>
+<h3>Other section format-related files</h3>
+<p>
+There are several other files associated with the sections
+</p>
+<ul>
+<li>
+::DataSpace/Storage/&lt;SectionName&gt;/ControlData
+<p>
+This file contains $20 bytes of information on the compression.
+The information is partially known:
+</p>
+<pre>
+: DWORD    Number of DWORDs following 'LZXC', must be 6 if version is 2
+: ASCII    'LZXC'  Compression type identifier
+: DWORD    Version (Must be <=2)
+C: DWORD    The LZX reset interval
+: DWORD    The window size
+: DWORD    The cache size
+: DWORD    0 (unknown)
+</pre>
+<p>
+Reset interval, window size, and cache size are in bytes if version is
+, $8000-byte blocks if version is 2.
+</p>
+</li>
+<li>::DataSpace/Storage/&lt;SectionName&gt;/SpanInfo
+<p>
+This file contains a quadword containing the uncompressed length of
+the section.
+</p>
+</li>
+<li>::DataSpace/Storage/&lt;SectionName&gt;/Transform/List
+<p>
+It appears this file was intended to contain a list of GUIDs belonging
+to methods of decompressing (or otherwise transforming) the section.
+However, it actually contains only half of the string representation
+of a GUID, apparently because it was sized for characters but contains
+wide characters.
+</p>
+</li>
+</ul>
+<h2 id="compression">Appendix: The Compression</h2>
+<p>
+The compressed sections are compressed using LZX, a compression
+method Microsoft also uses for its cabinet files.  To ensure this, check the
+second DWORD of compression info in the ControlData file for the
+section &mdash; it should be 'LZXC'.  To decompress, first read the file
+"::DataSpace/Storage/&lt;SectionName&gt;/Transform/{7FC28940-9D31-11D0-9B27-00A0C91E9C7C}/InstanceData/ResetTable".
+This reset table has the following format
+</p>
+<pre>
+: DWORD    2     unknown (possibly a version number)
+: DWORD    Number of entries in reset table
+: DWORD    8     Size of table entry (bytes)
+C: DWORD    $28   Length of table header (area before table entries)
+: QWORD    Uncompressed Length
+: QWORD    Compressed Length
+: QWORD    0x8000 block size for locations below
+: QWORD    0 (zeroth entry of table)
+: QWORD    location in compressed data of 1st block boundary in
+               uncompressed data
+Repeat to end of file
+</pre>
+<p>
+Now you can finally obtain the section (from its Content file).  The
+window size for the LZX compression is 16 (decimal) on all the files
+seen so far.  This is specified by the DWORD at $10 in the ControlData
+file (but note that DWORD gives the window size in 0x8000-byte blocks,
+not the LZX code for the window size)
+</p>
+<p>
+The rule that the input bit-stream is to be re-aligned to a 16-bit boundary
+after $8000 output characters have been processed IS in effect,
+despite this LZX not being part of a CAB file.  The reset table tells
+you when this was done, though there is no need for that
+during decompression; you can just keep track of the number of output
+characters.  Furthermore, while this does not appear to be documented
+in the LZX format, the uncompressed stream is padded to an $8000
+byte boundary.
+</p>
+<p>
+There is one change from LZX as defined by Microsoft: After each
+LZX reset interval (defined in the ControlData file, but in
+practice equal to the window size) of compressed data is
+processed, the LZX state is fully reset, as if an entirely new file
+was being encoded.  This allows semi-random access to the compressed
+data; you can start reading on any reset interval boundary using the
+reset interval size and the reset table.
+</p>
+<p>
+<em>Note:</em><br>
+Earlier versions of this document stated that the reset interval only
+reset the Huffman tables and required outputting the 1-bit header
+again.  This was erroneous.  The Lempel Ziv state is reset as well.  In
+practice, a decoder works just as well with the incorrect assumption,
+but encoding a file with match positions which refer to a time before
+the most recent LZX reset causes crashes on decoding.
+</p>
+<hr />
+<h2>Acknowledgements</h2>
+<p>The following people in (no particular order) have submitted
+information which has helped correct and close the gaps in this
+document.</p>
+<ul>
+<li>Peter Ferrie (peter_ferrie at hotmail.com) <a
+href="http://pferrie.tripod.com">Web Site</a></li>
+<li>Pabs (pabs at zip.to) who also runs the <a href="http://savannah.nongnu.org/projects/chmspec">CHM Spec page</a>.</li>
+</ul>
+<p>And others I have not been able to reach.</p>
+<hr />
+<p>
+Copyright 2001-2003 Matthew T. Russotto
+<br />
+</p>
+<p>
+You may freely copy and distribute unmodified copies of this file, or
+copies where the only modification is a change in line endings,
+padding after the html end tag, coding system, or any combination
+thereof.  The original is in ASCII with Unix line endings.
+</p>
+</html>
+===== 参考 =====
+  * http://www.russotto.net/chm/chmformat.html

Python 俱乐部

用户工具

站点工具

差别

页面工具