差别

这里会显示出您选择的修订版和当前版本之间的差别。

@@ 行 1: / 行 1: @@
+====== StarDict Dictionary Format ======
+<code>
+Format for StarDict dictionary files
+------------------------------------
+StarDict homepage: http://stardict.sourceforge.net
+StarDict on-line dictionary: http://www.stardict.org
+{0}. Number and Byte-order Conventions
+When you record the numbers that identify sizes, offsets, etc., you
+should use 32-bits numbers, such as you might represent with a glong.
+In order to make StarDict work on different platforms, these numbers
+must be in network byte order.  You can ensure the correct byte order
+by using the g_htonl() function when creating dictionary files.
+Conversely, you should use g_ntohl() when reading dictionary files.
+Strings should be encoded in UTF-8.
+{1}. Files
+Every dictionary consists of these files:
+(1). somedict.ifo
+(2). somedict.idx or somedict.idx.gz
+(3). somedict.dict or somedict.dict.dz
+(4). somedict.syn (optional)
+You can use gzip -9 to compress the .idx file. If the .idx file are not
+compressed, the loading can be fast and save memory when using, compress it
+will make the .idx file load into memory and make the quering become faster
+when using.
+You can use dictzip to compress the .dict file.
+"dictzip" uses the same compression algorithm and file format as does gzip,
+but provides a table that can be used to randomly access compressed blocks
+in the file.  The use of 50-64kB blocks for compression typically degrades
+compression by less than 10%, while maintaining acceptable random access
+capabilities for all data in the file.  As an added benefit, files
+compressed with dictzip can be decompressed with gunzip.
+For more information about dictzip, refer to DICT project, please see:
+http://www.dict.org
+When you create a dictionary, you should use .idx and .dict.dz in normal
+case.
+Stardict will search for the .ifo file, then open the .idx or
+.idx.gz file and the .dict.dz or .dict file which is in the same directory and
+has the same base name.
+{2}. The ".ifo" file's format.
+The .ifo file has the following format:
+StarDict's dict ifo file
+version=2.4.2
+[options]
+Note that the current "version" string must be "2.4.2" or "3.0.0".  If it's not,
+then StarDict will refuse to read the file.
+If version is "3.0.0", StarDict will parse the "idxoffsetbits" option.
+[options]
+---------
+In the example above, [options] expands to any of the following lines
+specifying information about the dictionary.  Each option is a keyword
+followed by an equal sign, then the value of that option, then a
+newline.  The options may be appear in any order.
+Note that the dictionary must have at least a bookname, a wordcount and a
+idxfilesize, or the load will fail.  All other information is optional.  All
+strings should be encoded in UTF-8.
+Available options:
+bookname=      // required
+wordcount=     // required
+synwordcount=  // required if ".syn" file exists.
+idxfilesize=   // required
+idxoffsetbits= // New in 3.0.0
+author=
+email=
+website=
+description=	// You can use <br> for new line.
+date=
+sametypesequence= // very important.
+wordcount is the count of word entries in .idx file, it must be right.
+idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed
+to a .idx.gz file, this entry must record the original .idx file's size, and it
+must be right too. The .gz file don't contain its original size information,
+but knowing the original size can speed up the extraction to memory, as you
+don't need to call realloc() for many times.
+idxoffsetbits can be 64 or 32. If "idxoffsetbits=64", the offset field of the
+.idx file will be 64 bits.
+The "sametypesequence" option is described in further detail below.
+***
+sametypesequence
+You should first familiarize yourself with the .dict file format
+described in the next section so that you can understand what effect
+this option has on the .dict file.
+If the sametypesequence option is set, it tells StarDict that each
+word's data in the .dict file will have the same sequence of datatypes.
+In this case, we expect a .dict file that's been optimized in two
+ways: the type identifiers should be omitted, and the size marker for
+the last data entry of each word should be omitted.
+Let's consider some concrete examples of the sametypesequence option.
+Suppose that a dictionary records many .wav files, and so sets:
+        sametypesequence=W
+In this case, each word's entry in the .dict file consists solely of a
+wav file.  In the .dict file, you would leave out the 'W' character
+before each entry, and you would also omit the 32-bits integer at the
+front of each .wav entry that would normally give the entry's length.
+You can do this since the length is known from the information in the
+idx file.
+As another example, suppose a dictionary contains phonetic information
+and a meaning for each word.  The sametypesequence option for this
+dictionary would be:
+        sametypesequence=tm
+Once again, you can omit the 't' and 'm' characters before each data
+entry in the .dict file.  In addition, you should omit the terminating
+'\0' for the 'm' entry for each word in the .dict file, as the length
+of the meaning string can be inferred from the length of the phonetic
+string (still indicated by a terminating '\0') and the length of the
+entire word entry (listed in the .idx file).
+So for cases where the last data entry for each word normally requires
+a terminating '\0' character, you should omit this character in the
+dict file.  And for cases where the last data entry for each word
+normally requires an initial 32-bits number giving the length of the
+field (such as WAV and PNG entries), you must omit this number in the
+dictionary.
+Every dictionary should try to use the sametypesequence feature to
+save disk space.
+***
+{3}. The ".idx" file's format.
+The .idx file is just a word list.
+The word list is a sorted list of word entries.
+Each entry in the word list contains three fields, one after the other:
+     word_str;  // a utf-8 string terminated by '\0'.
+     word_data_offset;  // word data's offset in .dict file
+     word_data_size;  // word data's total size in .dict file
+word_str gives the string representing this word.  It's the string
+that is "looked up" by the StarDict.
+Two or more entries may have the same "word_str" with different
+word_data_offset and word_data_size. This may be useful for some
+dictionaries. But this feature is only well supported by
+StarDict-2.4.8 and newer.
+The length of "word_str" should be less than 256. In other words,
+(strlen(word) < 256).
+If the version is "3.0.0" and "idxoffsetbits=64", word_data_offset will
+be 64-bits unsigned number in network byte order. Otherwise it will be
+-bits.
+word_data_size should be 32-bits unsigned number in network byte order.
+It is possible the different word_str have the same word_data_offset and
+word_data_size, so multiple word index point to the same definition.
+But this is not recommended, for mutiple words have the same definition,
+you may create a ".syn" file for them, see section 4 below.
+The word list must be sorted by calling stardict_strcmp() on the "word_str"
+fields.  If the word list order is wrong, StarDict will fail to function
+correctly!
+============
+gint stardict_strcmp(const gchar *s1, const gchar *s2)
+{
+	gint a;
+	a = g_ascii_strcasecmp(s1, s2);
+	if (a == 0)
+		return strcmp(s1, s2);
+	else
+		return a;
+}
+============
+g_ascii_strcasecmp() is a glib function:
+Unlike the BSD strcasecmp() function, this only recognizes standard
+ASCII letters and ignores the locale, treating all non-ASCII characters
+as if they are not letters.
+stardict_strcmp() works fine with English characters, but the other
+locale characters' sorting is not so good, in this case, you can enable
+the collation feature, see section 6.
+{4}. The ",syn" file's format.
+This file is optional, and you should notice tree dictionary needn't this file.
+Only StarDict-2.4.8 and newer support this file.
+The .syn file contains information for synonyms, that means, when you input a
+synonym, StarDict will search another word that related to it.
+The format is simple. Each item contain one string and a number.
+synonym_word;  // a utf-8 string terminated by '\0'.
+original_word_index; // original word's index in .idx file.
+Then other items without separation.
+When you input synonym_word, StarDict will search original_word;
+The length of "synonym_word" should be less than 256. In other
+words, (strlen(word) < 256).
+original_word_index is a 32-bits unsigned number in network byte order.
+Two or more items may have the same "synonym_word" with different
+original_word_index.
+The items must be sorted by stardict_strcmp() with synonym_word.
+{5}. The offset cache file's format.
+StarDict-2.4.8 start to support cache files, this feature can speed up
+loading and save memory as mmap() the cache file. The cache file names
+are .idx.oft and .syn.oft, the format is:
+First a utf-8 string terminated by '\0', then many 32-bits numbers as
+the wordoffset index, this index is sparse, and "ENTR_PER_PAGE=32",
+they are not stored in network byte order.
+The string must begin with:
+=====
+StarDict's oft file
+version=2.4.8
+=====
+Then a line like this:
+url=/usr/share/stardict/dic/stardict-somedict-2.4.2/somedict.idx
+This line should have a ending '\n'.
+StarDict will try to create the .oft file at the same directory of
+the .ifo file first, if failed, then try to create it at
+~/.cache/stardict/, ~/.cache is get by g_get_user_cache_dir().
+If two or more dictionaries have the same file name, StarDict will
+create somedict.idx.oft, somedict(2).idx.oft, somedict(3).idx.oft,
+etc. for them respectively, each with different "url=" in the
+beginning string.
+{6}. The collation file's format.
+StarDict-2.4.8 start to support collation, that sort the word
+list by collate function. It will create collation file which
+names .idx.clt and .syn.clt, the format is a little like offset
+cache file:
+First a utf-8 string terminated by '\0', then many 32-bits numbers as
+the index that sorted by the collate function, they are not stored
+in network byte order.
+The string must begin with:
+=====
+StarDict's clt file
+version=2.4.8
+=====
+Then two lines like this:
+url=/usr/share/stardict/dic/stardict-somedict-2.4.2/somedict.idx
+func=0
+The second line should have a ending '\n' too.
+StarDict support these collate functions currently:
+typedef enum {
+	UTF8_GENERAL_CI = 0,
+	UTF8_UNICODE_CI,
+	UTF8_BIN,
+	UTF8_CZECH_CI,
+	UTF8_DANISH_CI,
+	UTF8_ESPERANTO_CI,
+	UTF8_ESTONIAN_CI,
+	UTF8_HUNGARIAN_CI,
+	UTF8_ICELANDIC_CI,
+	UTF8_LATVIAN_CI,
+	UTF8_LITHUANIAN_CI,
+	UTF8_PERSIAN_CI,
+	UTF8_POLISH_CI,
+	UTF8_ROMAN_CI,
+	UTF8_ROMANIAN_CI,
+	UTF8_SLOVAK_CI,
+	UTF8_SLOVENIAN_CI,
+	UTF8_SPANISH_CI,
+	UTF8_SPANISH2_CI,
+	UTF8_SWEDISH_CI,
+	UTF8_TURKISH_CI,
+	COLLATE_FUNC_NUMS
+} CollateFunctions;
+These UTF8_*_CI functions comes from MySQL in fact.
+The file's locate path just like the .oft file.
+Notice, for "somedict.idx.gz" file, the corresponding collation
+file is somedict.idx.clt, but not somedict.idx.gz.clt, the
+"url=" is somedict.idx, not somedict.idx.gz. So after you gzip
+the .idx file, StarDict needn't create the .clt file again.
+{7}. The ".dict" file's format.
+The .dict file is a pure data sequence, as the offset and size of each
+word is recorded in the corresponding .idx file.
+If the "sametypesequence" option is not used in the .ifo file, then
+the .dict file has fields in the following order:
+==============
+word_1_data_1_type; // a single char identifying the data type
+word_1_data_1_data; // the data
+word_1_data_2_type;
+word_1_data_2_data;
+...... // the number of data entries for each word is determined by
+       // word_data_size in .idx file
+word_2_data_1_type;
+word_2_data_1_data;
+......
+==============
+It's important to note that each field in each word indicates its
+own length, as described below.  The number of possible fields per
+word is also not fixed, and is determined by simply reading data until
+you've read word_data_size bytes for that word.
+Suppose the "sametypesequence" option is used in the .idx file, and
+the option is set like this:
+sametypesequence=tm
+Then the .dict file will look like this:
+==============
+word_1_data_1_data
+word_1_data_2_data
+word_2_data_1_data
+word_2_data_2_data
+......
+==============
+The first data entry for each word will have a terminating '\0', but
+the second entry will not have a terminating '\0'.  The omissions of
+the type chars and of the last field's size information are the
+optimizations required by the "sametypesequence" option described
+above.
+If "idxoffsetbits=64", the file size of the .dict file will be bigger
+than 4G. Because we often need to mmap this large file, and there is
+a 4G maximum virtual memory space limit in a process on the 32 bits
+computer, which will make we can get error, so "idxoffsetbits=64"
+dictionary can't be loaded in 32 bits machine in fact, StarDict will
+simply print a warning in this case when loading. 64-bits computers
+should haven't this limit.
+Type identifiers
+----------------
+Here are the single-character type identifiers that may be used with
+the "sametypesequence" option in the .idx file, or may appear in the
+dict file itself if the "sametypesequence" option is not used.
+Lower-case characters signify that a field's size is determined by a
+terminating '\0', while upper-case characters indicate that the data
+begins with a network byte-ordered guint32 that gives the length of
+the following data's size(NOT the whole size which is 4 bytes bigger).
+'m'
+Word's pure text meaning.
+The data should be a utf-8 string ending with '\0'.
+'l'
+Word's pure text meaning.
+The data is NOT a utf-8 string, but is instead a string in locale
+encoding, ending with '\0'.  Sometimes using this type will save disk
+space, but its use is discouraged.
+'g'
+A utf-8 string which is marked up with the Pango text markup language.
+For more information about this markup language, See the "Pango
+Reference Manual."
+You might have it installed locally at:
+file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
+'t'
+English phonetic string.
+The data should be a utf-8 string ending with '\0'.
+Here are some utf-8 phonetic characters:
+θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑṃṇḷ
+æɑɒʌәєŋvθðʃʒɚːɡˏˊˋ
+'x'
+A utf-8 string which is marked up with the xdxf language.
+See http://xdxf.sourceforge.net
+StarDict have these extention:
+<rref> can have "type" attribute, it can be "image", "sound", "video"
+and "attach".
+<kref> can have "k" attribute.
+'y'
+Chinese YinBiao or Japanese KANA.
+The data should be a utf-8 string ending with '\0'.
+'k'
+KingSoft PowerWord's data. The data is a utf-8 string ending with '\0'.
+It is in XML format.
+'w'
+MediaWiki markup language.
+See http://meta.wikimedia.org/wiki/Help:Editing#The_wiki_markup
+'h'
+Html codes.
+'r'
+Resource file list.
+The content can be:
+img:pic/example.jpg	// Image file
+snd:apple.wav		// Sound file
+vdo:film.avi		// Video file
+att:file.bin		// Attachment file
+More than one line is supported as a list of available files.
+StarDict will find the files in the Resource Storage.
+The image will be shown, the sound file will have a play button.
+You can "save as" the attachment file and so on.
+'W'
+wav file.
+The data begins with a network byte-ordered guint32 to identify the wav
+file's size, immediately followed by the file's content.
+'P'
+Picture file.
+The data begins with a network byte-ordered guint32 to identify the picture
+file's size, immediately followed by the file's content.
+'X'
+this type identifier is reserved for experimental extensions.
+{8}. Resource Storage
+Resource Storage store the external file in 'r' resource file list, the
+image in html code, the image, media and other files in wiki tag.
+It have two forms:
+. Direct directory and files in the "res" sub-directory.
+. The res.rifo, res.ridx and res.rdic database.
+Direct files may have file name encoding problem, as Linux use UTF-8 and
+Windows use local encoding, so you'd better just use ASCII file name, or
+use databse to store UTF-8 file name.
+Databse may need to extract the file(such as .wav) file to a temporary
+file, so not so efficient compare to direct files. But database have the
+advantage of compressing.
+You can convert the res directory and the res database from each other by
+the dir2resdatabse and resdatabase2dir tools.
+StarDict will try to load the storage database first, then try the direct
+files form.
+The format of the res.rifo file:
+StarDict's storage ifo file
+version=3.0.0
+filecount=	// required.
+idxoffsetbits=	// optional.
+The format of the res.ridx file:
+filename;	// A string end with '\0'.
+offset;		// 32 or 64 bits unsigned number in network byte order.
+size;		// 32 bits unsigned number in network byte order.
+filename can include a path too, such as "pic/example.png". filename is
+case sensitive, and there should have no two same filenames in all the
+entries.
+if "idxoffsetbits=64", then offset is 64 bits.
+These three items are repeated as each entry.
+The entries are sorted by the strcmp() function with the filename field.
+It is possible that different filenames have the same offset and size.
+The format of the res.rdic file:
+It is just the join of each resource files.
+You can dictzip this file as res.rdic.dz
+{9}. Tree Dictionary
+The tree dictionary support is used for information viewing, etc.
+A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz
+and sometreedict.dict.dz.
+It is better to compress the .tdx file, as it is always load into memory.
+The .ifo file has the following format:
+StarDict's treedict ifo file
+version=2.4.2
+[options]
+Available options:
+bookname=      // required
+tdxfilesize=   // required
+wordcount=
+author=
+email=
+website=
+description=
+date=
+sametypesequence=
+wordcount is only used for info view in the dict manage dialog, so it is not
+important in tree dictionary.
+The .tdx file is just the word list.
+-----------
+The word list is a tree list of word entries.
+Each entry in the word list contains four fields, one after the other:
+     word_str;  // a utf-8 string terminated by '\0'.
+     word_data_offset;  // word data's offset in .dict file
+     word_data_size;  // word data's total size in .dict file. it can be 0.
+     word_subentry_count; //how many sub word this entry has, 0 means none.
+Subentry is immidiately followed by its parent entry. This make the order is
+just as when a tree list with all its nodes extended, then sort from top to
+bottom.
+word_data_offset, word_data_size and word_subentry_count should be 32-bits
+unsigned numbers in network byte order.
+The .dict file's format is the same as the normal dictionary.
+{10}. More information.
+You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and
+"src/tools/*.cpp" for more information.
+After you have build a dictionary, you can use "stardict_verify" to verify the
+dictionary files. You can find it at "src/tools/".
+If you have any questions, email me. :)
+Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's
+English.
+Hu Zheng <huzheng_001@163.com>
+http://forlinux.yeah.net
+.4.24
+</code>
+Reference:
+  * http://stardict.sourceforge.net/StarDictFileFormat

Python 俱乐部

用户工具

站点工具

差别

页面工具