aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/algorithm.txt209
-rw-r--r--doc/rfc1950.txt619
-rw-r--r--doc/rfc1951.txt955
-rw-r--r--doc/rfc1952.txt675
-rw-r--r--doc/txtvsbin.txt107
5 files changed, 2565 insertions, 0 deletions
diff --git a/doc/algorithm.txt b/doc/algorithm.txt
new file mode 100644
index 0000000..b022dde
--- /dev/null
+++ b/doc/algorithm.txt
@@ -0,0 +1,209 @@
11. Compression algorithm (deflate)
2
3The deflation algorithm used by gzip (also zip and zlib) is a variation of
4LZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
5the input data. The second occurrence of a string is replaced by a
6pointer to the previous string, in the form of a pair (distance,
7length). Distances are limited to 32K bytes, and lengths are limited
8to 258 bytes. When a string does not occur anywhere in the previous
932K bytes, it is emitted as a sequence of literal bytes. (In this
10description, `string' must be taken as an arbitrary sequence of bytes,
11and is not restricted to printable characters.)
12
13Literals or match lengths are compressed with one Huffman tree, and
14match distances are compressed with another tree. The trees are stored
15in a compact form at the start of each block. The blocks can have any
16size (except that the compressed data for one block must fit in
17available memory). A block is terminated when deflate() determines that
18it would be useful to start another block with fresh trees. (This is
19somewhat similar to the behavior of LZW-based _compress_.)
20
21Duplicated strings are found using a hash table. All input strings of
22length 3 are inserted in the hash table. A hash index is computed for
23the next 3 bytes. If the hash chain for this index is not empty, all
24strings in the chain are compared with the current input string, and
25the longest match is selected.
26
27The hash chains are searched starting with the most recent strings, to
28favor small distances and thus take advantage of the Huffman encoding.
29The hash chains are singly linked. There are no deletions from the
30hash chains, the algorithm simply discards matches that are too old.
31
32To avoid a worst-case situation, very long hash chains are arbitrarily
33truncated at a certain length, determined by a runtime option (level
34parameter of deflateInit). So deflate() does not always find the longest
35possible match but generally finds a match which is long enough.
36
37deflate() also defers the selection of matches with a lazy evaluation
38mechanism. After a match of length N has been found, deflate() searches for
39a longer match at the next input byte. If a longer match is found, the
40previous match is truncated to a length of one (thus producing a single
41literal byte) and the process of lazy evaluation begins again. Otherwise,
42the original match is kept, and the next match search is attempted only N
43steps later.
44
45The lazy match evaluation is also subject to a runtime parameter. If
46the current match is long enough, deflate() reduces the search for a longer
47match, thus speeding up the whole process. If compression ratio is more
48important than speed, deflate() attempts a complete second search even if
49the first match is already long enough.
50
51The lazy match evaluation is not performed for the fastest compression
52modes (level parameter 1 to 3). For these fast modes, new strings
53are inserted in the hash table only when no match was found, or
54when the match is not too long. This degrades the compression ratio
55but saves time since there are both fewer insertions and fewer searches.
56
57
582. Decompression algorithm (inflate)
59
602.1 Introduction
61
62The key question is how to represent a Huffman code (or any prefix code) so
63that you can decode fast. The most important characteristic is that shorter
64codes are much more common than longer codes, so pay attention to decoding the
65short codes fast, and let the long codes take longer to decode.
66
67inflate() sets up a first level table that covers some number of bits of
68input less than the length of longest code. It gets that many bits from the
69stream, and looks it up in the table. The table will tell if the next
70code is that many bits or less and how many, and if it is, it will tell
71the value, else it will point to the next level table for which inflate()
72grabs more bits and tries to decode a longer code.
73
74How many bits to make the first lookup is a tradeoff between the time it
75takes to decode and the time it takes to build the table. If building the
76table took no time (and if you had infinite memory), then there would only
77be a first level table to cover all the way to the longest code. However,
78building the table ends up taking a lot longer for more bits since short
79codes are replicated many times in such a table. What inflate() does is
80simply to make the number of bits in the first table a variable, and then
81to set that variable for the maximum speed.
82
83For inflate, which has 286 possible codes for the literal/length tree, the size
84of the first table is nine bits. Also the distance trees have 30 possible
85values, and the size of the first table is six bits. Note that for each of
86those cases, the table ended up one bit longer than the ``average'' code
87length, i.e. the code length of an approximately flat code which would be a
88little more than eight bits for 286 symbols and a little less than five bits
89for 30 symbols.
90
91
922.2 More details on the inflate table lookup
93
94Ok, you want to know what this cleverly obfuscated inflate tree actually
95looks like. You are correct that it's not a Huffman tree. It is simply a
96lookup table for the first, let's say, nine bits of a Huffman symbol. The
97symbol could be as short as one bit or as long as 15 bits. If a particular
98symbol is shorter than nine bits, then that symbol's translation is duplicated
99in all those entries that start with that symbol's bits. For example, if the
100symbol is four bits, then it's duplicated 32 times in a nine-bit table. If a
101symbol is nine bits long, it appears in the table once.
102
103If the symbol is longer than nine bits, then that entry in the table points
104to another similar table for the remaining bits. Again, there are duplicated
105entries as needed. The idea is that most of the time the symbol will be short
106and there will only be one table look up. (That's whole idea behind data
107compression in the first place.) For the less frequent long symbols, there
108will be two lookups. If you had a compression method with really long
109symbols, you could have as many levels of lookups as is efficient. For
110inflate, two is enough.
111
112So a table entry either points to another table (in which case nine bits in
113the above example are gobbled), or it contains the translation for the symbol
114and the number of bits to gobble. Then you start again with the next
115ungobbled bit.
116
117You may wonder: why not just have one lookup table for how ever many bits the
118longest symbol is? The reason is that if you do that, you end up spending
119more time filling in duplicate symbol entries than you do actually decoding.
120At least for deflate's output that generates new trees every several 10's of
121kbytes. You can imagine that filling in a 2^15 entry table for a 15-bit code
122would take too long if you're only decoding several thousand symbols. At the
123other extreme, you could make a new table for every bit in the code. In fact,
124that's essentially a Huffman tree. But then you spend two much time
125traversing the tree while decoding, even for short symbols.
126
127So the number of bits for the first lookup table is a trade of the time to
128fill out the table vs. the time spent looking at the second level and above of
129the table.
130
131Here is an example, scaled down:
132
133The code being decoded, with 10 symbols, from 1 to 6 bits long:
134
135A: 0
136B: 10
137C: 1100
138D: 11010
139E: 11011
140F: 11100
141G: 11101
142H: 11110
143I: 111110
144J: 111111
145
146Let's make the first table three bits long (eight entries):
147
148000: A,1
149001: A,1
150010: A,1
151011: A,1
152100: B,2
153101: B,2
154110: -> table X (gobble 3 bits)
155111: -> table Y (gobble 3 bits)
156
157Each entry is what the bits decode as and how many bits that is, i.e. how
158many bits to gobble. Or the entry points to another table, with the number of
159bits to gobble implicit in the size of the table.
160
161Table X is two bits long since the longest code starting with 110 is five bits
162long:
163
16400: C,1
16501: C,1
16610: D,2
16711: E,2
168
169Table Y is three bits long since the longest code starting with 111 is six
170bits long:
171
172000: F,2
173001: F,2
174010: G,2
175011: G,2
176100: H,2
177101: H,2
178110: I,3
179111: J,3
180
181So what we have here are three tables with a total of 20 entries that had to
182be constructed. That's compared to 64 entries for a single table. Or
183compared to 16 entries for a Huffman tree (six two entry tables and one four
184entry table). Assuming that the code ideally represents the probability of
185the symbols, it takes on the average 1.25 lookups per symbol. That's compared
186to one lookup for the single table, or 1.66 lookups per symbol for the
187Huffman tree.
188
189There, I think that gives you a picture of what's going on. For inflate, the
190meaning of a particular symbol is often more than just a letter. It can be a
191byte (a "literal"), or it can be either a length or a distance which
192indicates a base value and a number of bits to fetch after the code that is
193added to the base value. Or it might be the special end-of-block code. The
194data structures created in inftrees.c try to encode all that information
195compactly in the tables.
196
197
198Jean-loup Gailly Mark Adler
199jloup@gzip.org madler@alumni.caltech.edu
200
201
202References:
203
204[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data
205Compression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3,
206pp. 337-343.
207
208``DEFLATE Compressed Data Format Specification'' available in
209http://www.ietf.org/rfc/rfc1951.txt
diff --git a/doc/rfc1950.txt b/doc/rfc1950.txt
new file mode 100644
index 0000000..ce6428a
--- /dev/null
+++ b/doc/rfc1950.txt
@@ -0,0 +1,619 @@
1
2
3
4
5
6
7Network Working Group P. Deutsch
8Request for Comments: 1950 Aladdin Enterprises
9Category: Informational J-L. Gailly
10 Info-ZIP
11 May 1996
12
13
14 ZLIB Compressed Data Format Specification version 3.3
15
16Status of This Memo
17
18 This memo provides information for the Internet community. This memo
19 does not specify an Internet standard of any kind. Distribution of
20 this memo is unlimited.
21
22IESG Note:
23
24 The IESG takes no position on the validity of any Intellectual
25 Property Rights statements contained in this document.
26
27Notices
28
29 Copyright (c) 1996 L. Peter Deutsch and Jean-Loup Gailly
30
31 Permission is granted to copy and distribute this document for any
32 purpose and without charge, including translations into other
33 languages and incorporation into compilations, provided that the
34 copyright notice and this notice are preserved, and that any
35 substantive changes or deletions from the original are clearly
36 marked.
37
38 A pointer to the latest version of this and related documentation in
39 HTML format can be found at the URL
40 <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
41
42Abstract
43
44 This specification defines a lossless compressed data format. The
45 data can be produced or consumed, even for an arbitrarily long
46 sequentially presented input data stream, using only an a priori
47 bounded amount of intermediate storage. The format presently uses
48 the DEFLATE compression method but can be easily extended to use
49 other compression methods. It can be implemented readily in a manner
50 not covered by patents. This specification also defines the ADLER-32
51 checksum (an extension and improvement of the Fletcher checksum),
52 used for detection of data corruption, and provides an algorithm for
53 computing it.
54
55
56
57
58Deutsch & Gailly Informational [Page 1]
59
60RFC 1950 ZLIB Compressed Data Format Specification May 1996
61
62
63Table of Contents
64
65 1. Introduction ................................................... 2
66 1.1. Purpose ................................................... 2
67 1.2. Intended audience ......................................... 3
68 1.3. Scope ..................................................... 3
69 1.4. Compliance ................................................ 3
70 1.5. Definitions of terms and conventions used ................ 3
71 1.6. Changes from previous versions ............................ 3
72 2. Detailed specification ......................................... 3
73 2.1. Overall conventions ....................................... 3
74 2.2. Data format ............................................... 4
75 2.3. Compliance ................................................ 7
76 3. References ..................................................... 7
77 4. Source code .................................................... 8
78 5. Security Considerations ........................................ 8
79 6. Acknowledgements ............................................... 8
80 7. Authors' Addresses ............................................. 8
81 8. Appendix: Rationale ............................................ 9
82 9. Appendix: Sample code ..........................................10
83
841. Introduction
85
86 1.1. Purpose
87
88 The purpose of this specification is to define a lossless
89 compressed data format that:
90
91 * Is independent of CPU type, operating system, file system,
92 and character set, and hence can be used for interchange;
93
94 * Can be produced or consumed, even for an arbitrarily long
95 sequentially presented input data stream, using only an a
96 priori bounded amount of intermediate storage, and hence can
97 be used in data communications or similar structures such as
98 Unix filters;
99
100 * Can use a number of different compression methods;
101
102 * Can be implemented readily in a manner not covered by
103 patents, and hence can be practiced freely.
104
105 The data format defined by this specification does not attempt to
106 allow random access to compressed data.
107
108
109
110
111
112
113
114Deutsch & Gailly Informational [Page 2]
115
116RFC 1950 ZLIB Compressed Data Format Specification May 1996
117
118
119 1.2. Intended audience
120
121 This specification is intended for use by implementors of software
122 to compress data into zlib format and/or decompress data from zlib
123 format.
124
125 The text of the specification assumes a basic background in
126 programming at the level of bits and other primitive data
127 representations.
128
129 1.3. Scope
130
131 The specification specifies a compressed data format that can be
132 used for in-memory compression of a sequence of arbitrary bytes.
133
134 1.4. Compliance
135
136 Unless otherwise indicated below, a compliant decompressor must be
137 able to accept and decompress any data set that conforms to all
138 the specifications presented here; a compliant compressor must
139 produce data sets that conform to all the specifications presented
140 here.
141
142 1.5. Definitions of terms and conventions used
143
144 byte: 8 bits stored or transmitted as a unit (same as an octet).
145 (For this specification, a byte is exactly 8 bits, even on
146 machines which store a character on a number of bits different
147 from 8.) See below, for the numbering of bits within a byte.
148
149 1.6. Changes from previous versions
150
151 Version 3.1 was the first public release of this specification.
152 In version 3.2, some terminology was changed and the Adler-32
153 sample code was rewritten for clarity. In version 3.3, the
154 support for a preset dictionary was introduced, and the
155 specification was converted to RFC style.
156
1572. Detailed specification
158
159 2.1. Overall conventions
160
161 In the diagrams below, a box like this:
162
163 +---+
164 | | <-- the vertical bars might be missing
165 +---+
166
167
168
169
170Deutsch & Gailly Informational [Page 3]
171
172RFC 1950 ZLIB Compressed Data Format Specification May 1996
173
174
175 represents one byte; a box like this:
176
177 +==============+
178 | |
179 +==============+
180
181 represents a variable number of bytes.
182
183 Bytes stored within a computer do not have a "bit order", since
184 they are always treated as a unit. However, a byte considered as
185 an integer between 0 and 255 does have a most- and least-
186 significant bit, and since we write numbers with the most-
187 significant digit on the left, we also write bytes with the most-
188 significant bit on the left. In the diagrams below, we number the
189 bits of a byte so that bit 0 is the least-significant bit, i.e.,
190 the bits are numbered:
191
192 +--------+
193 |76543210|
194 +--------+
195
196 Within a computer, a number may occupy multiple bytes. All
197 multi-byte numbers in the format described here are stored with
198 the MOST-significant byte first (at the lower memory address).
199 For example, the decimal number 520 is stored as:
200
201 0 1
202 +--------+--------+
203 |00000010|00001000|
204 +--------+--------+
205 ^ ^
206 | |
207 | + less significant byte = 8
208 + more significant byte = 2 x 256
209
210 2.2. Data format
211
212 A zlib stream has the following structure:
213
214 0 1
215 +---+---+
216 |CMF|FLG| (more-->)
217 +---+---+
218
219
220
221
222
223
224
225
226Deutsch & Gailly Informational [Page 4]
227
228RFC 1950 ZLIB Compressed Data Format Specification May 1996
229
230
231 (if FLG.FDICT set)
232
233 0 1 2 3
234 +---+---+---+---+
235 | DICTID | (more-->)
236 +---+---+---+---+
237
238 +=====================+---+---+---+---+
239 |...compressed data...| ADLER32 |
240 +=====================+---+---+---+---+
241
242 Any data which may appear after ADLER32 are not part of the zlib
243 stream.
244
245 CMF (Compression Method and flags)
246 This byte is divided into a 4-bit compression method and a 4-
247 bit information field depending on the compression method.
248
249 bits 0 to 3 CM Compression method
250 bits 4 to 7 CINFO Compression info
251
252 CM (Compression method)
253 This identifies the compression method used in the file. CM = 8
254 denotes the "deflate" compression method with a window size up
255 to 32K. This is the method used by gzip and PNG (see
256 references [1] and [2] in Chapter 3, below, for the reference
257 documents). CM = 15 is reserved. It might be used in a future
258 version of this specification to indicate the presence of an
259 extra field before the compressed data.
260
261 CINFO (Compression info)
262 For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
263 size, minus eight (CINFO=7 indicates a 32K window size). Values
264 of CINFO above 7 are not allowed in this version of the
265 specification. CINFO is not defined in this specification for
266 CM not equal to 8.
267
268 FLG (FLaGs)
269 This flag byte is divided as follows:
270
271 bits 0 to 4 FCHECK (check bits for CMF and FLG)
272 bit 5 FDICT (preset dictionary)
273 bits 6 to 7 FLEVEL (compression level)
274
275 The FCHECK value must be such that CMF and FLG, when viewed as
276 a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG),
277 is a multiple of 31.
278
279
280
281
282Deutsch & Gailly Informational [Page 5]
283
284RFC 1950 ZLIB Compressed Data Format Specification May 1996
285
286
287 FDICT (Preset dictionary)
288 If FDICT is set, a DICT dictionary identifier is present
289 immediately after the FLG byte. The dictionary is a sequence of
290 bytes which are initially fed to the compressor without
291 producing any compressed output. DICT is the Adler-32 checksum
292 of this sequence of bytes (see the definition of ADLER32
293 below). The decompressor can use this identifier to determine
294 which dictionary has been used by the compressor.
295
296 FLEVEL (Compression level)
297 These flags are available for use by specific compression
298 methods. The "deflate" method (CM = 8) sets these flags as
299 follows:
300
301 0 - compressor used fastest algorithm
302 1 - compressor used fast algorithm
303 2 - compressor used default algorithm
304 3 - compressor used maximum compression, slowest algorithm
305
306 The information in FLEVEL is not needed for decompression; it
307 is there to indicate if recompression might be worthwhile.
308
309 compressed data
310 For compression method 8, the compressed data is stored in the
311 deflate compressed data format as described in the document
312 "DEFLATE Compressed Data Format Specification" by L. Peter
313 Deutsch. (See reference [3] in Chapter 3, below)
314
315 Other compressed data formats are not specified in this version
316 of the zlib specification.
317
318 ADLER32 (Adler-32 checksum)
319 This contains a checksum value of the uncompressed data
320 (excluding any dictionary data) computed according to Adler-32
321 algorithm. This algorithm is a 32-bit extension and improvement
322 of the Fletcher algorithm, used in the ITU-T X.224 / ISO 8073
323 standard. See references [4] and [5] in Chapter 3, below)
324
325 Adler-32 is composed of two sums accumulated per byte: s1 is
326 the sum of all bytes, s2 is the sum of all s1 values. Both sums
327 are done modulo 65521. s1 is initialized to 1, s2 to zero. The
328 Adler-32 checksum is stored as s2*65536 + s1 in most-
329 significant-byte first (network) order.
330
331
332
333
334
335
336
337
338Deutsch & Gailly Informational [Page 6]
339
340RFC 1950 ZLIB Compressed Data Format Specification May 1996
341
342
343 2.3. Compliance
344
345 A compliant compressor must produce streams with correct CMF, FLG
346 and ADLER32, but need not support preset dictionaries. When the
347 zlib data format is used as part of another standard data format,
348 the compressor may use only preset dictionaries that are specified
349 by this other data format. If this other format does not use the
350 preset dictionary feature, the compressor must not set the FDICT
351 flag.
352
353 A compliant decompressor must check CMF, FLG, and ADLER32, and
354 provide an error indication if any of these have incorrect values.
355 A compliant decompressor must give an error indication if CM is
356 not one of the values defined in this specification (only the
357 value 8 is permitted in this version), since another value could
358 indicate the presence of new features that would cause subsequent
359 data to be interpreted incorrectly. A compliant decompressor must
360 give an error indication if FDICT is set and DICTID is not the
361 identifier of a known preset dictionary. A decompressor may
362 ignore FLEVEL and still be compliant. When the zlib data format
363 is being used as a part of another standard format, a compliant
364 decompressor must support all the preset dictionaries specified by
365 the other format. When the other format does not use the preset
366 dictionary feature, a compliant decompressor must reject any
367 stream in which the FDICT flag is set.
368
3693. References
370
371 [1] Deutsch, L.P.,"GZIP Compressed Data Format Specification",
372 available in ftp://ftp.uu.net/pub/archiving/zip/doc/
373
374 [2] Thomas Boutell, "PNG (Portable Network Graphics) specification",
375 available in ftp://ftp.uu.net/graphics/png/documents/
376
377 [3] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification",
378 available in ftp://ftp.uu.net/pub/archiving/zip/doc/
379
380 [4] Fletcher, J. G., "An Arithmetic Checksum for Serial
381 Transmissions," IEEE Transactions on Communications, Vol. COM-30,
382 No. 1, January 1982, pp. 247-252.
383
384 [5] ITU-T Recommendation X.224, Annex D, "Checksum Algorithms,"
385 November, 1993, pp. 144, 145. (Available from
386 gopher://info.itu.ch). ITU-T X.244 is also the same as ISO 8073.
387
388
389
390
391
392
393
394Deutsch & Gailly Informational [Page 7]
395
396RFC 1950 ZLIB Compressed Data Format Specification May 1996
397
398
3994. Source code
400
401 Source code for a C language implementation of a "zlib" compliant
402 library is available at ftp://ftp.uu.net/pub/archiving/zip/zlib/.
403
4045. Security Considerations
405
406 A decoder that fails to check the ADLER32 checksum value may be
407 subject to undetected data corruption.
408
4096. Acknowledgements
410
411 Trademarks cited in this document are the property of their
412 respective owners.
413
414 Jean-Loup Gailly and Mark Adler designed the zlib format and wrote
415 the related software described in this specification. Glenn
416 Randers-Pehrson converted this document to RFC and HTML format.
417
4187. Authors' Addresses
419
420 L. Peter Deutsch
421 Aladdin Enterprises
422 203 Santa Margarita Ave.
423 Menlo Park, CA 94025
424
425 Phone: (415) 322-0103 (AM only)
426 FAX: (415) 322-1734
427 EMail: <ghost@aladdin.com>
428
429
430 Jean-Loup Gailly
431
432 EMail: <gzip@prep.ai.mit.edu>
433
434 Questions about the technical content of this specification can be
435 sent by email to
436
437 Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
438 Mark Adler <madler@alumni.caltech.edu>
439
440 Editorial comments on this specification can be sent by email to
441
442 L. Peter Deutsch <ghost@aladdin.com> and
443 Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
444
445
446
447
448
449
450Deutsch & Gailly Informational [Page 8]
451
452RFC 1950 ZLIB Compressed Data Format Specification May 1996
453
454
4558. Appendix: Rationale
456
457 8.1. Preset dictionaries
458
459 A preset dictionary is specially useful to compress short input
460 sequences. The compressor can take advantage of the dictionary
461 context to encode the input in a more compact manner. The
462 decompressor can be initialized with the appropriate context by
463 virtually decompressing a compressed version of the dictionary
464 without producing any output. However for certain compression
465 algorithms such as the deflate algorithm this operation can be
466 achieved without actually performing any decompression.
467
468 The compressor and the decompressor must use exactly the same
469 dictionary. The dictionary may be fixed or may be chosen among a
470 certain number of predefined dictionaries, according to the kind
471 of input data. The decompressor can determine which dictionary has
472 been chosen by the compressor by checking the dictionary
473 identifier. This document does not specify the contents of
474 predefined dictionaries, since the optimal dictionaries are
475 application specific. Standard data formats using this feature of
476 the zlib specification must precisely define the allowed
477 dictionaries.
478
479 8.2. The Adler-32 algorithm
480
481 The Adler-32 algorithm is much faster than the CRC32 algorithm yet
482 still provides an extremely low probability of undetected errors.
483
484 The modulo on unsigned long accumulators can be delayed for 5552
485 bytes, so the modulo operation time is negligible. If the bytes
486 are a, b, c, the second sum is 3a + 2b + c + 3, and so is position
487 and order sensitive, unlike the first sum, which is just a
488 checksum. That 65521 is prime is important to avoid a possible
489 large class of two-byte errors that leave the check unchanged.
490 (The Fletcher checksum uses 255, which is not prime and which also
491 makes the Fletcher check insensitive to single byte changes 0 <->
492 255.)
493
494 The sum s1 is initialized to 1 instead of zero to make the length
495 of the sequence part of s2, so that the length does not have to be
496 checked separately. (Any sequence of zeroes has a Fletcher
497 checksum of zero.)
498
499
500
501
502
503
504
505
506Deutsch & Gailly Informational [Page 9]
507
508RFC 1950 ZLIB Compressed Data Format Specification May 1996
509
510
5119. Appendix: Sample code
512
513 The following C code computes the Adler-32 checksum of a data buffer.
514 It is written for clarity, not for speed. The sample code is in the
515 ANSI C programming language. Non C users may find it easier to read
516 with these hints:
517
518 & Bitwise AND operator.
519 >> Bitwise right shift operator. When applied to an
520 unsigned quantity, as here, right shift inserts zero bit(s)
521 at the left.
522 << Bitwise left shift operator. Left shift inserts zero
523 bit(s) at the right.
524 ++ "n++" increments the variable n.
525 % modulo operator: a % b is the remainder of a divided by b.
526
527 #define BASE 65521 /* largest prime smaller than 65536 */
528
529 /*
530 Update a running Adler-32 checksum with the bytes buf[0..len-1]
531 and return the updated checksum. The Adler-32 checksum should be
532 initialized to 1.
533
534 Usage example:
535
536 unsigned long adler = 1L;
537
538 while (read_buffer(buffer, length) != EOF) {
539 adler = update_adler32(adler, buffer, length);
540 }
541 if (adler != original_adler) error();
542 */
543 unsigned long update_adler32(unsigned long adler,
544 unsigned char *buf, int len)
545 {
546 unsigned long s1 = adler & 0xffff;
547 unsigned long s2 = (adler >> 16) & 0xffff;
548 int n;
549
550 for (n = 0; n < len; n++) {
551 s1 = (s1 + buf[n]) % BASE;
552 s2 = (s2 + s1) % BASE;
553 }
554 return (s2 << 16) + s1;
555 }
556
557 /* Return the adler32 of the bytes buf[0..len-1] */
558
559
560
561
562Deutsch & Gailly Informational [Page 10]
563
564RFC 1950 ZLIB Compressed Data Format Specification May 1996
565
566
567 unsigned long adler32(unsigned char *buf, int len)
568 {
569 return update_adler32(1L, buf, len);
570 }
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618Deutsch & Gailly Informational [Page 11]
619
diff --git a/doc/rfc1951.txt b/doc/rfc1951.txt
new file mode 100644
index 0000000..403c8c7
--- /dev/null
+++ b/doc/rfc1951.txt
@@ -0,0 +1,955 @@
1
2
3
4
5
6
7Network Working Group P. Deutsch
8Request for Comments: 1951 Aladdin Enterprises
9Category: Informational May 1996
10
11
12 DEFLATE Compressed Data Format Specification version 1.3
13
14Status of This Memo
15
16 This memo provides information for the Internet community. This memo
17 does not specify an Internet standard of any kind. Distribution of
18 this memo is unlimited.
19
20IESG Note:
21
22 The IESG takes no position on the validity of any Intellectual
23 Property Rights statements contained in this document.
24
25Notices
26
27 Copyright (c) 1996 L. Peter Deutsch
28
29 Permission is granted to copy and distribute this document for any
30 purpose and without charge, including translations into other
31 languages and incorporation into compilations, provided that the
32 copyright notice and this notice are preserved, and that any
33 substantive changes or deletions from the original are clearly
34 marked.
35
36 A pointer to the latest version of this and related documentation in
37 HTML format can be found at the URL
38 <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
39
40Abstract
41
42 This specification defines a lossless compressed data format that
43 compresses data using a combination of the LZ77 algorithm and Huffman
44 coding, with efficiency comparable to the best currently available
45 general-purpose compression methods. The data can be produced or
46 consumed, even for an arbitrarily long sequentially presented input
47 data stream, using only an a priori bounded amount of intermediate
48 storage. The format can be implemented readily in a manner not
49 covered by patents.
50
51
52
53
54
55
56
57
58Deutsch Informational [Page 1]
59
60RFC 1951 DEFLATE Compressed Data Format Specification May 1996
61
62
63Table of Contents
64
65 1. Introduction ................................................... 2
66 1.1. Purpose ................................................... 2
67 1.2. Intended audience ......................................... 3
68 1.3. Scope ..................................................... 3
69 1.4. Compliance ................................................ 3
70 1.5. Definitions of terms and conventions used ................ 3
71 1.6. Changes from previous versions ............................ 4
72 2. Compressed representation overview ............................. 4
73 3. Detailed specification ......................................... 5
74 3.1. Overall conventions ....................................... 5
75 3.1.1. Packing into bytes .................................. 5
76 3.2. Compressed block format ................................... 6
77 3.2.1. Synopsis of prefix and Huffman coding ............... 6
78 3.2.2. Use of Huffman coding in the "deflate" format ....... 7
79 3.2.3. Details of block format ............................. 9
80 3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
81 3.2.5. Compressed blocks (length and distance codes) ...... 11
82 3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
83 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
84 3.3. Compliance ............................................... 14
85 4. Compression algorithm details ................................. 14
86 5. References .................................................... 16
87 6. Security Considerations ....................................... 16
88 7. Source code ................................................... 16
89 8. Acknowledgements .............................................. 16
90 9. Author's Address .............................................. 17
91
921. Introduction
93
94 1.1. Purpose
95
96 The purpose of this specification is to define a lossless
97 compressed data format that:
98 * Is independent of CPU type, operating system, file system,
99 and character set, and hence can be used for interchange;
100 * Can be produced or consumed, even for an arbitrarily long
101 sequentially presented input data stream, using only an a
102 priori bounded amount of intermediate storage, and hence
103 can be used in data communications or similar structures
104 such as Unix filters;
105 * Compresses data with efficiency comparable to the best
106 currently available general-purpose compression methods,
107 and in particular considerably better than the "compress"
108 program;
109 * Can be implemented readily in a manner not covered by
110 patents, and hence can be practiced freely;
111
112
113
114Deutsch Informational [Page 2]
115
116RFC 1951 DEFLATE Compressed Data Format Specification May 1996
117
118
119 * Is compatible with the file format produced by the current
120 widely used gzip utility, in that conforming decompressors
121 will be able to read data produced by the existing gzip
122 compressor.
123
124 The data format defined by this specification does not attempt to:
125
126 * Allow random access to compressed data;
127 * Compress specialized data (e.g., raster graphics) as well
128 as the best currently available specialized algorithms.
129
130 A simple counting argument shows that no lossless compression
131 algorithm can compress every possible input data set. For the
132 format defined here, the worst case expansion is 5 bytes per 32K-
133 byte block, i.e., a size increase of 0.015% for large data sets.
134 English text usually compresses by a factor of 2.5 to 3;
135 executable files usually compress somewhat less; graphical data
136 such as raster images may compress much more.
137
138 1.2. Intended audience
139
140 This specification is intended for use by implementors of software
141 to compress data into "deflate" format and/or decompress data from
142 "deflate" format.
143
144 The text of the specification assumes a basic background in
145 programming at the level of bits and other primitive data
146 representations. Familiarity with the technique of Huffman coding
147 is helpful but not required.
148
149 1.3. Scope
150
151 The specification specifies a method for representing a sequence
152 of bytes as a (usually shorter) sequence of bits, and a method for
153 packing the latter bit sequence into bytes.
154
155 1.4. Compliance
156
157 Unless otherwise indicated below, a compliant decompressor must be
158 able to accept and decompress any data set that conforms to all
159 the specifications presented here; a compliant compressor must
160 produce data sets that conform to all the specifications presented
161 here.
162
163 1.5. Definitions of terms and conventions used
164
165 Byte: 8 bits stored or transmitted as a unit (same as an octet).
166 For this specification, a byte is exactly 8 bits, even on machines
167
168
169
170Deutsch Informational [Page 3]
171
172RFC 1951 DEFLATE Compressed Data Format Specification May 1996
173
174
175 which store a character on a number of bits different from eight.
176 See below, for the numbering of bits within a byte.
177
178 String: a sequence of arbitrary bytes.
179
180 1.6. Changes from previous versions
181
182 There have been no technical changes to the deflate format since
183 version 1.1 of this specification. In version 1.2, some
184 terminology was changed. Version 1.3 is a conversion of the
185 specification to RFC style.
186
1872. Compressed representation overview
188
189 A compressed data set consists of a series of blocks, corresponding
190 to successive blocks of input data. The block sizes are arbitrary,
191 except that non-compressible blocks are limited to 65,535 bytes.
192
193 Each block is compressed using a combination of the LZ77 algorithm
194 and Huffman coding. The Huffman trees for each block are independent
195 of those for previous or subsequent blocks; the LZ77 algorithm may
196 use a reference to a duplicated string occurring in a previous block,
197 up to 32K input bytes before.
198
199 Each block consists of two parts: a pair of Huffman code trees that
200 describe the representation of the compressed data part, and a
201 compressed data part. (The Huffman trees themselves are compressed
202 using Huffman encoding.) The compressed data consists of a series of
203 elements of two types: literal bytes (of strings that have not been
204 detected as duplicated within the previous 32K input bytes), and
205 pointers to duplicated strings, where a pointer is represented as a
206 pair <length, backward distance>. The representation used in the
207 "deflate" format limits distances to 32K bytes and lengths to 258
208 bytes, but does not limit the size of a block, except for
209 uncompressible blocks, which are limited as noted above.
210
211 Each type of value (literals, distances, and lengths) in the
212 compressed data is represented using a Huffman code, using one code
213 tree for literals and lengths and a separate code tree for distances.
214 The code trees for each block appear in a compact form just before
215 the compressed data for that block.
216
217
218
219
220
221
222
223
224
225
226Deutsch Informational [Page 4]
227
228RFC 1951 DEFLATE Compressed Data Format Specification May 1996
229
230
2313. Detailed specification
232
233 3.1. Overall conventions In the diagrams below, a box like this:
234
235 +---+
236 | | <-- the vertical bars might be missing
237 +---+
238
239 represents one byte; a box like this:
240
241 +==============+
242 | |
243 +==============+
244
245 represents a variable number of bytes.
246
247 Bytes stored within a computer do not have a "bit order", since
248 they are always treated as a unit. However, a byte considered as
249 an integer between 0 and 255 does have a most- and least-
250 significant bit, and since we write numbers with the most-
251 significant digit on the left, we also write bytes with the most-
252 significant bit on the left. In the diagrams below, we number the
253 bits of a byte so that bit 0 is the least-significant bit, i.e.,
254 the bits are numbered:
255
256 +--------+
257 |76543210|
258 +--------+
259
260 Within a computer, a number may occupy multiple bytes. All
261 multi-byte numbers in the format described here are stored with
262 the least-significant byte first (at the lower memory address).
263 For example, the decimal number 520 is stored as:
264
265 0 1
266 +--------+--------+
267 |00001000|00000010|
268 +--------+--------+
269 ^ ^
270 | |
271 | + more significant byte = 2 x 256
272 + less significant byte = 8
273
274 3.1.1. Packing into bytes
275
276 This document does not address the issue of the order in which
277 bits of a byte are transmitted on a bit-sequential medium,
278 since the final data format described here is byte- rather than
279
280
281
282Deutsch Informational [Page 5]
283
284RFC 1951 DEFLATE Compressed Data Format Specification May 1996
285
286
287 bit-oriented. However, we describe the compressed block format
288 in below, as a sequence of data elements of various bit
289 lengths, not a sequence of bytes. We must therefore specify
290 how to pack these data elements into bytes to form the final
291 compressed byte sequence:
292
293 * Data elements are packed into bytes in order of
294 increasing bit number within the byte, i.e., starting
295 with the least-significant bit of the byte.
296 * Data elements other than Huffman codes are packed
297 starting with the least-significant bit of the data
298 element.
299 * Huffman codes are packed starting with the most-
300 significant bit of the code.
301
302 In other words, if one were to print out the compressed data as
303 a sequence of bytes, starting with the first byte at the
304 *right* margin and proceeding to the *left*, with the most-
305 significant bit of each byte on the left as usual, one would be
306 able to parse the result from right to left, with fixed-width
307 elements in the correct MSB-to-LSB order and Huffman codes in
308 bit-reversed order (i.e., with the first bit of the code in the
309 relative LSB position).
310
311 3.2. Compressed block format
312
313 3.2.1. Synopsis of prefix and Huffman coding
314
315 Prefix coding represents symbols from an a priori known
316 alphabet by bit sequences (codes), one code for each symbol, in
317 a manner such that different symbols may be represented by bit
318 sequences of different lengths, but a parser can always parse
319 an encoded string unambiguously symbol-by-symbol.
320
321 We define a prefix code in terms of a binary tree in which the
322 two edges descending from each non-leaf node are labeled 0 and
323 1 and in which the leaf nodes correspond one-for-one with (are
324 labeled with) the symbols of the alphabet; then the code for a
325 symbol is the sequence of 0's and 1's on the edges leading from
326 the root to the leaf labeled with that symbol. For example:
327
328
329
330
331
332
333
334
335
336
337
338Deutsch Informational [Page 6]
339
340RFC 1951 DEFLATE Compressed Data Format Specification May 1996
341
342
343 /\ Symbol Code
344 0 1 ------ ----
345 / \ A 00
346 /\ B B 1
347 0 1 C 011
348 / \ D 010
349 A /\
350 0 1
351 / \
352 D C
353
354 A parser can decode the next symbol from an encoded input
355 stream by walking down the tree from the root, at each step
356 choosing the edge corresponding to the next input bit.
357
358 Given an alphabet with known symbol frequencies, the Huffman
359 algorithm allows the construction of an optimal prefix code
360 (one which represents strings with those symbol frequencies
361 using the fewest bits of any possible prefix codes for that
362 alphabet). Such a code is called a Huffman code. (See
363 reference [1] in Chapter 5, references for additional
364 information on Huffman codes.)
365
366 Note that in the "deflate" format, the Huffman codes for the
367 various alphabets must not exceed certain maximum code lengths.
368 This constraint complicates the algorithm for computing code
369 lengths from symbol frequencies. Again, see Chapter 5,
370 references for details.
371
372 3.2.2. Use of Huffman coding in the "deflate" format
373
374 The Huffman codes used for each alphabet in the "deflate"
375 format have two additional rules:
376
377 * All codes of a given bit length have lexicographically
378 consecutive values, in the same order as the symbols
379 they represent;
380
381 * Shorter codes lexicographically precede longer codes.
382
383
384
385
386
387
388
389
390
391
392
393
394Deutsch Informational [Page 7]
395
396RFC 1951 DEFLATE Compressed Data Format Specification May 1996
397
398
399 We could recode the example above to follow this rule as
400 follows, assuming that the order of the alphabet is ABCD:
401
402 Symbol Code
403 ------ ----
404 A 10
405 B 0
406 C 110
407 D 111
408
409 I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
410 lexicographically consecutive.
411
412 Given this rule, we can define the Huffman code for an alphabet
413 just by giving the bit lengths of the codes for each symbol of
414 the alphabet in order; this is sufficient to determine the
415 actual codes. In our example, the code is completely defined
416 by the sequence of bit lengths (2, 1, 3, 3). The following
417 algorithm generates the codes as integers, intended to be read
418 from most- to least-significant bit. The code lengths are
419 initially in tree[I].Len; the codes are produced in
420 tree[I].Code.
421
422 1) Count the number of codes for each code length. Let
423 bl_count[N] be the number of codes of length N, N >= 1.
424
425 2) Find the numerical value of the smallest code for each
426 code length:
427
428 code = 0;
429 bl_count[0] = 0;
430 for (bits = 1; bits <= MAX_BITS; bits++) {
431 code = (code + bl_count[bits-1]) << 1;
432 next_code[bits] = code;
433 }
434
435 3) Assign numerical values to all codes, using consecutive
436 values for all codes of the same length with the base
437 values determined at step 2. Codes that are never used
438 (which have a bit length of zero) must not be assigned a
439 value.
440
441 for (n = 0; n <= max_code; n++) {
442 len = tree[n].Len;
443 if (len != 0) {
444 tree[n].Code = next_code[len];
445 next_code[len]++;
446 }
447
448
449
450Deutsch Informational [Page 8]
451
452RFC 1951 DEFLATE Compressed Data Format Specification May 1996
453
454
455 }
456
457 Example:
458
459 Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
460 3, 2, 4, 4). After step 1, we have:
461
462 N bl_count[N]
463 - -----------
464 2 1
465 3 5
466 4 2
467
468 Step 2 computes the following next_code values:
469
470 N next_code[N]
471 - ------------
472 1 0
473 2 0
474 3 2
475 4 14
476
477 Step 3 produces the following code values:
478
479 Symbol Length Code
480 ------ ------ ----
481 A 3 010
482 B 3 011
483 C 3 100
484 D 3 101
485 E 3 110
486 F 2 00
487 G 4 1110
488 H 4 1111
489
490 3.2.3. Details of block format
491
492 Each block of compressed data begins with 3 header bits
493 containing the following data:
494
495 first bit BFINAL
496 next 2 bits BTYPE
497
498 Note that the header bits do not necessarily begin on a byte
499 boundary, since a block does not necessarily occupy an integral
500 number of bytes.
501
502
503
504
505
506Deutsch Informational [Page 9]
507
508RFC 1951 DEFLATE Compressed Data Format Specification May 1996
509
510
511 BFINAL is set if and only if this is the last block of the data
512 set.
513
514 BTYPE specifies how the data are compressed, as follows:
515
516 00 - no compression
517 01 - compressed with fixed Huffman codes
518 10 - compressed with dynamic Huffman codes
519 11 - reserved (error)
520
521 The only difference between the two compressed cases is how the
522 Huffman codes for the literal/length and distance alphabets are
523 defined.
524
525 In all cases, the decoding algorithm for the actual data is as
526 follows:
527
528 do
529 read block header from input stream.
530 if stored with no compression
531 skip any remaining bits in current partially
532 processed byte
533 read LEN and NLEN (see next section)
534 copy LEN bytes of data to output
535 otherwise
536 if compressed with dynamic Huffman codes
537 read representation of code trees (see
538 subsection below)
539 loop (until end of block code recognized)
540 decode literal/length value from input stream
541 if value < 256
542 copy value (literal byte) to output stream
543 otherwise
544 if value = end of block (256)
545 break from loop
546 otherwise (value = 257..285)
547 decode distance from input stream
548
549 move backwards distance bytes in the output
550 stream, and copy length bytes from this
551 position to the output stream.
552 end loop
553 while not last block
554
555 Note that a duplicated string reference may refer to a string
556 in a previous block; i.e., the backward distance may cross one
557 or more block boundaries. However a distance cannot refer past
558 the beginning of the output stream. (An application using a
559
560
561
562Deutsch Informational [Page 10]
563
564RFC 1951 DEFLATE Compressed Data Format Specification May 1996
565
566
567 preset dictionary might discard part of the output stream; a
568 distance can refer to that part of the output stream anyway)
569 Note also that the referenced string may overlap the current
570 position; for example, if the last 2 bytes decoded have values
571 X and Y, a string reference with <length = 5, distance = 2>
572 adds X,Y,X,Y,X to the output stream.
573
574 We now specify each compression method in turn.
575
576 3.2.4. Non-compressed blocks (BTYPE=00)
577
578 Any bits of input up to the next byte boundary are ignored.
579 The rest of the block consists of the following information:
580
581 0 1 2 3 4...
582 +---+---+---+---+================================+
583 | LEN | NLEN |... LEN bytes of literal data...|
584 +---+---+---+---+================================+
585
586 LEN is the number of data bytes in the block. NLEN is the
587 one's complement of LEN.
588
589 3.2.5. Compressed blocks (length and distance codes)
590
591 As noted above, encoded data blocks in the "deflate" format
592 consist of sequences of symbols drawn from three conceptually
593 distinct alphabets: either literal bytes, from the alphabet of
594 byte values (0..255), or <length, backward distance> pairs,
595 where the length is drawn from (3..258) and the distance is
596 drawn from (1..32,768). In fact, the literal and length
597 alphabets are merged into a single alphabet (0..285), where
598 values 0..255 represent literal bytes, the value 256 indicates
599 end-of-block, and values 257..285 represent length codes
600 (possibly in conjunction with extra bits following the symbol
601 code) as follows:
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618Deutsch Informational [Page 11]
619
620RFC 1951 DEFLATE Compressed Data Format Specification May 1996
621
622
623 Extra Extra Extra
624 Code Bits Length(s) Code Bits Lengths Code Bits Length(s)
625 ---- ---- ------ ---- ---- ------- ---- ---- -------
626 257 0 3 267 1 15,16 277 4 67-82
627 258 0 4 268 1 17,18 278 4 83-98
628 259 0 5 269 2 19-22 279 4 99-114
629 260 0 6 270 2 23-26 280 4 115-130
630 261 0 7 271 2 27-30 281 5 131-162
631 262 0 8 272 2 31-34 282 5 163-194
632 263 0 9 273 3 35-42 283 5 195-226
633 264 0 10 274 3 43-50 284 5 227-257
634 265 1 11,12 275 3 51-58 285 0 258
635 266 1 13,14 276 3 59-66
636
637 The extra bits should be interpreted as a machine integer
638 stored with the most-significant bit first, e.g., bits 1110
639 represent the value 14.
640
641 Extra Extra Extra
642 Code Bits Dist Code Bits Dist Code Bits Distance
643 ---- ---- ---- ---- ---- ------ ---- ---- --------
644 0 0 1 10 4 33-48 20 9 1025-1536
645 1 0 2 11 4 49-64 21 9 1537-2048
646 2 0 3 12 5 65-96 22 10 2049-3072
647 3 0 4 13 5 97-128 23 10 3073-4096
648 4 1 5,6 14 6 129-192 24 11 4097-6144
649 5 1 7,8 15 6 193-256 25 11 6145-8192
650 6 2 9-12 16 7 257-384 26 12 8193-12288
651 7 2 13-16 17 7 385-512 27 12 12289-16384
652 8 3 17-24 18 8 513-768 28 13 16385-24576
653 9 3 25-32 19 8 769-1024 29 13 24577-32768
654
655 3.2.6. Compression with fixed Huffman codes (BTYPE=01)
656
657 The Huffman codes for the two alphabets are fixed, and are not
658 represented explicitly in the data. The Huffman code lengths
659 for the literal/length alphabet are:
660
661 Lit Value Bits Codes
662 --------- ---- -----
663 0 - 143 8 00110000 through
664 10111111
665 144 - 255 9 110010000 through
666 111111111
667 256 - 279 7 0000000 through
668 0010111
669 280 - 287 8 11000000 through
670 11000111
671
672
673
674Deutsch Informational [Page 12]
675
676RFC 1951 DEFLATE Compressed Data Format Specification May 1996
677
678
679 The code lengths are sufficient to generate the actual codes,
680 as described above; we show the codes in the table for added
681 clarity. Literal/length values 286-287 will never actually
682 occur in the compressed data, but participate in the code
683 construction.
684
685 Distance codes 0-31 are represented by (fixed-length) 5-bit
686 codes, with possible additional bits as shown in the table
687 shown in Paragraph 3.2.5, above. Note that distance codes 30-
688 31 will never actually occur in the compressed data.
689
690 3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
691
692 The Huffman codes for the two alphabets appear in the block
693 immediately after the header bits and before the actual
694 compressed data, first the literal/length code and then the
695 distance code. Each code is defined by a sequence of code
696 lengths, as discussed in Paragraph 3.2.2, above. For even
697 greater compactness, the code length sequences themselves are
698 compressed using a Huffman code. The alphabet for code lengths
699 is as follows:
700
701 0 - 15: Represent code lengths of 0 - 15
702 16: Copy the previous code length 3 - 6 times.
703 The next 2 bits indicate repeat length
704 (0 = 3, ... , 3 = 6)
705 Example: Codes 8, 16 (+2 bits 11),
706 16 (+2 bits 10) will expand to
707 12 code lengths of 8 (1 + 6 + 5)
708 17: Repeat a code length of 0 for 3 - 10 times.
709 (3 bits of length)
710 18: Repeat a code length of 0 for 11 - 138 times
711 (7 bits of length)
712
713 A code length of 0 indicates that the corresponding symbol in
714 the literal/length or distance alphabet will not occur in the
715 block, and should not participate in the Huffman code
716 construction algorithm given earlier. If only one distance
717 code is used, it is encoded using one bit, not zero bits; in
718 this case there is a single code length of one, with one unused
719 code. One distance code of zero bits means that there are no
720 distance codes used at all (the data is all literals).
721
722 We can now define the format of the block:
723
724 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
725 5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
726 4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
727
728
729
730Deutsch Informational [Page 13]
731
732RFC 1951 DEFLATE Compressed Data Format Specification May 1996
733
734
735 (HCLEN + 4) x 3 bits: code lengths for the code length
736 alphabet given just above, in the order: 16, 17, 18,
737 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
738
739 These code lengths are interpreted as 3-bit integers
740 (0-7); as above, a code length of 0 means the
741 corresponding symbol (literal/length or distance code
742 length) is not used.
743
744 HLIT + 257 code lengths for the literal/length alphabet,
745 encoded using the code length Huffman code
746
747 HDIST + 1 code lengths for the distance alphabet,
748 encoded using the code length Huffman code
749
750 The actual compressed data of the block,
751 encoded using the literal/length and distance Huffman
752 codes
753
754 The literal/length symbol 256 (end of data),
755 encoded using the literal/length Huffman code
756
757 The code length repeat codes can cross from HLIT + 257 to the
758 HDIST + 1 code lengths. In other words, all code lengths form
759 a single sequence of HLIT + HDIST + 258 values.
760
761 3.3. Compliance
762
763 A compressor may limit further the ranges of values specified in
764 the previous section and still be compliant; for example, it may
765 limit the range of backward pointers to some value smaller than
766 32K. Similarly, a compressor may limit the size of blocks so that
767 a compressible block fits in memory.
768
769 A compliant decompressor must accept the full range of possible
770 values defined in the previous section, and must accept blocks of
771 arbitrary size.
772
7734. Compression algorithm details
774
775 While it is the intent of this document to define the "deflate"
776 compressed data format without reference to any particular
777 compression algorithm, the format is related to the compressed
778 formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
779 since many variations of LZ77 are patented, it is strongly
780 recommended that the implementor of a compressor follow the general
781 algorithm presented here, which is known not to be patented per se.
782 The material in this section is not part of the definition of the
783
784
785
786Deutsch Informational [Page 14]
787
788RFC 1951 DEFLATE Compressed Data Format Specification May 1996
789
790
791 specification per se, and a compressor need not follow it in order to
792 be compliant.
793
794 The compressor terminates a block when it determines that starting a
795 new block with fresh trees would be useful, or when the block size
796 fills up the compressor's block buffer.
797
798 The compressor uses a chained hash table to find duplicated strings,
799 using a hash function that operates on 3-byte sequences. At any
800 given point during compression, let XYZ be the next 3 input bytes to
801 be examined (not necessarily all different, of course). First, the
802 compressor examines the hash chain for XYZ. If the chain is empty,
803 the compressor simply writes out X as a literal byte and advances one
804 byte in the input. If the hash chain is not empty, indicating that
805 the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
806 same hash function value) has occurred recently, the compressor
807 compares all strings on the XYZ hash chain with the actual input data
808 sequence starting at the current point, and selects the longest
809 match.
810
811 The compressor searches the hash chains starting with the most recent
812 strings, to favor small distances and thus take advantage of the
813 Huffman encoding. The hash chains are singly linked. There are no
814 deletions from the hash chains; the algorithm simply discards matches
815 that are too old. To avoid a worst-case situation, very long hash
816 chains are arbitrarily truncated at a certain length, determined by a
817 run-time parameter.
818
819 To improve overall compression, the compressor optionally defers the
820 selection of matches ("lazy matching"): after a match of length N has
821 been found, the compressor searches for a longer match starting at
822 the next input byte. If it finds a longer match, it truncates the
823 previous match to a length of one (thus producing a single literal
824 byte) and then emits the longer match. Otherwise, it emits the
825 original match, and, as described above, advances N bytes before
826 continuing.
827
828 Run-time parameters also control this "lazy match" procedure. If
829 compression ratio is most important, the compressor attempts a
830 complete second search regardless of the length of the first match.
831 In the normal case, if the current match is "long enough", the
832 compressor reduces the search for a longer match, thus speeding up
833 the process. If speed is most important, the compressor inserts new
834 strings in the hash table only when no match was found, or when the
835 match is not "too long". This degrades the compression ratio but
836 saves time since there are both fewer insertions and fewer searches.
837
838
839
840
841
842Deutsch Informational [Page 15]
843
844RFC 1951 DEFLATE Compressed Data Format Specification May 1996
845
846
8475. References
848
849 [1] Huffman, D. A., "A Method for the Construction of Minimum
850 Redundancy Codes", Proceedings of the Institute of Radio
851 Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
852
853 [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
854 Compression", IEEE Transactions on Information Theory, Vol. 23,
855 No. 3, pp. 337-343.
856
857 [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources,
858 available in ftp://ftp.uu.net/pub/archiving/zip/doc/
859
860 [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources,
861 available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/
862
863 [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
864 encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
865
866 [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
867 Comm. ACM, 33,4, April 1990, pp. 449-459.
868
8696. Security Considerations
870
871 Any data compression method involves the reduction of redundancy in
872 the data. Consequently, any corruption of the data is likely to have
873 severe effects and be difficult to correct. Uncompressed text, on
874 the other hand, will probably still be readable despite the presence
875 of some corrupted bytes.
876
877 It is recommended that systems using this data format provide some
878 means of validating the integrity of the compressed data. See
879 reference [3], for example.
880
8817. Source code
882
883 Source code for a C language implementation of a "deflate" compliant
884 compressor and decompressor is available within the zlib package at
885 ftp://ftp.uu.net/pub/archiving/zip/zlib/.
886
8878. Acknowledgements
888
889 Trademarks cited in this document are the property of their
890 respective owners.
891
892 Phil Katz designed the deflate format. Jean-Loup Gailly and Mark
893 Adler wrote the related software described in this specification.
894 Glenn Randers-Pehrson converted this document to RFC and HTML format.
895
896
897
898Deutsch Informational [Page 16]
899
900RFC 1951 DEFLATE Compressed Data Format Specification May 1996
901
902
9039. Author's Address
904
905 L. Peter Deutsch
906 Aladdin Enterprises
907 203 Santa Margarita Ave.
908 Menlo Park, CA 94025
909
910 Phone: (415) 322-0103 (AM only)
911 FAX: (415) 322-1734
912 EMail: <ghost@aladdin.com>
913
914 Questions about the technical content of this specification can be
915 sent by email to:
916
917 Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
918 Mark Adler <madler@alumni.caltech.edu>
919
920 Editorial comments on this specification can be sent by email to:
921
922 L. Peter Deutsch <ghost@aladdin.com> and
923 Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954Deutsch Informational [Page 17]
955
diff --git a/doc/rfc1952.txt b/doc/rfc1952.txt
new file mode 100644
index 0000000..a8e51b4
--- /dev/null
+++ b/doc/rfc1952.txt
@@ -0,0 +1,675 @@
1
2
3
4
5
6
7Network Working Group P. Deutsch
8Request for Comments: 1952 Aladdin Enterprises
9Category: Informational May 1996
10
11
12 GZIP file format specification version 4.3
13
14Status of This Memo
15
16 This memo provides information for the Internet community. This memo
17 does not specify an Internet standard of any kind. Distribution of
18 this memo is unlimited.
19
20IESG Note:
21
22 The IESG takes no position on the validity of any Intellectual
23 Property Rights statements contained in this document.
24
25Notices
26
27 Copyright (c) 1996 L. Peter Deutsch
28
29 Permission is granted to copy and distribute this document for any
30 purpose and without charge, including translations into other
31 languages and incorporation into compilations, provided that the
32 copyright notice and this notice are preserved, and that any
33 substantive changes or deletions from the original are clearly
34 marked.
35
36 A pointer to the latest version of this and related documentation in
37 HTML format can be found at the URL
38 <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
39
40Abstract
41
42 This specification defines a lossless compressed data format that is
43 compatible with the widely used GZIP utility. The format includes a
44 cyclic redundancy check value for detecting data corruption. The
45 format presently uses the DEFLATE method of compression but can be
46 easily extended to use other compression methods. The format can be
47 implemented readily in a manner not covered by patents.
48
49
50
51
52
53
54
55
56
57
58Deutsch Informational [Page 1]
59
60RFC 1952 GZIP File Format Specification May 1996
61
62
63Table of Contents
64
65 1. Introduction ................................................... 2
66 1.1. Purpose ................................................... 2
67 1.2. Intended audience ......................................... 3
68 1.3. Scope ..................................................... 3
69 1.4. Compliance ................................................ 3
70 1.5. Definitions of terms and conventions used ................. 3
71 1.6. Changes from previous versions ............................ 3
72 2. Detailed specification ......................................... 4
73 2.1. Overall conventions ....................................... 4
74 2.2. File format ............................................... 5
75 2.3. Member format ............................................. 5
76 2.3.1. Member header and trailer ........................... 6
77 2.3.1.1. Extra field ................................... 8
78 2.3.1.2. Compliance .................................... 9
79 3. References .................................................. 9
80 4. Security Considerations .................................... 10
81 5. Acknowledgements ........................................... 10
82 6. Author's Address ........................................... 10
83 7. Appendix: Jean-Loup Gailly's gzip utility .................. 11
84 8. Appendix: Sample CRC Code .................................. 11
85
861. Introduction
87
88 1.1. Purpose
89
90 The purpose of this specification is to define a lossless
91 compressed data format that:
92
93 * Is independent of CPU type, operating system, file system,
94 and character set, and hence can be used for interchange;
95 * Can compress or decompress a data stream (as opposed to a
96 randomly accessible file) to produce another data stream,
97 using only an a priori bounded amount of intermediate
98 storage, and hence can be used in data communications or
99 similar structures such as Unix filters;
100 * Compresses data with efficiency comparable to the best
101 currently available general-purpose compression methods,
102 and in particular considerably better than the "compress"
103 program;
104 * Can be implemented readily in a manner not covered by
105 patents, and hence can be practiced freely;
106 * Is compatible with the file format produced by the current
107 widely used gzip utility, in that conforming decompressors
108 will be able to read data produced by the existing gzip
109 compressor.
110
111
112
113
114Deutsch Informational [Page 2]
115
116RFC 1952 GZIP File Format Specification May 1996
117
118
119 The data format defined by this specification does not attempt to:
120
121 * Provide random access to compressed data;
122 * Compress specialized data (e.g., raster graphics) as well as
123 the best currently available specialized algorithms.
124
125 1.2. Intended audience
126
127 This specification is intended for use by implementors of software
128 to compress data into gzip format and/or decompress data from gzip
129 format.
130
131 The text of the specification assumes a basic background in
132 programming at the level of bits and other primitive data
133 representations.
134
135 1.3. Scope
136
137 The specification specifies a compression method and a file format
138 (the latter assuming only that a file can store a sequence of
139 arbitrary bytes). It does not specify any particular interface to
140 a file system or anything about character sets or encodings
141 (except for file names and comments, which are optional).
142
143 1.4. Compliance
144
145 Unless otherwise indicated below, a compliant decompressor must be
146 able to accept and decompress any file that conforms to all the
147 specifications presented here; a compliant compressor must produce
148 files that conform to all the specifications presented here. The
149 material in the appendices is not part of the specification per se
150 and is not relevant to compliance.
151
152 1.5. Definitions of terms and conventions used
153
154 byte: 8 bits stored or transmitted as a unit (same as an octet).
155 (For this specification, a byte is exactly 8 bits, even on
156 machines which store a character on a number of bits different
157 from 8.) See below for the numbering of bits within a byte.
158
159 1.6. Changes from previous versions
160
161 There have been no technical changes to the gzip format since
162 version 4.1 of this specification. In version 4.2, some
163 terminology was changed, and the sample CRC code was rewritten for
164 clarity and to eliminate the requirement for the caller to do pre-
165 and post-conditioning. Version 4.3 is a conversion of the
166 specification to RFC style.
167
168
169
170Deutsch Informational [Page 3]
171
172RFC 1952 GZIP File Format Specification May 1996
173
174
1752. Detailed specification
176
177 2.1. Overall conventions
178
179 In the diagrams below, a box like this:
180
181 +---+
182 | | <-- the vertical bars might be missing
183 +---+
184
185 represents one byte; a box like this:
186
187 +==============+
188 | |
189 +==============+
190
191 represents a variable number of bytes.
192
193 Bytes stored within a computer do not have a "bit order", since
194 they are always treated as a unit. However, a byte considered as
195 an integer between 0 and 255 does have a most- and least-
196 significant bit, and since we write numbers with the most-
197 significant digit on the left, we also write bytes with the most-
198 significant bit on the left. In the diagrams below, we number the
199 bits of a byte so that bit 0 is the least-significant bit, i.e.,
200 the bits are numbered:
201
202 +--------+
203 |76543210|
204 +--------+
205
206 This document does not address the issue of the order in which
207 bits of a byte are transmitted on a bit-sequential medium, since
208 the data format described here is byte- rather than bit-oriented.
209
210 Within a computer, a number may occupy multiple bytes. All
211 multi-byte numbers in the format described here are stored with
212 the least-significant byte first (at the lower memory address).
213 For example, the decimal number 520 is stored as:
214
215 0 1
216 +--------+--------+
217 |00001000|00000010|
218 +--------+--------+
219 ^ ^
220 | |
221 | + more significant byte = 2 x 256
222 + less significant byte = 8
223
224
225
226Deutsch Informational [Page 4]
227
228RFC 1952 GZIP File Format Specification May 1996
229
230
231 2.2. File format
232
233 A gzip file consists of a series of "members" (compressed data
234 sets). The format of each member is specified in the following
235 section. The members simply appear one after another in the file,
236 with no additional information before, between, or after them.
237
238 2.3. Member format
239
240 Each member has the following structure:
241
242 +---+---+---+---+---+---+---+---+---+---+
243 |ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
244 +---+---+---+---+---+---+---+---+---+---+
245
246 (if FLG.FEXTRA set)
247
248 +---+---+=================================+
249 | XLEN |...XLEN bytes of "extra field"...| (more-->)
250 +---+---+=================================+
251
252 (if FLG.FNAME set)
253
254 +=========================================+
255 |...original file name, zero-terminated...| (more-->)
256 +=========================================+
257
258 (if FLG.FCOMMENT set)
259
260 +===================================+
261 |...file comment, zero-terminated...| (more-->)
262 +===================================+
263
264 (if FLG.FHCRC set)
265
266 +---+---+
267 | CRC16 |
268 +---+---+
269
270 +=======================+
271 |...compressed blocks...| (more-->)
272 +=======================+
273
274 0 1 2 3 4 5 6 7
275 +---+---+---+---+---+---+---+---+
276 | CRC32 | ISIZE |
277 +---+---+---+---+---+---+---+---+
278
279
280
281
282Deutsch Informational [Page 5]
283
284RFC 1952 GZIP File Format Specification May 1996
285
286
287 2.3.1. Member header and trailer
288
289 ID1 (IDentification 1)
290 ID2 (IDentification 2)
291 These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139
292 (0x8b, \213), to identify the file as being in gzip format.
293
294 CM (Compression Method)
295 This identifies the compression method used in the file. CM
296 = 0-7 are reserved. CM = 8 denotes the "deflate"
297 compression method, which is the one customarily used by
298 gzip and which is documented elsewhere.
299
300 FLG (FLaGs)
301 This flag byte is divided into individual bits as follows:
302
303 bit 0 FTEXT
304 bit 1 FHCRC
305 bit 2 FEXTRA
306 bit 3 FNAME
307 bit 4 FCOMMENT
308 bit 5 reserved
309 bit 6 reserved
310 bit 7 reserved
311
312 If FTEXT is set, the file is probably ASCII text. This is
313 an optional indication, which the compressor may set by
314 checking a small amount of the input data to see whether any
315 non-ASCII characters are present. In case of doubt, FTEXT
316 is cleared, indicating binary data. For systems which have
317 different file formats for ascii text and binary data, the
318 decompressor can use FTEXT to choose the appropriate format.
319 We deliberately do not specify the algorithm used to set
320 this bit, since a compressor always has the option of
321 leaving it cleared and a decompressor always has the option
322 of ignoring it and letting some other program handle issues
323 of data conversion.
324
325 If FHCRC is set, a CRC16 for the gzip header is present,
326 immediately before the compressed data. The CRC16 consists
327 of the two least significant bytes of the CRC32 for all
328 bytes of the gzip header up to and not including the CRC16.
329 [The FHCRC bit was never set by versions of gzip up to
330 1.2.4, even though it was documented with a different
331 meaning in gzip 1.2.4.]
332
333 If FEXTRA is set, optional extra fields are present, as
334 described in a following section.
335
336
337
338Deutsch Informational [Page 6]
339
340RFC 1952 GZIP File Format Specification May 1996
341
342
343 If FNAME is set, an original file name is present,
344 terminated by a zero byte. The name must consist of ISO
345 8859-1 (LATIN-1) characters; on operating systems using
346 EBCDIC or any other character set for file names, the name
347 must be translated to the ISO LATIN-1 character set. This
348 is the original name of the file being compressed, with any
349 directory components removed, and, if the file being
350 compressed is on a file system with case insensitive names,
351 forced to lower case. There is no original file name if the
352 data was compressed from a source other than a named file;
353 for example, if the source was stdin on a Unix system, there
354 is no file name.
355
356 If FCOMMENT is set, a zero-terminated file comment is
357 present. This comment is not interpreted; it is only
358 intended for human consumption. The comment must consist of
359 ISO 8859-1 (LATIN-1) characters. Line breaks should be
360 denoted by a single line feed character (10 decimal).
361
362 Reserved FLG bits must be zero.
363
364 MTIME (Modification TIME)
365 This gives the most recent modification time of the original
366 file being compressed. The time is in Unix format, i.e.,
367 seconds since 00:00:00 GMT, Jan. 1, 1970. (Note that this
368 may cause problems for MS-DOS and other systems that use
369 local rather than Universal time.) If the compressed data
370 did not come from a file, MTIME is set to the time at which
371 compression started. MTIME = 0 means no time stamp is
372 available.
373
374 XFL (eXtra FLags)
375 These flags are available for use by specific compression
376 methods. The "deflate" method (CM = 8) sets these flags as
377 follows:
378
379 XFL = 2 - compressor used maximum compression,
380 slowest algorithm
381 XFL = 4 - compressor used fastest algorithm
382
383 OS (Operating System)
384 This identifies the type of file system on which compression
385 took place. This may be useful in determining end-of-line
386 convention for text files. The currently defined values are
387 as follows:
388
389
390
391
392
393
394Deutsch Informational [Page 7]
395
396RFC 1952 GZIP File Format Specification May 1996
397
398
399 0 - FAT filesystem (MS-DOS, OS/2, NT/Win32)
400 1 - Amiga
401 2 - VMS (or OpenVMS)
402 3 - Unix
403 4 - VM/CMS
404 5 - Atari TOS
405 6 - HPFS filesystem (OS/2, NT)
406 7 - Macintosh
407 8 - Z-System
408 9 - CP/M
409 10 - TOPS-20
410 11 - NTFS filesystem (NT)
411 12 - QDOS
412 13 - Acorn RISCOS
413 255 - unknown
414
415 XLEN (eXtra LENgth)
416 If FLG.FEXTRA is set, this gives the length of the optional
417 extra field. See below for details.
418
419 CRC32 (CRC-32)
420 This contains a Cyclic Redundancy Check value of the
421 uncompressed data computed according to CRC-32 algorithm
422 used in the ISO 3309 standard and in section 8.1.1.6.2 of
423 ITU-T recommendation V.42. (See http://www.iso.ch for
424 ordering ISO documents. See gopher://info.itu.ch for an
425 online version of ITU-T V.42.)
426
427 ISIZE (Input SIZE)
428 This contains the size of the original (uncompressed) input
429 data modulo 2^32.
430
431 2.3.1.1. Extra field
432
433 If the FLG.FEXTRA bit is set, an "extra field" is present in
434 the header, with total length XLEN bytes. It consists of a
435 series of subfields, each of the form:
436
437 +---+---+---+---+==================================+
438 |SI1|SI2| LEN |... LEN bytes of subfield data ...|
439 +---+---+---+---+==================================+
440
441 SI1 and SI2 provide a subfield ID, typically two ASCII letters
442 with some mnemonic value. Jean-Loup Gailly
443 <gzip@prep.ai.mit.edu> is maintaining a registry of subfield
444 IDs; please send him any subfield ID you wish to use. Subfield
445 IDs with SI2 = 0 are reserved for future use. The following
446 IDs are currently defined:
447
448
449
450Deutsch Informational [Page 8]
451
452RFC 1952 GZIP File Format Specification May 1996
453
454
455 SI1 SI2 Data
456 ---------- ---------- ----
457 0x41 ('A') 0x70 ('P') Apollo file type information
458
459 LEN gives the length of the subfield data, excluding the 4
460 initial bytes.
461
462 2.3.1.2. Compliance
463
464 A compliant compressor must produce files with correct ID1,
465 ID2, CM, CRC32, and ISIZE, but may set all the other fields in
466 the fixed-length part of the header to default values (255 for
467 OS, 0 for all others). The compressor must set all reserved
468 bits to zero.
469
470 A compliant decompressor must check ID1, ID2, and CM, and
471 provide an error indication if any of these have incorrect
472 values. It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC
473 at least so it can skip over the optional fields if they are
474 present. It need not examine any other part of the header or
475 trailer; in particular, a decompressor may ignore FTEXT and OS
476 and always produce binary output, and still be compliant. A
477 compliant decompressor must give an error indication if any
478 reserved bit is non-zero, since such a bit could indicate the
479 presence of a new field that would cause subsequent data to be
480 interpreted incorrectly.
481
4823. References
483
484 [1] "Information Processing - 8-bit single-byte coded graphic
485 character sets - Part 1: Latin alphabet No.1" (ISO 8859-1:1987).
486 The ISO 8859-1 (Latin-1) character set is a superset of 7-bit
487 ASCII. Files defining this character set are available as
488 iso_8859-1.* in ftp://ftp.uu.net/graphics/png/documents/
489
490 [2] ISO 3309
491
492 [3] ITU-T recommendation V.42
493
494 [4] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification",
495 available in ftp://ftp.uu.net/pub/archiving/zip/doc/
496
497 [5] Gailly, J.-L., GZIP documentation, available as gzip-*.tar in
498 ftp://prep.ai.mit.edu/pub/gnu/
499
500 [6] Sarwate, D.V., "Computation of Cyclic Redundancy Checks via Table
501 Look-Up", Communications of the ACM, 31(8), pp.1008-1013.
502
503
504
505
506Deutsch Informational [Page 9]
507
508RFC 1952 GZIP File Format Specification May 1996
509
510
511 [7] Schwaderer, W.D., "CRC Calculation", April 85 PC Tech Journal,
512 pp.118-133.
513
514 [8] ftp://ftp.adelaide.edu.au/pub/rocksoft/papers/crc_v3.txt,
515 describing the CRC concept.
516
5174. Security Considerations
518
519 Any data compression method involves the reduction of redundancy in
520 the data. Consequently, any corruption of the data is likely to have
521 severe effects and be difficult to correct. Uncompressed text, on
522 the other hand, will probably still be readable despite the presence
523 of some corrupted bytes.
524
525 It is recommended that systems using this data format provide some
526 means of validating the integrity of the compressed data, such as by
527 setting and checking the CRC-32 check value.
528
5295. Acknowledgements
530
531 Trademarks cited in this document are the property of their
532 respective owners.
533
534 Jean-Loup Gailly designed the gzip format and wrote, with Mark Adler,
535 the related software described in this specification. Glenn
536 Randers-Pehrson converted this document to RFC and HTML format.
537
5386. Author's Address
539
540 L. Peter Deutsch
541 Aladdin Enterprises
542 203 Santa Margarita Ave.
543 Menlo Park, CA 94025
544
545 Phone: (415) 322-0103 (AM only)
546 FAX: (415) 322-1734
547 EMail: <ghost@aladdin.com>
548
549 Questions about the technical content of this specification can be
550 sent by email to:
551
552 Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
553 Mark Adler <madler@alumni.caltech.edu>
554
555 Editorial comments on this specification can be sent by email to:
556
557 L. Peter Deutsch <ghost@aladdin.com> and
558 Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
559
560
561
562Deutsch Informational [Page 10]
563
564RFC 1952 GZIP File Format Specification May 1996
565
566
5677. Appendix: Jean-Loup Gailly's gzip utility
568
569 The most widely used implementation of gzip compression, and the
570 original documentation on which this specification is based, were
571 created by Jean-Loup Gailly <gzip@prep.ai.mit.edu>. Since this
572 implementation is a de facto standard, we mention some more of its
573 features here. Again, the material in this section is not part of
574 the specification per se, and implementations need not follow it to
575 be compliant.
576
577 When compressing or decompressing a file, gzip preserves the
578 protection, ownership, and modification time attributes on the local
579 file system, since there is no provision for representing protection
580 attributes in the gzip file format itself. Since the file format
581 includes a modification time, the gzip decompressor provides a
582 command line switch that assigns the modification time from the file,
583 rather than the local modification time of the compressed input, to
584 the decompressed output.
585
5868. Appendix: Sample CRC Code
587
588 The following sample code represents a practical implementation of
589 the CRC (Cyclic Redundancy Check). (See also ISO 3309 and ITU-T V.42
590 for a formal specification.)
591
592 The sample code is in the ANSI C programming language. Non C users
593 may find it easier to read with these hints:
594
595 & Bitwise AND operator.
596 ^ Bitwise exclusive-OR operator.
597 >> Bitwise right shift operator. When applied to an
598 unsigned quantity, as here, right shift inserts zero
599 bit(s) at the left.
600 ! Logical NOT operator.
601 ++ "n++" increments the variable n.
602 0xNNN 0x introduces a hexadecimal (base 16) constant.
603 Suffix L indicates a long value (at least 32 bits).
604
605 /* Table of CRCs of all 8-bit messages. */
606 unsigned long crc_table[256];
607
608 /* Flag: has the table been computed? Initially false. */
609 int crc_table_computed = 0;
610
611 /* Make the table for a fast CRC. */
612 void make_crc_table(void)
613 {
614 unsigned long c;
615
616
617
618Deutsch Informational [Page 11]
619
620RFC 1952 GZIP File Format Specification May 1996
621
622
623 int n, k;
624 for (n = 0; n < 256; n++) {
625 c = (unsigned long) n;
626 for (k = 0; k < 8; k++) {
627 if (c & 1) {
628 c = 0xedb88320L ^ (c >> 1);
629 } else {
630 c = c >> 1;
631 }
632 }
633 crc_table[n] = c;
634 }
635 crc_table_computed = 1;
636 }
637
638 /*
639 Update a running crc with the bytes buf[0..len-1] and return
640 the updated crc. The crc should be initialized to zero. Pre- and
641 post-conditioning (one's complement) is performed within this
642 function so it shouldn't be done by the caller. Usage example:
643
644 unsigned long crc = 0L;
645
646 while (read_buffer(buffer, length) != EOF) {
647 crc = update_crc(crc, buffer, length);
648 }
649 if (crc != original_crc) error();
650 */
651 unsigned long update_crc(unsigned long crc,
652 unsigned char *buf, int len)
653 {
654 unsigned long c = crc ^ 0xffffffffL;
655 int n;
656
657 if (!crc_table_computed)
658 make_crc_table();
659 for (n = 0; n < len; n++) {
660 c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8);
661 }
662 return c ^ 0xffffffffL;
663 }
664
665 /* Return the CRC of the bytes buf[0..len-1]. */
666 unsigned long crc(unsigned char *buf, int len)
667 {
668 return update_crc(0L, buf, len);
669 }
670
671
672
673
674Deutsch Informational [Page 12]
675
diff --git a/doc/txtvsbin.txt b/doc/txtvsbin.txt
new file mode 100644
index 0000000..3d0f063
--- /dev/null
+++ b/doc/txtvsbin.txt
@@ -0,0 +1,107 @@
1A Fast Method for Identifying Plain Text Files
2==============================================
3
4
5Introduction
6------------
7
8Given a file coming from an unknown source, it is sometimes desirable
9to find out whether the format of that file is plain text. Although
10this may appear like a simple task, a fully accurate detection of the
11file type requires heavy-duty semantic analysis on the file contents.
12It is, however, possible to obtain satisfactory results by employing
13various heuristics.
14
15Previous versions of PKZip and other zip-compatible compression tools
16were using a crude detection scheme: if more than 80% (4/5) of the bytes
17found in a certain buffer are within the range [7..127], the file is
18labeled as plain text, otherwise it is labeled as binary. A prominent
19limitation of this scheme is the restriction to Latin-based alphabets.
20Other alphabets, like Greek, Cyrillic or Asian, make extensive use of
21the bytes within the range [128..255], and texts using these alphabets
22are most often misidentified by this scheme; in other words, the rate
23of false negatives is sometimes too high, which means that the recall
24is low. Another weakness of this scheme is a reduced precision, due to
25the false positives that may occur when binary files containing large
26amounts of textual characters are misidentified as plain text.
27
28In this article we propose a new, simple detection scheme that features
29a much increased precision and a near-100% recall. This scheme is
30designed to work on ASCII, Unicode and other ASCII-derived alphabets,
31and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
32and variable-sized encodings (ISO-2022, UTF-8, etc.). Wider encodings
33(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
34
35
36The Algorithm
37-------------
38
39The algorithm works by dividing the set of bytecodes [0..255] into three
40categories:
41- The white list of textual bytecodes:
42 9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
43- The gray list of tolerated bytecodes:
44 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
45- The black list of undesired, non-textual bytecodes:
46 0 (NUL) to 6, 14 to 31.
47
48If a file contains at least one byte that belongs to the white list and
49no byte that belongs to the black list, then the file is categorized as
50plain text; otherwise, it is categorized as binary. (The boundary case,
51when the file is empty, automatically falls into the latter category.)
52
53
54Rationale
55---------
56
57The idea behind this algorithm relies on two observations.
58
59The first observation is that, although the full range of 7-bit codes
60[0..127] is properly specified by the ASCII standard, most control
61characters in the range [0..31] are not used in practice. The only
62widely-used, almost universally-portable control codes are 9 (TAB),
6310 (LF) and 13 (CR). There are a few more control codes that are
64recognized on a reduced range of platforms and text viewers/editors:
657 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
66codes are rarely (if ever) used alone, without being accompanied by
67some printable text. Even the newer, portable text formats such as
68XML avoid using control characters outside the list mentioned here.
69
70The second observation is that most of the binary files tend to contain
71control characters, especially 0 (NUL). Even though the older text
72detection schemes observe the presence of non-ASCII codes from the range
73[128..255], the precision rarely has to suffer if this upper range is
74labeled as textual, because the files that are genuinely binary tend to
75contain both control characters and codes from the upper range. On the
76other hand, the upper range needs to be labeled as textual, because it
77is used by virtually all ASCII extensions. In particular, this range is
78used for encoding non-Latin scripts.
79
80Since there is no counting involved, other than simply observing the
81presence or the absence of some byte values, the algorithm produces
82consistent results, regardless what alphabet encoding is being used.
83(If counting were involved, it could be possible to obtain different
84results on a text encoded, say, using ISO-8859-16 versus UTF-8.)
85
86There is an extra category of plain text files that are "polluted" with
87one or more black-listed codes, either by mistake or by peculiar design
88considerations. In such cases, a scheme that tolerates a small fraction
89of black-listed codes would provide an increased recall (i.e. more true
90positives). This, however, incurs a reduced precision overall, since
91false positives are more likely to appear in binary files that contain
92large chunks of textual data. Furthermore, "polluted" plain text should
93be regarded as binary by general-purpose text detection schemes, because
94general-purpose text processing algorithms might not be applicable.
95Under this premise, it is safe to say that our detection method provides
96a near-100% recall.
97
98Experiments have been run on many files coming from various platforms
99and applications. We tried plain text files, system logs, source code,
100formatted office documents, compiled object code, etc. The results
101confirm the optimistic assumptions about the capabilities of this
102algorithm.
103
104
105--
106Cosmin Truta
107Last updated: 2006-May-28