diff options
author | Julian Seward <jseward@acm.org> | 1997-08-07 22:13:13 +0200 |
---|---|---|
committer | Julian Seward <jseward@acm.org> | 1997-08-07 22:13:13 +0200 |
commit | 33d134030248633ffa7d60c0a35a783c46da034b (patch) | |
tree | b760dc34185dccc7054989c1472478574223cc31 /bzip2.1 | |
download | bzip2-33d134030248633ffa7d60c0a35a783c46da034b.tar.gz bzip2-33d134030248633ffa7d60c0a35a783c46da034b.tar.bz2 bzip2-33d134030248633ffa7d60c0a35a783c46da034b.zip |
bzip2-0.1bzip2-0.1
Diffstat (limited to 'bzip2.1')
-rw-r--r-- | bzip2.1 | 441 |
1 files changed, 441 insertions, 0 deletions
@@ -0,0 +1,441 @@ | |||
1 | .PU | ||
2 | .TH bzip2 1 | ||
3 | .SH NAME | ||
4 | bzip2, bunzip2 \- a block-sorting file compressor, v0.1 | ||
5 | .br | ||
6 | bzip2recover \- recovers data from damaged bzip2 files | ||
7 | |||
8 | .SH SYNOPSIS | ||
9 | .ll +8 | ||
10 | .B bzip2 | ||
11 | .RB [ " \-cdfkstvVL123456789 " ] | ||
12 | [ | ||
13 | .I "filenames \&..." | ||
14 | ] | ||
15 | .ll -8 | ||
16 | .br | ||
17 | .B bunzip2 | ||
18 | .RB [ " \-kvsVL " ] | ||
19 | [ | ||
20 | .I "filenames \&..." | ||
21 | ] | ||
22 | .br | ||
23 | .B bzip2recover | ||
24 | .I "filename" | ||
25 | |||
26 | .SH DESCRIPTION | ||
27 | .I Bzip2 | ||
28 | compresses files using the Burrows-Wheeler block-sorting | ||
29 | text compression algorithm, and Huffman coding. | ||
30 | Compression is generally considerably | ||
31 | better than that | ||
32 | achieved by more conventional LZ77/LZ78-based compressors, | ||
33 | and approaches the performance of the PPM family of statistical | ||
34 | compressors. | ||
35 | |||
36 | The command-line options are deliberately very similar to | ||
37 | those of | ||
38 | .I GNU Gzip, | ||
39 | but they are not identical. | ||
40 | |||
41 | .I Bzip2 | ||
42 | expects a list of file names to accompany the command-line flags. | ||
43 | Each file is replaced by a compressed version of itself, | ||
44 | with the name "original_name.bz2". | ||
45 | Each compressed file has the same modification date and permissions | ||
46 | as the corresponding original, so that these properties can be | ||
47 | correctly restored at decompression time. File name handling is | ||
48 | naive in the sense that there is no mechanism for preserving | ||
49 | original file names, permissions and dates in filesystems | ||
50 | which lack these concepts, or have serious file name length | ||
51 | restrictions, such as MS-DOS. | ||
52 | |||
53 | .I Bzip2 | ||
54 | and | ||
55 | .I bunzip2 | ||
56 | will not overwrite existing files; if you want this to happen, | ||
57 | you should delete them first. | ||
58 | |||
59 | If no file names are specified, | ||
60 | .I bzip2 | ||
61 | compresses from standard input to standard output. | ||
62 | In this case, | ||
63 | .I bzip2 | ||
64 | will decline to write compressed output to a terminal, as | ||
65 | this would be entirely incomprehensible and therefore pointless. | ||
66 | |||
67 | .I Bunzip2 | ||
68 | (or | ||
69 | .I bzip2 \-d | ||
70 | ) decompresses and restores all specified files whose names | ||
71 | end in ".bz2". | ||
72 | Files without this suffix are ignored. | ||
73 | Again, supplying no filenames | ||
74 | causes decompression from standard input to standard output. | ||
75 | |||
76 | You can also compress or decompress files to | ||
77 | the standard output by giving the \-c flag. | ||
78 | You can decompress multiple files like this, but you may | ||
79 | only compress a single file this way, since it would otherwise | ||
80 | be difficult to separate out the compressed representations of | ||
81 | the original files. | ||
82 | |||
83 | Compression is always performed, even if the compressed file is | ||
84 | slightly larger than the original. Files of less than about | ||
85 | one hundred bytes tend to get larger, since the compression | ||
86 | mechanism has a constant overhead in the region of 50 bytes. | ||
87 | Random data (including the output of most file compressors) | ||
88 | is coded at about 8.05 bits per byte, giving an expansion of | ||
89 | around 0.5%. | ||
90 | |||
91 | As a self-check for your protection, | ||
92 | .I bzip2 | ||
93 | uses 32-bit CRCs to make sure that the decompressed | ||
94 | version of a file is identical to the original. | ||
95 | This guards against corruption of the compressed data, | ||
96 | and against undetected bugs in | ||
97 | .I bzip2 | ||
98 | (hopefully very unlikely). | ||
99 | The chances of data corruption going undetected is | ||
100 | microscopic, about one chance in four billion | ||
101 | for each file processed. Be aware, though, that the check | ||
102 | occurs upon decompression, so it can only tell you that | ||
103 | that something is wrong. It can't help you recover the | ||
104 | original uncompressed data. | ||
105 | You can use | ||
106 | .I bzip2recover | ||
107 | to try to recover data from damaged files. | ||
108 | |||
109 | Return values: | ||
110 | 0 for a normal exit, | ||
111 | 1 for environmental | ||
112 | problems (file not found, invalid flags, I/O errors, &c), | ||
113 | 2 to indicate a corrupt compressed file, | ||
114 | 3 for an internal consistency error (eg, bug) which caused | ||
115 | .I bzip2 | ||
116 | to panic. | ||
117 | |||
118 | .SH MEMORY MANAGEMENT | ||
119 | .I Bzip2 | ||
120 | compresses large files in blocks. The block size affects both the | ||
121 | compression ratio achieved, and the amount of memory needed both for | ||
122 | compression and decompression. The flags \-1 through \-9 | ||
123 | specify the block size to be 100,000 bytes through 900,000 bytes | ||
124 | (the default) respectively. At decompression-time, the block size used for | ||
125 | compression is read from the header of the compressed file, and | ||
126 | .I bunzip2 | ||
127 | then allocates itself just enough memory to decompress the file. | ||
128 | Since block sizes are stored in compressed files, it follows that the flags | ||
129 | \-1 to \-9 | ||
130 | are irrelevant to and so ignored during decompression. | ||
131 | Compression and decompression requirements, in bytes, can be estimated as: | ||
132 | |||
133 | Compression: 400k + ( 7 x block size ) | ||
134 | |||
135 | Decompression: 100k + ( 5 x block size ), or | ||
136 | .br | ||
137 | 100k + ( 2.5 x block size ) | ||
138 | |||
139 | Larger block sizes give rapidly diminishing marginal returns; most | ||
140 | of the | ||
141 | compression comes from the first two or three hundred k of block size, | ||
142 | a fact worth bearing in mind when using | ||
143 | .I bzip2 | ||
144 | on small machines. It is also important to appreciate that the | ||
145 | decompression memory requirement is set at compression-time by the | ||
146 | choice of block size. | ||
147 | |||
148 | For files compressed with the default 900k block size, | ||
149 | .I bunzip2 | ||
150 | will require about 4600 kbytes to decompress. | ||
151 | To support decompression of any file on a 4 megabyte machine, | ||
152 | .I bunzip2 | ||
153 | has an option to decompress using approximately half this | ||
154 | amount of memory, about 2300 kbytes. Decompression speed is | ||
155 | also halved, so you should use this option only where necessary. | ||
156 | The relevant flag is \-s. | ||
157 | |||
158 | In general, try and use the largest block size | ||
159 | memory constraints allow, since that maximises the compression | ||
160 | achieved. Compression and decompression | ||
161 | speed are virtually unaffected by block size. | ||
162 | |||
163 | Another significant point applies to files which fit in a single | ||
164 | block -- that means most files you'd encounter using a large | ||
165 | block size. The amount of real memory touched is proportional | ||
166 | to the size of the file, since the file is smaller than a block. | ||
167 | For example, compressing a file 20,000 bytes long with the flag | ||
168 | \-9 | ||
169 | will cause the compressor to allocate around | ||
170 | 6700k of memory, but only touch 400k + 20000 * 7 = 540 | ||
171 | kbytes of it. Similarly, the decompressor will allocate 4600k but | ||
172 | only touch 100k + 20000 * 5 = 200 kbytes. | ||
173 | |||
174 | Here is a table which summarises the maximum memory usage for | ||
175 | different block sizes. Also recorded is the total compressed | ||
176 | size for 14 files of the Calgary Text Compression Corpus | ||
177 | totalling 3,141,622 bytes. This column gives some feel for how | ||
178 | compression varies with block size. These figures tend to understate | ||
179 | the advantage of larger block sizes for larger files, since the | ||
180 | Corpus is dominated by smaller files. | ||
181 | |||
182 | Compress Decompress Decompress Corpus | ||
183 | Flag usage usage -s usage Size | ||
184 | |||
185 | -1 1100k 600k 350k 914704 | ||
186 | -2 1800k 1100k 600k 877703 | ||
187 | -3 2500k 1600k 850k 860338 | ||
188 | -4 3200k 2100k 1100k 846899 | ||
189 | -5 3900k 2600k 1350k 845160 | ||
190 | -6 4600k 3100k 1600k 838626 | ||
191 | -7 5400k 3600k 1850k 834096 | ||
192 | -8 6000k 4100k 2100k 828642 | ||
193 | -9 6700k 4600k 2350k 828642 | ||
194 | |||
195 | .SH OPTIONS | ||
196 | .TP | ||
197 | .B \-c --stdout | ||
198 | Compress or decompress to standard output. \-c will decompress | ||
199 | multiple files to stdout, but will only compress a single file to | ||
200 | stdout. | ||
201 | .TP | ||
202 | .B \-d --decompress | ||
203 | Force decompression. | ||
204 | .I Bzip2 | ||
205 | and | ||
206 | .I bunzip2 | ||
207 | are really the same program, and the decision about whether to | ||
208 | compress or decompress is done on the basis of which name is | ||
209 | used. This flag overrides that mechanism, and forces | ||
210 | .I bzip2 | ||
211 | to decompress. | ||
212 | .TP | ||
213 | .B \-f --compress | ||
214 | The complement to \-d: forces compression, regardless of the invokation | ||
215 | name. | ||
216 | .TP | ||
217 | .B \-t --test | ||
218 | Check integrity of the specified file(s), but don't decompress them. | ||
219 | This really performs a trial decompression and throws away the result, | ||
220 | using the low-memory decompression algorithm (see \-s). | ||
221 | .TP | ||
222 | .B \-k --keep | ||
223 | Keep (don't delete) input files during compression or decompression. | ||
224 | .TP | ||
225 | .B \-s --small | ||
226 | Reduce memory usage, both for compression and decompression. | ||
227 | Files are decompressed using a modified algorithm which only | ||
228 | requires 2.5 bytes per block byte. This means any file can be | ||
229 | decompressed in 2300k of memory, albeit somewhat more slowly than | ||
230 | usual. | ||
231 | |||
232 | During compression, -s selects a block size of 200k, which limits | ||
233 | memory use to around the same figure, at the expense of your | ||
234 | compression ratio. In short, if your machine is low on memory | ||
235 | (8 megabytes or less), use -s for everything. See | ||
236 | MEMORY MANAGEMENT above. | ||
237 | |||
238 | .TP | ||
239 | .B \-v --verbose | ||
240 | Verbose mode -- show the compression ratio for each file processed. | ||
241 | Further \-v's increase the verbosity level, spewing out lots of | ||
242 | information which is primarily of interest for diagnostic purposes. | ||
243 | .TP | ||
244 | .B \-L --license | ||
245 | Display the software version, license terms and conditions. | ||
246 | .TP | ||
247 | .B \-V --version | ||
248 | Same as \-L. | ||
249 | .TP | ||
250 | .B \-1 to \-9 | ||
251 | Set the block size to 100 k, 200 k .. 900 k when | ||
252 | compressing. Has no effect when decompressing. | ||
253 | See MEMORY MANAGEMENT above. | ||
254 | .TP | ||
255 | .B \--repetitive-fast | ||
256 | .I bzip2 | ||
257 | injects some small pseudo-random variations | ||
258 | into very repetitive blocks to limit | ||
259 | worst-case performance during compression. | ||
260 | If sorting runs into difficulties, the block | ||
261 | is randomised, and sorting is restarted. | ||
262 | Very roughly, | ||
263 | .I bzip2 | ||
264 | persists for three times as long as a well-behaved input | ||
265 | would take before resorting to randomisation. | ||
266 | This flag makes it give up much sooner. | ||
267 | |||
268 | .TP | ||
269 | .B \--repetitive-best | ||
270 | Opposite of \--repetitive-fast; try a lot harder before | ||
271 | resorting to randomisation. | ||
272 | |||
273 | .SH RECOVERING DATA FROM DAMAGED FILES | ||
274 | .I bzip2 | ||
275 | compresses files in blocks, usually 900kbytes long. | ||
276 | Each block is handled independently. If a media or | ||
277 | transmission error causes a multi-block .bz2 | ||
278 | file to become damaged, | ||
279 | it may be possible to recover data from the undamaged blocks | ||
280 | in the file. | ||
281 | |||
282 | The compressed representation of each block is delimited by | ||
283 | a 48-bit pattern, which makes it possible to find the block | ||
284 | boundaries with reasonable certainty. Each block also carries | ||
285 | its own 32-bit CRC, so damaged blocks can be | ||
286 | distinguished from undamaged ones. | ||
287 | |||
288 | .I bzip2recover | ||
289 | is a simple program whose purpose is to search for | ||
290 | blocks in .bz2 files, and write each block out into | ||
291 | its own .bz2 file. You can then use | ||
292 | .I bzip2 -t | ||
293 | to test the integrity of the resulting files, | ||
294 | and decompress those which are undamaged. | ||
295 | |||
296 | .I bzip2recover | ||
297 | takes a single argument, the name of the damaged file, | ||
298 | and writes a number of files "rec0001file.bz2", "rec0002file.bz2", | ||
299 | etc, containing the extracted blocks. The output filenames | ||
300 | are designed so that the use of wildcards in subsequent processing | ||
301 | -- for example, "bzip2 -dc rec*file.bz2 > recovered_data" -- | ||
302 | lists the files in the "right" order. | ||
303 | |||
304 | .I bzip2recover | ||
305 | should be of most use dealing with large .bz2 files, as | ||
306 | these will contain many blocks. It is clearly futile to | ||
307 | use it on damaged single-block files, since a damaged | ||
308 | block cannot be recovered. If you wish to minimise | ||
309 | any potential data loss through media or transmission | ||
310 | errors, you might consider compressing with a smaller | ||
311 | block size. | ||
312 | |||
313 | .SH PERFORMANCE NOTES | ||
314 | The sorting phase of compression gathers together similar strings | ||
315 | in the file. Because of this, files containing very long | ||
316 | runs of repeated symbols, like "aabaabaabaab ..." (repeated | ||
317 | several hundred times) may compress extraordinarily slowly. | ||
318 | You can use the | ||
319 | \-vvvvv | ||
320 | option to monitor progress in great detail, if you want. | ||
321 | Decompression speed is unaffected. | ||
322 | |||
323 | Such pathological cases | ||
324 | seem rare in practice, appearing mostly in artificially-constructed | ||
325 | test files, and in low-level disk images. It may be inadvisable to | ||
326 | use | ||
327 | .I bzip2 | ||
328 | to compress the latter. | ||
329 | If you do get a file which causes severe slowness in compression, | ||
330 | try making the block size as small as possible, with flag \-1. | ||
331 | |||
332 | Incompressible or virtually-incompressible data may decompress | ||
333 | rather more slowly than one would hope. This is due to | ||
334 | a naive implementation of the move-to-front coder. | ||
335 | |||
336 | .I bzip2 | ||
337 | usually allocates several megabytes of memory to operate in, | ||
338 | and then charges all over it in a fairly random fashion. This | ||
339 | means that performance, both for compressing and decompressing, | ||
340 | is largely determined by the speed | ||
341 | at which your machine can service cache misses. | ||
342 | Because of this, small changes | ||
343 | to the code to reduce the miss rate have been observed to give | ||
344 | disproportionately large performance improvements. | ||
345 | I imagine | ||
346 | .I bzip2 | ||
347 | will perform best on machines with very large caches. | ||
348 | |||
349 | Test mode (\-t) uses the low-memory decompression algorithm | ||
350 | (\-s). This means test mode does not run as fast as it could; | ||
351 | it could run as fast as the normal decompression machinery. | ||
352 | This could easily be fixed at the cost of some code bloat. | ||
353 | |||
354 | .SH CAVEATS | ||
355 | I/O error messages are not as helpful as they could be. | ||
356 | .I Bzip2 | ||
357 | tries hard to detect I/O errors and exit cleanly, but the | ||
358 | details of what the problem is sometimes seem rather misleading. | ||
359 | |||
360 | This manual page pertains to version 0.1 of | ||
361 | .I bzip2. | ||
362 | It may well happen that some future version will | ||
363 | use a different compressed file format. If you try to | ||
364 | decompress, using 0.1, a .bz2 file created with some | ||
365 | future version which uses a different compressed file format, | ||
366 | 0.1 will complain that your file "is not a bzip2 file". | ||
367 | If that happens, you should obtain a more recent version | ||
368 | of | ||
369 | .I bzip2 | ||
370 | and use that to decompress the file. | ||
371 | |||
372 | Wildcard expansion for Windows 95 and NT | ||
373 | is flaky. | ||
374 | |||
375 | .I bzip2recover | ||
376 | uses 32-bit integers to represent bit positions in | ||
377 | compressed files, so it cannot handle compressed files | ||
378 | more than 512 megabytes long. This could easily be fixed. | ||
379 | |||
380 | .I bzip2recover | ||
381 | sometimes reports a very small, incomplete final block. | ||
382 | This is spurious and can be safely ignored. | ||
383 | |||
384 | .SH RELATIONSHIP TO bzip-0.21 | ||
385 | This program is a descendant of the | ||
386 | .I bzip | ||
387 | program, version 0.21, which I released in August 1996. | ||
388 | The primary difference of | ||
389 | .I bzip2 | ||
390 | is its avoidance of the possibly patented algorithms | ||
391 | which were used in 0.21. | ||
392 | .I bzip2 | ||
393 | also brings various useful refinements (\-s, \-t), | ||
394 | uses less memory, decompresses significantly faster, and | ||
395 | has support for recovering data from damaged files. | ||
396 | |||
397 | Because | ||
398 | .I bzip2 | ||
399 | uses Huffman coding to construct the compressed bitstream, | ||
400 | rather than the arithmetic coding used in 0.21, | ||
401 | the compressed representations generated by the two programs | ||
402 | are incompatible, and they will not interoperate. The change | ||
403 | in suffix from .bz to .bz2 reflects this. It would have been | ||
404 | helpful to at least allow | ||
405 | .I bzip2 | ||
406 | to decompress files created by 0.21, but this would | ||
407 | defeat the primary aim of having a patent-free compressor. | ||
408 | |||
409 | Huffman coding necessarily involves some coding inefficiency | ||
410 | compared to arithmetic coding. This means that | ||
411 | .I bzip2 | ||
412 | compresses about 1% worse than 0.21, an unfortunate but | ||
413 | unavoidable fact-of-life. On the other hand, decompression | ||
414 | is approximately 50% faster for the same reason, and the | ||
415 | change in file format gave an opportunity to add data-recovery | ||
416 | features. So it is not all bad. | ||
417 | |||
418 | .SH AUTHOR | ||
419 | Julian Seward, jseward@acm.org. | ||
420 | |||
421 | The ideas embodied in | ||
422 | .I bzip | ||
423 | and | ||
424 | .I bzip2 | ||
425 | are due to (at least) the following people: | ||
426 | Michael Burrows and David Wheeler (for the block sorting | ||
427 | transformation), David Wheeler (again, for the Huffman coder), | ||
428 | Peter Fenwick (for the structured coding model in 0.21, | ||
429 | and many refinements), | ||
430 | and | ||
431 | Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic | ||
432 | coder in 0.21). I am much indebted for their help, support and advice. | ||
433 | See the file ALGORITHMS in the source distribution for pointers to | ||
434 | sources of documentation. | ||
435 | Christian von Roques encouraged me to look for faster | ||
436 | sorting algorithms, so as to speed up compression. | ||
437 | Bela Lubkin encouraged me to improve the worst-case | ||
438 | compression performance. | ||
439 | Many people sent patches, helped with portability problems, | ||
440 | lent machines, gave advice and were generally helpful. | ||
441 | |||