aboutsummaryrefslogtreecommitdiff
path: root/bzip2.txt
diff options
context:
space:
mode:
Diffstat (limited to 'bzip2.txt')
-rw-r--r--bzip2.txt462
1 files changed, 462 insertions, 0 deletions
diff --git a/bzip2.txt b/bzip2.txt
new file mode 100644
index 0000000..83366bc
--- /dev/null
+++ b/bzip2.txt
@@ -0,0 +1,462 @@
1
2
3
4bzip2(1) bzip2(1)
5
6
7NAME
8 bzip2, bunzip2 - a block-sorting file compressor, v0.1
9 bzip2recover - recovers data from damaged bzip2 files
10
11
12SYNOPSIS
13 bzip2 [ -cdfkstvVL123456789 ] [ filenames ... ]
14 bunzip2 [ -kvsVL ] [ filenames ... ]
15 bzip2recover filename
16
17
18DESCRIPTION
19 Bzip2 compresses files using the Burrows-Wheeler block-
20 sorting text compression algorithm, and Huffman coding.
21 Compression is generally considerably better than that
22 achieved by more conventional LZ77/LZ78-based compressors,
23 and approaches the performance of the PPM family of sta-
24 tistical compressors.
25
26 The command-line options are deliberately very similar to
27 those of GNU Gzip, but they are not identical.
28
29 Bzip2 expects a list of file names to accompany the com-
30 mand-line flags. Each file is replaced by a compressed
31 version of itself, with the name "original_name.bz2".
32 Each compressed file has the same modification date and
33 permissions as the corresponding original, so that these
34 properties can be correctly restored at decompression
35 time. File name handling is naive in the sense that there
36 is no mechanism for preserving original file names, per-
37 missions and dates in filesystems which lack these con-
38 cepts, or have serious file name length restrictions, such
39 as MS-DOS.
40
41 Bzip2 and bunzip2 will not overwrite existing files; if
42 you want this to happen, you should delete them first.
43
44 If no file names are specified, bzip2 compresses from
45 standard input to standard output. In this case, bzip2
46 will decline to write compressed output to a terminal, as
47 this would be entirely incomprehensible and therefore
48 pointless.
49
50 Bunzip2 (or bzip2 -d ) decompresses and restores all spec-
51 ified files whose names end in ".bz2". Files without this
52 suffix are ignored. Again, supplying no filenames causes
53 decompression from standard input to standard output.
54
55 You can also compress or decompress files to the standard
56 output by giving the -c flag. You can decompress multiple
57 files like this, but you may only compress a single file
58 this way, since it would otherwise be difficult to sepa-
59 rate out the compressed representations of the original
60 files.
61
62
63
64 1
65
66
67
68
69
70bzip2(1) bzip2(1)
71
72
73 Compression is always performed, even if the compressed
74 file is slightly larger than the original. Files of less
75 than about one hundred bytes tend to get larger, since the
76 compression mechanism has a constant overhead in the
77 region of 50 bytes. Random data (including the output of
78 most file compressors) is coded at about 8.05 bits per
79 byte, giving an expansion of around 0.5%.
80
81 As a self-check for your protection, bzip2 uses 32-bit
82 CRCs to make sure that the decompressed version of a file
83 is identical to the original. This guards against corrup-
84 tion of the compressed data, and against undetected bugs
85 in bzip2 (hopefully very unlikely). The chances of data
86 corruption going undetected is microscopic, about one
87 chance in four billion for each file processed. Be aware,
88 though, that the check occurs upon decompression, so it
89 can only tell you that that something is wrong. It can't
90 help you recover the original uncompressed data. You can
91 use bzip2recover to try to recover data from damaged
92 files.
93
94 Return values: 0 for a normal exit, 1 for environmental
95 problems (file not found, invalid flags, I/O errors, &c),
96 2 to indicate a corrupt compressed file, 3 for an internal
97 consistency error (eg, bug) which caused bzip2 to panic.
98
99
100MEMORY MANAGEMENT
101 Bzip2 compresses large files in blocks. The block size
102 affects both the compression ratio achieved, and the
103 amount of memory needed both for compression and decom-
104 pression. The flags -1 through -9 specify the block size
105 to be 100,000 bytes through 900,000 bytes (the default)
106 respectively. At decompression-time, the block size used
107 for compression is read from the header of the compressed
108 file, and bunzip2 then allocates itself just enough memory
109 to decompress the file. Since block sizes are stored in
110 compressed files, it follows that the flags -1 to -9 are
111 irrelevant to and so ignored during decompression. Com-
112 pression and decompression requirements, in bytes, can be
113 estimated as:
114
115 Compression: 400k + ( 7 x block size )
116
117 Decompression: 100k + ( 5 x block size ), or
118 100k + ( 2.5 x block size )
119
120 Larger block sizes give rapidly diminishing marginal
121 returns; most of the compression comes from the first two
122 or three hundred k of block size, a fact worth bearing in
123 mind when using bzip2 on small machines. It is also
124 important to appreciate that the decompression memory
125 requirement is set at compression-time by the choice of
126 block size.
127
128
129
130 2
131
132
133
134
135
136bzip2(1) bzip2(1)
137
138
139 For files compressed with the default 900k block size,
140 bunzip2 will require about 4600 kbytes to decompress. To
141 support decompression of any file on a 4 megabyte machine,
142 bunzip2 has an option to decompress using approximately
143 half this amount of memory, about 2300 kbytes. Decompres-
144 sion speed is also halved, so you should use this option
145 only where necessary. The relevant flag is -s.
146
147 In general, try and use the largest block size memory con-
148 straints allow, since that maximises the compression
149 achieved. Compression and decompression speed are virtu-
150 ally unaffected by block size.
151
152 Another significant point applies to files which fit in a
153 single block -- that means most files you'd encounter
154 using a large block size. The amount of real memory
155 touched is proportional to the size of the file, since the
156 file is smaller than a block. For example, compressing a
157 file 20,000 bytes long with the flag -9 will cause the
158 compressor to allocate around 6700k of memory, but only
159 touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the
160 decompressor will allocate 4600k but only touch 100k +
161 20000 * 5 = 200 kbytes.
162
163 Here is a table which summarises the maximum memory usage
164 for different block sizes. Also recorded is the total
165 compressed size for 14 files of the Calgary Text Compres-
166 sion Corpus totalling 3,141,622 bytes. This column gives
167 some feel for how compression varies with block size.
168 These figures tend to understate the advantage of larger
169 block sizes for larger files, since the Corpus is domi-
170 nated by smaller files.
171
172 Compress Decompress Decompress Corpus
173 Flag usage usage -s usage Size
174
175 -1 1100k 600k 350k 914704
176 -2 1800k 1100k 600k 877703
177 -3 2500k 1600k 850k 860338
178 -4 3200k 2100k 1100k 846899
179 -5 3900k 2600k 1350k 845160
180 -6 4600k 3100k 1600k 838626
181 -7 5400k 3600k 1850k 834096
182 -8 6000k 4100k 2100k 828642
183 -9 6700k 4600k 2350k 828642
184
185
186OPTIONS
187 -c --stdout
188 Compress or decompress to standard output. -c will
189 decompress multiple files to stdout, but will only
190 compress a single file to stdout.
191
192
193
194
195
196 3
197
198
199
200
201
202bzip2(1) bzip2(1)
203
204
205 -d --decompress
206 Force decompression. Bzip2 and bunzip2 are really
207 the same program, and the decision about whether to
208 compress or decompress is done on the basis of
209 which name is used. This flag overrides that mech-
210 anism, and forces bzip2 to decompress.
211
212 -f --compress
213 The complement to -d: forces compression, regard-
214 less of the invokation name.
215
216 -t --test
217 Check integrity of the specified file(s), but don't
218 decompress them. This really performs a trial
219 decompression and throws away the result, using the
220 low-memory decompression algorithm (see -s).
221
222 -k --keep
223 Keep (don't delete) input files during compression
224 or decompression.
225
226 -s --small
227 Reduce memory usage, both for compression and
228 decompression. Files are decompressed using a mod-
229 ified algorithm which only requires 2.5 bytes per
230 block byte. This means any file can be decom-
231 pressed in 2300k of memory, albeit somewhat more
232 slowly than usual.
233
234 During compression, -s selects a block size of
235 200k, which limits memory use to around the same
236 figure, at the expense of your compression ratio.
237 In short, if your machine is low on memory (8
238 megabytes or less), use -s for everything. See
239 MEMORY MANAGEMENT above.
240
241
242 -v --verbose
243 Verbose mode -- show the compression ratio for each
244 file processed. Further -v's increase the ver-
245 bosity level, spewing out lots of information which
246 is primarily of interest for diagnostic purposes.
247
248 -L --license
249 Display the software version, license terms and
250 conditions.
251
252 -V --version
253 Same as -L.
254
255 -1 to -9
256 Set the block size to 100 k, 200 k .. 900 k when
257 compressing. Has no effect when decompressing.
258 See MEMORY MANAGEMENT above.
259
260
261
262 4
263
264
265
266
267
268bzip2(1) bzip2(1)
269
270
271 --repetitive-fast
272 bzip2 injects some small pseudo-random variations
273 into very repetitive blocks to limit worst-case
274 performance during compression. If sorting runs
275 into difficulties, the block is randomised, and
276 sorting is restarted. Very roughly, bzip2 persists
277 for three times as long as a well-behaved input
278 would take before resorting to randomisation. This
279 flag makes it give up much sooner.
280
281
282 --repetitive-best
283 Opposite of --repetitive-fast; try a lot harder
284 before resorting to randomisation.
285
286
287RECOVERING DATA FROM DAMAGED FILES
288 bzip2 compresses files in blocks, usually 900kbytes long.
289 Each block is handled independently. If a media or trans-
290 mission error causes a multi-block .bz2 file to become
291 damaged, it may be possible to recover data from the
292 undamaged blocks in the file.
293
294 The compressed representation of each block is delimited
295 by a 48-bit pattern, which makes it possible to find the
296 block boundaries with reasonable certainty. Each block
297 also carries its own 32-bit CRC, so damaged blocks can be
298 distinguished from undamaged ones.
299
300 bzip2recover is a simple program whose purpose is to
301 search for blocks in .bz2 files, and write each block out
302 into its own .bz2 file. You can then use bzip2 -t to test
303 the integrity of the resulting files, and decompress those
304 which are undamaged.
305
306 bzip2recover takes a single argument, the name of the dam-
307 aged file, and writes a number of files "rec0001file.bz2",
308 "rec0002file.bz2", etc, containing the extracted blocks.
309 The output filenames are designed so that the use of wild-
310 cards in subsequent processing -- for example, "bzip2 -dc
311 rec*file.bz2 > recovered_data" -- lists the files in the
312 "right" order.
313
314 bzip2recover should be of most use dealing with large .bz2
315 files, as these will contain many blocks. It is clearly
316 futile to use it on damaged single-block files, since a
317 damaged block cannot be recovered. If you wish to min-
318 imise any potential data loss through media or transmis-
319 sion errors, you might consider compressing with a smaller
320 block size.
321
322
323PERFORMANCE NOTES
324 The sorting phase of compression gathers together similar
325
326
327
328 5
329
330
331
332
333
334bzip2(1) bzip2(1)
335
336
337 strings in the file. Because of this, files containing
338 very long runs of repeated symbols, like "aabaabaabaab
339 ..." (repeated several hundred times) may compress
340 extraordinarily slowly. You can use the -vvvvv option to
341 monitor progress in great detail, if you want. Decompres-
342 sion speed is unaffected.
343
344 Such pathological cases seem rare in practice, appearing
345 mostly in artificially-constructed test files, and in low-
346 level disk images. It may be inadvisable to use bzip2 to
347 compress the latter. If you do get a file which causes
348 severe slowness in compression, try making the block size
349 as small as possible, with flag -1.
350
351 Incompressible or virtually-incompressible data may decom-
352 press rather more slowly than one would hope. This is due
353 to a naive implementation of the move-to-front coder.
354
355 bzip2 usually allocates several megabytes of memory to
356 operate in, and then charges all over it in a fairly ran-
357 dom fashion. This means that performance, both for com-
358 pressing and decompressing, is largely determined by the
359 speed at which your machine can service cache misses.
360 Because of this, small changes to the code to reduce the
361 miss rate have been observed to give disproportionately
362 large performance improvements. I imagine bzip2 will per-
363 form best on machines with very large caches.
364
365 Test mode (-t) uses the low-memory decompression algorithm
366 (-s). This means test mode does not run as fast as it
367 could; it could run as fast as the normal decompression
368 machinery. This could easily be fixed at the cost of some
369 code bloat.
370
371
372CAVEATS
373 I/O error messages are not as helpful as they could be.
374 Bzip2 tries hard to detect I/O errors and exit cleanly,
375 but the details of what the problem is sometimes seem
376 rather misleading.
377
378 This manual page pertains to version 0.1 of bzip2. It may
379 well happen that some future version will use a different
380 compressed file format. If you try to decompress, using
381 0.1, a .bz2 file created with some future version which
382 uses a different compressed file format, 0.1 will complain
383 that your file "is not a bzip2 file". If that happens,
384 you should obtain a more recent version of bzip2 and use
385 that to decompress the file.
386
387 Wildcard expansion for Windows 95 and NT is flaky.
388
389 bzip2recover uses 32-bit integers to represent bit posi-
390 tions in compressed files, so it cannot handle compressed
391
392
393
394 6
395
396
397
398
399
400bzip2(1) bzip2(1)
401
402
403 files more than 512 megabytes long. This could easily be
404 fixed.
405
406 bzip2recover sometimes reports a very small, incomplete
407 final block. This is spurious and can be safely ignored.
408
409
410RELATIONSHIP TO bzip-0.21
411 This program is a descendant of the bzip program, version
412 0.21, which I released in August 1996. The primary dif-
413 ference of bzip2 is its avoidance of the possibly patented
414 algorithms which were used in 0.21. bzip2 also brings
415 various useful refinements (-s, -t), uses less memory,
416 decompresses significantly faster, and has support for
417 recovering data from damaged files.
418
419 Because bzip2 uses Huffman coding to construct the com-
420 pressed bitstream, rather than the arithmetic coding used
421 in 0.21, the compressed representations generated by the
422 two programs are incompatible, and they will not interop-
423 erate. The change in suffix from .bz to .bz2 reflects
424 this. It would have been helpful to at least allow bzip2
425 to decompress files created by 0.21, but this would defeat
426 the primary aim of having a patent-free compressor.
427
428 Huffman coding necessarily involves some coding ineffi-
429 ciency compared to arithmetic coding. This means that
430 bzip2 compresses about 1% worse than 0.21, an unfortunate
431 but unavoidable fact-of-life. On the other hand, decom-
432 pression is approximately 50% faster for the same reason,
433 and the change in file format gave an opportunity to add
434 data-recovery features. So it is not all bad.
435
436
437AUTHOR
438 Julian Seward, jseward@acm.org.
439
440 The ideas embodied in bzip and bzip2 are due to (at least)
441 the following people: Michael Burrows and David Wheeler
442 (for the block sorting transformation), David Wheeler
443 (again, for the Huffman coder), Peter Fenwick (for the
444 structured coding model in 0.21, and many refinements),
445 and Alistair Moffat, Radford Neal and Ian Witten (for the
446 arithmetic coder in 0.21). I am much indebted for their
447 help, support and advice. See the file ALGORITHMS in the
448 source distribution for pointers to sources of documenta-
449 tion. Christian von Roques encouraged me to look for
450 faster sorting algorithms, so as to speed up compression.
451 Bela Lubkin encouraged me to improve the worst-case com-
452 pression performance. Many people sent patches, helped
453 with portability problems, lent machines, gave advice and
454 were generally helpful.
455
456
457
458
459
460 7
461
462