aboutsummaryrefslogtreecommitdiff
path: root/bzip2.txt
diff options
context:
space:
mode:
Diffstat (limited to 'bzip2.txt')
-rw-r--r--bzip2.txt292
1 files changed, 91 insertions, 201 deletions
diff --git a/bzip2.txt b/bzip2.txt
index aee8e2b..898dfe8 100644
--- a/bzip2.txt
+++ b/bzip2.txt
@@ -1,22 +1,22 @@
1 1
2
3
4bzip2(1) bzip2(1) 2bzip2(1) bzip2(1)
5 3
6 4
7NAME 5NAME
8 bzip2, bunzip2 - a block-sorting file compressor, v0.1 6 bzip2, bunzip2 - a block-sorting file compressor, v0.9.0
7 bzcat - decompresses files to stdout
9 bzip2recover - recovers data from damaged bzip2 files 8 bzip2recover - recovers data from damaged bzip2 files
10 9
11 10
12SYNOPSIS 11SYNOPSIS
13 bzip2 [ -cdfkstvVL123456789 ] [ filenames ... ] 12 bzip2 [ -cdfkstvzVL123456789 ] [ filenames ... ]
14 bunzip2 [ -kvsVL ] [ filenames ... ] 13 bunzip2 [ -fkvsVL ] [ filenames ... ]
14 bzcat [ -s ] [ filenames ... ]
15 bzip2recover filename 15 bzip2recover filename
16 16
17 17
18DESCRIPTION 18DESCRIPTION
19 Bzip2 compresses files using the Burrows-Wheeler block- 19 bzip2 compresses files using the Burrows-Wheeler block-
20 sorting text compression algorithm, and Huffman coding. 20 sorting text compression algorithm, and Huffman coding.
21 Compression is generally considerably better than that 21 Compression is generally considerably better than that
22 achieved by more conventional LZ77/LZ78-based compressors, 22 achieved by more conventional LZ77/LZ78-based compressors,
@@ -26,7 +26,7 @@ DESCRIPTION
26 The command-line options are deliberately very similar to 26 The command-line options are deliberately very similar to
27 those of GNU Gzip, but they are not identical. 27 those of GNU Gzip, but they are not identical.
28 28
29 Bzip2 expects a list of file names to accompany the com- 29 bzip2 expects a list of file names to accompany the com-
30 mand-line flags. Each file is replaced by a compressed 30 mand-line flags. Each file is replaced by a compressed
31 version of itself, with the name "original_name.bz2". 31 version of itself, with the name "original_name.bz2".
32 Each compressed file has the same modification date and 32 Each compressed file has the same modification date and
@@ -38,8 +38,8 @@ DESCRIPTION
38 cepts, or have serious file name length restrictions, such 38 cepts, or have serious file name length restrictions, such
39 as MS-DOS. 39 as MS-DOS.
40 40
41 Bzip2 and bunzip2 will not overwrite existing files; if 41 bzip2 and bunzip2 will by default not overwrite existing
42 you want this to happen, you should delete them first. 42 files; if you want this to happen, specify the -f flag.
43 43
44 If no file names are specified, bzip2 compresses from 44 If no file names are specified, bzip2 compresses from
45 standard input to standard output. In this case, bzip2 45 standard input to standard output. In this case, bzip2
@@ -47,28 +47,29 @@ DESCRIPTION
47 this would be entirely incomprehensible and therefore 47 this would be entirely incomprehensible and therefore
48 pointless. 48 pointless.
49 49
50 Bunzip2 (or bzip2 -d ) decompresses and restores all spec- 50 bunzip2 (or bzip2 -d ) decompresses and restores all spec-
51 ified files whose names end in ".bz2". Files without this 51 ified files whose names end in ".bz2". Files without this
52 suffix are ignored. Again, supplying no filenames causes 52 suffix are ignored. Again, supplying no filenames causes
53 decompression from standard input to standard output. 53 decompression from standard input to standard output.
54 54
55 You can also compress or decompress files to the standard 55 bunzip2 will correctly decompress a file which is the con-
56 output by giving the -c flag. You can decompress multiple 56 catenation of two or more compressed files. The result is
57 files like this, but you may only compress a single file 57 the concatenation of the corresponding uncompressed files.
58 this way, since it would otherwise be difficult to sepa- 58 Integrity testing (-t) of concatenated compressed files is
59 rate out the compressed representations of the original 59 also supported.
60 files.
61
62
63
64 1
65
66
67
68
69
70bzip2(1) bzip2(1)
71 60
61 You can also compress or decompress files to the standard
62 output by giving the -c flag. Multiple files may be com-
63 pressed and decompressed like this. The resulting outputs
64 are fed sequentially to stdout. Compression of multiple
65 files in this manner generates a stream containing multi-
66 ple compressed file representations. Such a stream can be
67 decompressed correctly only by bzip2 version 0.9.0 or
68 later. Earlier versions of bzip2 will stop after decom-
69 pressing the first file in the stream.
70
71 bzcat (or bzip2 -dc ) decompresses all specified files to
72 the standard output.
72 73
73 Compression is always performed, even if the compressed 74 Compression is always performed, even if the compressed
74 file is slightly larger than the original. Files of less 75 file is slightly larger than the original. Files of less
@@ -108,13 +109,14 @@ MEMORY MANAGEMENT
108 file, and bunzip2 then allocates itself just enough memory 109 file, and bunzip2 then allocates itself just enough memory
109 to decompress the file. Since block sizes are stored in 110 to decompress the file. Since block sizes are stored in
110 compressed files, it follows that the flags -1 to -9 are 111 compressed files, it follows that the flags -1 to -9 are
111 irrelevant to and so ignored during decompression. Com- 112 irrelevant to and so ignored during decompression.
112 pression and decompression requirements, in bytes, can be 113
113 estimated as: 114 Compression and decompression requirements, in bytes, can
115 be estimated as:
114 116
115 Compression: 400k + ( 7 x block size ) 117 Compression: 400k + ( 7 x block size )
116 118
117 Decompression: 100k + ( 5 x block size ), or 119 Decompression: 100k + ( 4 x block size ), or
118 100k + ( 2.5 x block size ) 120 100k + ( 2.5 x block size )
119 121
120 Larger block sizes give rapidly diminishing marginal 122 Larger block sizes give rapidly diminishing marginal
@@ -125,19 +127,8 @@ MEMORY MANAGEMENT
125 requirement is set at compression-time by the choice of 127 requirement is set at compression-time by the choice of
126 block size. 128 block size.
127 129
128
129
130 2
131
132
133
134
135
136bzip2(1) bzip2(1)
137
138
139 For files compressed with the default 900k block size, 130 For files compressed with the default 900k block size,
140 bunzip2 will require about 4600 kbytes to decompress. To 131 bunzip2 will require about 3700 kbytes to decompress. To
141 support decompression of any file on a 4 megabyte machine, 132 support decompression of any file on a 4 megabyte machine,
142 bunzip2 has an option to decompress using approximately 133 bunzip2 has an option to decompress using approximately
143 half this amount of memory, about 2300 kbytes. Decompres- 134 half this amount of memory, about 2300 kbytes. Decompres-
@@ -157,8 +148,8 @@ bzip2(1) bzip2(1)
157 file 20,000 bytes long with the flag -9 will cause the 148 file 20,000 bytes long with the flag -9 will cause the
158 compressor to allocate around 6700k of memory, but only 149 compressor to allocate around 6700k of memory, but only
159 touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the 150 touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the
160 decompressor will allocate 4600k but only touch 100k + 151 decompressor will allocate 3700k but only touch 100k +
161 20000 * 5 = 200 kbytes. 152 20000 * 4 = 180 kbytes.
162 153
163 Here is a table which summarises the maximum memory usage 154 Here is a table which summarises the maximum memory usage
164 for different block sizes. Also recorded is the total 155 for different block sizes. Also recorded is the total
@@ -172,15 +163,15 @@ bzip2(1) bzip2(1)
172 Compress Decompress Decompress Corpus 163 Compress Decompress Decompress Corpus
173 Flag usage usage -s usage Size 164 Flag usage usage -s usage Size
174 165
175 -1 1100k 600k 350k 914704 166 -1 1100k 500k 350k 914704
176 -2 1800k 1100k 600k 877703 167 -2 1800k 900k 600k 877703
177 -3 2500k 1600k 850k 860338 168 -3 2500k 1300k 850k 860338
178 -4 3200k 2100k 1100k 846899 169 -4 3200k 1700k 1100k 846899
179 -5 3900k 2600k 1350k 845160 170 -5 3900k 2100k 1350k 845160
180 -6 4600k 3100k 1600k 838626 171 -6 4600k 2500k 1600k 838626
181 -7 5400k 3600k 1850k 834096 172 -7 5400k 2900k 1850k 834096
182 -8 6000k 4100k 2100k 828642 173 -8 6000k 3300k 2100k 828642
183 -9 6700k 4600k 2350k 828642 174 -9 6700k 3700k 2350k 828642
184 175
185 176
186OPTIONS 177OPTIONS
@@ -189,47 +180,37 @@ OPTIONS
189 decompress multiple files to stdout, but will only 180 decompress multiple files to stdout, but will only
190 compress a single file to stdout. 181 compress a single file to stdout.
191 182
192
193
194
195
196 3
197
198
199
200
201
202bzip2(1) bzip2(1)
203
204
205 -d --decompress 183 -d --decompress
206 Force decompression. Bzip2 and bunzip2 are really 184 Force decompression. bzip2, bunzip2 and bzcat are
207 the same program, and the decision about whether to 185 really the same program, and the decision about
208 compress or decompress is done on the basis of 186 what actions to take is done on the basis of which
209 which name is used. This flag overrides that mech- 187 name is used. This flag overrides that mechanism,
210 anism, and forces bzip2 to decompress. 188 and forces bzip2 to decompress.
211 189
212 -f --compress 190 -z --compress
213 The complement to -d: forces compression, regard- 191 The complement to -d: forces compression, regard-
214 less of the invokation name. 192 less of the invokation name.
215 193
216 -t --test 194 -t --test
217 Check integrity of the specified file(s), but don't 195 Check integrity of the specified file(s), but don't
218 decompress them. This really performs a trial 196 decompress them. This really performs a trial
219 decompression and throws away the result, using the 197 decompression and throws away the result.
220 low-memory decompression algorithm (see -s). 198
199 -f --force
200 Force overwrite of output files. Normally, bzip2
201 will not overwrite existing output files.
221 202
222 -k --keep 203 -k --keep
223 Keep (don't delete) input files during compression 204 Keep (don't delete) input files during compression
224 or decompression. 205 or decompression.
225 206
226 -s --small 207 -s --small
227 Reduce memory usage, both for compression and 208 Reduce memory usage, for compression, decompression
228 decompression. Files are decompressed using a mod- 209 and testing. Files are decompressed and tested
229 ified algorithm which only requires 2.5 bytes per 210 using a modified algorithm which only requires 2.5
230 block byte. This means any file can be decom- 211 bytes per block byte. This means any file can be
231 pressed in 2300k of memory, albeit somewhat more 212 decompressed in 2300k of memory, albeit at about
232 slowly than usual. 213 half the normal speed.
233 214
234 During compression, -s selects a block size of 215 During compression, -s selects a block size of
235 200k, which limits memory use to around the same 216 200k, which limits memory use to around the same
@@ -238,36 +219,21 @@ bzip2(1) bzip2(1)
238 megabytes or less), use -s for everything. See 219 megabytes or less), use -s for everything. See
239 MEMORY MANAGEMENT above. 220 MEMORY MANAGEMENT above.
240 221
241
242 -v --verbose 222 -v --verbose
243 Verbose mode -- show the compression ratio for each 223 Verbose mode -- show the compression ratio for each
244 file processed. Further -v's increase the ver- 224 file processed. Further -v's increase the ver-
245 bosity level, spewing out lots of information which 225 bosity level, spewing out lots of information which
246 is primarily of interest for diagnostic purposes. 226 is primarily of interest for diagnostic purposes.
247 227
248 -L --license 228 -L --license -V --version
249 Display the software version, license terms and 229 Display the software version, license terms and
250 conditions. 230 conditions.
251 231
252 -V --version
253 Same as -L.
254
255 -1 to -9 232 -1 to -9
256 Set the block size to 100 k, 200 k .. 900 k when 233 Set the block size to 100 k, 200 k .. 900 k when
257 compressing. Has no effect when decompressing. 234 compressing. Has no effect when decompressing.
258 See MEMORY MANAGEMENT above. 235 See MEMORY MANAGEMENT above.
259 236
260
261
262 4
263
264
265
266
267
268bzip2(1) bzip2(1)
269
270
271 --repetitive-fast 237 --repetitive-fast
272 bzip2 injects some small pseudo-random variations 238 bzip2 injects some small pseudo-random variations
273 into very repetitive blocks to limit worst-case 239 into very repetitive blocks to limit worst-case
@@ -278,7 +244,6 @@ bzip2(1) bzip2(1)
278 would take before resorting to randomisation. This 244 would take before resorting to randomisation. This
279 flag makes it give up much sooner. 245 flag makes it give up much sooner.
280 246
281
282 --repetitive-best 247 --repetitive-best
283 Opposite of --repetitive-fast; try a lot harder 248 Opposite of --repetitive-fast; try a lot harder
284 before resorting to randomisation. 249 before resorting to randomisation.
@@ -306,10 +271,10 @@ RECOVERING DATA FROM DAMAGED FILES
306 bzip2recover takes a single argument, the name of the dam- 271 bzip2recover takes a single argument, the name of the dam-
307 aged file, and writes a number of files "rec0001file.bz2", 272 aged file, and writes a number of files "rec0001file.bz2",
308 "rec0002file.bz2", etc, containing the extracted blocks. 273 "rec0002file.bz2", etc, containing the extracted blocks.
309 The output filenames are designed so that the use of wild- 274 The output filenames are designed so that the use of
310 cards in subsequent processing -- for example, "bzip2 -dc 275 wildcards in subsequent processing -- for example, "bzip2
311 rec*file.bz2 > recovered_data" -- lists the files in the 276 -dc rec*file.bz2 > recovered_data" -- lists the files in
312 "right" order. 277 the "right" order.
313 278
314 bzip2recover should be of most use dealing with large .bz2 279 bzip2recover should be of most use dealing with large .bz2
315 files, as these will contain many blocks. It is clearly 280 files, as these will contain many blocks. It is clearly
@@ -322,18 +287,6 @@ RECOVERING DATA FROM DAMAGED FILES
322 287
323PERFORMANCE NOTES 288PERFORMANCE NOTES
324 The sorting phase of compression gathers together similar 289 The sorting phase of compression gathers together similar
325
326
327
328 5
329
330
331
332
333
334bzip2(1) bzip2(1)
335
336
337 strings in the file. Because of this, files containing 290 strings in the file. Because of this, files containing
338 very long runs of repeated symbols, like "aabaabaabaab 291 very long runs of repeated symbols, like "aabaabaabaab
339 ..." (repeated several hundred times) may compress 292 ..." (repeated several hundred times) may compress
@@ -348,10 +301,6 @@ bzip2(1) bzip2(1)
348 severe slowness in compression, try making the block size 301 severe slowness in compression, try making the block size
349 as small as possible, with flag -1. 302 as small as possible, with flag -1.
350 303
351 Incompressible or virtually-incompressible data may decom-
352 press rather more slowly than one would hope. This is due
353 to a naive implementation of the move-to-front coder.
354
355 bzip2 usually allocates several megabytes of memory to 304 bzip2 usually allocates several megabytes of memory to
356 operate in, and then charges all over it in a fairly ran- 305 operate in, and then charges all over it in a fairly ran-
357 dom fashion. This means that performance, both for com- 306 dom fashion. This means that performance, both for com-
@@ -362,12 +311,6 @@ bzip2(1) bzip2(1)
362 large performance improvements. I imagine bzip2 will per- 311 large performance improvements. I imagine bzip2 will per-
363 form best on machines with very large caches. 312 form best on machines with very large caches.
364 313
365 Test mode (-t) uses the low-memory decompression algorithm
366 (-s). This means test mode does not run as fast as it
367 could; it could run as fast as the normal decompression
368 machinery. This could easily be fixed at the cost of some
369 code bloat.
370
371 314
372CAVEATS 315CAVEATS
373 I/O error messages are not as helpful as they could be. 316 I/O error messages are not as helpful as they could be.
@@ -375,91 +318,38 @@ CAVEATS
375 but the details of what the problem is sometimes seem 318 but the details of what the problem is sometimes seem
376 rather misleading. 319 rather misleading.
377 320
378 This manual page pertains to version 0.1 of bzip2. It may 321 This manual page pertains to version 0.9.0 of bzip2. Com-
379 well happen that some future version will use a different 322 pressed data created by this version is entirely forwards
380 compressed file format. If you try to decompress, using 323 and backwards compatible with the previous public release,
381 0.1, a .bz2 file created with some future version which 324 version 0.1pl2, but with the following exception: 0.9.0
382 uses a different compressed file format, 0.1 will complain 325 can correctly decompress multiple concatenated compressed
383 that your file "is not a bzip2 file". If that happens, 326 files. 0.1pl2 cannot do this; it will stop after decom-
384 you should obtain a more recent version of bzip2 and use 327 pressing just the first file in the stream.
385 that to decompress the file.
386 328
387 Wildcard expansion for Windows 95 and NT is flaky. 329 Wildcard expansion for Windows 95 and NT is flaky.
388 330
389 bzip2recover uses 32-bit integers to represent bit posi- 331 bzip2recover uses 32-bit integers to represent bit posi-
390 tions in compressed files, so it cannot handle compressed 332 tions in compressed files, so it cannot handle compressed
391 333 files more than 512 megabytes long. This could easily be
392
393
394 6
395
396
397
398
399
400bzip2(1) bzip2(1)
401
402
403 files more than 512 megabytes long. This could easily be
404 fixed. 334 fixed.
405 335
406 bzip2recover sometimes reports a very small, incomplete
407 final block. This is spurious and can be safely ignored.
408
409
410RELATIONSHIP TO bzip-0.21
411 This program is a descendant of the bzip program, version
412 0.21, which I released in August 1996. The primary dif-
413 ference of bzip2 is its avoidance of the possibly patented
414 algorithms which were used in 0.21. bzip2 also brings
415 various useful refinements (-s, -t), uses less memory,
416 decompresses significantly faster, and has support for
417 recovering data from damaged files.
418
419 Because bzip2 uses Huffman coding to construct the com-
420 pressed bitstream, rather than the arithmetic coding used
421 in 0.21, the compressed representations generated by the
422 two programs are incompatible, and they will not interop-
423 erate. The change in suffix from .bz to .bz2 reflects
424 this. It would have been helpful to at least allow bzip2
425 to decompress files created by 0.21, but this would defeat
426 the primary aim of having a patent-free compressor.
427
428 For a more precise statement about patent issues in bzip2,
429 please see the README file in the distribution.
430
431 Huffman coding necessarily involves some coding ineffi-
432 ciency compared to arithmetic coding. This means that
433 bzip2 compresses about 1% worse than 0.21, an unfortunate
434 but unavoidable fact-of-life. On the other hand, decom-
435 pression is approximately 50% faster for the same reason,
436 and the change in file format gave an opportunity to add
437 data-recovery features. So it is not all bad.
438
439 336
440AUTHOR 337AUTHOR
441 Julian Seward, jseward@acm.org. 338 Julian Seward, jseward@acm.org.
442 339 http://www.muraroa.demon.co.uk
443 The ideas embodied in bzip and bzip2 are due to (at least) 340
444 the following people: Michael Burrows and David Wheeler 341 The ideas embodied in bzip2 are due to (at least) the fol-
445 (for the block sorting transformation), David Wheeler 342 lowing people: Michael Burrows and David Wheeler (for the
446 (again, for the Huffman coder), Peter Fenwick (for the 343 block sorting transformation), David Wheeler (again, for
447 structured coding model in 0.21, and many refinements), 344 the Huffman coder), Peter Fenwick (for the structured cod-
448 and Alistair Moffat, Radford Neal and Ian Witten (for the 345 ing model in the original bzip, and many refinements), and
449 arithmetic coder in 0.21). I am much indebted for their 346 Alistair Moffat, Radford Neal and Ian Witten (for the
450 help, support and advice. See the file ALGORITHMS in the 347 arithmetic coder in the original bzip). I am much
451 source distribution for pointers to sources of documenta- 348 indebted for their help, support and advice. See the man-
452 tion. Christian von Roques encouraged me to look for 349 ual in the source distribution for pointers to sources of
453 faster sorting algorithms, so as to speed up compression. 350 documentation. Christian von Roques encouraged me to look
454 Bela Lubkin encouraged me to improve the worst-case com- 351 for faster sorting algorithms, so as to speed up compres-
455 pression performance. Many people sent patches, helped 352 sion. Bela Lubkin encouraged me to improve the worst-case
353 compression performance. Many people sent patches, helped
456 with portability problems, lent machines, gave advice and 354 with portability problems, lent machines, gave advice and
457 were generally helpful. 355 were generally helpful.
458
459
460
461
462
463 7
464
465