aboutsummaryrefslogtreecommitdiff
path: root/bzip2.txt
diff options
context:
space:
mode:
Diffstat (limited to 'bzip2.txt')
-rw-r--r--bzip2.txt134
1 files changed, 74 insertions, 60 deletions
diff --git a/bzip2.txt b/bzip2.txt
index 4f1ae86..6afe358 100644
--- a/bzip2.txt
+++ b/bzip2.txt
@@ -1,7 +1,6 @@
1 1
2
3NAME 2NAME
4 bzip2, bunzip2 - a block-sorting file compressor, v1.0 3 bzip2, bunzip2 - a block-sorting file compressor, v1.0.2
5 bzcat - decompresses files to stdout 4 bzcat - decompresses files to stdout
6 bzip2recover - recovers data from damaged bzip2 files 5 bzip2recover - recovers data from damaged bzip2 files
7 6
@@ -18,20 +17,20 @@ DESCRIPTION
18 sorting text compression algorithm, and Huffman coding. 17 sorting text compression algorithm, and Huffman coding.
19 Compression is generally considerably better than that 18 Compression is generally considerably better than that
20 achieved by more conventional LZ77/LZ78-based compressors, 19 achieved by more conventional LZ77/LZ78-based compressors,
21 and approaches the performance of the PPM family of sta- 20 and approaches the performance of the PPM family of sta­
22 tistical compressors. 21 tistical compressors.
23 22
24 The command-line options are deliberately very similar to 23 The command-line options are deliberately very similar to
25 those of GNU gzip, but they are not identical. 24 those of GNU gzip, but they are not identical.
26 25
27 bzip2 expects a list of file names to accompany the com- 26 bzip2 expects a list of file names to accompany the com­
28 mand-line flags. Each file is replaced by a compressed 27 mand-line flags. Each file is replaced by a compressed
29 version of itself, with the name "original_name.bz2". 28 version of itself, with the name "original_name.bz2".
30 Each compressed file has the same modification date, per- 29 Each compressed file has the same modification date, per­
31 missions, and, when possible, ownership as the correspond- 30 missions, and, when possible, ownership as the correspond­
32 ing original, so that these properties can be correctly 31 ing original, so that these properties can be correctly
33 restored at decompression time. File name handling is 32 restored at decompression time. File name handling is
34 naive in the sense that there is no mechanism for preserv- 33 naive in the sense that there is no mechanism for preserv­
35 ing original file names, permissions, ownerships or dates 34 ing original file names, permissions, ownerships or dates
36 in filesystems which lack these concepts, or have serious 35 in filesystems which lack these concepts, or have serious
37 file name length restrictions, such as MS-DOS. 36 file name length restrictions, such as MS-DOS.
@@ -62,23 +61,23 @@ DESCRIPTION
62 guess the name of the original file, and uses the original 61 guess the name of the original file, and uses the original
63 name with .out appended. 62 name with .out appended.
64 63
65 As with compression, supplying no filenames causes decom- 64 As with compression, supplying no filenames causes decom­
66 pression from standard input to standard output. 65 pression from standard input to standard output.
67 66
68 bunzip2 will correctly decompress a file which is the con- 67 bunzip2 will correctly decompress a file which is the con­
69 catenation of two or more compressed files. The result is 68 catenation of two or more compressed files. The result is
70 the concatenation of the corresponding uncompressed files. 69 the concatenation of the corresponding uncompressed files.
71 Integrity testing (-t) of concatenated compressed files is 70 Integrity testing (-t) of concatenated compressed files is
72 also supported. 71 also supported.
73 72
74 You can also compress or decompress files to the standard 73 You can also compress or decompress files to the standard
75 output by giving the -c flag. Multiple files may be com- 74 output by giving the -c flag. Multiple files may be com­
76 pressed and decompressed like this. The resulting outputs 75 pressed and decompressed like this. The resulting outputs
77 are fed sequentially to stdout. Compression of multiple 76 are fed sequentially to stdout. Compression of multiple
78 files in this manner generates a stream containing multi- 77 files in this manner generates a stream containing multi­
79 ple compressed file representations. Such a stream can be 78 ple compressed file representations. Such a stream can be
80 decompressed correctly only by bzip2 version 0.9.0 or 79 decompressed correctly only by bzip2 version 0.9.0 or
81 later. Earlier versions of bzip2 will stop after decom- 80 later. Earlier versions of bzip2 will stop after decom­
82 pressing the first file in the stream. 81 pressing the first file in the stream.
83 82
84 bzcat (or bzip2 -dc) decompresses all specified files to 83 bzcat (or bzip2 -dc) decompresses all specified files to
@@ -99,7 +98,7 @@ DESCRIPTION
99 98
100 As a self-check for your protection, bzip2 uses 32-bit 99 As a self-check for your protection, bzip2 uses 32-bit
101 CRCs to make sure that the decompressed version of a file 100 CRCs to make sure that the decompressed version of a file
102 is identical to the original. This guards against corrup- 101 is identical to the original. This guards against corrup­
103 tion of the compressed data, and against undetected bugs 102 tion of the compressed data, and against undetected bugs
104 in bzip2 (hopefully very unlikely). The chances of data 103 in bzip2 (hopefully very unlikely). The chances of data
105 corruption going undetected is microscopic, about one 104 corruption going undetected is microscopic, about one
@@ -127,8 +126,8 @@ OPTIONS
127 and forces bzip2 to decompress. 126 and forces bzip2 to decompress.
128 127
129 -z --compress 128 -z --compress
130 The complement to -d: forces compression, regard- 129 The complement to -d: forces compression,
131 less of the invokation name. 130 regardless of the invocation name.
132 131
133 -t --test 132 -t --test
134 Check integrity of the specified file(s), but don't 133 Check integrity of the specified file(s), but don't
@@ -141,6 +140,11 @@ OPTIONS
141 forces bzip2 to break hard links to files, which it 140 forces bzip2 to break hard links to files, which it
142 otherwise wouldn't do. 141 otherwise wouldn't do.
143 142
143 bzip2 normally declines to decompress files which
144 don't have the correct magic header bytes. If
145 forced (-f), however, it will pass such files
146 through unmodified. This is how GNU gzip behaves.
147
144 -k --keep 148 -k --keep
145 Keep (don't delete) input files during compression 149 Keep (don't delete) input files during compression
146 or decompression. 150 or decompression.
@@ -167,7 +171,7 @@ OPTIONS
167 171
168 -v --verbose 172 -v --verbose
169 Verbose mode -- show the compression ratio for each 173 Verbose mode -- show the compression ratio for each
170 file processed. Further -v's increase the ver- 174 file processed. Further -v's increase the ver­
171 bosity level, spewing out lots of information which 175 bosity level, spewing out lots of information which
172 is primarily of interest for diagnostic purposes. 176 is primarily of interest for diagnostic purposes.
173 177
@@ -175,20 +179,24 @@ OPTIONS
175 Display the software version, license terms and 179 Display the software version, license terms and
176 conditions. 180 conditions.
177 181
178 -1 to -9 182 -1 (or --fast) to -9 (or --best)
179 Set the block size to 100 k, 200 k .. 900 k when 183 Set the block size to 100 k, 200 k .. 900 k when
180 compressing. Has no effect when decompressing. 184 compressing. Has no effect when decompressing.
181 See MEMORY MANAGEMENT below. 185 See MEMORY MANAGEMENT below. The --fast and --best
186 aliases are primarily for GNU gzip compatibility.
187 In particular, --fast doesn't make things signifi­
188 cantly faster. And --best merely selects the
189 default behaviour.
182 190
183 -- Treats all subsequent arguments as file names, even 191 -- Treats all subsequent arguments as file names, even
184 if they start with a dash. This is so you can han- 192 if they start with a dash. This is so you can han­
185 dle files with names beginning with a dash, for 193 dle files with names beginning with a dash, for
186 example: bzip2 -- -myfilename. 194 example: bzip2 -- -myfilename.
187 195
188 --repetitive-fast --repetitive-best 196 --repetitive-fast --repetitive-best
189 These flags are redundant in versions 0.9.5 and 197 These flags are redundant in versions 0.9.5 and
190 above. They provided some coarse control over the 198 above. They provided some coarse control over the
191 behaviour of the sorting algorithm in earlier ver- 199 behaviour of the sorting algorithm in earlier ver­
192 sions, which was sometimes useful. 0.9.5 and above 200 sions, which was sometimes useful. 0.9.5 and above
193 have an improved algorithm which renders these 201 have an improved algorithm which renders these
194 flags irrelevant. 202 flags irrelevant.
@@ -199,7 +207,7 @@ MEMORY MANAGEMENT
199 affects both the compression ratio achieved, and the 207 affects both the compression ratio achieved, and the
200 amount of memory needed for compression and decompression. 208 amount of memory needed for compression and decompression.
201 The flags -1 through -9 specify the block size to be 209 The flags -1 through -9 specify the block size to be
202 100,000 bytes through 900,000 bytes (the default) respec- 210 100,000 bytes through 900,000 bytes (the default) respec­
203 tively. At decompression time, the block size used for 211 tively. At decompression time, the block size used for
204 compression is read from the header of the compressed 212 compression is read from the header of the compressed
205 file, and bunzip2 then allocates itself just enough memory 213 file, and bunzip2 then allocates itself just enough memory
@@ -227,13 +235,13 @@ MEMORY MANAGEMENT
227 bunzip2 will require about 3700 kbytes to decompress. To 235 bunzip2 will require about 3700 kbytes to decompress. To
228 support decompression of any file on a 4 megabyte machine, 236 support decompression of any file on a 4 megabyte machine,
229 bunzip2 has an option to decompress using approximately 237 bunzip2 has an option to decompress using approximately
230 half this amount of memory, about 2300 kbytes. Decompres- 238 half this amount of memory, about 2300 kbytes. Decompres­
231 sion speed is also halved, so you should use this option 239 sion speed is also halved, so you should use this option
232 only where necessary. The relevant flag is -s. 240 only where necessary. The relevant flag is -s.
233 241
234 In general, try and use the largest block size memory con- 242 In general, try and use the largest block size memory con­
235 straints allow, since that maximises the compression 243 straints allow, since that maximises the compression
236 achieved. Compression and decompression speed are virtu- 244 achieved. Compression and decompression speed are virtu­
237 ally unaffected by block size. 245 ally unaffected by block size.
238 246
239 Another significant point applies to files which fit in a 247 Another significant point applies to files which fit in a
@@ -249,11 +257,11 @@ MEMORY MANAGEMENT
249 257
250 Here is a table which summarises the maximum memory usage 258 Here is a table which summarises the maximum memory usage
251 for different block sizes. Also recorded is the total 259 for different block sizes. Also recorded is the total
252 compressed size for 14 files of the Calgary Text Compres- 260 compressed size for 14 files of the Calgary Text Compres­
253 sion Corpus totalling 3,141,622 bytes. This column gives 261 sion Corpus totalling 3,141,622 bytes. This column gives
254 some feel for how compression varies with block size. 262 some feel for how compression varies with block size.
255 These figures tend to understate the advantage of larger 263 These figures tend to understate the advantage of larger
256 block sizes for larger files, since the Corpus is domi- 264 block sizes for larger files, since the Corpus is domi­
257 nated by smaller files. 265 nated by smaller files.
258 266
259 Compress Decompress Decompress Corpus 267 Compress Decompress Decompress Corpus
@@ -272,7 +280,7 @@ MEMORY MANAGEMENT
272 280
273RECOVERING DATA FROM DAMAGED FILES 281RECOVERING DATA FROM DAMAGED FILES
274 bzip2 compresses files in blocks, usually 900kbytes long. 282 bzip2 compresses files in blocks, usually 900kbytes long.
275 Each block is handled independently. If a media or trans- 283 Each block is handled independently. If a media or trans­
276 mission error causes a multi-block .bz2 file to become 284 mission error causes a multi-block .bz2 file to become
277 damaged, it may be possible to recover data from the 285 damaged, it may be possible to recover data from the
278 undamaged blocks in the file. 286 undamaged blocks in the file.
@@ -289,19 +297,19 @@ RECOVERING DATA FROM DAMAGED FILES
289 the integrity of the resulting files, and decompress those 297 the integrity of the resulting files, and decompress those
290 which are undamaged. 298 which are undamaged.
291 299
292 bzip2recover takes a single argument, the name of the dam- 300 bzip2recover takes a single argument, the name of the dam­
293 aged file, and writes a number of files "rec0001file.bz2", 301 aged file, and writes a number of files
294 "rec0002file.bz2", etc, containing the extracted blocks. 302 "rec00001file.bz2", "rec00002file.bz2", etc, containing
295 The output filenames are designed so that the use of 303 the extracted blocks. The output filenames are
296 wildcards in subsequent processing -- for example, "bzip2 304 designed so that the use of wildcards in subsequent pro­
297 -dc rec*file.bz2 > recovered_data" -- lists the files in 305 cessing -- for example, "bzip2 -dc rec*file.bz2 > recov­
298 the correct order. 306 ered_data" -- processes the files in the correct order.
299 307
300 bzip2recover should be of most use dealing with large .bz2 308 bzip2recover should be of most use dealing with large .bz2
301 files, as these will contain many blocks. It is clearly 309 files, as these will contain many blocks. It is clearly
302 futile to use it on damaged single-block files, since a 310 futile to use it on damaged single-block files, since a
303 damaged block cannot be recovered. If you wish to min- 311 damaged block cannot be recovered. If you wish to min­
304 imise any potential data loss through media or transmis- 312 imise any potential data loss through media or transmis­
305 sion errors, you might consider compressing with a smaller 313 sion errors, you might consider compressing with a smaller
306 block size. 314 block size.
307 315
@@ -315,19 +323,19 @@ PERFORMANCE NOTES
315 better than previous versions in this respect. The ratio 323 better than previous versions in this respect. The ratio
316 between worst-case and average-case compression time is in 324 between worst-case and average-case compression time is in
317 the region of 10:1. For previous versions, this figure 325 the region of 10:1. For previous versions, this figure
318 was more like 100:1. You can use the -vvvv option to mon- 326 was more like 100:1. You can use the -vvvv option to mon­
319 itor progress in great detail, if you want. 327 itor progress in great detail, if you want.
320 328
321 Decompression speed is unaffected by these phenomena. 329 Decompression speed is unaffected by these phenomena.
322 330
323 bzip2 usually allocates several megabytes of memory to 331 bzip2 usually allocates several megabytes of memory to
324 operate in, and then charges all over it in a fairly ran- 332 operate in, and then charges all over it in a fairly ran­
325 dom fashion. This means that performance, both for com- 333 dom fashion. This means that performance, both for com­
326 pressing and decompressing, is largely determined by the 334 pressing and decompressing, is largely determined by the
327 speed at which your machine can service cache misses. 335 speed at which your machine can service cache misses.
328 Because of this, small changes to the code to reduce the 336 Because of this, small changes to the code to reduce the
329 miss rate have been observed to give disproportionately 337 miss rate have been observed to give disproportionately
330 large performance improvements. I imagine bzip2 will per- 338 large performance improvements. I imagine bzip2 will per­
331 form best on machines with very large caches. 339 form best on machines with very large caches.
332 340
333 341
@@ -337,40 +345,46 @@ CAVEATS
337 but the details of what the problem is sometimes seem 345 but the details of what the problem is sometimes seem
338 rather misleading. 346 rather misleading.
339 347
340 This manual page pertains to version 1.0 of bzip2. Com- 348 This manual page pertains to version 1.0.2 of bzip2. Com­
341 pressed data created by this version is entirely forwards 349 pressed data created by this version is entirely forwards
342 and backwards compatible with the previous public 350 and backwards compatible with the previous public
343 releases, versions 0.1pl2, 0.9.0 and 0.9.5, but with the 351 releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0 and 1.0.1,
344 following exception: 0.9.0 and above can correctly decom- 352 but with the following exception: 0.9.0 and above can cor­
345 press multiple concatenated compressed files. 0.1pl2 can- 353 rectly decompress multiple concatenated compressed files.
346 not do this; it will stop after decompressing just the 354 0.1pl2 cannot do this; it will stop after decompressing
347 first file in the stream. 355 just the first file in the stream.
348 356
349 bzip2recover uses 32-bit integers to represent bit posi- 357 bzip2recover versions prior to this one, 1.0.2, used
350 tions in compressed files, so it cannot handle compressed 358 32-bit integers to represent bit positions in compressed
351 files more than 512 megabytes long. This could easily be 359 files, so it could not handle compressed files more than
352 fixed. 360 512 megabytes long. Version 1.0.2 and above uses 64-bit
361 ints on some platforms which support them (GNU supported
362 targets, and Windows). To establish whether or not
363 bzip2recover was built with such a limitation, run it
364 without arguments. In any event you can build yourself an
365 unlimited version if you can recompile it with MaybeUInt64
366 set to be an unsigned 64-bit integer.
353 367
354 368
355AUTHOR 369AUTHOR
356 Julian Seward, jseward@acm.org. 370 Julian Seward, jseward@acm.org.
357 371
358 http://sourceware.cygnus.com/bzip2 372 http://sources.redhat.com/bzip2
359 http://www.muraroa.demon.co.uk
360 373
361 The ideas embodied in bzip2 are due to (at least) the fol- 374 The ideas embodied in bzip2 are due to (at least) the fol­
362 lowing people: Michael Burrows and David Wheeler (for the 375 lowing people: Michael Burrows and David Wheeler (for the
363 block sorting transformation), David Wheeler (again, for 376 block sorting transformation), David Wheeler (again, for
364 the Huffman coder), Peter Fenwick (for the structured cod- 377 the Huffman coder), Peter Fenwick (for the structured cod­
365 ing model in the original bzip, and many refinements), and 378 ing model in the original bzip, and many refinements), and
366 Alistair Moffat, Radford Neal and Ian Witten (for the 379 Alistair Moffat, Radford Neal and Ian Witten (for the
367 arithmetic coder in the original bzip). I am much 380 arithmetic coder in the original bzip). I am much
368 indebted for their help, support and advice. See the man- 381 indebted for their help, support and advice. See the man­
369 ual in the source distribution for pointers to sources of 382 ual in the source distribution for pointers to sources of
370 documentation. Christian von Roques encouraged me to look 383 documentation. Christian von Roques encouraged me to look
371 for faster sorting algorithms, so as to speed up compres- 384 for faster sorting algorithms, so as to speed up compres­
372 sion. Bela Lubkin encouraged me to improve the worst-case 385 sion. Bela Lubkin encouraged me to improve the worst-case
373 compression performance. Many people sent patches, helped 386 compression performance. The bz* scripts are derived from
374 with portability problems, lent machines, gave advice and 387 those of GNU gzip. Many people sent patches, helped with
375 were generally helpful. 388 portability problems, lent machines, gave advice and were
389 generally helpful.
376 390