diff options
Diffstat (limited to 'bzip2.txt')
-rw-r--r-- | bzip2.txt | 134 |
1 files changed, 74 insertions, 60 deletions
@@ -1,7 +1,6 @@ | |||
1 | 1 | ||
2 | |||
3 | NAME | 2 | NAME |
4 | bzip2, bunzip2 - a block-sorting file compressor, v1.0 | 3 | bzip2, bunzip2 - a block-sorting file compressor, v1.0.2 |
5 | bzcat - decompresses files to stdout | 4 | bzcat - decompresses files to stdout |
6 | bzip2recover - recovers data from damaged bzip2 files | 5 | bzip2recover - recovers data from damaged bzip2 files |
7 | 6 | ||
@@ -18,20 +17,20 @@ DESCRIPTION | |||
18 | sorting text compression algorithm, and Huffman coding. | 17 | sorting text compression algorithm, and Huffman coding. |
19 | Compression is generally considerably better than that | 18 | Compression is generally considerably better than that |
20 | achieved by more conventional LZ77/LZ78-based compressors, | 19 | achieved by more conventional LZ77/LZ78-based compressors, |
21 | and approaches the performance of the PPM family of sta- | 20 | and approaches the performance of the PPM family of sta |
22 | tistical compressors. | 21 | tistical compressors. |
23 | 22 | ||
24 | The command-line options are deliberately very similar to | 23 | The command-line options are deliberately very similar to |
25 | those of GNU gzip, but they are not identical. | 24 | those of GNU gzip, but they are not identical. |
26 | 25 | ||
27 | bzip2 expects a list of file names to accompany the com- | 26 | bzip2 expects a list of file names to accompany the com |
28 | mand-line flags. Each file is replaced by a compressed | 27 | mand-line flags. Each file is replaced by a compressed |
29 | version of itself, with the name "original_name.bz2". | 28 | version of itself, with the name "original_name.bz2". |
30 | Each compressed file has the same modification date, per- | 29 | Each compressed file has the same modification date, per |
31 | missions, and, when possible, ownership as the correspond- | 30 | missions, and, when possible, ownership as the correspond |
32 | ing original, so that these properties can be correctly | 31 | ing original, so that these properties can be correctly |
33 | restored at decompression time. File name handling is | 32 | restored at decompression time. File name handling is |
34 | naive in the sense that there is no mechanism for preserv- | 33 | naive in the sense that there is no mechanism for preserv |
35 | ing original file names, permissions, ownerships or dates | 34 | ing original file names, permissions, ownerships or dates |
36 | in filesystems which lack these concepts, or have serious | 35 | in filesystems which lack these concepts, or have serious |
37 | file name length restrictions, such as MS-DOS. | 36 | file name length restrictions, such as MS-DOS. |
@@ -62,23 +61,23 @@ DESCRIPTION | |||
62 | guess the name of the original file, and uses the original | 61 | guess the name of the original file, and uses the original |
63 | name with .out appended. | 62 | name with .out appended. |
64 | 63 | ||
65 | As with compression, supplying no filenames causes decom- | 64 | As with compression, supplying no filenames causes decom |
66 | pression from standard input to standard output. | 65 | pression from standard input to standard output. |
67 | 66 | ||
68 | bunzip2 will correctly decompress a file which is the con- | 67 | bunzip2 will correctly decompress a file which is the con |
69 | catenation of two or more compressed files. The result is | 68 | catenation of two or more compressed files. The result is |
70 | the concatenation of the corresponding uncompressed files. | 69 | the concatenation of the corresponding uncompressed files. |
71 | Integrity testing (-t) of concatenated compressed files is | 70 | Integrity testing (-t) of concatenated compressed files is |
72 | also supported. | 71 | also supported. |
73 | 72 | ||
74 | You can also compress or decompress files to the standard | 73 | You can also compress or decompress files to the standard |
75 | output by giving the -c flag. Multiple files may be com- | 74 | output by giving the -c flag. Multiple files may be com |
76 | pressed and decompressed like this. The resulting outputs | 75 | pressed and decompressed like this. The resulting outputs |
77 | are fed sequentially to stdout. Compression of multiple | 76 | are fed sequentially to stdout. Compression of multiple |
78 | files in this manner generates a stream containing multi- | 77 | files in this manner generates a stream containing multi |
79 | ple compressed file representations. Such a stream can be | 78 | ple compressed file representations. Such a stream can be |
80 | decompressed correctly only by bzip2 version 0.9.0 or | 79 | decompressed correctly only by bzip2 version 0.9.0 or |
81 | later. Earlier versions of bzip2 will stop after decom- | 80 | later. Earlier versions of bzip2 will stop after decom |
82 | pressing the first file in the stream. | 81 | pressing the first file in the stream. |
83 | 82 | ||
84 | bzcat (or bzip2 -dc) decompresses all specified files to | 83 | bzcat (or bzip2 -dc) decompresses all specified files to |
@@ -99,7 +98,7 @@ DESCRIPTION | |||
99 | 98 | ||
100 | As a self-check for your protection, bzip2 uses 32-bit | 99 | As a self-check for your protection, bzip2 uses 32-bit |
101 | CRCs to make sure that the decompressed version of a file | 100 | CRCs to make sure that the decompressed version of a file |
102 | is identical to the original. This guards against corrup- | 101 | is identical to the original. This guards against corrup |
103 | tion of the compressed data, and against undetected bugs | 102 | tion of the compressed data, and against undetected bugs |
104 | in bzip2 (hopefully very unlikely). The chances of data | 103 | in bzip2 (hopefully very unlikely). The chances of data |
105 | corruption going undetected is microscopic, about one | 104 | corruption going undetected is microscopic, about one |
@@ -127,8 +126,8 @@ OPTIONS | |||
127 | and forces bzip2 to decompress. | 126 | and forces bzip2 to decompress. |
128 | 127 | ||
129 | -z --compress | 128 | -z --compress |
130 | The complement to -d: forces compression, regard- | 129 | The complement to -d: forces compression, |
131 | less of the invokation name. | 130 | regardless of the invocation name. |
132 | 131 | ||
133 | -t --test | 132 | -t --test |
134 | Check integrity of the specified file(s), but don't | 133 | Check integrity of the specified file(s), but don't |
@@ -141,6 +140,11 @@ OPTIONS | |||
141 | forces bzip2 to break hard links to files, which it | 140 | forces bzip2 to break hard links to files, which it |
142 | otherwise wouldn't do. | 141 | otherwise wouldn't do. |
143 | 142 | ||
143 | bzip2 normally declines to decompress files which | ||
144 | don't have the correct magic header bytes. If | ||
145 | forced (-f), however, it will pass such files | ||
146 | through unmodified. This is how GNU gzip behaves. | ||
147 | |||
144 | -k --keep | 148 | -k --keep |
145 | Keep (don't delete) input files during compression | 149 | Keep (don't delete) input files during compression |
146 | or decompression. | 150 | or decompression. |
@@ -167,7 +171,7 @@ OPTIONS | |||
167 | 171 | ||
168 | -v --verbose | 172 | -v --verbose |
169 | Verbose mode -- show the compression ratio for each | 173 | Verbose mode -- show the compression ratio for each |
170 | file processed. Further -v's increase the ver- | 174 | file processed. Further -v's increase the ver |
171 | bosity level, spewing out lots of information which | 175 | bosity level, spewing out lots of information which |
172 | is primarily of interest for diagnostic purposes. | 176 | is primarily of interest for diagnostic purposes. |
173 | 177 | ||
@@ -175,20 +179,24 @@ OPTIONS | |||
175 | Display the software version, license terms and | 179 | Display the software version, license terms and |
176 | conditions. | 180 | conditions. |
177 | 181 | ||
178 | -1 to -9 | 182 | -1 (or --fast) to -9 (or --best) |
179 | Set the block size to 100 k, 200 k .. 900 k when | 183 | Set the block size to 100 k, 200 k .. 900 k when |
180 | compressing. Has no effect when decompressing. | 184 | compressing. Has no effect when decompressing. |
181 | See MEMORY MANAGEMENT below. | 185 | See MEMORY MANAGEMENT below. The --fast and --best |
186 | aliases are primarily for GNU gzip compatibility. | ||
187 | In particular, --fast doesn't make things signifi | ||
188 | cantly faster. And --best merely selects the | ||
189 | default behaviour. | ||
182 | 190 | ||
183 | -- Treats all subsequent arguments as file names, even | 191 | -- Treats all subsequent arguments as file names, even |
184 | if they start with a dash. This is so you can han- | 192 | if they start with a dash. This is so you can han |
185 | dle files with names beginning with a dash, for | 193 | dle files with names beginning with a dash, for |
186 | example: bzip2 -- -myfilename. | 194 | example: bzip2 -- -myfilename. |
187 | 195 | ||
188 | --repetitive-fast --repetitive-best | 196 | --repetitive-fast --repetitive-best |
189 | These flags are redundant in versions 0.9.5 and | 197 | These flags are redundant in versions 0.9.5 and |
190 | above. They provided some coarse control over the | 198 | above. They provided some coarse control over the |
191 | behaviour of the sorting algorithm in earlier ver- | 199 | behaviour of the sorting algorithm in earlier ver |
192 | sions, which was sometimes useful. 0.9.5 and above | 200 | sions, which was sometimes useful. 0.9.5 and above |
193 | have an improved algorithm which renders these | 201 | have an improved algorithm which renders these |
194 | flags irrelevant. | 202 | flags irrelevant. |
@@ -199,7 +207,7 @@ MEMORY MANAGEMENT | |||
199 | affects both the compression ratio achieved, and the | 207 | affects both the compression ratio achieved, and the |
200 | amount of memory needed for compression and decompression. | 208 | amount of memory needed for compression and decompression. |
201 | The flags -1 through -9 specify the block size to be | 209 | The flags -1 through -9 specify the block size to be |
202 | 100,000 bytes through 900,000 bytes (the default) respec- | 210 | 100,000 bytes through 900,000 bytes (the default) respec |
203 | tively. At decompression time, the block size used for | 211 | tively. At decompression time, the block size used for |
204 | compression is read from the header of the compressed | 212 | compression is read from the header of the compressed |
205 | file, and bunzip2 then allocates itself just enough memory | 213 | file, and bunzip2 then allocates itself just enough memory |
@@ -227,13 +235,13 @@ MEMORY MANAGEMENT | |||
227 | bunzip2 will require about 3700 kbytes to decompress. To | 235 | bunzip2 will require about 3700 kbytes to decompress. To |
228 | support decompression of any file on a 4 megabyte machine, | 236 | support decompression of any file on a 4 megabyte machine, |
229 | bunzip2 has an option to decompress using approximately | 237 | bunzip2 has an option to decompress using approximately |
230 | half this amount of memory, about 2300 kbytes. Decompres- | 238 | half this amount of memory, about 2300 kbytes. Decompres |
231 | sion speed is also halved, so you should use this option | 239 | sion speed is also halved, so you should use this option |
232 | only where necessary. The relevant flag is -s. | 240 | only where necessary. The relevant flag is -s. |
233 | 241 | ||
234 | In general, try and use the largest block size memory con- | 242 | In general, try and use the largest block size memory con |
235 | straints allow, since that maximises the compression | 243 | straints allow, since that maximises the compression |
236 | achieved. Compression and decompression speed are virtu- | 244 | achieved. Compression and decompression speed are virtu |
237 | ally unaffected by block size. | 245 | ally unaffected by block size. |
238 | 246 | ||
239 | Another significant point applies to files which fit in a | 247 | Another significant point applies to files which fit in a |
@@ -249,11 +257,11 @@ MEMORY MANAGEMENT | |||
249 | 257 | ||
250 | Here is a table which summarises the maximum memory usage | 258 | Here is a table which summarises the maximum memory usage |
251 | for different block sizes. Also recorded is the total | 259 | for different block sizes. Also recorded is the total |
252 | compressed size for 14 files of the Calgary Text Compres- | 260 | compressed size for 14 files of the Calgary Text Compres |
253 | sion Corpus totalling 3,141,622 bytes. This column gives | 261 | sion Corpus totalling 3,141,622 bytes. This column gives |
254 | some feel for how compression varies with block size. | 262 | some feel for how compression varies with block size. |
255 | These figures tend to understate the advantage of larger | 263 | These figures tend to understate the advantage of larger |
256 | block sizes for larger files, since the Corpus is domi- | 264 | block sizes for larger files, since the Corpus is domi |
257 | nated by smaller files. | 265 | nated by smaller files. |
258 | 266 | ||
259 | Compress Decompress Decompress Corpus | 267 | Compress Decompress Decompress Corpus |
@@ -272,7 +280,7 @@ MEMORY MANAGEMENT | |||
272 | 280 | ||
273 | RECOVERING DATA FROM DAMAGED FILES | 281 | RECOVERING DATA FROM DAMAGED FILES |
274 | bzip2 compresses files in blocks, usually 900kbytes long. | 282 | bzip2 compresses files in blocks, usually 900kbytes long. |
275 | Each block is handled independently. If a media or trans- | 283 | Each block is handled independently. If a media or trans |
276 | mission error causes a multi-block .bz2 file to become | 284 | mission error causes a multi-block .bz2 file to become |
277 | damaged, it may be possible to recover data from the | 285 | damaged, it may be possible to recover data from the |
278 | undamaged blocks in the file. | 286 | undamaged blocks in the file. |
@@ -289,19 +297,19 @@ RECOVERING DATA FROM DAMAGED FILES | |||
289 | the integrity of the resulting files, and decompress those | 297 | the integrity of the resulting files, and decompress those |
290 | which are undamaged. | 298 | which are undamaged. |
291 | 299 | ||
292 | bzip2recover takes a single argument, the name of the dam- | 300 | bzip2recover takes a single argument, the name of the dam |
293 | aged file, and writes a number of files "rec0001file.bz2", | 301 | aged file, and writes a number of files |
294 | "rec0002file.bz2", etc, containing the extracted blocks. | 302 | "rec00001file.bz2", "rec00002file.bz2", etc, containing |
295 | The output filenames are designed so that the use of | 303 | the extracted blocks. The output filenames are |
296 | wildcards in subsequent processing -- for example, "bzip2 | 304 | designed so that the use of wildcards in subsequent pro |
297 | -dc rec*file.bz2 > recovered_data" -- lists the files in | 305 | cessing -- for example, "bzip2 -dc rec*file.bz2 > recov |
298 | the correct order. | 306 | ered_data" -- processes the files in the correct order. |
299 | 307 | ||
300 | bzip2recover should be of most use dealing with large .bz2 | 308 | bzip2recover should be of most use dealing with large .bz2 |
301 | files, as these will contain many blocks. It is clearly | 309 | files, as these will contain many blocks. It is clearly |
302 | futile to use it on damaged single-block files, since a | 310 | futile to use it on damaged single-block files, since a |
303 | damaged block cannot be recovered. If you wish to min- | 311 | damaged block cannot be recovered. If you wish to min |
304 | imise any potential data loss through media or transmis- | 312 | imise any potential data loss through media or transmis |
305 | sion errors, you might consider compressing with a smaller | 313 | sion errors, you might consider compressing with a smaller |
306 | block size. | 314 | block size. |
307 | 315 | ||
@@ -315,19 +323,19 @@ PERFORMANCE NOTES | |||
315 | better than previous versions in this respect. The ratio | 323 | better than previous versions in this respect. The ratio |
316 | between worst-case and average-case compression time is in | 324 | between worst-case and average-case compression time is in |
317 | the region of 10:1. For previous versions, this figure | 325 | the region of 10:1. For previous versions, this figure |
318 | was more like 100:1. You can use the -vvvv option to mon- | 326 | was more like 100:1. You can use the -vvvv option to mon |
319 | itor progress in great detail, if you want. | 327 | itor progress in great detail, if you want. |
320 | 328 | ||
321 | Decompression speed is unaffected by these phenomena. | 329 | Decompression speed is unaffected by these phenomena. |
322 | 330 | ||
323 | bzip2 usually allocates several megabytes of memory to | 331 | bzip2 usually allocates several megabytes of memory to |
324 | operate in, and then charges all over it in a fairly ran- | 332 | operate in, and then charges all over it in a fairly ran |
325 | dom fashion. This means that performance, both for com- | 333 | dom fashion. This means that performance, both for com |
326 | pressing and decompressing, is largely determined by the | 334 | pressing and decompressing, is largely determined by the |
327 | speed at which your machine can service cache misses. | 335 | speed at which your machine can service cache misses. |
328 | Because of this, small changes to the code to reduce the | 336 | Because of this, small changes to the code to reduce the |
329 | miss rate have been observed to give disproportionately | 337 | miss rate have been observed to give disproportionately |
330 | large performance improvements. I imagine bzip2 will per- | 338 | large performance improvements. I imagine bzip2 will per |
331 | form best on machines with very large caches. | 339 | form best on machines with very large caches. |
332 | 340 | ||
333 | 341 | ||
@@ -337,40 +345,46 @@ CAVEATS | |||
337 | but the details of what the problem is sometimes seem | 345 | but the details of what the problem is sometimes seem |
338 | rather misleading. | 346 | rather misleading. |
339 | 347 | ||
340 | This manual page pertains to version 1.0 of bzip2. Com- | 348 | This manual page pertains to version 1.0.2 of bzip2. Com |
341 | pressed data created by this version is entirely forwards | 349 | pressed data created by this version is entirely forwards |
342 | and backwards compatible with the previous public | 350 | and backwards compatible with the previous public |
343 | releases, versions 0.1pl2, 0.9.0 and 0.9.5, but with the | 351 | releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0 and 1.0.1, |
344 | following exception: 0.9.0 and above can correctly decom- | 352 | but with the following exception: 0.9.0 and above can cor |
345 | press multiple concatenated compressed files. 0.1pl2 can- | 353 | rectly decompress multiple concatenated compressed files. |
346 | not do this; it will stop after decompressing just the | 354 | 0.1pl2 cannot do this; it will stop after decompressing |
347 | first file in the stream. | 355 | just the first file in the stream. |
348 | 356 | ||
349 | bzip2recover uses 32-bit integers to represent bit posi- | 357 | bzip2recover versions prior to this one, 1.0.2, used |
350 | tions in compressed files, so it cannot handle compressed | 358 | 32-bit integers to represent bit positions in compressed |
351 | files more than 512 megabytes long. This could easily be | 359 | files, so it could not handle compressed files more than |
352 | fixed. | 360 | 512 megabytes long. Version 1.0.2 and above uses 64-bit |
361 | ints on some platforms which support them (GNU supported | ||
362 | targets, and Windows). To establish whether or not | ||
363 | bzip2recover was built with such a limitation, run it | ||
364 | without arguments. In any event you can build yourself an | ||
365 | unlimited version if you can recompile it with MaybeUInt64 | ||
366 | set to be an unsigned 64-bit integer. | ||
353 | 367 | ||
354 | 368 | ||
355 | AUTHOR | 369 | AUTHOR |
356 | Julian Seward, jseward@acm.org. | 370 | Julian Seward, jseward@acm.org. |
357 | 371 | ||
358 | http://sourceware.cygnus.com/bzip2 | 372 | http://sources.redhat.com/bzip2 |
359 | http://www.muraroa.demon.co.uk | ||
360 | 373 | ||
361 | The ideas embodied in bzip2 are due to (at least) the fol- | 374 | The ideas embodied in bzip2 are due to (at least) the fol |
362 | lowing people: Michael Burrows and David Wheeler (for the | 375 | lowing people: Michael Burrows and David Wheeler (for the |
363 | block sorting transformation), David Wheeler (again, for | 376 | block sorting transformation), David Wheeler (again, for |
364 | the Huffman coder), Peter Fenwick (for the structured cod- | 377 | the Huffman coder), Peter Fenwick (for the structured cod |
365 | ing model in the original bzip, and many refinements), and | 378 | ing model in the original bzip, and many refinements), and |
366 | Alistair Moffat, Radford Neal and Ian Witten (for the | 379 | Alistair Moffat, Radford Neal and Ian Witten (for the |
367 | arithmetic coder in the original bzip). I am much | 380 | arithmetic coder in the original bzip). I am much |
368 | indebted for their help, support and advice. See the man- | 381 | indebted for their help, support and advice. See the man |
369 | ual in the source distribution for pointers to sources of | 382 | ual in the source distribution for pointers to sources of |
370 | documentation. Christian von Roques encouraged me to look | 383 | documentation. Christian von Roques encouraged me to look |
371 | for faster sorting algorithms, so as to speed up compres- | 384 | for faster sorting algorithms, so as to speed up compres |
372 | sion. Bela Lubkin encouraged me to improve the worst-case | 385 | sion. Bela Lubkin encouraged me to improve the worst-case |
373 | compression performance. Many people sent patches, helped | 386 | compression performance. The bz* scripts are derived from |
374 | with portability problems, lent machines, gave advice and | 387 | those of GNU gzip. Many people sent patches, helped with |
375 | were generally helpful. | 388 | portability problems, lent machines, gave advice and were |
389 | generally helpful. | ||
376 | 390 | ||