From f93cd82a9a7094ad90fd19bbc6ccf6f4627f8060 Mon Sep 17 00:00:00 2001 From: Julian Seward Date: Sat, 4 Sep 1999 22:13:13 +0200 Subject: bzip2-0.9.5d --- bzip2.txt | 336 +++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 178 insertions(+), 158 deletions(-) (limited to 'bzip2.txt') diff --git a/bzip2.txt b/bzip2.txt index 898dfe8..da23c64 100644 --- a/bzip2.txt +++ b/bzip2.txt @@ -1,22 +1,20 @@ -bzip2(1) bzip2(1) - NAME - bzip2, bunzip2 - a block-sorting file compressor, v0.9.0 + bzip2, bunzip2 - a block-sorting file compressor, v0.9.5 bzcat - decompresses files to stdout bzip2recover - recovers data from damaged bzip2 files SYNOPSIS - bzip2 [ -cdfkstvzVL123456789 ] [ filenames ... ] + bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ] bunzip2 [ -fkvsVL ] [ filenames ... ] bzcat [ -s ] [ filenames ... ] bzip2recover filename DESCRIPTION - bzip2 compresses files using the Burrows-Wheeler block- + bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, @@ -24,22 +22,22 @@ DESCRIPTION tistical compressors. The command-line options are deliberately very similar to - those of GNU Gzip, but they are not identical. + those of GNU gzip, but they are not identical. bzip2 expects a list of file names to accompany the com- mand-line flags. Each file is replaced by a compressed version of itself, with the name "original_name.bz2". - Each compressed file has the same modification date and - permissions as the corresponding original, so that these - properties can be correctly restored at decompression - time. File name handling is naive in the sense that there - is no mechanism for preserving original file names, per- - missions and dates in filesystems which lack these con- - cepts, or have serious file name length restrictions, such - as MS-DOS. + Each compressed file has the same modification date, per- + missions, and, when possible, ownership as the correspond- + ing original, so that these properties can be correctly + restored at decompression time. File name handling is + naive in the sense that there is no mechanism for preserv- + ing original file names, permissions, ownerships or dates + in filesystems which lack these concepts, or have serious + file name length restrictions, such as MS-DOS. bzip2 and bunzip2 will by default not overwrite existing - files; if you want this to happen, specify the -f flag. + files. If you want this to happen, specify the -f flag. If no file names are specified, bzip2 compresses from standard input to standard output. In this case, bzip2 @@ -47,10 +45,25 @@ DESCRIPTION this would be entirely incomprehensible and therefore pointless. - bunzip2 (or bzip2 -d ) decompresses and restores all spec- - ified files whose names end in ".bz2". Files without this - suffix are ignored. Again, supplying no filenames causes - decompression from standard input to standard output. + bunzip2 (or bzip2 -d) decompresses all specified files. + Files which were not created by bzip2 will be detected and + ignored, and a warning issued. bzip2 attempts to guess + the filename for the decompressed file from that of the + compressed file as follows: + + filename.bz2 becomes filename + filename.bz becomes filename + filename.tbz2 becomes filename.tar + filename.tbz becomes filename.tar + anyothername becomes anyothername.out + + If the file does not end in one of the recognised endings, + .bz2, .bz, .tbz2 or .tbz, bzip2 complains that it cannot + guess the name of the original file, and uses the original + name with .out appended. + + As with compression, supplying no filenames causes decom- + pression from standard input to standard output. bunzip2 will correctly decompress a file which is the con- catenation of two or more compressed files. The result is @@ -58,19 +71,24 @@ DESCRIPTION Integrity testing (-t) of concatenated compressed files is also supported. - You can also compress or decompress files to the standard - output by giving the -c flag. Multiple files may be com- + You can also compress or decompress files to the standard + output by giving the -c flag. Multiple files may be com- pressed and decompressed like this. The resulting outputs - are fed sequentially to stdout. Compression of multiple - files in this manner generates a stream containing multi- + are fed sequentially to stdout. Compression of multiple + files in this manner generates a stream containing multi- ple compressed file representations. Such a stream can be - decompressed correctly only by bzip2 version 0.9.0 or - later. Earlier versions of bzip2 will stop after decom- + decompressed correctly only by bzip2 version 0.9.0 or + later. Earlier versions of bzip2 will stop after decom- pressing the first file in the stream. - bzcat (or bzip2 -dc ) decompresses all specified files to + bzcat (or bzip2 -dc) decompresses all specified files to the standard output. + bzip2 will read arguments from the environment variables + BZIP2 and BZIP, in that order, and will process them + before any arguments read from the command line. This + gives a convenient way to supply default arguments. + Compression is always performed, even if the compressed file is slightly larger than the original. Files of less than about one hundred bytes tend to get larger, since the @@ -87,98 +105,19 @@ DESCRIPTION corruption going undetected is microscopic, about one chance in four billion for each file processed. Be aware, though, that the check occurs upon decompression, so it - can only tell you that that something is wrong. It can't - help you recover the original uncompressed data. You can - use bzip2recover to try to recover data from damaged - files. + can only tell you that something is wrong. It can't help + you recover the original uncompressed data. You can use + bzip2recover to try to recover data from damaged files. - Return values: 0 for a normal exit, 1 for environmental - problems (file not found, invalid flags, I/O errors, &c), + Return values: 0 for a normal exit, 1 for environmental + problems (file not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt compressed file, 3 for an internal consistency error (eg, bug) which caused bzip2 to panic. -MEMORY MANAGEMENT - Bzip2 compresses large files in blocks. The block size - affects both the compression ratio achieved, and the - amount of memory needed both for compression and decom- - pression. The flags -1 through -9 specify the block size - to be 100,000 bytes through 900,000 bytes (the default) - respectively. At decompression-time, the block size used - for compression is read from the header of the compressed - file, and bunzip2 then allocates itself just enough memory - to decompress the file. Since block sizes are stored in - compressed files, it follows that the flags -1 to -9 are - irrelevant to and so ignored during decompression. - - Compression and decompression requirements, in bytes, can - be estimated as: - - Compression: 400k + ( 7 x block size ) - - Decompression: 100k + ( 4 x block size ), or - 100k + ( 2.5 x block size ) - - Larger block sizes give rapidly diminishing marginal - returns; most of the compression comes from the first two - or three hundred k of block size, a fact worth bearing in - mind when using bzip2 on small machines. It is also - important to appreciate that the decompression memory - requirement is set at compression-time by the choice of - block size. - - For files compressed with the default 900k block size, - bunzip2 will require about 3700 kbytes to decompress. To - support decompression of any file on a 4 megabyte machine, - bunzip2 has an option to decompress using approximately - half this amount of memory, about 2300 kbytes. Decompres- - sion speed is also halved, so you should use this option - only where necessary. The relevant flag is -s. - - In general, try and use the largest block size memory con- - straints allow, since that maximises the compression - achieved. Compression and decompression speed are virtu- - ally unaffected by block size. - - Another significant point applies to files which fit in a - single block -- that means most files you'd encounter - using a large block size. The amount of real memory - touched is proportional to the size of the file, since the - file is smaller than a block. For example, compressing a - file 20,000 bytes long with the flag -9 will cause the - compressor to allocate around 6700k of memory, but only - touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the - decompressor will allocate 3700k but only touch 100k + - 20000 * 4 = 180 kbytes. - - Here is a table which summarises the maximum memory usage - for different block sizes. Also recorded is the total - compressed size for 14 files of the Calgary Text Compres- - sion Corpus totalling 3,141,622 bytes. This column gives - some feel for how compression varies with block size. - These figures tend to understate the advantage of larger - block sizes for larger files, since the Corpus is domi- - nated by smaller files. - - Compress Decompress Decompress Corpus - Flag usage usage -s usage Size - - -1 1100k 500k 350k 914704 - -2 1800k 900k 600k 877703 - -3 2500k 1300k 850k 860338 - -4 3200k 1700k 1100k 846899 - -5 3900k 2100k 1350k 845160 - -6 4600k 2500k 1600k 838626 - -7 5400k 2900k 1850k 834096 - -8 6000k 3300k 2100k 828642 - -9 6700k 3700k 2350k 828642 - - OPTIONS -c --stdout - Compress or decompress to standard output. -c will - decompress multiple files to stdout, but will only - compress a single file to stdout. + Compress or decompress to standard output. -d --decompress Force decompression. bzip2, bunzip2 and bzcat are @@ -198,7 +137,9 @@ OPTIONS -f --force Force overwrite of output files. Normally, bzip2 - will not overwrite existing output files. + will not overwrite existing output files. Also + forces bzip2 to break hard links to files, which it + otherwise wouldn't do. -k --keep Keep (don't delete) input files during compression @@ -217,7 +158,12 @@ OPTIONS figure, at the expense of your compression ratio. In short, if your machine is low on memory (8 megabytes or less), use -s for everything. See - MEMORY MANAGEMENT above. + MEMORY MANAGEMENT below. + + -q --quiet + Suppress non-essential warning messages. Messages + pertaining to I/O errors and other critical events + will not be suppressed. -v --verbose Verbose mode -- show the compression ratio for each @@ -232,21 +178,96 @@ OPTIONS -1 to -9 Set the block size to 100 k, 200 k .. 900 k when compressing. Has no effect when decompressing. - See MEMORY MANAGEMENT above. + See MEMORY MANAGEMENT below. - --repetitive-fast - bzip2 injects some small pseudo-random variations - into very repetitive blocks to limit worst-case - performance during compression. If sorting runs - into difficulties, the block is randomised, and - sorting is restarted. Very roughly, bzip2 persists - for three times as long as a well-behaved input - would take before resorting to randomisation. This - flag makes it give up much sooner. + -- Treats all subsequent arguments as file names, even + if they start with a dash. This is so you can han- + dle files with names beginning with a dash, for + example: bzip2 -- -myfilename. - --repetitive-best - Opposite of --repetitive-fast; try a lot harder - before resorting to randomisation. + --repetitive-fast --repetitive-best + These flags are redundant in versions 0.9.5 and + above. They provided some coarse control over the + behaviour of the sorting algorithm in earlier ver- + sions, which was sometimes useful. 0.9.5 and above + have an improved algorithm which renders these + flags irrelevant. + + +MEMORY MANAGEMENT + bzip2 compresses large files in blocks. The block size + affects both the compression ratio achieved, and the + amount of memory needed for compression and decompression. + The flags -1 through -9 specify the block size to be + 100,000 bytes through 900,000 bytes (the default) respec- + tively. At decompression time, the block size used for + compression is read from the header of the compressed + file, and bunzip2 then allocates itself just enough memory + to decompress the file. Since block sizes are stored in + compressed files, it follows that the flags -1 to -9 are + irrelevant to and so ignored during decompression. + + Compression and decompression requirements, in bytes, can + be estimated as: + + Compression: 400k + ( 8 x block size ) + + Decompression: 100k + ( 4 x block size ), or + 100k + ( 2.5 x block size ) + + Larger block sizes give rapidly diminishing marginal + returns. Most of the compression comes from the first two + or three hundred k of block size, a fact worth bearing in + mind when using bzip2 on small machines. It is also + important to appreciate that the decompression memory + requirement is set at compression time by the choice of + block size. + + For files compressed with the default 900k block size, + bunzip2 will require about 3700 kbytes to decompress. To + support decompression of any file on a 4 megabyte machine, + bunzip2 has an option to decompress using approximately + half this amount of memory, about 2300 kbytes. Decompres- + sion speed is also halved, so you should use this option + only where necessary. The relevant flag is -s. + + In general, try and use the largest block size memory con- + straints allow, since that maximises the compression + achieved. Compression and decompression speed are virtu- + ally unaffected by block size. + + Another significant point applies to files which fit in a + single block -- that means most files you'd encounter + using a large block size. The amount of real memory + touched is proportional to the size of the file, since the + file is smaller than a block. For example, compressing a + file 20,000 bytes long with the flag -9 will cause the + compressor to allocate around 7600k of memory, but only + touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the + decompressor will allocate 3700k but only touch 100k + + 20000 * 4 = 180 kbytes. + + Here is a table which summarises the maximum memory usage + for different block sizes. Also recorded is the total + compressed size for 14 files of the Calgary Text Compres- + sion Corpus totalling 3,141,622 bytes. This column gives + some feel for how compression varies with block size. + These figures tend to understate the advantage of larger + block sizes for larger files, since the Corpus is domi- + nated by smaller files. + + Compress Decompress Decompress Corpus + Flag usage usage -s usage Size + + -1 1200k 500k 350k 914704 + -2 2000k 900k 600k 877703 + -3 2800k 1300k 850k 860338 + -4 3600k 1700k 1100k 846899 + -5 4400k 2100k 1350k 845160 + -6 5200k 2500k 1600k 838626 + -7 6100k 2900k 1850k 834096 + -8 6800k 3300k 2100k 828642 + -9 7600k 3700k 2350k 828642 RECOVERING DATA FROM DAMAGED FILES @@ -273,8 +294,8 @@ RECOVERING DATA FROM DAMAGED FILES "rec0002file.bz2", etc, containing the extracted blocks. The output filenames are designed so that the use of wildcards in subsequent processing -- for example, "bzip2 - -dc rec*file.bz2 > recovered_data" -- lists the files in - the "right" order. + -dc rec*file.bz2 > recovered_data" -- lists the files in + the correct order. bzip2recover should be of most use dealing with large .bz2 files, as these will contain many blocks. It is clearly @@ -289,17 +310,15 @@ PERFORMANCE NOTES The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab - ..." (repeated several hundred times) may compress - extraordinarily slowly. You can use the -vvvvv option to - monitor progress in great detail, if you want. Decompres- - sion speed is unaffected. - - Such pathological cases seem rare in practice, appearing - mostly in artificially-constructed test files, and in low- - level disk images. It may be inadvisable to use bzip2 to - compress the latter. If you do get a file which causes - severe slowness in compression, try making the block size - as small as possible, with flag -1. + ..." (repeated several hundred times) may compress more + slowly than normal. Versions 0.9.5 and above fare much + better than previous versions in this respect. The ratio + between worst-case and average-case compression time is in + the region of 10:1. For previous versions, this figure + was more like 100:1. You can use the -vvvv option to mon- + itor progress in great detail, if you want. + + Decompression speed is unaffected by these phenomena. bzip2 usually allocates several megabytes of memory to operate in, and then charges all over it in a fairly ran- @@ -314,42 +333,43 @@ PERFORMANCE NOTES CAVEATS I/O error messages are not as helpful as they could be. - Bzip2 tries hard to detect I/O errors and exit cleanly, + bzip2 tries hard to detect I/O errors and exit cleanly, but the details of what the problem is sometimes seem rather misleading. - This manual page pertains to version 0.9.0 of bzip2. Com- + This manual page pertains to version 0.9.5 of bzip2. Com- pressed data created by this version is entirely forwards - and backwards compatible with the previous public release, - version 0.1pl2, but with the following exception: 0.9.0 - can correctly decompress multiple concatenated compressed - files. 0.1pl2 cannot do this; it will stop after decom- - pressing just the first file in the stream. - - Wildcard expansion for Windows 95 and NT is flaky. - - bzip2recover uses 32-bit integers to represent bit posi- - tions in compressed files, so it cannot handle compressed - files more than 512 megabytes long. This could easily be + and backwards compatible with the previous public + releases, versions 0.1pl2 and 0.9.0, but with the follow- + ing exception: 0.9.0 and above can correctly decompress + multiple concatenated compressed files. 0.1pl2 cannot do + this; it will stop after decompressing just the first file + in the stream. + + bzip2recover uses 32-bit integers to represent bit posi- + tions in compressed files, so it cannot handle compressed + files more than 512 megabytes long. This could easily be fixed. AUTHOR Julian Seward, jseward@acm.org. + http://www.muraroa.demon.co.uk The ideas embodied in bzip2 are due to (at least) the fol- - lowing people: Michael Burrows and David Wheeler (for the - block sorting transformation), David Wheeler (again, for + lowing people: Michael Burrows and David Wheeler (for the + block sorting transformation), David Wheeler (again, for the Huffman coder), Peter Fenwick (for the structured cod- ing model in the original bzip, and many refinements), and - Alistair Moffat, Radford Neal and Ian Witten (for the + Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic coder in the original bzip). I am much indebted for their help, support and advice. See the man- - ual in the source distribution for pointers to sources of + ual in the source distribution for pointers to sources of documentation. Christian von Roques encouraged me to look - for faster sorting algorithms, so as to speed up compres- + for faster sorting algorithms, so as to speed up compres- sion. Bela Lubkin encouraged me to improve the worst-case compression performance. Many people sent patches, helped - with portability problems, lent machines, gave advice and + with portability problems, lent machines, gave advice and were generally helpful. + -- cgit v1.2.3-55-g6feb