From f93cd82a9a7094ad90fd19bbc6ccf6f4627f8060 Mon Sep 17 00:00:00 2001 From: Julian Seward Date: Sat, 4 Sep 1999 22:13:13 +0200 Subject: bzip2-0.9.5d --- bzip2.1 | 610 +++++++++++++++++++++++++++++++++------------------------------- 1 file changed, 314 insertions(+), 296 deletions(-) (limited to 'bzip2.1') diff --git a/bzip2.1 b/bzip2.1 index a6789a4..99eda9b 100644 --- a/bzip2.1 +++ b/bzip2.1 @@ -1,7 +1,7 @@ .PU .TH bzip2 1 .SH NAME -bzip2, bunzip2 \- a block-sorting file compressor, v0.9.0 +bzip2, bunzip2 \- a block-sorting file compressor, v0.9.5 .br bzcat \- decompresses files to stdout .br @@ -10,7 +10,7 @@ bzip2recover \- recovers data from damaged bzip2 files .SH SYNOPSIS .ll +8 .B bzip2 -.RB [ " \-cdfkstvzVL123456789 " ] +.RB [ " \-cdfkqstvzVL123456789 " ] [ .I "filenames \&..." ] @@ -18,13 +18,13 @@ bzip2recover \- recovers data from damaged bzip2 files .br .B bunzip2 .RB [ " \-fkvsVL " ] -[ +[ .I "filenames \&..." ] -.br +.br .B bzcat .RB [ " \-s " ] -[ +[ .I "filenames \&..." ] .br @@ -33,211 +33,171 @@ bzip2recover \- recovers data from damaged bzip2 files .SH DESCRIPTION .I bzip2 -compresses files using the Burrows-Wheeler block-sorting -text compression algorithm, and Huffman coding. -Compression is generally considerably -better than that -achieved by more conventional LZ77/LZ78-based compressors, -and approaches the performance of the PPM family of statistical -compressors. +compresses files using the Burrows-Wheeler block sorting +text compression algorithm, and Huffman coding. Compression is +generally considerably better than that achieved by more conventional +LZ77/LZ78-based compressors, and approaches the performance of the PPM +family of statistical compressors. The command-line options are deliberately very similar to those of -.I GNU Gzip, +.I GNU gzip, but they are not identical. -.I bzip2 -expects a list of file names to accompany the command-line flags. -Each file is replaced by a compressed version of itself, -with the name "original_name.bz2". -Each compressed file has the same modification date and permissions -as the corresponding original, so that these properties can be -correctly restored at decompression time. File name handling is -naive in the sense that there is no mechanism for preserving -original file names, permissions and dates in filesystems -which lack these concepts, or have serious file name length -restrictions, such as MS-DOS. +.I bzip2 +expects a list of file names to accompany the +command-line flags. Each file is replaced by a compressed version of +itself, with the name "original_name.bz2". +Each compressed file +has the same modification date, permissions, and, when possible, +ownership as the corresponding original, so that these properties can +be correctly restored at decompression time. File name handling is +naive in the sense that there is no mechanism for preserving original +file names, permissions, ownerships or dates in filesystems which lack +these concepts, or have serious file name length restrictions, such as +MS-DOS. .I bzip2 and .I bunzip2 -will by default not overwrite existing files; -if you want this to happen, specify the \-f flag. +will by default not overwrite existing +files. If you want this to happen, specify the \-f flag. If no file names are specified, .I bzip2 -compresses from standard input to standard output. -In this case, +compresses from standard +input to standard output. In this case, .I bzip2 -will decline to write compressed output to a terminal, as -this would be entirely incomprehensible and therefore pointless. +will decline to +write compressed output to a terminal, as this would be entirely +incomprehensible and therefore pointless. .I bunzip2 (or -.I bzip2 \-d -) decompresses and restores all specified files whose names -end in ".bz2". -Files without this suffix are ignored. -Again, supplying no filenames -causes decompression from standard input to standard output. - -.I bunzip2 -will correctly decompress a file which is the concatenation -of two or more compressed files. The result is the concatenation -of the corresponding uncompressed files. Integrity testing -(\-t) of concatenated compressed files is also supported. - -You can also compress or decompress files to -the standard output by giving the \-c flag. -Multiple files may be compressed and decompressed like this. -The resulting outputs are fed sequentially to stdout. -Compression of multiple files in this manner generates -a stream containing multiple compressed file representations. -Such a stream can be decompressed correctly only by +.I bzip2 \-d) +decompresses all +specified files. Files which were not created by .I bzip2 -version 0.9.0 or later. Earlier versions of +will be detected and ignored, and a warning issued. .I bzip2 -will stop after decompressing the first file in the stream. +attempts to guess the filename for the decompressed file +from that of the compressed file as follows: + + filename.bz2 becomes filename + filename.bz becomes filename + filename.tbz2 becomes filename.tar + filename.tbz becomes filename.tar + anyothername becomes anyothername.out + +If the file does not end in one of the recognised endings, +.I .bz2, +.I .bz, +.I .tbz2 +or +.I .tbz, +.I bzip2 +complains that it cannot +guess the name of the original file, and uses the original name +with +.I .out +appended. + +As with compression, supplying no +filenames causes decompression from +standard input to standard output. + +.I bunzip2 +will correctly decompress a file which is the +concatenation of two or more compressed files. The result is the +concatenation of the corresponding uncompressed files. Integrity +testing (\-t) +of concatenated +compressed files is also supported. + +You can also compress or decompress files to the standard output by +giving the \-c flag. Multiple files may be compressed and +decompressed like this. The resulting outputs are fed sequentially to +stdout. Compression of multiple files +in this manner generates a stream +containing multiple compressed file representations. Such a stream +can be decompressed correctly only by +.I bzip2 +version 0.9.0 or +later. Earlier versions of +.I bzip2 +will stop after decompressing +the first file in the stream. .I bzcat (or -.I bzip2 \-dc -) decompresses all specified files to the standard output. - -Compression is always performed, even if the compressed file is -slightly larger than the original. Files of less than about -one hundred bytes tend to get larger, since the compression -mechanism has a constant overhead in the region of 50 bytes. -Random data (including the output of most file compressors) -is coded at about 8.05 bits per byte, giving an expansion of -around 0.5%. - -As a self-check for your protection, +.I bzip2 -dc) +decompresses all specified files to +the standard output. + .I bzip2 -uses 32-bit CRCs to make sure that the decompressed -version of a file is identical to the original. -This guards against corruption of the compressed data, -and against undetected bugs in +will read arguments from the environment variables +.I BZIP2 +and +.I BZIP, +in that order, and will process them +before any arguments read from the command line. This gives a +convenient way to supply default arguments. + +Compression is always performed, even if the compressed +file is slightly +larger than the original. Files of less than about one hundred bytes +tend to get larger, since the compression mechanism has a constant +overhead in the region of 50 bytes. Random data (including the output +of most file compressors) is coded at about 8.05 bits per byte, giving +an expansion of around 0.5%. + +As a self-check for your protection, +.I +bzip2 +uses 32-bit CRCs to +make sure that the decompressed version of a file is identical to the +original. This guards against corruption of the compressed data, and +against undetected bugs in .I bzip2 -(hopefully very unlikely). -The chances of data corruption going undetected is -microscopic, about one chance in four billion -for each file processed. Be aware, though, that the check -occurs upon decompression, so it can only tell you that -that something is wrong. It can't help you recover the -original uncompressed data. -You can use +(hopefully very unlikely). The +chances of data corruption going undetected is microscopic, about one +chance in four billion for each file processed. Be aware, though, that +the check occurs upon decompression, so it can only tell you that +something is wrong. It can't help you +recover the original uncompressed +data. You can use .I bzip2recover -to try to recover data from damaged files. - -Return values: -0 for a normal exit, -1 for environmental -problems (file not found, invalid flags, I/O errors, &c), -2 to indicate a corrupt compressed file, -3 for an internal consistency error (eg, bug) which caused -.I bzip2 -to panic. +to try to recover data from +damaged files. -.SH MEMORY MANAGEMENT -.I Bzip2 -compresses large files in blocks. The block size affects both the -compression ratio achieved, and the amount of memory needed both for -compression and decompression. The flags \-1 through \-9 -specify the block size to be 100,000 bytes through 900,000 bytes -(the default) respectively. At decompression-time, the block size used for -compression is read from the header of the compressed file, and -.I bunzip2 -then allocates itself just enough memory to decompress the file. -Since block sizes are stored in compressed files, it follows that the flags -\-1 to \-9 -are irrelevant to and so ignored during decompression. -Compression and decompression requirements, in bytes, can be estimated as: - - Compression: 400k + ( 7 x block size ) - - Decompression: 100k + ( 4 x block size ), or -.br - 100k + ( 2.5 x block size ) - -Larger block sizes give rapidly diminishing marginal returns; most -of the -compression comes from the first two or three hundred k of block size, -a fact worth bearing in mind when using +Return values: 0 for a normal exit, 1 for environmental problems (file +not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt +compressed file, 3 for an internal consistency error (eg, bug) which +caused .I bzip2 -on small machines. It is also important to appreciate that the -decompression memory requirement is set at compression-time by the -choice of block size. - -For files compressed with the default 900k block size, -.I bunzip2 -will require about 3700 kbytes to decompress. -To support decompression of any file on a 4 megabyte machine, -.I bunzip2 -has an option to decompress using approximately half this -amount of memory, about 2300 kbytes. Decompression speed is -also halved, so you should use this option only where necessary. -The relevant flag is \-s. - -In general, try and use the largest block size -memory constraints allow, since that maximises the compression -achieved. Compression and decompression -speed are virtually unaffected by block size. - -Another significant point applies to files which fit in a single -block -- that means most files you'd encounter using a large -block size. The amount of real memory touched is proportional -to the size of the file, since the file is smaller than a block. -For example, compressing a file 20,000 bytes long with the flag -\-9 -will cause the compressor to allocate around -6700k of memory, but only touch 400k + 20000 * 7 = 540 -kbytes of it. Similarly, the decompressor will allocate 3700k but -only touch 100k + 20000 * 4 = 180 kbytes. - -Here is a table which summarises the maximum memory usage for -different block sizes. Also recorded is the total compressed -size for 14 files of the Calgary Text Compression Corpus -totalling 3,141,622 bytes. This column gives some feel for how -compression varies with block size. These figures tend to understate -the advantage of larger block sizes for larger files, since the -Corpus is dominated by smaller files. - - Compress Decompress Decompress Corpus - Flag usage usage -s usage Size - - -1 1100k 500k 350k 914704 - -2 1800k 900k 600k 877703 - -3 2500k 1300k 850k 860338 - -4 3200k 1700k 1100k 846899 - -5 3900k 2100k 1350k 845160 - -6 4600k 2500k 1600k 838626 - -7 5400k 2900k 1850k 834096 - -8 6000k 3300k 2100k 828642 - -9 6700k 3700k 2350k 828642 +to panic. .SH OPTIONS .TP .B \-c --stdout -Compress or decompress to standard output. \-c will decompress -multiple files to stdout, but will only compress a single file to -stdout. +Compress or decompress to standard output. .TP .B \-d --decompress -Force decompression. -.I bzip2, -.I bunzip2 +Force decompression. +.I bzip2, +.I bunzip2 and -.I bzcat -are really the same program, and the decision about what actions -to take is done on the basis of which name is -used. This flag overrides that mechanism, and forces +.I bzcat +are +really the same program, and the decision about what actions to take is +done on the basis of which name is used. This flag overrides that +mechanism, and forces .I bzip2 to decompress. -.TP +.TP .B \-z --compress -The complement to \-d: forces compression, regardless of the invokation -name. +The complement to \-d: forces compression, regardless of the +invokation name. .TP .B \-t --test Check integrity of the specified file(s), but don't decompress them. @@ -245,25 +205,31 @@ This really performs a trial decompression and throws away the result. .TP .B \-f --force Force overwrite of output files. Normally, -.I bzip2 -will not overwrite existing output files. +.I bzip2 +will not overwrite +existing output files. Also forces +.I bzip2 +to break hard links +to files, which it otherwise wouldn't do. .TP .B \-k --keep -Keep (don't delete) input files during compression or decompression. +Keep (don't delete) input files during compression +or decompression. .TP .B \-s --small -Reduce memory usage, for compression, decompression and -testing. -Files are decompressed and tested using a modified algorithm which only +Reduce memory usage, for compression, decompression and testing. Files +are decompressed and tested using a modified algorithm which only requires 2.5 bytes per block byte. This means any file can be -decompressed in 2300k of memory, albeit at about half the normal -speed. - -During compression, -s selects a block size of 200k, which limits -memory use to around the same figure, at the expense of your -compression ratio. In short, if your machine is low on memory -(8 megabytes or less), use -s for everything. See -MEMORY MANAGEMENT above. +decompressed in 2300k of memory, albeit at about half the normal speed. + +During compression, \-s selects a block size of 200k, which limits +memory use to around the same figure, at the expense of your compression +ratio. In short, if your machine is low on memory (8 megabytes or +less), use \-s for everything. See MEMORY MANAGEMENT below. +.TP +.B \-q --quiet +Suppress non-essential warning messages. Messages pertaining to +I/O errors and other critical events will not be suppressed. .TP .B \-v --verbose Verbose mode -- show the compression ratio for each file processed. @@ -273,147 +239,199 @@ information which is primarily of interest for diagnostic purposes. .B \-L --license -V --version Display the software version, license terms and conditions. .TP -.B \-1 to \-9 -Set the block size to 100 k, 200 k .. 900 k when -compressing. Has no effect when decompressing. -See MEMORY MANAGEMENT above. +.B \-1 to \-9 +Set the block size to 100 k, 200 k .. 900 k when compressing. Has no +effect when decompressing. See MEMORY MANAGEMENT below. .TP -.B \--repetitive-fast -.I bzip2 -injects some small pseudo-random variations -into very repetitive blocks to limit -worst-case performance during compression. -If sorting runs into difficulties, the block -is randomised, and sorting is restarted. -Very roughly, +.B \-- +Treats all subsequent arguments as file names, even if they start +with a dash. This is so you can handle files with names beginning +with a dash, for example: bzip2 \-- \-myfilename. +.TP +.B \--repetitive-fast --repetitive-best +These flags are redundant in versions 0.9.5 and above. They provided +some coarse control over the behaviour of the sorting algorithm in +earlier versions, which was sometimes useful. 0.9.5 and above have an +improved algorithm which renders these flags irrelevant. + +.SH MEMORY MANAGEMENT +.I bzip2 +compresses large files in blocks. The block size affects +both the compression ratio achieved, and the amount of memory needed for +compression and decompression. The flags \-1 through \-9 +specify the block size to be 100,000 bytes through 900,000 bytes (the +default) respectively. At decompression time, the block size used for +compression is read from the header of the compressed file, and +.I bunzip2 +then allocates itself just enough memory to decompress +the file. Since block sizes are stored in compressed files, it follows +that the flags \-1 to \-9 are irrelevant to and so ignored +during decompression. + +Compression and decompression requirements, +in bytes, can be estimated as: + + Compression: 400k + ( 8 x block size ) + + Decompression: 100k + ( 4 x block size ), or + 100k + ( 2.5 x block size ) + +Larger block sizes give rapidly diminishing marginal returns. Most of +the compression comes from the first two or three hundred k of block +size, a fact worth bearing in mind when using .I bzip2 -persists for three times as long as a well-behaved input -would take before resorting to randomisation. -This flag makes it give up much sooner. +on small machines. +It is also important to appreciate that the decompression memory +requirement is set at compression time by the choice of block size. -.TP -.B \--repetitive-best -Opposite of \--repetitive-fast; try a lot harder before -resorting to randomisation. +For files compressed with the default 900k block size, +.I bunzip2 +will require about 3700 kbytes to decompress. To support decompression +of any file on a 4 megabyte machine, +.I bunzip2 +has an option to +decompress using approximately half this amount of memory, about 2300 +kbytes. Decompression speed is also halved, so you should use this +option only where necessary. The relevant flag is -s. + +In general, try and use the largest block size memory constraints allow, +since that maximises the compression achieved. Compression and +decompression speed are virtually unaffected by block size. + +Another significant point applies to files which fit in a single block +-- that means most files you'd encounter using a large block size. The +amount of real memory touched is proportional to the size of the file, +since the file is smaller than a block. For example, compressing a file +20,000 bytes long with the flag -9 will cause the compressor to +allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 +kbytes of it. Similarly, the decompressor will allocate 3700k but only +touch 100k + 20000 * 4 = 180 kbytes. + +Here is a table which summarises the maximum memory usage for different +block sizes. Also recorded is the total compressed size for 14 files of +the Calgary Text Compression Corpus totalling 3,141,622 bytes. This +column gives some feel for how compression varies with block size. +These figures tend to understate the advantage of larger block sizes for +larger files, since the Corpus is dominated by smaller files. + + Compress Decompress Decompress Corpus + Flag usage usage -s usage Size + + -1 1200k 500k 350k 914704 + -2 2000k 900k 600k 877703 + -3 2800k 1300k 850k 860338 + -4 3600k 1700k 1100k 846899 + -5 4400k 2100k 1350k 845160 + -6 5200k 2500k 1600k 838626 + -7 6100k 2900k 1850k 834096 + -8 6800k 3300k 2100k 828642 + -9 7600k 3700k 2350k 828642 .SH RECOVERING DATA FROM DAMAGED FILES .I bzip2 -compresses files in blocks, usually 900kbytes long. -Each block is handled independently. If a media or -transmission error causes a multi-block .bz2 -file to become damaged, -it may be possible to recover data from the undamaged blocks -in the file. - -The compressed representation of each block is delimited by -a 48-bit pattern, which makes it possible to find the block -boundaries with reasonable certainty. Each block also carries -its own 32-bit CRC, so damaged blocks can be -distinguished from undamaged ones. +compresses files in blocks, usually 900kbytes long. Each +block is handled independently. If a media or transmission error causes +a multi-block .bz2 +file to become damaged, it may be possible to +recover data from the undamaged blocks in the file. + +The compressed representation of each block is delimited by a 48-bit +pattern, which makes it possible to find the block boundaries with +reasonable certainty. Each block also carries its own 32-bit CRC, so +damaged blocks can be distinguished from undamaged ones. .I bzip2recover -is a simple program whose purpose is to search for -blocks in .bz2 files, and write each block out into -its own .bz2 file. You can then use -.I bzip2 -t -to test the integrity of the resulting files, -and decompress those which are undamaged. +is a simple program whose purpose is to search for +blocks in .bz2 files, and write each block out into its own .bz2 +file. You can then use +.I bzip2 +\-t +to test the +integrity of the resulting files, and decompress those which are +undamaged. .I bzip2recover -takes a single argument, the name of the damaged file, -and writes a number of files "rec0001file.bz2", "rec0002file.bz2", -etc, containing the extracted blocks. The output filenames -are designed so that the use of wildcards in subsequent processing --- for example, "bzip2 -dc rec*file.bz2 > recovered_data" -- -lists the files in the "right" order. +takes a single argument, the name of the damaged file, +and writes a number of files "rec0001file.bz2", +"rec0002file.bz2", etc, containing the extracted blocks. +The output filenames are designed so that the use of +wildcards in subsequent processing -- for example, +"bzip2 -dc rec*file.bz2 > recovered_data" -- lists the files in +the correct order. .I bzip2recover -should be of most use dealing with large .bz2 files, as -these will contain many blocks. It is clearly futile to -use it on damaged single-block files, since a damaged -block cannot be recovered. If you wish to minimise -any potential data loss through media or transmission -errors, you might consider compressing with a smaller +should be of most use dealing with large .bz2 +files, as these will contain many blocks. It is clearly +futile to use it on damaged single-block files, since a +damaged block cannot be recovered. If you wish to minimise +any potential data loss through media or transmission errors, +you might consider compressing with a smaller block size. .SH PERFORMANCE NOTES -The sorting phase of compression gathers together similar strings -in the file. Because of this, files containing very long -runs of repeated symbols, like "aabaabaabaab ..." (repeated -several hundred times) may compress extraordinarily slowly. -You can use the -\-vvvvv -option to monitor progress in great detail, if you want. -Decompression speed is unaffected. - -Such pathological cases -seem rare in practice, appearing mostly in artificially-constructed -test files, and in low-level disk images. It may be inadvisable to -use -.I bzip2 -to compress the latter. -If you do get a file which causes severe slowness in compression, -try making the block size as small as possible, with flag \-1. +The sorting phase of compression gathers together similar strings in the +file. Because of this, files containing very long runs of repeated +symbols, like "aabaabaabaab ..." (repeated several hundred times) may +compress more slowly than normal. Versions 0.9.5 and above fare much +better than previous versions in this respect. The ratio between +worst-case and average-case compression time is in the region of 10:1. +For previous versions, this figure was more like 100:1. You can use the +\-vvvv option to monitor progress in great detail, if you want. + +Decompression speed is unaffected by these phenomena. .I bzip2 -usually allocates several megabytes of memory to operate in, -and then charges all over it in a fairly random fashion. This -means that performance, both for compressing and decompressing, -is largely determined by the speed -at which your machine can service cache misses. -Because of this, small changes -to the code to reduce the miss rate have been observed to give -disproportionately large performance improvements. +usually allocates several megabytes of memory to operate +in, and then charges all over it in a fairly random fashion. This means +that performance, both for compressing and decompressing, is largely +determined by the speed at which your machine can service cache misses. +Because of this, small changes to the code to reduce the miss rate have +been observed to give disproportionately large performance improvements. I imagine .I bzip2 will perform best on machines with very large caches. .SH CAVEATS I/O error messages are not as helpful as they could be. -.I Bzip2 -tries hard to detect I/O errors and exit cleanly, but the -details of what the problem is sometimes seem rather misleading. +.I bzip2 +tries hard to detect I/O errors and exit cleanly, but the details of +what the problem is sometimes seem rather misleading. -This manual page pertains to version 0.9.0 of +This manual page pertains to version 0.9.5 of .I bzip2. -Compressed data created by this version is entirely forwards and -backwards compatible with the previous public release, version 0.1pl2, -but with the following exception: 0.9.0 can correctly decompress -multiple concatenated compressed files. 0.1pl2 cannot do this; it -will stop after decompressing just the first file in the stream. - -Wildcard expansion for Windows 95 and NT -is flaky. +Compressed +data created by this version is entirely forwards and backwards +compatible with the previous public releases, versions 0.1pl2 and 0.9.0, +but with the following exception: 0.9.0 and above can correctly +decompress multiple concatenated compressed files. 0.1pl2 cannot do +this; it will stop after decompressing just the first file in the +stream. .I bzip2recover uses 32-bit integers to represent bit positions in -compressed files, so it cannot handle compressed files -more than 512 megabytes long. This could easily be fixed. +compressed files, so it cannot handle compressed files more than 512 +megabytes long. This could easily be fixed. .SH AUTHOR Julian Seward, jseward@acm.org. http://www.muraroa.demon.co.uk -The ideas embodied in +The ideas embodied in .I bzip2 -are due to (at least) the following people: -Michael Burrows and David Wheeler (for the block sorting -transformation), David Wheeler (again, for the Huffman coder), -Peter Fenwick (for the structured coding model in the original -.I bzip, -and many refinements), -and -Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic -coder in the original +are due to (at least) the following +people: Michael Burrows and David Wheeler (for the block sorting +transformation), David Wheeler (again, for the Huffman coder), Peter +Fenwick (for the structured coding model in the original +.I bzip, +and many refinements), and Alistair Moffat, Radford Neal and Ian Witten +(for the arithmetic coder in the original .I bzip). -I am much indebted for their help, support and advice. -See the manual in the source distribution for pointers to -sources of documentation. -Christian von Roques encouraged me to look for faster -sorting algorithms, so as to speed up compression. -Bela Lubkin encouraged me to improve the worst-case -compression performance. -Many people sent patches, helped with portability problems, -lent machines, gave advice and were generally helpful. - +I am much +indebted for their help, support and advice. See the manual in the +source distribution for pointers to sources of documentation. Christian +von Roques encouraged me to look for faster sorting algorithms, so as to +speed up compression. Bela Lubkin encouraged me to improve the +worst-case compression performance. Many people sent patches, helped +with portability problems, lent machines, gave advice and were generally +helpful. -- cgit v1.2.3-55-g6feb