From f93cd82a9a7094ad90fd19bbc6ccf6f4627f8060 Mon Sep 17 00:00:00 2001 From: Julian Seward Date: Sat, 4 Sep 1999 22:13:13 +0200 Subject: bzip2-0.9.5d --- manual.texi | 869 ++++++++++++++++++++++++++++++++---------------------------- 1 file changed, 459 insertions(+), 410 deletions(-) (limited to 'manual.texi') diff --git a/manual.texi b/manual.texi index 99ce661..e48e656 100644 --- a/manual.texi +++ b/manual.texi @@ -2,10 +2,10 @@ @setfilename bzip2.info @ignore -This file documents bzip2 version 0.9.0c, and associated library +This file documents bzip2 version 0.9.5, and associated library libbzip2, written by Julian Seward (jseward@acm.org). -Copyright (C) 1996-1998 Julian R Seward +Copyright (C) 1996-1999 Julian R Seward Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice @@ -30,21 +30,21 @@ END-INFO-DIR-ENTRY @titlepage @title bzip2 and libbzip2 @subtitle a program and library for data compression -@subtitle copyright (C) 1996-1998 Julian Seward -@subtitle version 0.9.0c of 18 October 1998 +@subtitle copyright (C) 1996-1999 Julian Seward +@subtitle version 0.9.5d of 4 September 1999 @author Julian Seward @end titlepage -@end iftex - @parindent 0mm @parskip 2mm +@end iftex +@node Top, Overview, (dir), (dir) This program, @code{bzip2}, and associated library @code{libbzip2}, are -Copyright (C) 1996-1998 Julian R Seward. All rights reserved. +Copyright (C) 1996-1999 Julian R Seward. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions @@ -78,13 +78,13 @@ WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -Julian Seward, Guildford, Surrey, UK. +Julian Seward, Cambridge, UK. @code{jseward@@acm.org} @code{http://www.muraroa.demon.co.uk} -@code{bzip2}/@code{libbzip2} version 0.9.0c of 18 October 1998. +@code{bzip2}/@code{libbzip2} version 0.9.5 of 24 May 1999. PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented algorithms. However, I do not have the resources available to carry out @@ -124,360 +124,345 @@ ought to be recorded somewhere. This chapter contains a copy of the @code{bzip2} man page, and nothing else. + +@quotation + +@unnumberedsubsubsec NAME +@itemize +@item @code{bzip2}, @code{bunzip2} +- a block-sorting file compressor, v0.9.5 +@item @code{bzcat} +- decompresses files to stdout +@item @code{bzip2recover} +- recovers data from damaged bzip2 files +@end itemize + +@unnumberedsubsubsec SYNOPSIS +@itemize +@item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ... ] +@item @code{bunzip2} [ -fkvsVL ] [ filenames ... ] +@item @code{bzcat} [ -s ] [ filenames ... ] +@item @code{bzip2recover} filename +@end itemize + +@unnumberedsubsubsec DESCRIPTION + +@code{bzip2} compresses files using the Burrows-Wheeler block sorting +text compression algorithm, and Huffman coding. Compression is +generally considerably better than that achieved by more conventional +LZ77/LZ78-based compressors, and approaches the performance of the PPM +family of statistical compressors. + +The command-line options are deliberately very similar to those of GNU +@code{gzip}, but they are not identical. + +@code{bzip2} expects a list of file names to accompany the command-line +flags. Each file is replaced by a compressed version of itself, with +the name @code{original_name.bz2}. Each compressed file has the same +modification date, permissions, and, when possible, ownership as the +corresponding original, so that these properties can be correctly +restored at decompression time. File name handling is naive in the +sense that there is no mechanism for preserving original file names, +permissions, ownerships or dates in filesystems which lack these +concepts, or have serious file name length restrictions, such as MS-DOS. + +@code{bzip2} and @code{bunzip2} will by default not overwrite existing +files. If you want this to happen, specify the @code{-f} flag. + +If no file names are specified, @code{bzip2} compresses from standard +input to standard output. In this case, @code{bzip2} will decline to +write compressed output to a terminal, as this would be entirely +incomprehensible and therefore pointless. + +@code{bunzip2} (or @code{bzip2 -d}) decompresses all +specified files. Files which were not created by @code{bzip2} +will be detected and ignored, and a warning issued. +@code{bzip2} attempts to guess the filename for the decompressed file +from that of the compressed file as follows: +@itemize +@item @code{filename.bz2 } becomes @code{filename} +@item @code{filename.bz } becomes @code{filename} +@item @code{filename.tbz2} becomes @code{filename.tar} +@item @code{filename.tbz } becomes @code{filename.tar} +@item @code{anyothername } becomes @code{anyothername.out} +@end itemize +If the file does not end in one of the recognised endings, +@code{.bz2}, @code{.bz}, +@code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot +guess the name of the original file, and uses the original name +with @code{.out} appended. + +As with compression, supplying no +filenames causes decompression from standard input to standard output. + +@code{bunzip2} will correctly decompress a file which is the +concatenation of two or more compressed files. The result is the +concatenation of the corresponding uncompressed files. Integrity +testing (@code{-t}) of concatenated compressed files is also supported. + +You can also compress or decompress files to the standard output by +giving the @code{-c} flag. Multiple files may be compressed and +decompressed like this. The resulting outputs are fed sequentially to +stdout. Compression of multiple files in this manner generates a stream +containing multiple compressed file representations. Such a stream +can be decompressed correctly only by @code{bzip2} version 0.9.0 or +later. Earlier versions of @code{bzip2} will stop after decompressing +the first file in the stream. + +@code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to +the standard output. + +@code{bzip2} will read arguments from the environment variables +@code{BZIP2} and @code{BZIP}, in that order, and will process them +before any arguments read from the command line. This gives a +convenient way to supply default arguments. + +Compression is always performed, even if the compressed file is slightly +larger than the original. Files of less than about one hundred bytes +tend to get larger, since the compression mechanism has a constant +overhead in the region of 50 bytes. Random data (including the output +of most file compressors) is coded at about 8.05 bits per byte, giving +an expansion of around 0.5%. + +As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to +make sure that the decompressed version of a file is identical to the +original. This guards against corruption of the compressed data, and +against undetected bugs in @code{bzip2} (hopefully very unlikely). The +chances of data corruption going undetected is microscopic, about one +chance in four billion for each file processed. Be aware, though, that +the check occurs upon decompression, so it can only tell you that +something is wrong. It can't help you recover the original uncompressed +data. You can use @code{bzip2recover} to try to recover data from +damaged files. + +Return values: 0 for a normal exit, 1 for environmental problems (file +not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt +compressed file, 3 for an internal consistency error (eg, bug) which +caused @code{bzip2} to panic. + + +@unnumberedsubsubsec OPTIONS +@table @code +@item -c --stdout +Compress or decompress to standard output. +@item -d --decompress +Force decompression. @code{bzip2}, @code{bunzip2} and @code{bzcat} are +really the same program, and the decision about what actions to take is +done on the basis of which name is used. This flag overrides that +mechanism, and forces bzip2 to decompress. +@item -z --compress +The complement to @code{-d}: forces compression, regardless of the +invokation name. +@item -t --test +Check integrity of the specified file(s), but don't decompress them. +This really performs a trial decompression and throws away the result. +@item -f --force +Force overwrite of output files. Normally, @code{bzip2} will not overwrite +existing output files. Also forces @code{bzip2} to break hard links +to files, which it otherwise wouldn't do. +@item -k --keep +Keep (don't delete) input files during compression +or decompression. +@item -s --small +Reduce memory usage, for compression, decompression and testing. Files +are decompressed and tested using a modified algorithm which only +requires 2.5 bytes per block byte. This means any file can be +decompressed in 2300k of memory, albeit at about half the normal speed. + +During compression, @code{-s} selects a block size of 200k, which limits +memory use to around the same figure, at the expense of your compression +ratio. In short, if your machine is low on memory (8 megabytes or +less), use -s for everything. See MEMORY MANAGEMENT below. +@item -q --quiet +Suppress non-essential warning messages. Messages pertaining to +I/O errors and other critical events will not be suppressed. +@item -v --verbose +Verbose mode -- show the compression ratio for each file processed. +Further @code{-v}'s increase the verbosity level, spewing out lots of +information which is primarily of interest for diagnostic purposes. +@item -L --license -V --version +Display the software version, license terms and conditions. +@item -1 to -9 +Set the block size to 100 k, 200 k .. 900 k when compressing. Has no +effect when decompressing. See MEMORY MANAGEMENT below. +@item -- +Treats all subsequent arguments as file names, even if they start +with a dash. This is so you can handle files with names beginning +with a dash, for example: @code{bzip2 -- -myfilename}. +@item --repetitive-fast +@item --repetitive-best +These flags are redundant in versions 0.9.5 and above. They provided +some coarse control over the behaviour of the sorting algorithm in +earlier versions, which was sometimes useful. 0.9.5 and above have an +improved algorithm which renders these flags irrelevant. +@end table + + +@unnumberedsubsubsec MEMORY MANAGEMENT + +@code{bzip2} compresses large files in blocks. The block size affects +both the compression ratio achieved, and the amount of memory needed for +compression and decompression. The flags @code{-1} through @code{-9} +specify the block size to be 100,000 bytes through 900,000 bytes (the +default) respectively. At decompression time, the block size used for +compression is read from the header of the compressed file, and +@code{bunzip2} then allocates itself just enough memory to decompress +the file. Since block sizes are stored in compressed files, it follows +that the flags @code{-1} to @code{-9} are irrelevant to and so ignored +during decompression. + +Compression and decompression requirements, in bytes, can be estimated +as: @example -NAME - bzip2, bunzip2 - a block-sorting file compressor, v0.9.0 - bzcat - decompresses files to stdout - bzip2recover - recovers data from damaged bzip2 files - - -SYNOPSIS - bzip2 [ -cdfkstvzVL123456789 ] [ filenames ... ] - bunzip2 [ -fkvsVL ] [ filenames ... ] - bzcat [ -s ] [ filenames ... ] - bzip2recover filename - - -DESCRIPTION - bzip2 compresses files using the Burrows-Wheeler block- - sorting text compression algorithm, and Huffman coding. - Compression is generally considerably better than that - achieved by more conventional LZ77/LZ78-based compressors, - and approaches the performance of the PPM family of sta- - tistical compressors. - - The command-line options are deliberately very similar to - those of GNU Gzip, but they are not identical. - - bzip2 expects a list of file names to accompany the com- - mand-line flags. Each file is replaced by a compressed - version of itself, with the name "original_name.bz2". - Each compressed file has the same modification date and - permissions as the corresponding original, so that these - properties can be correctly restored at decompression - time. File name handling is naive in the sense that there - is no mechanism for preserving original file names, per- - missions and dates in filesystems which lack these con- - cepts, or have serious file name length restrictions, such - as MS-DOS. - - bzip2 and bunzip2 will by default not overwrite existing - files; if you want this to happen, specify the -f flag. - - If no file names are specified, bzip2 compresses from - standard input to standard output. In this case, bzip2 - will decline to write compressed output to a terminal, as - this would be entirely incomprehensible and therefore - pointless. - - bunzip2 (or bzip2 -d ) decompresses and restores all spec- - ified files whose names end in ".bz2". Files without this - suffix are ignored. Again, supplying no filenames causes - decompression from standard input to standard output. - - bunzip2 will correctly decompress a file which is the con- - catenation of two or more compressed files. The result is - the concatenation of the corresponding uncompressed files. - Integrity testing (-t) of concatenated compressed files is - also supported. - - You can also compress or decompress files to the standard - output by giving the -c flag. Multiple files may be com- - pressed and decompressed like this. The resulting outputs - are fed sequentially to stdout. Compression of multiple - files in this manner generates a stream containing multi- - ple compressed file representations. Such a stream can be - decompressed correctly only by bzip2 version 0.9.0 or - later. Earlier versions of bzip2 will stop after decom- - pressing the first file in the stream. - - bzcat (or bzip2 -dc ) decompresses all specified files to - the standard output. - - Compression is always performed, even if the compressed - file is slightly larger than the original. Files of less - than about one hundred bytes tend to get larger, since the - compression mechanism has a constant overhead in the - region of 50 bytes. Random data (including the output of - most file compressors) is coded at about 8.05 bits per - byte, giving an expansion of around 0.5%. - - As a self-check for your protection, bzip2 uses 32-bit - CRCs to make sure that the decompressed version of a file - is identical to the original. This guards against corrup- - tion of the compressed data, and against undetected bugs - in bzip2 (hopefully very unlikely). The chances of data - corruption going undetected is microscopic, about one - chance in four billion for each file processed. Be aware, - though, that the check occurs upon decompression, so it - can only tell you that that something is wrong. It can't - help you recover the original uncompressed data. You can - use bzip2recover to try to recover data from damaged - files. - - Return values: 0 for a normal exit, 1 for environmental - problems (file not found, invalid flags, I/O errors, &c), - 2 to indicate a corrupt compressed file, 3 for an internal - consistency error (eg, bug) which caused bzip2 to panic. - - -MEMORY MANAGEMENT - Bzip2 compresses large files in blocks. The block size - affects both the compression ratio achieved, and the - amount of memory needed both for compression and decom- - pression. The flags -1 through -9 specify the block size - to be 100,000 bytes through 900,000 bytes (the default) - respectively. At decompression-time, the block size used - for compression is read from the header of the compressed - file, and bunzip2 then allocates itself just enough memory - to decompress the file. Since block sizes are stored in - compressed files, it follows that the flags -1 to -9 are - irrelevant to and so ignored during decompression. - - Compression and decompression requirements, in bytes, can - be estimated as: - - Compression: 400k + ( 7 x block size ) - - Decompression: 100k + ( 4 x block size ), or - 100k + ( 2.5 x block size ) - - Larger block sizes give rapidly diminishing marginal - returns; most of the compression comes from the first two - or three hundred k of block size, a fact worth bearing in - mind when using bzip2 on small machines. It is also - important to appreciate that the decompression memory - requirement is set at compression-time by the choice of - block size. + Compression: 400k + ( 8 x block size ) + + Decompression: 100k + ( 4 x block size ), or + 100k + ( 2.5 x block size ) +@end example +Larger block sizes give rapidly diminishing marginal returns. Most of +the compression comes from the first two or three hundred k of block +size, a fact worth bearing in mind when using @code{bzip2} on small machines. +It is also important to appreciate that the decompression memory +requirement is set at compression time by the choice of block size. + +For files compressed with the default 900k block size, @code{bunzip2} +will require about 3700 kbytes to decompress. To support decompression +of any file on a 4 megabyte machine, @code{bunzip2} has an option to +decompress using approximately half this amount of memory, about 2300 +kbytes. Decompression speed is also halved, so you should use this +option only where necessary. The relevant flag is @code{-s}. + +In general, try and use the largest block size memory constraints allow, +since that maximises the compression achieved. Compression and +decompression speed are virtually unaffected by block size. + +Another significant point applies to files which fit in a single block +-- that means most files you'd encounter using a large block size. The +amount of real memory touched is proportional to the size of the file, +since the file is smaller than a block. For example, compressing a file +20,000 bytes long with the flag @code{-9} will cause the compressor to +allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 +kbytes of it. Similarly, the decompressor will allocate 3700k but only +touch 100k + 20000 * 4 = 180 kbytes. + +Here is a table which summarises the maximum memory usage for different +block sizes. Also recorded is the total compressed size for 14 files of +the Calgary Text Compression Corpus totalling 3,141,622 bytes. This +column gives some feel for how compression varies with block size. +These figures tend to understate the advantage of larger block sizes for +larger files, since the Corpus is dominated by smaller files. +@example + Compress Decompress Decompress Corpus + Flag usage usage -s usage Size + + -1 1200k 500k 350k 914704 + -2 2000k 900k 600k 877703 + -3 2800k 1300k 850k 860338 + -4 3600k 1700k 1100k 846899 + -5 4400k 2100k 1350k 845160 + -6 5200k 2500k 1600k 838626 + -7 6100k 2900k 1850k 834096 + -8 6800k 3300k 2100k 828642 + -9 7600k 3700k 2350k 828642 +@end example + +@unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES - For files compressed with the default 900k block size, - bunzip2 will require about 3700 kbytes to decompress. To - support decompression of any file on a 4 megabyte machine, - bunzip2 has an option to decompress using approximately - half this amount of memory, about 2300 kbytes. Decompres- - sion speed is also halved, so you should use this option - only where necessary. The relevant flag is -s. - - In general, try and use the largest block size memory con- - straints allow, since that maximises the compression - achieved. Compression and decompression speed are virtu- - ally unaffected by block size. - - Another significant point applies to files which fit in a - single block -- that means most files you'd encounter - using a large block size. The amount of real memory - touched is proportional to the size of the file, since the - file is smaller than a block. For example, compressing a - file 20,000 bytes long with the flag -9 will cause the - compressor to allocate around 6700k of memory, but only - touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the - decompressor will allocate 3700k but only touch 100k + - 20000 * 4 = 180 kbytes. - - Here is a table which summarises the maximum memory usage - for different block sizes. Also recorded is the total - compressed size for 14 files of the Calgary Text Compres- - sion Corpus totalling 3,141,622 bytes. This column gives - some feel for how compression varies with block size. - These figures tend to understate the advantage of larger - block sizes for larger files, since the Corpus is domi- - nated by smaller files. - - Compress Decompress Decompress Corpus - Flag usage usage -s usage Size - - -1 1100k 500k 350k 914704 - -2 1800k 900k 600k 877703 - -3 2500k 1300k 850k 860338 - -4 3200k 1700k 1100k 846899 - -5 3900k 2100k 1350k 845160 - -6 4600k 2500k 1600k 838626 - -7 5400k 2900k 1850k 834096 - -8 6000k 3300k 2100k 828642 - -9 6700k 3700k 2350k 828642 - - -OPTIONS - -c --stdout - Compress or decompress to standard output. -c will - decompress multiple files to stdout, but will only - compress a single file to stdout. - - -d --decompress - Force decompression. bzip2, bunzip2 and bzcat are - really the same program, and the decision about - what actions to take is done on the basis of which - name is used. This flag overrides that mechanism, - and forces bzip2 to decompress. - - -z --compress - The complement to -d: forces compression, regard- - less of the invokation name. - - -t --test - Check integrity of the specified file(s), but don't - decompress them. This really performs a trial - decompression and throws away the result. - - -f --force - Force overwrite of output files. Normally, bzip2 - will not overwrite existing output files. - - -k --keep - Keep (don't delete) input files during compression - or decompression. - - -s --small - Reduce memory usage, for compression, decompression - and testing. Files are decompressed and tested - using a modified algorithm which only requires 2.5 - bytes per block byte. This means any file can be - decompressed in 2300k of memory, albeit at about - half the normal speed. - - During compression, -s selects a block size of - 200k, which limits memory use to around the same - figure, at the expense of your compression ratio. - In short, if your machine is low on memory (8 - megabytes or less), use -s for everything. See - MEMORY MANAGEMENT above. - - -v --verbose - Verbose mode -- show the compression ratio for each - file processed. Further -v's increase the ver- - bosity level, spewing out lots of information which - is primarily of interest for diagnostic purposes. - - -L --license -V --version - Display the software version, license terms and - conditions. - - -1 to -9 - Set the block size to 100 k, 200 k .. 900 k when - compressing. Has no effect when decompressing. - See MEMORY MANAGEMENT above. - - --repetitive-fast - bzip2 injects some small pseudo-random variations - into very repetitive blocks to limit worst-case - performance during compression. If sorting runs - into difficulties, the block is randomised, and - sorting is restarted. Very roughly, bzip2 persists - for three times as long as a well-behaved input - would take before resorting to randomisation. This - flag makes it give up much sooner. - - --repetitive-best - Opposite of --repetitive-fast; try a lot harder - before resorting to randomisation. - - -RECOVERING DATA FROM DAMAGED FILES - bzip2 compresses files in blocks, usually 900kbytes long. - Each block is handled independently. If a media or trans- - mission error causes a multi-block .bz2 file to become - damaged, it may be possible to recover data from the - undamaged blocks in the file. - - The compressed representation of each block is delimited - by a 48-bit pattern, which makes it possible to find the - block boundaries with reasonable certainty. Each block - also carries its own 32-bit CRC, so damaged blocks can be - distinguished from undamaged ones. - - bzip2recover is a simple program whose purpose is to - search for blocks in .bz2 files, and write each block out - into its own .bz2 file. You can then use bzip2 -t to test - the integrity of the resulting files, and decompress those - which are undamaged. - - bzip2recover takes a single argument, the name of the dam- - aged file, and writes a number of files "rec0001file.bz2", - "rec0002file.bz2", etc, containing the extracted blocks. +@code{bzip2} compresses files in blocks, usually 900kbytes long. Each +block is handled independently. If a media or transmission error causes +a multi-block @code{.bz2} file to become damaged, it may be possible to +recover data from the undamaged blocks in the file. + +The compressed representation of each block is delimited by a 48-bit +pattern, which makes it possible to find the block boundaries with +reasonable certainty. Each block also carries its own 32-bit CRC, so +damaged blocks can be distinguished from undamaged ones. + +@code{bzip2recover} is a simple program whose purpose is to search for +blocks in @code{.bz2} files, and write each block out into its own +@code{.bz2} file. You can then use @code{bzip2 -t} to test the +integrity of the resulting files, and decompress those which are +undamaged. + +@code{bzip2recover} +takes a single argument, the name of the damaged file, +and writes a number of files @code{rec0001file.bz2}, + @code{rec0002file.bz2}, etc, containing the extracted blocks. The output filenames are designed so that the use of - wildcards in subsequent processing -- for example, "bzip2 - -dc rec*file.bz2 > recovered_data" -- lists the files in - the "right" order. + wildcards in subsequent processing -- for example, +@code{bzip2 -dc rec*file.bz2 > recovered_data} -- lists the files in + the correct order. - bzip2recover should be of most use dealing with large .bz2 +@code{bzip2recover} should be of most use dealing with large @code{.bz2} files, as these will contain many blocks. It is clearly futile to use it on damaged single-block files, since a - damaged block cannot be recovered. If you wish to min- - imise any potential data loss through media or transmis- - sion errors, you might consider compressing with a smaller + damaged block cannot be recovered. If you wish to minimise +any potential data loss through media or transmission errors, +you might consider compressing with a smaller block size. -PERFORMANCE NOTES - The sorting phase of compression gathers together similar - strings in the file. Because of this, files containing - very long runs of repeated symbols, like "aabaabaabaab - ..." (repeated several hundred times) may compress - extraordinarily slowly. You can use the -vvvvv option to - monitor progress in great detail, if you want. Decompres- - sion speed is unaffected. - - Such pathological cases seem rare in practice, appearing - mostly in artificially-constructed test files, and in low- - level disk images. It may be inadvisable to use bzip2 to - compress the latter. If you do get a file which causes - severe slowness in compression, try making the block size - as small as possible, with flag -1. - - bzip2 usually allocates several megabytes of memory to - operate in, and then charges all over it in a fairly ran- - dom fashion. This means that performance, both for com- - pressing and decompressing, is largely determined by the - speed at which your machine can service cache misses. - Because of this, small changes to the code to reduce the - miss rate have been observed to give disproportionately - large performance improvements. I imagine bzip2 will per- - form best on machines with very large caches. - - -CAVEATS - I/O error messages are not as helpful as they could be. - Bzip2 tries hard to detect I/O errors and exit cleanly, - but the details of what the problem is sometimes seem - rather misleading. - - This manual page pertains to version 0.9.0 of bzip2. Com- - pressed data created by this version is entirely forwards - and backwards compatible with the previous public release, - version 0.1pl2, but with the following exception: 0.9.0 - can correctly decompress multiple concatenated compressed - files. 0.1pl2 cannot do this; it will stop after decom- - pressing just the first file in the stream. - - Wildcard expansion for Windows 95 and NT is flaky. - - bzip2recover uses 32-bit integers to represent bit posi- - tions in compressed files, so it cannot handle compressed - files more than 512 megabytes long. This could easily be - fixed. - - -AUTHOR - Julian Seward, jseward@@acm.org. - - The ideas embodied in bzip2 are due to (at least) the fol- - lowing people: Michael Burrows and David Wheeler (for the - block sorting transformation), David Wheeler (again, for - the Huffman coder), Peter Fenwick (for the structured cod- - ing model in the original bzip, and many refinements), and - Alistair Moffat, Radford Neal and Ian Witten (for the - arithmetic coder in the original bzip). I am much - indebted for their help, support and advice. See the man- - ual in the source distribution for pointers to sources of - documentation. Christian von Roques encouraged me to look - for faster sorting algorithms, so as to speed up compres- - sion. Bela Lubkin encouraged me to improve the worst-case - compression performance. Many people sent patches, helped - with portability problems, lent machines, gave advice and - were generally helpful. -@end example +@unnumberedsubsubsec PERFORMANCE NOTES + +The sorting phase of compression gathers together similar strings in the +file. Because of this, files containing very long runs of repeated +symbols, like "aabaabaabaab ..." (repeated several hundred times) may +compress more slowly than normal. Versions 0.9.5 and above fare much +better than previous versions in this respect. The ratio between +worst-case and average-case compression time is in the region of 10:1. +For previous versions, this figure was more like 100:1. You can use the +@code{-vvvv} option to monitor progress in great detail, if you want. + +Decompression speed is unaffected by these phenomena. + +@code{bzip2} usually allocates several megabytes of memory to operate +in, and then charges all over it in a fairly random fashion. This means +that performance, both for compressing and decompressing, is largely +determined by the speed at which your machine can service cache misses. +Because of this, small changes to the code to reduce the miss rate have +been observed to give disproportionately large performance improvements. +I imagine @code{bzip2} will perform best on machines with very large +caches. +@unnumberedsubsubsec CAVEATS + +I/O error messages are not as helpful as they could be. @code{bzip2} +tries hard to detect I/O errors and exit cleanly, but the details of +what the problem is sometimes seem rather misleading. + +This manual page pertains to version 0.9.5 of @code{bzip2}. Compressed +data created by this version is entirely forwards and backwards +compatible with the previous public releases, versions 0.1pl2 and 0.9.0, +but with the following exception: 0.9.0 and above can correctly +decompress multiple concatenated compressed files. 0.1pl2 cannot do +this; it will stop after decompressing just the first file in the +stream. + +@code{bzip2recover} uses 32-bit integers to represent bit positions in +compressed files, so it cannot handle compressed files more than 512 +megabytes long. This could easily be fixed. + + +@unnumberedsubsubsec AUTHOR +Julian Seward, @code{jseward@@acm.org}. + +The ideas embodied in @code{bzip2} are due to (at least) the following +people: Michael Burrows and David Wheeler (for the block sorting +transformation), David Wheeler (again, for the Huffman coder), Peter +Fenwick (for the structured coding model in the original @code{bzip}, +and many refinements), and Alistair Moffat, Radford Neal and Ian Witten +(for the arithmetic coder in the original @code{bzip}). I am much +indebted for their help, support and advice. See the manual in the +source distribution for pointers to sources of documentation. Christian +von Roques encouraged me to look for faster sorting algorithms, so as to +speed up compression. Bela Lubkin encouraged me to improve the +worst-case compression performance. Many people sent patches, helped +with portability problems, lent machines, gave advice and were generally +helpful. + +@end quotation + @@ -579,7 +564,7 @@ give better @code{zlib} compatibility. These functions are @code{bzerror} and @code{bzlibVersion}. You may find these functions more convenient for simple file reading and writing, than those in the high-level interface. These functions are not (yet) officially part of -the library, and are not further documented here. If they break, you +the library, and are minimally documented here. If they break, you get to keep all the pieces. I hope to document them properly when time permits. @@ -748,25 +733,31 @@ monitoring/debugging output. If the library has been compiled with setting. Parameter @code{workFactor} controls how the compression phase behaves -when presented with worst case, highly repetitive, input data. -If compression runs into difficulties caused by repetitive data, -some pseudo-random variations are inserted into the block, and -compression is restarted. Lower values of @code{workFactor} -reduce the tolerance of compression to repetitive data. -You should set this parameter carefully; too low, and -compression ratio suffers, too high, and your average-to-worst -case compression times can become very large. -The default value of 30 -gives reasonable behaviour over a wide range of circumstances. - -Allowable values range from 0 to 250 inclusive. 0 is a special -case, equivalent to using the default value of 30. - -Note that the randomisation process is entirely transparent. -If the library decides to randomise and restart compression on a -block, it does so without comment. Randomised blocks are -automatically de-randomised during decompression, so data -integrity is never compromised. +when presented with worst case, highly repetitive, input data. If +compression runs into difficulties caused by repetitive data, the +library switches from the standard sorting algorithm to a fallback +algorithm. The fallback is slower than the standard algorithm by +perhaps a factor of three, but always behaves reasonably, no matter how +bad the input. + +Lower values of @code{workFactor} reduce the amount of effort the +standard algorithm will expend before resorting to the fallback. You +should set this parameter carefully; too low, and many inputs will be +handled by the fallback algorithm and so compress rather slowly, too +high, and your average-to-worst case compression times can become very +large. The default value of 30 gives reasonable behaviour over a wide +range of circumstances. + +Allowable values range from 0 to 250 inclusive. 0 is a special case, +equivalent to using the default value of 30. + +Note that the compressed output generated is the same regardless of +whether or not the fallback algorithm is used. + +Be aware also that this parameter may disappear entirely in future +versions of the library. In principle it should be possible to devise a +good way to automatically choose which algorithm to use. Such a +mechanism would render the parameter obsolete. Possible return values: @display @@ -1652,6 +1643,48 @@ Possible return values: +@section @code{zlib} compatibility functions +Yoshioka Tsuneo has contributed some functions to +give better @code{zlib} compatibility. These functions are +@code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush}, +@code{bzclose}, +@code{bzerror} and @code{bzlibVersion}. +These functions are not (yet) officially part of +the library. If they break, you get to keep all the pieces. +Nevertheless, I think they work ok. +@example +typedef void BZFILE; + +const char * bzlibVersion ( void ); +@end example +Returns a string indicating the library version. +@example +BZFILE * bzopen ( const char *path, const char *mode ); +BZFILE * bzdopen ( int fd, const char *mode ); +@end example +Opens a @code{.bz2} file for reading or writing, using either its name +or a pre-existing file descriptor. +Analogous to @code{fopen} and @code{fdopen}. +@example +int bzread ( BZFILE* b, void* buf, int len ); +int bzwrite ( BZFILE* b, void* buf, int len ); +@end example +Reads/writes data from/to a previously opened @code{BZFILE}. +Analogous to @code{fread} and @code{fwrite}. +@example +int bzflush ( BZFILE* b ); +void bzclose ( BZFILE* b ); +@end example +Flushes/closes a @code{BZFILE}. @code{bzflush} doesn't actually do +anything. Analogous to @code{fflush} and @code{fclose}. + +@example +const char * bzerror ( BZFILE *b, int *errnum ) +@end example +Returns a string describing the more recent error status of +@code{b}, and also sets @code{*errnum} to its numerical value. + + @section Using the library in a @code{stdio}-free environment @subsection Getting rid of @code{stdio} @@ -1677,14 +1710,14 @@ was compiled with @code{BZ_NO_STDIO} set. For a normal compile, an assertion failure yields the message @example - bzip2/libbzip2, v0.9.0: internal error number N. - This is a bug in bzip2/libbzip2, v0.9.0. Please report + bzip2/libbzip2, v0.9.5: internal error number N. + This is a bug in bzip2/libbzip2, v0.9.5. Please report it to me at: jseward@@acm.org. If this happened when you were using some program which uses libbzip2 as a component, you should also report this bug to the author(s) of that program. Please make an effort to report this bug; timely and accurate bug reports eventually lead to higher - quality software. Thx. Julian Seward, 27 June 1998. + quality software. Thanks. Julian Seward, 24 May 1999. @end example where @code{N} is some error code number. @code{exit(3)} is then called. @@ -1721,7 +1754,7 @@ If you can't open the project file for some reason, make a new one, naming these files: @code{blocksort.c}, @code{bzlib.c}, @code{compress.c}, @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @* -@code{randtable.c} and @code{libbz2.def}. You might also need +@code{randtable.c} and @code{libbz2.def}. You will also need to name the header files @code{bzlib.h} and @code{bzlib_private.h}. If you don't use VC++, you may need to define the proprocessor symbol @@ -1730,6 +1763,14 @@ If you don't use VC++, you may need to define the proprocessor symbol Finally, @code{dlltest.c} is a sample program using the DLL. It has a project file, @code{dlltest.dsp}. +If you just want a makefile for Visual C, have a look at +@code{makefile.msc}. + +Be aware that if you compile @code{bzip2} itself on Win32, you must set +@code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file +@code{bzip2.c}, before compiling. Otherwise the resulting binary won't +work correctly. + I haven't tried any of this stuff myself, but it all looks plausible. @@ -1740,7 +1781,8 @@ These are just some random thoughts of mine. Your mileage may vary. @section Limitations of the compressed file format -@code{bzip2-0.9.0} uses exactly the same file format as the previous +@code{bzip2-0.9.5} and @code{0.9.0} +use exactly the same file format as the previous version, @code{bzip2-0.1}. This decision was made in the interests of stability. Creating yet another incompatible compressed file format would create further confusion and disruption for users. @@ -1776,12 +1818,11 @@ decompression and, in retrospect, are unnecessary. These are: @code{bzip2-0.1}'s performance on repetitive data, so perhaps it isn't a problem for real inputs. - Probably the best long-term solution + Probably the best long-term solution, + and the one I have incorporated into 0.9.5 and above, is to use the existing sorting algorithm initially, and fall back to a O(N (log N)^2) algorithm if the standard algorithm gets into difficulties. - This can be done without much difficulty; I made - a prototype implementation of it some months now. @item The compressed file format was never designed to be handled by a library, and I have had to jump though some hoops to produce an efficient implementation of @@ -1797,7 +1838,7 @@ It would be fair to say that the @code{bzip2} format was frozen before I properly and fully understood the performance consequences of doing so. -Improvements which I have been able to incorporate into +Improvements which I was able to incorporate into 0.9.0, despite using the same file format, are: @itemize @bullet @item Single array implementation of the inverse BWT. This @@ -1808,8 +1849,7 @@ Improvements which I have been able to incorporate into of values. @item @code{bzip2-0.9.0} now reads and writes files with @code{fread} and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}. - Duh! I'm embarrassed at my own moronicness (moronicity?) on this - one. + Duh! Well, you live and learn. @end itemize Further ahead, it would be nice @@ -1820,7 +1860,7 @@ require some careful design of compressed file formats. @section Portability issues After some consideration, I have decided not to use -GNU @code{autoconf} to configure 0.9.0. +GNU @code{autoconf} to configure 0.9.5. @code{autoconf}, admirable and wonderful though it is, mainly assists with portability problems between Unix-like @@ -1835,8 +1875,10 @@ under Unix straight out-of-the-box, so to speak, especially if you have a version of GNU C available. There are a couple of @code{__inline__} directives in the code. GNU C -(@code{gcc}) should be able to handle them. If your compiler doesn't -like them, just @code{#define} @code{__inline__} to be null. One +(@code{gcc}) should be able to handle them. If you're not using +GNU C, your C compiler shouldn't see them at all. +If your compiler does, for some reason, see them and doesn't +like them, just @code{#define} @code{__inline__} to be @code{/* */}. One easy way to do this is to compile with the flag @code{-D__inline__=}, which should be understood by most Unix compilers. @@ -1853,6 +1895,11 @@ distribution, please try and link it statically (@code{gcc -s}). This avoids all sorts of library-version issues that others may encounter later on. +If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and +@code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling. +Otherwise the resulting binary won't work correctly. + + @section Reporting bugs I tried pretty hard to make sure @code{bzip2} is @@ -1976,45 +2023,39 @@ A record of the tests I've done. First, some data sets: @itemize @bullet -@item B: a directory containing a 6001 files, one for every length in the +@item B: a directory containing 6001 files, one for every length in the range 0 to 6000 bytes. The files contain random lowercase letters. 18.7 megabytes. @item H: my home directory tree. Documents, source code, mail files, compressed data. H contains B, and also a directory of files designed as boundary cases for the sorting; mostly very - repetitive, nasty files. 445 megabytes. + repetitive, nasty files. 565 megabytes. @item A: directory tree holding various applications built from source: - @code{egcs-1.0.2}, @code{gcc-2.8.1}, KDE Beta 4, GTK, Octave, etc. - 827 megabytes. -@item P: directory tree holding large amounts of source code (@code{.tar} - files) of the entire GNU distribution, plus a couple of - Linux distributions. 2400 megabytes. + @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc. + 2200 megabytes. @end itemize The tests conducted are as follows. Each test means compressing (a copy of) each file in the data set, decompressing it and comparing it against the original. -First, a bunch of tests with block sizes, internal buffer -sizes and randomisation lengths set very small, +First, a bunch of tests with block sizes and internal buffer +sizes set very small, to detect any problems with the -blocking, buffering and randomisation mechanisms. +blocking and buffering mechanisms. This required modifying the source code so as to try to break it. @enumerate @item Data set H, with buffer size of 1 byte, and block size of 23 bytes. @item Data set B, buffer sizes 1 byte, block size 1 byte. -@item As (2) but small-mode decompression (first 1700 files). +@item As (2) but small-mode decompression. @item As (2) with block size 2 bytes. @item As (2) with block size 3 bytes. @item As (2) with block size 4 bytes. @item As (2) with block size 5 bytes. @item As (2) with block size 6 bytes and small-mode decompression. -@item H with normal buffer sizes (5000 bytes), normal block - size (up to 900000 bytes), but with randomisation - mechanism running intensely (randomising approximately every - third byte). -@item As (9) with small-mode decompression. +@item H with buffer size of 1 byte, but normal block + size (up to 900000 bytes). @end enumerate Then some tests with unmodified source code. @enumerate @@ -2023,18 +2064,26 @@ Then some tests with unmodified source code. @item H, compress with flag @code{-1}. @item H, compress with flag @code{-s}, decompress with flag @code{-s}. @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing, - @code{bzip2-0.9.0} decompressing, all settings normal. -@item Backwards compatibility: H, @code{bzip2-0.9.0} compressing, + @code{bzip2-0.9.5} decompressing, all settings normal. +@item Backwards compatibility: H, @code{bzip2-0.9.5} compressing, @code{bzip2-0.1pl2} decompressing, all settings normal. @item Bigger tests: A, all settings normal. -@item P, all settings normal. -@item Misc test: about 100 megabytes of @code{.tar} files with - @code{bzip2} compiled with Purify. +@item As (7), using the fallback (Sadakane-like) sorting algorithm. +@item As (8), compress with flag @code{-1}, decompress with flag + @code{-s}. +@item H, using the fallback sorting algorithm. +@item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing, + @code{bzip2-0.9.5} decompressing, all settings normal. +@item Backwards compatibility: A, @code{bzip2-0.9.5} compressing, + @code{bzip2-0.1pl2} decompressing, all settings normal. +@item Misc test: about 400 megabytes of @code{.tar} files with + @code{bzip2} compiled with Checker (a memory access error + detector, like Purify). @item Misc tests to make sure it builds and runs ok on non-Linux/x86 platforms. @end enumerate -These tests were conducted on a 205 MHz Cyrix 6x86MX machine, running -Linux 2.0.32. They represent nearly a week of continuous computation. +These tests were conducted on a 225 MHz IDT WinChip machine, running +Linux 2.0.36. They represent nearly a week of continuous computation. All tests completed successfully. @@ -2063,7 +2112,7 @@ David J. Wheeler Program bred3.c and accompanying document bred3.ps. This contains the idea behind the multi-table Huffman coding scheme. - ftp://ftp.cl.cam.ac.uk/pub/user/djw3/ + ftp://ftp.cl.cam.ac.uk/users/djw3/ Jon L. Bentley and Robert Sedgewick "Fast Algorithms for Sorting and Searching Strings" -- cgit v1.2.3-55-g6feb