From 33d134030248633ffa7d60c0a35a783c46da034b Mon Sep 17 00:00:00 2001 From: Julian Seward Date: Thu, 7 Aug 1997 22:13:13 +0200 Subject: bzip2-0.1 --- bzip2.1 | 441 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 441 insertions(+) create mode 100644 bzip2.1 (limited to 'bzip2.1') diff --git a/bzip2.1 b/bzip2.1 new file mode 100644 index 0000000..9094c7c --- /dev/null +++ b/bzip2.1 @@ -0,0 +1,441 @@ +.PU +.TH bzip2 1 +.SH NAME +bzip2, bunzip2 \- a block-sorting file compressor, v0.1 +.br +bzip2recover \- recovers data from damaged bzip2 files + +.SH SYNOPSIS +.ll +8 +.B bzip2 +.RB [ " \-cdfkstvVL123456789 " ] +[ +.I "filenames \&..." +] +.ll -8 +.br +.B bunzip2 +.RB [ " \-kvsVL " ] +[ +.I "filenames \&..." +] +.br +.B bzip2recover +.I "filename" + +.SH DESCRIPTION +.I Bzip2 +compresses files using the Burrows-Wheeler block-sorting +text compression algorithm, and Huffman coding. +Compression is generally considerably +better than that +achieved by more conventional LZ77/LZ78-based compressors, +and approaches the performance of the PPM family of statistical +compressors. + +The command-line options are deliberately very similar to +those of +.I GNU Gzip, +but they are not identical. + +.I Bzip2 +expects a list of file names to accompany the command-line flags. +Each file is replaced by a compressed version of itself, +with the name "original_name.bz2". +Each compressed file has the same modification date and permissions +as the corresponding original, so that these properties can be +correctly restored at decompression time. File name handling is +naive in the sense that there is no mechanism for preserving +original file names, permissions and dates in filesystems +which lack these concepts, or have serious file name length +restrictions, such as MS-DOS. + +.I Bzip2 +and +.I bunzip2 +will not overwrite existing files; if you want this to happen, +you should delete them first. + +If no file names are specified, +.I bzip2 +compresses from standard input to standard output. +In this case, +.I bzip2 +will decline to write compressed output to a terminal, as +this would be entirely incomprehensible and therefore pointless. + +.I Bunzip2 +(or +.I bzip2 \-d +) decompresses and restores all specified files whose names +end in ".bz2". +Files without this suffix are ignored. +Again, supplying no filenames +causes decompression from standard input to standard output. + +You can also compress or decompress files to +the standard output by giving the \-c flag. +You can decompress multiple files like this, but you may +only compress a single file this way, since it would otherwise +be difficult to separate out the compressed representations of +the original files. + +Compression is always performed, even if the compressed file is +slightly larger than the original. Files of less than about +one hundred bytes tend to get larger, since the compression +mechanism has a constant overhead in the region of 50 bytes. +Random data (including the output of most file compressors) +is coded at about 8.05 bits per byte, giving an expansion of +around 0.5%. + +As a self-check for your protection, +.I bzip2 +uses 32-bit CRCs to make sure that the decompressed +version of a file is identical to the original. +This guards against corruption of the compressed data, +and against undetected bugs in +.I bzip2 +(hopefully very unlikely). +The chances of data corruption going undetected is +microscopic, about one chance in four billion +for each file processed. Be aware, though, that the check +occurs upon decompression, so it can only tell you that +that something is wrong. It can't help you recover the +original uncompressed data. +You can use +.I bzip2recover +to try to recover data from damaged files. + +Return values: +0 for a normal exit, +1 for environmental +problems (file not found, invalid flags, I/O errors, &c), +2 to indicate a corrupt compressed file, +3 for an internal consistency error (eg, bug) which caused +.I bzip2 +to panic. + +.SH MEMORY MANAGEMENT +.I Bzip2 +compresses large files in blocks. The block size affects both the +compression ratio achieved, and the amount of memory needed both for +compression and decompression. The flags \-1 through \-9 +specify the block size to be 100,000 bytes through 900,000 bytes +(the default) respectively. At decompression-time, the block size used for +compression is read from the header of the compressed file, and +.I bunzip2 +then allocates itself just enough memory to decompress the file. +Since block sizes are stored in compressed files, it follows that the flags +\-1 to \-9 +are irrelevant to and so ignored during decompression. +Compression and decompression requirements, in bytes, can be estimated as: + + Compression: 400k + ( 7 x block size ) + + Decompression: 100k + ( 5 x block size ), or +.br + 100k + ( 2.5 x block size ) + +Larger block sizes give rapidly diminishing marginal returns; most +of the +compression comes from the first two or three hundred k of block size, +a fact worth bearing in mind when using +.I bzip2 +on small machines. It is also important to appreciate that the +decompression memory requirement is set at compression-time by the +choice of block size. + +For files compressed with the default 900k block size, +.I bunzip2 +will require about 4600 kbytes to decompress. +To support decompression of any file on a 4 megabyte machine, +.I bunzip2 +has an option to decompress using approximately half this +amount of memory, about 2300 kbytes. Decompression speed is +also halved, so you should use this option only where necessary. +The relevant flag is \-s. + +In general, try and use the largest block size +memory constraints allow, since that maximises the compression +achieved. Compression and decompression +speed are virtually unaffected by block size. + +Another significant point applies to files which fit in a single +block -- that means most files you'd encounter using a large +block size. The amount of real memory touched is proportional +to the size of the file, since the file is smaller than a block. +For example, compressing a file 20,000 bytes long with the flag +\-9 +will cause the compressor to allocate around +6700k of memory, but only touch 400k + 20000 * 7 = 540 +kbytes of it. Similarly, the decompressor will allocate 4600k but +only touch 100k + 20000 * 5 = 200 kbytes. + +Here is a table which summarises the maximum memory usage for +different block sizes. Also recorded is the total compressed +size for 14 files of the Calgary Text Compression Corpus +totalling 3,141,622 bytes. This column gives some feel for how +compression varies with block size. These figures tend to understate +the advantage of larger block sizes for larger files, since the +Corpus is dominated by smaller files. + + Compress Decompress Decompress Corpus + Flag usage usage -s usage Size + + -1 1100k 600k 350k 914704 + -2 1800k 1100k 600k 877703 + -3 2500k 1600k 850k 860338 + -4 3200k 2100k 1100k 846899 + -5 3900k 2600k 1350k 845160 + -6 4600k 3100k 1600k 838626 + -7 5400k 3600k 1850k 834096 + -8 6000k 4100k 2100k 828642 + -9 6700k 4600k 2350k 828642 + +.SH OPTIONS +.TP +.B \-c --stdout +Compress or decompress to standard output. \-c will decompress +multiple files to stdout, but will only compress a single file to +stdout. +.TP +.B \-d --decompress +Force decompression. +.I Bzip2 +and +.I bunzip2 +are really the same program, and the decision about whether to +compress or decompress is done on the basis of which name is +used. This flag overrides that mechanism, and forces +.I bzip2 +to decompress. +.TP +.B \-f --compress +The complement to \-d: forces compression, regardless of the invokation +name. +.TP +.B \-t --test +Check integrity of the specified file(s), but don't decompress them. +This really performs a trial decompression and throws away the result, +using the low-memory decompression algorithm (see \-s). +.TP +.B \-k --keep +Keep (don't delete) input files during compression or decompression. +.TP +.B \-s --small +Reduce memory usage, both for compression and decompression. +Files are decompressed using a modified algorithm which only +requires 2.5 bytes per block byte. This means any file can be +decompressed in 2300k of memory, albeit somewhat more slowly than +usual. + +During compression, -s selects a block size of 200k, which limits +memory use to around the same figure, at the expense of your +compression ratio. In short, if your machine is low on memory +(8 megabytes or less), use -s for everything. See +MEMORY MANAGEMENT above. + +.TP +.B \-v --verbose +Verbose mode -- show the compression ratio for each file processed. +Further \-v's increase the verbosity level, spewing out lots of +information which is primarily of interest for diagnostic purposes. +.TP +.B \-L --license +Display the software version, license terms and conditions. +.TP +.B \-V --version +Same as \-L. +.TP +.B \-1 to \-9 +Set the block size to 100 k, 200 k .. 900 k when +compressing. Has no effect when decompressing. +See MEMORY MANAGEMENT above. +.TP +.B \--repetitive-fast +.I bzip2 +injects some small pseudo-random variations +into very repetitive blocks to limit +worst-case performance during compression. +If sorting runs into difficulties, the block +is randomised, and sorting is restarted. +Very roughly, +.I bzip2 +persists for three times as long as a well-behaved input +would take before resorting to randomisation. +This flag makes it give up much sooner. + +.TP +.B \--repetitive-best +Opposite of \--repetitive-fast; try a lot harder before +resorting to randomisation. + +.SH RECOVERING DATA FROM DAMAGED FILES +.I bzip2 +compresses files in blocks, usually 900kbytes long. +Each block is handled independently. If a media or +transmission error causes a multi-block .bz2 +file to become damaged, +it may be possible to recover data from the undamaged blocks +in the file. + +The compressed representation of each block is delimited by +a 48-bit pattern, which makes it possible to find the block +boundaries with reasonable certainty. Each block also carries +its own 32-bit CRC, so damaged blocks can be +distinguished from undamaged ones. + +.I bzip2recover +is a simple program whose purpose is to search for +blocks in .bz2 files, and write each block out into +its own .bz2 file. You can then use +.I bzip2 -t +to test the integrity of the resulting files, +and decompress those which are undamaged. + +.I bzip2recover +takes a single argument, the name of the damaged file, +and writes a number of files "rec0001file.bz2", "rec0002file.bz2", +etc, containing the extracted blocks. The output filenames +are designed so that the use of wildcards in subsequent processing +-- for example, "bzip2 -dc rec*file.bz2 > recovered_data" -- +lists the files in the "right" order. + +.I bzip2recover +should be of most use dealing with large .bz2 files, as +these will contain many blocks. It is clearly futile to +use it on damaged single-block files, since a damaged +block cannot be recovered. If you wish to minimise +any potential data loss through media or transmission +errors, you might consider compressing with a smaller +block size. + +.SH PERFORMANCE NOTES +The sorting phase of compression gathers together similar strings +in the file. Because of this, files containing very long +runs of repeated symbols, like "aabaabaabaab ..." (repeated +several hundred times) may compress extraordinarily slowly. +You can use the +\-vvvvv +option to monitor progress in great detail, if you want. +Decompression speed is unaffected. + +Such pathological cases +seem rare in practice, appearing mostly in artificially-constructed +test files, and in low-level disk images. It may be inadvisable to +use +.I bzip2 +to compress the latter. +If you do get a file which causes severe slowness in compression, +try making the block size as small as possible, with flag \-1. + +Incompressible or virtually-incompressible data may decompress +rather more slowly than one would hope. This is due to +a naive implementation of the move-to-front coder. + +.I bzip2 +usually allocates several megabytes of memory to operate in, +and then charges all over it in a fairly random fashion. This +means that performance, both for compressing and decompressing, +is largely determined by the speed +at which your machine can service cache misses. +Because of this, small changes +to the code to reduce the miss rate have been observed to give +disproportionately large performance improvements. +I imagine +.I bzip2 +will perform best on machines with very large caches. + +Test mode (\-t) uses the low-memory decompression algorithm +(\-s). This means test mode does not run as fast as it could; +it could run as fast as the normal decompression machinery. +This could easily be fixed at the cost of some code bloat. + +.SH CAVEATS +I/O error messages are not as helpful as they could be. +.I Bzip2 +tries hard to detect I/O errors and exit cleanly, but the +details of what the problem is sometimes seem rather misleading. + +This manual page pertains to version 0.1 of +.I bzip2. +It may well happen that some future version will +use a different compressed file format. If you try to +decompress, using 0.1, a .bz2 file created with some +future version which uses a different compressed file format, +0.1 will complain that your file "is not a bzip2 file". +If that happens, you should obtain a more recent version +of +.I bzip2 +and use that to decompress the file. + +Wildcard expansion for Windows 95 and NT +is flaky. + +.I bzip2recover +uses 32-bit integers to represent bit positions in +compressed files, so it cannot handle compressed files +more than 512 megabytes long. This could easily be fixed. + +.I bzip2recover +sometimes reports a very small, incomplete final block. +This is spurious and can be safely ignored. + +.SH RELATIONSHIP TO bzip-0.21 +This program is a descendant of the +.I bzip +program, version 0.21, which I released in August 1996. +The primary difference of +.I bzip2 +is its avoidance of the possibly patented algorithms +which were used in 0.21. +.I bzip2 +also brings various useful refinements (\-s, \-t), +uses less memory, decompresses significantly faster, and +has support for recovering data from damaged files. + +Because +.I bzip2 +uses Huffman coding to construct the compressed bitstream, +rather than the arithmetic coding used in 0.21, +the compressed representations generated by the two programs +are incompatible, and they will not interoperate. The change +in suffix from .bz to .bz2 reflects this. It would have been +helpful to at least allow +.I bzip2 +to decompress files created by 0.21, but this would +defeat the primary aim of having a patent-free compressor. + +Huffman coding necessarily involves some coding inefficiency +compared to arithmetic coding. This means that +.I bzip2 +compresses about 1% worse than 0.21, an unfortunate but +unavoidable fact-of-life. On the other hand, decompression +is approximately 50% faster for the same reason, and the +change in file format gave an opportunity to add data-recovery +features. So it is not all bad. + +.SH AUTHOR +Julian Seward, jseward@acm.org. + +The ideas embodied in +.I bzip +and +.I bzip2 +are due to (at least) the following people: +Michael Burrows and David Wheeler (for the block sorting +transformation), David Wheeler (again, for the Huffman coder), +Peter Fenwick (for the structured coding model in 0.21, +and many refinements), +and +Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic +coder in 0.21). I am much indebted for their help, support and advice. +See the file ALGORITHMS in the source distribution for pointers to +sources of documentation. +Christian von Roques encouraged me to look for faster +sorting algorithms, so as to speed up compression. +Bela Lubkin encouraged me to improve the worst-case +compression performance. +Many people sent patches, helped with portability problems, +lent machines, gave advice and were generally helpful. + -- cgit v1.2.3-55-g6feb