From 33d134030248633ffa7d60c0a35a783c46da034b Mon Sep 17 00:00:00 2001 From: Julian Seward Date: Thu, 7 Aug 1997 22:13:13 +0200 Subject: bzip2-0.1 --- bzip2.1.preformatted | 462 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 462 insertions(+) create mode 100644 bzip2.1.preformatted (limited to 'bzip2.1.preformatted') diff --git a/bzip2.1.preformatted b/bzip2.1.preformatted new file mode 100644 index 0000000..947dc97 --- /dev/null +++ b/bzip2.1.preformatted @@ -0,0 +1,462 @@ + + + +bzip2(1) bzip2(1) + + +NNAAMMEE + bzip2, bunzip2 - a block-sorting file compressor, v0.1 + bzip2recover - recovers data from damaged bzip2 files + + +SSYYNNOOPPSSIISS + bbzziipp22 [ --ccddffkkssttvvVVLL112233445566778899 ] [ _f_i_l_e_n_a_m_e_s _._._. ] + bbuunnzziipp22 [ --kkvvssVVLL ] [ _f_i_l_e_n_a_m_e_s _._._. ] + bbzziipp22rreeccoovveerr _f_i_l_e_n_a_m_e + + +DDEESSCCRRIIPPTTIIOONN + _B_z_i_p_2 compresses files using the Burrows-Wheeler block- + sorting text compression algorithm, and Huffman coding. + Compression is generally considerably better than that + achieved by more conventional LZ77/LZ78-based compressors, + and approaches the performance of the PPM family of sta- + tistical compressors. + + The command-line options are deliberately very similar to + those of _G_N_U _G_z_i_p_, but they are not identical. + + _B_z_i_p_2 expects a list of file names to accompany the com- + mand-line flags. Each file is replaced by a compressed + version of itself, with the name "original_name.bz2". + Each compressed file has the same modification date and + permissions as the corresponding original, so that these + properties can be correctly restored at decompression + time. File name handling is naive in the sense that there + is no mechanism for preserving original file names, per- + missions and dates in filesystems which lack these con- + cepts, or have serious file name length restrictions, such + as MS-DOS. + + _B_z_i_p_2 and _b_u_n_z_i_p_2 will not overwrite existing files; if + you want this to happen, you should delete them first. + + If no file names are specified, _b_z_i_p_2 compresses from + standard input to standard output. In this case, _b_z_i_p_2 + will decline to write compressed output to a terminal, as + this would be entirely incomprehensible and therefore + pointless. + + _B_u_n_z_i_p_2 (or _b_z_i_p_2 _-_d ) decompresses and restores all spec- + ified files whose names end in ".bz2". Files without this + suffix are ignored. Again, supplying no filenames causes + decompression from standard input to standard output. + + You can also compress or decompress files to the standard + output by giving the -c flag. You can decompress multiple + files like this, but you may only compress a single file + this way, since it would otherwise be difficult to sepa- + rate out the compressed representations of the original + files. + + + + 1 + + + + + +bzip2(1) bzip2(1) + + + Compression is always performed, even if the compressed + file is slightly larger than the original. Files of less + than about one hundred bytes tend to get larger, since the + compression mechanism has a constant overhead in the + region of 50 bytes. Random data (including the output of + most file compressors) is coded at about 8.05 bits per + byte, giving an expansion of around 0.5%. + + As a self-check for your protection, _b_z_i_p_2 uses 32-bit + CRCs to make sure that the decompressed version of a file + is identical to the original. This guards against corrup- + tion of the compressed data, and against undetected bugs + in _b_z_i_p_2 (hopefully very unlikely). The chances of data + corruption going undetected is microscopic, about one + chance in four billion for each file processed. Be aware, + though, that the check occurs upon decompression, so it + can only tell you that that something is wrong. It can't + help you recover the original uncompressed data. You can + use _b_z_i_p_2_r_e_c_o_v_e_r to try to recover data from damaged + files. + + Return values: 0 for a normal exit, 1 for environmental + problems (file not found, invalid flags, I/O errors, &c), + 2 to indicate a corrupt compressed file, 3 for an internal + consistency error (eg, bug) which caused _b_z_i_p_2 to panic. + + +MMEEMMOORRYY MMAANNAAGGEEMMEENNTT + _B_z_i_p_2 compresses large files in blocks. The block size + affects both the compression ratio achieved, and the + amount of memory needed both for compression and decom- + pression. The flags -1 through -9 specify the block size + to be 100,000 bytes through 900,000 bytes (the default) + respectively. At decompression-time, the block size used + for compression is read from the header of the compressed + file, and _b_u_n_z_i_p_2 then allocates itself just enough memory + to decompress the file. Since block sizes are stored in + compressed files, it follows that the flags -1 to -9 are + irrelevant to and so ignored during decompression. Com- + pression and decompression requirements, in bytes, can be + estimated as: + + Compression: 400k + ( 7 x block size ) + + Decompression: 100k + ( 5 x block size ), or + 100k + ( 2.5 x block size ) + + Larger block sizes give rapidly diminishing marginal + returns; most of the compression comes from the first two + or three hundred k of block size, a fact worth bearing in + mind when using _b_z_i_p_2 on small machines. It is also + important to appreciate that the decompression memory + requirement is set at compression-time by the choice of + block size. + + + + 2 + + + + + +bzip2(1) bzip2(1) + + + For files compressed with the default 900k block size, + _b_u_n_z_i_p_2 will require about 4600 kbytes to decompress. To + support decompression of any file on a 4 megabyte machine, + _b_u_n_z_i_p_2 has an option to decompress using approximately + half this amount of memory, about 2300 kbytes. Decompres- + sion speed is also halved, so you should use this option + only where necessary. The relevant flag is -s. + + In general, try and use the largest block size memory con- + straints allow, since that maximises the compression + achieved. Compression and decompression speed are virtu- + ally unaffected by block size. + + Another significant point applies to files which fit in a + single block -- that means most files you'd encounter + using a large block size. The amount of real memory + touched is proportional to the size of the file, since the + file is smaller than a block. For example, compressing a + file 20,000 bytes long with the flag -9 will cause the + compressor to allocate around 6700k of memory, but only + touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the + decompressor will allocate 4600k but only touch 100k + + 20000 * 5 = 200 kbytes. + + Here is a table which summarises the maximum memory usage + for different block sizes. Also recorded is the total + compressed size for 14 files of the Calgary Text Compres- + sion Corpus totalling 3,141,622 bytes. This column gives + some feel for how compression varies with block size. + These figures tend to understate the advantage of larger + block sizes for larger files, since the Corpus is domi- + nated by smaller files. + + Compress Decompress Decompress Corpus + Flag usage usage -s usage Size + + -1 1100k 600k 350k 914704 + -2 1800k 1100k 600k 877703 + -3 2500k 1600k 850k 860338 + -4 3200k 2100k 1100k 846899 + -5 3900k 2600k 1350k 845160 + -6 4600k 3100k 1600k 838626 + -7 5400k 3600k 1850k 834096 + -8 6000k 4100k 2100k 828642 + -9 6700k 4600k 2350k 828642 + + +OOPPTTIIOONNSS + --cc ----ssttddoouutt + Compress or decompress to standard output. -c will + decompress multiple files to stdout, but will only + compress a single file to stdout. + + + + + + 3 + + + + + +bzip2(1) bzip2(1) + + + --dd ----ddeeccoommpprreessss + Force decompression. _B_z_i_p_2 and _b_u_n_z_i_p_2 are really + the same program, and the decision about whether to + compress or decompress is done on the basis of + which name is used. This flag overrides that mech- + anism, and forces _b_z_i_p_2 to decompress. + + --ff ----ccoommpprreessss + The complement to -d: forces compression, regard- + less of the invokation name. + + --tt ----tteesstt + Check integrity of the specified file(s), but don't + decompress them. This really performs a trial + decompression and throws away the result, using the + low-memory decompression algorithm (see -s). + + --kk ----kkeeeepp + Keep (don't delete) input files during compression + or decompression. + + --ss ----ssmmaallll + Reduce memory usage, both for compression and + decompression. Files are decompressed using a mod- + ified algorithm which only requires 2.5 bytes per + block byte. This means any file can be decom- + pressed in 2300k of memory, albeit somewhat more + slowly than usual. + + During compression, -s selects a block size of + 200k, which limits memory use to around the same + figure, at the expense of your compression ratio. + In short, if your machine is low on memory (8 + megabytes or less), use -s for everything. See + MEMORY MANAGEMENT above. + + + --vv ----vveerrbboossee + Verbose mode -- show the compression ratio for each + file processed. Further -v's increase the ver- + bosity level, spewing out lots of information which + is primarily of interest for diagnostic purposes. + + --LL ----lliicceennssee + Display the software version, license terms and + conditions. + + --VV ----vveerrssiioonn + Same as -L. + + --11 ttoo --99 + Set the block size to 100 k, 200 k .. 900 k when + compressing. Has no effect when decompressing. + See MEMORY MANAGEMENT above. + + + + 4 + + + + + +bzip2(1) bzip2(1) + + + ----rreeppeettiittiivvee--ffaasstt + _b_z_i_p_2 injects some small pseudo-random variations + into very repetitive blocks to limit worst-case + performance during compression. If sorting runs + into difficulties, the block is randomised, and + sorting is restarted. Very roughly, _b_z_i_p_2 persists + for three times as long as a well-behaved input + would take before resorting to randomisation. This + flag makes it give up much sooner. + + + ----rreeppeettiittiivvee--bbeesstt + Opposite of --repetitive-fast; try a lot harder + before resorting to randomisation. + + +RREECCOOVVEERRIINNGG DDAATTAA FFRROOMM DDAAMMAAGGEEDD FFIILLEESS + _b_z_i_p_2 compresses files in blocks, usually 900kbytes long. + Each block is handled independently. If a media or trans- + mission error causes a multi-block .bz2 file to become + damaged, it may be possible to recover data from the + undamaged blocks in the file. + + The compressed representation of each block is delimited + by a 48-bit pattern, which makes it possible to find the + block boundaries with reasonable certainty. Each block + also carries its own 32-bit CRC, so damaged blocks can be + distinguished from undamaged ones. + + _b_z_i_p_2_r_e_c_o_v_e_r is a simple program whose purpose is to + search for blocks in .bz2 files, and write each block out + into its own .bz2 file. You can then use _b_z_i_p_2 _-_t to test + the integrity of the resulting files, and decompress those + which are undamaged. + + _b_z_i_p_2_r_e_c_o_v_e_r takes a single argument, the name of the dam- + aged file, and writes a number of files "rec0001file.bz2", + "rec0002file.bz2", etc, containing the extracted blocks. + The output filenames are designed so that the use of wild- + cards in subsequent processing -- for example, "bzip2 -dc + rec*file.bz2 > recovered_data" -- lists the files in the + "right" order. + + _b_z_i_p_2_r_e_c_o_v_e_r should be of most use dealing with large .bz2 + files, as these will contain many blocks. It is clearly + futile to use it on damaged single-block files, since a + damaged block cannot be recovered. If you wish to min- + imise any potential data loss through media or transmis- + sion errors, you might consider compressing with a smaller + block size. + + +PPEERRFFOORRMMAANNCCEE NNOOTTEESS + The sorting phase of compression gathers together similar + + + + 5 + + + + + +bzip2(1) bzip2(1) + + + strings in the file. Because of this, files containing + very long runs of repeated symbols, like "aabaabaabaab + ..." (repeated several hundred times) may compress + extraordinarily slowly. You can use the -vvvvv option to + monitor progress in great detail, if you want. Decompres- + sion speed is unaffected. + + Such pathological cases seem rare in practice, appearing + mostly in artificially-constructed test files, and in low- + level disk images. It may be inadvisable to use _b_z_i_p_2 to + compress the latter. If you do get a file which causes + severe slowness in compression, try making the block size + as small as possible, with flag -1. + + Incompressible or virtually-incompressible data may decom- + press rather more slowly than one would hope. This is due + to a naive implementation of the move-to-front coder. + + _b_z_i_p_2 usually allocates several megabytes of memory to + operate in, and then charges all over it in a fairly ran- + dom fashion. This means that performance, both for com- + pressing and decompressing, is largely determined by the + speed at which your machine can service cache misses. + Because of this, small changes to the code to reduce the + miss rate have been observed to give disproportionately + large performance improvements. I imagine _b_z_i_p_2 will per- + form best on machines with very large caches. + + Test mode (-t) uses the low-memory decompression algorithm + (-s). This means test mode does not run as fast as it + could; it could run as fast as the normal decompression + machinery. This could easily be fixed at the cost of some + code bloat. + + +CCAAVVEEAATTSS + I/O error messages are not as helpful as they could be. + _B_z_i_p_2 tries hard to detect I/O errors and exit cleanly, + but the details of what the problem is sometimes seem + rather misleading. + + This manual page pertains to version 0.1 of _b_z_i_p_2_. It may + well happen that some future version will use a different + compressed file format. If you try to decompress, using + 0.1, a .bz2 file created with some future version which + uses a different compressed file format, 0.1 will complain + that your file "is not a bzip2 file". If that happens, + you should obtain a more recent version of _b_z_i_p_2 and use + that to decompress the file. + + Wildcard expansion for Windows 95 and NT is flaky. + + _b_z_i_p_2_r_e_c_o_v_e_r uses 32-bit integers to represent bit posi- + tions in compressed files, so it cannot handle compressed + + + + 6 + + + + + +bzip2(1) bzip2(1) + + + files more than 512 megabytes long. This could easily be + fixed. + + _b_z_i_p_2_r_e_c_o_v_e_r sometimes reports a very small, incomplete + final block. This is spurious and can be safely ignored. + + +RREELLAATTIIOONNSSHHIIPP TTOO bbzziipp--00..2211 + This program is a descendant of the _b_z_i_p program, version + 0.21, which I released in August 1996. The primary dif- + ference of _b_z_i_p_2 is its avoidance of the possibly patented + algorithms which were used in 0.21. _b_z_i_p_2 also brings + various useful refinements (-s, -t), uses less memory, + decompresses significantly faster, and has support for + recovering data from damaged files. + + Because _b_z_i_p_2 uses Huffman coding to construct the com- + pressed bitstream, rather than the arithmetic coding used + in 0.21, the compressed representations generated by the + two programs are incompatible, and they will not interop- + erate. The change in suffix from .bz to .bz2 reflects + this. It would have been helpful to at least allow _b_z_i_p_2 + to decompress files created by 0.21, but this would defeat + the primary aim of having a patent-free compressor. + + Huffman coding necessarily involves some coding ineffi- + ciency compared to arithmetic coding. This means that + _b_z_i_p_2 compresses about 1% worse than 0.21, an unfortunate + but unavoidable fact-of-life. On the other hand, decom- + pression is approximately 50% faster for the same reason, + and the change in file format gave an opportunity to add + data-recovery features. So it is not all bad. + + +AAUUTTHHOORR + Julian Seward, jseward@acm.org. + + The ideas embodied in _b_z_i_p and _b_z_i_p_2 are due to (at least) + the following people: Michael Burrows and David Wheeler + (for the block sorting transformation), David Wheeler + (again, for the Huffman coder), Peter Fenwick (for the + structured coding model in 0.21, and many refinements), + and Alistair Moffat, Radford Neal and Ian Witten (for the + arithmetic coder in 0.21). I am much indebted for their + help, support and advice. See the file ALGORITHMS in the + source distribution for pointers to sources of documenta- + tion. Christian von Roques encouraged me to look for + faster sorting algorithms, so as to speed up compression. + Bela Lubkin encouraged me to improve the worst-case com- + pression performance. Many people sent patches, helped + with portability problems, lent machines, gave advice and + were generally helpful. + + + + + + 7 + + -- cgit v1.2.3-55-g6feb