aboutsummaryrefslogtreecommitdiff
path: root/bzip2.1
diff options
context:
space:
mode:
authorJulian Seward <jseward@acm.org>1997-08-07 22:13:13 +0200
committerJulian Seward <jseward@acm.org>1997-08-07 22:13:13 +0200
commit33d134030248633ffa7d60c0a35a783c46da034b (patch)
treeb760dc34185dccc7054989c1472478574223cc31 /bzip2.1
downloadbzip2-33d134030248633ffa7d60c0a35a783c46da034b.tar.gz
bzip2-33d134030248633ffa7d60c0a35a783c46da034b.tar.bz2
bzip2-33d134030248633ffa7d60c0a35a783c46da034b.zip
bzip2-0.1bzip2-0.1
Diffstat (limited to 'bzip2.1')
-rw-r--r--bzip2.1441
1 files changed, 441 insertions, 0 deletions
diff --git a/bzip2.1 b/bzip2.1
new file mode 100644
index 0000000..9094c7c
--- /dev/null
+++ b/bzip2.1
@@ -0,0 +1,441 @@
1.PU
2.TH bzip2 1
3.SH NAME
4bzip2, bunzip2 \- a block-sorting file compressor, v0.1
5.br
6bzip2recover \- recovers data from damaged bzip2 files
7
8.SH SYNOPSIS
9.ll +8
10.B bzip2
11.RB [ " \-cdfkstvVL123456789 " ]
12[
13.I "filenames \&..."
14]
15.ll -8
16.br
17.B bunzip2
18.RB [ " \-kvsVL " ]
19[
20.I "filenames \&..."
21]
22.br
23.B bzip2recover
24.I "filename"
25
26.SH DESCRIPTION
27.I Bzip2
28compresses files using the Burrows-Wheeler block-sorting
29text compression algorithm, and Huffman coding.
30Compression is generally considerably
31better than that
32achieved by more conventional LZ77/LZ78-based compressors,
33and approaches the performance of the PPM family of statistical
34compressors.
35
36The command-line options are deliberately very similar to
37those of
38.I GNU Gzip,
39but they are not identical.
40
41.I Bzip2
42expects a list of file names to accompany the command-line flags.
43Each file is replaced by a compressed version of itself,
44with the name "original_name.bz2".
45Each compressed file has the same modification date and permissions
46as the corresponding original, so that these properties can be
47correctly restored at decompression time. File name handling is
48naive in the sense that there is no mechanism for preserving
49original file names, permissions and dates in filesystems
50which lack these concepts, or have serious file name length
51restrictions, such as MS-DOS.
52
53.I Bzip2
54and
55.I bunzip2
56will not overwrite existing files; if you want this to happen,
57you should delete them first.
58
59If no file names are specified,
60.I bzip2
61compresses from standard input to standard output.
62In this case,
63.I bzip2
64will decline to write compressed output to a terminal, as
65this would be entirely incomprehensible and therefore pointless.
66
67.I Bunzip2
68(or
69.I bzip2 \-d
70) decompresses and restores all specified files whose names
71end in ".bz2".
72Files without this suffix are ignored.
73Again, supplying no filenames
74causes decompression from standard input to standard output.
75
76You can also compress or decompress files to
77the standard output by giving the \-c flag.
78You can decompress multiple files like this, but you may
79only compress a single file this way, since it would otherwise
80be difficult to separate out the compressed representations of
81the original files.
82
83Compression is always performed, even if the compressed file is
84slightly larger than the original. Files of less than about
85one hundred bytes tend to get larger, since the compression
86mechanism has a constant overhead in the region of 50 bytes.
87Random data (including the output of most file compressors)
88is coded at about 8.05 bits per byte, giving an expansion of
89around 0.5%.
90
91As a self-check for your protection,
92.I bzip2
93uses 32-bit CRCs to make sure that the decompressed
94version of a file is identical to the original.
95This guards against corruption of the compressed data,
96and against undetected bugs in
97.I bzip2
98(hopefully very unlikely).
99The chances of data corruption going undetected is
100microscopic, about one chance in four billion
101for each file processed. Be aware, though, that the check
102occurs upon decompression, so it can only tell you that
103that something is wrong. It can't help you recover the
104original uncompressed data.
105You can use
106.I bzip2recover
107to try to recover data from damaged files.
108
109Return values:
1100 for a normal exit,
1111 for environmental
112problems (file not found, invalid flags, I/O errors, &c),
1132 to indicate a corrupt compressed file,
1143 for an internal consistency error (eg, bug) which caused
115.I bzip2
116to panic.
117
118.SH MEMORY MANAGEMENT
119.I Bzip2
120compresses large files in blocks. The block size affects both the
121compression ratio achieved, and the amount of memory needed both for
122compression and decompression. The flags \-1 through \-9
123specify the block size to be 100,000 bytes through 900,000 bytes
124(the default) respectively. At decompression-time, the block size used for
125compression is read from the header of the compressed file, and
126.I bunzip2
127then allocates itself just enough memory to decompress the file.
128Since block sizes are stored in compressed files, it follows that the flags
129\-1 to \-9
130are irrelevant to and so ignored during decompression.
131Compression and decompression requirements, in bytes, can be estimated as:
132
133 Compression: 400k + ( 7 x block size )
134
135 Decompression: 100k + ( 5 x block size ), or
136.br
137 100k + ( 2.5 x block size )
138
139Larger block sizes give rapidly diminishing marginal returns; most
140of the
141compression comes from the first two or three hundred k of block size,
142a fact worth bearing in mind when using
143.I bzip2
144on small machines. It is also important to appreciate that the
145decompression memory requirement is set at compression-time by the
146choice of block size.
147
148For files compressed with the default 900k block size,
149.I bunzip2
150will require about 4600 kbytes to decompress.
151To support decompression of any file on a 4 megabyte machine,
152.I bunzip2
153has an option to decompress using approximately half this
154amount of memory, about 2300 kbytes. Decompression speed is
155also halved, so you should use this option only where necessary.
156The relevant flag is \-s.
157
158In general, try and use the largest block size
159memory constraints allow, since that maximises the compression
160achieved. Compression and decompression
161speed are virtually unaffected by block size.
162
163Another significant point applies to files which fit in a single
164block -- that means most files you'd encounter using a large
165block size. The amount of real memory touched is proportional
166to the size of the file, since the file is smaller than a block.
167For example, compressing a file 20,000 bytes long with the flag
168\-9
169will cause the compressor to allocate around
1706700k of memory, but only touch 400k + 20000 * 7 = 540
171kbytes of it. Similarly, the decompressor will allocate 4600k but
172only touch 100k + 20000 * 5 = 200 kbytes.
173
174Here is a table which summarises the maximum memory usage for
175different block sizes. Also recorded is the total compressed
176size for 14 files of the Calgary Text Compression Corpus
177totalling 3,141,622 bytes. This column gives some feel for how
178compression varies with block size. These figures tend to understate
179the advantage of larger block sizes for larger files, since the
180Corpus is dominated by smaller files.
181
182 Compress Decompress Decompress Corpus
183 Flag usage usage -s usage Size
184
185 -1 1100k 600k 350k 914704
186 -2 1800k 1100k 600k 877703
187 -3 2500k 1600k 850k 860338
188 -4 3200k 2100k 1100k 846899
189 -5 3900k 2600k 1350k 845160
190 -6 4600k 3100k 1600k 838626
191 -7 5400k 3600k 1850k 834096
192 -8 6000k 4100k 2100k 828642
193 -9 6700k 4600k 2350k 828642
194
195.SH OPTIONS
196.TP
197.B \-c --stdout
198Compress or decompress to standard output. \-c will decompress
199multiple files to stdout, but will only compress a single file to
200stdout.
201.TP
202.B \-d --decompress
203Force decompression.
204.I Bzip2
205and
206.I bunzip2
207are really the same program, and the decision about whether to
208compress or decompress is done on the basis of which name is
209used. This flag overrides that mechanism, and forces
210.I bzip2
211to decompress.
212.TP
213.B \-f --compress
214The complement to \-d: forces compression, regardless of the invokation
215name.
216.TP
217.B \-t --test
218Check integrity of the specified file(s), but don't decompress them.
219This really performs a trial decompression and throws away the result,
220using the low-memory decompression algorithm (see \-s).
221.TP
222.B \-k --keep
223Keep (don't delete) input files during compression or decompression.
224.TP
225.B \-s --small
226Reduce memory usage, both for compression and decompression.
227Files are decompressed using a modified algorithm which only
228requires 2.5 bytes per block byte. This means any file can be
229decompressed in 2300k of memory, albeit somewhat more slowly than
230usual.
231
232During compression, -s selects a block size of 200k, which limits
233memory use to around the same figure, at the expense of your
234compression ratio. In short, if your machine is low on memory
235(8 megabytes or less), use -s for everything. See
236MEMORY MANAGEMENT above.
237
238.TP
239.B \-v --verbose
240Verbose mode -- show the compression ratio for each file processed.
241Further \-v's increase the verbosity level, spewing out lots of
242information which is primarily of interest for diagnostic purposes.
243.TP
244.B \-L --license
245Display the software version, license terms and conditions.
246.TP
247.B \-V --version
248Same as \-L.
249.TP
250.B \-1 to \-9
251Set the block size to 100 k, 200 k .. 900 k when
252compressing. Has no effect when decompressing.
253See MEMORY MANAGEMENT above.
254.TP
255.B \--repetitive-fast
256.I bzip2
257injects some small pseudo-random variations
258into very repetitive blocks to limit
259worst-case performance during compression.
260If sorting runs into difficulties, the block
261is randomised, and sorting is restarted.
262Very roughly,
263.I bzip2
264persists for three times as long as a well-behaved input
265would take before resorting to randomisation.
266This flag makes it give up much sooner.
267
268.TP
269.B \--repetitive-best
270Opposite of \--repetitive-fast; try a lot harder before
271resorting to randomisation.
272
273.SH RECOVERING DATA FROM DAMAGED FILES
274.I bzip2
275compresses files in blocks, usually 900kbytes long.
276Each block is handled independently. If a media or
277transmission error causes a multi-block .bz2
278file to become damaged,
279it may be possible to recover data from the undamaged blocks
280in the file.
281
282The compressed representation of each block is delimited by
283a 48-bit pattern, which makes it possible to find the block
284boundaries with reasonable certainty. Each block also carries
285its own 32-bit CRC, so damaged blocks can be
286distinguished from undamaged ones.
287
288.I bzip2recover
289is a simple program whose purpose is to search for
290blocks in .bz2 files, and write each block out into
291its own .bz2 file. You can then use
292.I bzip2 -t
293to test the integrity of the resulting files,
294and decompress those which are undamaged.
295
296.I bzip2recover
297takes a single argument, the name of the damaged file,
298and writes a number of files "rec0001file.bz2", "rec0002file.bz2",
299etc, containing the extracted blocks. The output filenames
300are designed so that the use of wildcards in subsequent processing
301-- for example, "bzip2 -dc rec*file.bz2 > recovered_data" --
302lists the files in the "right" order.
303
304.I bzip2recover
305should be of most use dealing with large .bz2 files, as
306these will contain many blocks. It is clearly futile to
307use it on damaged single-block files, since a damaged
308block cannot be recovered. If you wish to minimise
309any potential data loss through media or transmission
310errors, you might consider compressing with a smaller
311block size.
312
313.SH PERFORMANCE NOTES
314The sorting phase of compression gathers together similar strings
315in the file. Because of this, files containing very long
316runs of repeated symbols, like "aabaabaabaab ..." (repeated
317several hundred times) may compress extraordinarily slowly.
318You can use the
319\-vvvvv
320option to monitor progress in great detail, if you want.
321Decompression speed is unaffected.
322
323Such pathological cases
324seem rare in practice, appearing mostly in artificially-constructed
325test files, and in low-level disk images. It may be inadvisable to
326use
327.I bzip2
328to compress the latter.
329If you do get a file which causes severe slowness in compression,
330try making the block size as small as possible, with flag \-1.
331
332Incompressible or virtually-incompressible data may decompress
333rather more slowly than one would hope. This is due to
334a naive implementation of the move-to-front coder.
335
336.I bzip2
337usually allocates several megabytes of memory to operate in,
338and then charges all over it in a fairly random fashion. This
339means that performance, both for compressing and decompressing,
340is largely determined by the speed
341at which your machine can service cache misses.
342Because of this, small changes
343to the code to reduce the miss rate have been observed to give
344disproportionately large performance improvements.
345I imagine
346.I bzip2
347will perform best on machines with very large caches.
348
349Test mode (\-t) uses the low-memory decompression algorithm
350(\-s). This means test mode does not run as fast as it could;
351it could run as fast as the normal decompression machinery.
352This could easily be fixed at the cost of some code bloat.
353
354.SH CAVEATS
355I/O error messages are not as helpful as they could be.
356.I Bzip2
357tries hard to detect I/O errors and exit cleanly, but the
358details of what the problem is sometimes seem rather misleading.
359
360This manual page pertains to version 0.1 of
361.I bzip2.
362It may well happen that some future version will
363use a different compressed file format. If you try to
364decompress, using 0.1, a .bz2 file created with some
365future version which uses a different compressed file format,
3660.1 will complain that your file "is not a bzip2 file".
367If that happens, you should obtain a more recent version
368of
369.I bzip2
370and use that to decompress the file.
371
372Wildcard expansion for Windows 95 and NT
373is flaky.
374
375.I bzip2recover
376uses 32-bit integers to represent bit positions in
377compressed files, so it cannot handle compressed files
378more than 512 megabytes long. This could easily be fixed.
379
380.I bzip2recover
381sometimes reports a very small, incomplete final block.
382This is spurious and can be safely ignored.
383
384.SH RELATIONSHIP TO bzip-0.21
385This program is a descendant of the
386.I bzip
387program, version 0.21, which I released in August 1996.
388The primary difference of
389.I bzip2
390is its avoidance of the possibly patented algorithms
391which were used in 0.21.
392.I bzip2
393also brings various useful refinements (\-s, \-t),
394uses less memory, decompresses significantly faster, and
395has support for recovering data from damaged files.
396
397Because
398.I bzip2
399uses Huffman coding to construct the compressed bitstream,
400rather than the arithmetic coding used in 0.21,
401the compressed representations generated by the two programs
402are incompatible, and they will not interoperate. The change
403in suffix from .bz to .bz2 reflects this. It would have been
404helpful to at least allow
405.I bzip2
406to decompress files created by 0.21, but this would
407defeat the primary aim of having a patent-free compressor.
408
409Huffman coding necessarily involves some coding inefficiency
410compared to arithmetic coding. This means that
411.I bzip2
412compresses about 1% worse than 0.21, an unfortunate but
413unavoidable fact-of-life. On the other hand, decompression
414is approximately 50% faster for the same reason, and the
415change in file format gave an opportunity to add data-recovery
416features. So it is not all bad.
417
418.SH AUTHOR
419Julian Seward, jseward@acm.org.
420
421The ideas embodied in
422.I bzip
423and
424.I bzip2
425are due to (at least) the following people:
426Michael Burrows and David Wheeler (for the block sorting
427transformation), David Wheeler (again, for the Huffman coder),
428Peter Fenwick (for the structured coding model in 0.21,
429and many refinements),
430and
431Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic
432coder in 0.21). I am much indebted for their help, support and advice.
433See the file ALGORITHMS in the source distribution for pointers to
434sources of documentation.
435Christian von Roques encouraged me to look for faster
436sorting algorithms, so as to speed up compression.
437Bela Lubkin encouraged me to improve the worst-case
438compression performance.
439Many people sent patches, helped with portability problems,
440lent machines, gave advice and were generally helpful.
441