aboutsummaryrefslogtreecommitdiff
path: root/bzip2.txt
diff options
context:
space:
mode:
Diffstat (limited to 'bzip2.txt')
-rw-r--r--bzip2.txt336
1 files changed, 178 insertions, 158 deletions
diff --git a/bzip2.txt b/bzip2.txt
index 898dfe8..da23c64 100644
--- a/bzip2.txt
+++ b/bzip2.txt
@@ -1,22 +1,20 @@
1 1
2bzip2(1) bzip2(1)
3
4 2
5NAME 3NAME
6 bzip2, bunzip2 - a block-sorting file compressor, v0.9.0 4 bzip2, bunzip2 - a block-sorting file compressor, v0.9.5
7 bzcat - decompresses files to stdout 5 bzcat - decompresses files to stdout
8 bzip2recover - recovers data from damaged bzip2 files 6 bzip2recover - recovers data from damaged bzip2 files
9 7
10 8
11SYNOPSIS 9SYNOPSIS
12 bzip2 [ -cdfkstvzVL123456789 ] [ filenames ... ] 10 bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ]
13 bunzip2 [ -fkvsVL ] [ filenames ... ] 11 bunzip2 [ -fkvsVL ] [ filenames ... ]
14 bzcat [ -s ] [ filenames ... ] 12 bzcat [ -s ] [ filenames ... ]
15 bzip2recover filename 13 bzip2recover filename
16 14
17 15
18DESCRIPTION 16DESCRIPTION
19 bzip2 compresses files using the Burrows-Wheeler block- 17 bzip2 compresses files using the Burrows-Wheeler block
20 sorting text compression algorithm, and Huffman coding. 18 sorting text compression algorithm, and Huffman coding.
21 Compression is generally considerably better than that 19 Compression is generally considerably better than that
22 achieved by more conventional LZ77/LZ78-based compressors, 20 achieved by more conventional LZ77/LZ78-based compressors,
@@ -24,22 +22,22 @@ DESCRIPTION
24 tistical compressors. 22 tistical compressors.
25 23
26 The command-line options are deliberately very similar to 24 The command-line options are deliberately very similar to
27 those of GNU Gzip, but they are not identical. 25 those of GNU gzip, but they are not identical.
28 26
29 bzip2 expects a list of file names to accompany the com- 27 bzip2 expects a list of file names to accompany the com-
30 mand-line flags. Each file is replaced by a compressed 28 mand-line flags. Each file is replaced by a compressed
31 version of itself, with the name "original_name.bz2". 29 version of itself, with the name "original_name.bz2".
32 Each compressed file has the same modification date and 30 Each compressed file has the same modification date, per-
33 permissions as the corresponding original, so that these 31 missions, and, when possible, ownership as the correspond-
34 properties can be correctly restored at decompression 32 ing original, so that these properties can be correctly
35 time. File name handling is naive in the sense that there 33 restored at decompression time. File name handling is
36 is no mechanism for preserving original file names, per- 34 naive in the sense that there is no mechanism for preserv-
37 missions and dates in filesystems which lack these con- 35 ing original file names, permissions, ownerships or dates
38 cepts, or have serious file name length restrictions, such 36 in filesystems which lack these concepts, or have serious
39 as MS-DOS. 37 file name length restrictions, such as MS-DOS.
40 38
41 bzip2 and bunzip2 will by default not overwrite existing 39 bzip2 and bunzip2 will by default not overwrite existing
42 files; if you want this to happen, specify the -f flag. 40 files. If you want this to happen, specify the -f flag.
43 41
44 If no file names are specified, bzip2 compresses from 42 If no file names are specified, bzip2 compresses from
45 standard input to standard output. In this case, bzip2 43 standard input to standard output. In this case, bzip2
@@ -47,10 +45,25 @@ DESCRIPTION
47 this would be entirely incomprehensible and therefore 45 this would be entirely incomprehensible and therefore
48 pointless. 46 pointless.
49 47
50 bunzip2 (or bzip2 -d ) decompresses and restores all spec- 48 bunzip2 (or bzip2 -d) decompresses all specified files.
51 ified files whose names end in ".bz2". Files without this 49 Files which were not created by bzip2 will be detected and
52 suffix are ignored. Again, supplying no filenames causes 50 ignored, and a warning issued. bzip2 attempts to guess
53 decompression from standard input to standard output. 51 the filename for the decompressed file from that of the
52 compressed file as follows:
53
54 filename.bz2 becomes filename
55 filename.bz becomes filename
56 filename.tbz2 becomes filename.tar
57 filename.tbz becomes filename.tar
58 anyothername becomes anyothername.out
59
60 If the file does not end in one of the recognised endings,
61 .bz2, .bz, .tbz2 or .tbz, bzip2 complains that it cannot
62 guess the name of the original file, and uses the original
63 name with .out appended.
64
65 As with compression, supplying no filenames causes decom-
66 pression from standard input to standard output.
54 67
55 bunzip2 will correctly decompress a file which is the con- 68 bunzip2 will correctly decompress a file which is the con-
56 catenation of two or more compressed files. The result is 69 catenation of two or more compressed files. The result is
@@ -58,19 +71,24 @@ DESCRIPTION
58 Integrity testing (-t) of concatenated compressed files is 71 Integrity testing (-t) of concatenated compressed files is
59 also supported. 72 also supported.
60 73
61 You can also compress or decompress files to the standard 74 You can also compress or decompress files to the standard
62 output by giving the -c flag. Multiple files may be com- 75 output by giving the -c flag. Multiple files may be com-
63 pressed and decompressed like this. The resulting outputs 76 pressed and decompressed like this. The resulting outputs
64 are fed sequentially to stdout. Compression of multiple 77 are fed sequentially to stdout. Compression of multiple
65 files in this manner generates a stream containing multi- 78 files in this manner generates a stream containing multi-
66 ple compressed file representations. Such a stream can be 79 ple compressed file representations. Such a stream can be
67 decompressed correctly only by bzip2 version 0.9.0 or 80 decompressed correctly only by bzip2 version 0.9.0 or
68 later. Earlier versions of bzip2 will stop after decom- 81 later. Earlier versions of bzip2 will stop after decom-
69 pressing the first file in the stream. 82 pressing the first file in the stream.
70 83
71 bzcat (or bzip2 -dc ) decompresses all specified files to 84 bzcat (or bzip2 -dc) decompresses all specified files to
72 the standard output. 85 the standard output.
73 86
87 bzip2 will read arguments from the environment variables
88 BZIP2 and BZIP, in that order, and will process them
89 before any arguments read from the command line. This
90 gives a convenient way to supply default arguments.
91
74 Compression is always performed, even if the compressed 92 Compression is always performed, even if the compressed
75 file is slightly larger than the original. Files of less 93 file is slightly larger than the original. Files of less
76 than about one hundred bytes tend to get larger, since the 94 than about one hundred bytes tend to get larger, since the
@@ -87,98 +105,19 @@ DESCRIPTION
87 corruption going undetected is microscopic, about one 105 corruption going undetected is microscopic, about one
88 chance in four billion for each file processed. Be aware, 106 chance in four billion for each file processed. Be aware,
89 though, that the check occurs upon decompression, so it 107 though, that the check occurs upon decompression, so it
90 can only tell you that that something is wrong. It can't 108 can only tell you that something is wrong. It can't help
91 help you recover the original uncompressed data. You can 109 you recover the original uncompressed data. You can use
92 use bzip2recover to try to recover data from damaged 110 bzip2recover to try to recover data from damaged files.
93 files.
94 111
95 Return values: 0 for a normal exit, 1 for environmental 112 Return values: 0 for a normal exit, 1 for environmental
96 problems (file not found, invalid flags, I/O errors, &c), 113 problems (file not found, invalid flags, I/O errors, &c),
97 2 to indicate a corrupt compressed file, 3 for an internal 114 2 to indicate a corrupt compressed file, 3 for an internal
98 consistency error (eg, bug) which caused bzip2 to panic. 115 consistency error (eg, bug) which caused bzip2 to panic.
99 116
100 117
101MEMORY MANAGEMENT
102 Bzip2 compresses large files in blocks. The block size
103 affects both the compression ratio achieved, and the
104 amount of memory needed both for compression and decom-
105 pression. The flags -1 through -9 specify the block size
106 to be 100,000 bytes through 900,000 bytes (the default)
107 respectively. At decompression-time, the block size used
108 for compression is read from the header of the compressed
109 file, and bunzip2 then allocates itself just enough memory
110 to decompress the file. Since block sizes are stored in
111 compressed files, it follows that the flags -1 to -9 are
112 irrelevant to and so ignored during decompression.
113
114 Compression and decompression requirements, in bytes, can
115 be estimated as:
116
117 Compression: 400k + ( 7 x block size )
118
119 Decompression: 100k + ( 4 x block size ), or
120 100k + ( 2.5 x block size )
121
122 Larger block sizes give rapidly diminishing marginal
123 returns; most of the compression comes from the first two
124 or three hundred k of block size, a fact worth bearing in
125 mind when using bzip2 on small machines. It is also
126 important to appreciate that the decompression memory
127 requirement is set at compression-time by the choice of
128 block size.
129
130 For files compressed with the default 900k block size,
131 bunzip2 will require about 3700 kbytes to decompress. To
132 support decompression of any file on a 4 megabyte machine,
133 bunzip2 has an option to decompress using approximately
134 half this amount of memory, about 2300 kbytes. Decompres-
135 sion speed is also halved, so you should use this option
136 only where necessary. The relevant flag is -s.
137
138 In general, try and use the largest block size memory con-
139 straints allow, since that maximises the compression
140 achieved. Compression and decompression speed are virtu-
141 ally unaffected by block size.
142
143 Another significant point applies to files which fit in a
144 single block -- that means most files you'd encounter
145 using a large block size. The amount of real memory
146 touched is proportional to the size of the file, since the
147 file is smaller than a block. For example, compressing a
148 file 20,000 bytes long with the flag -9 will cause the
149 compressor to allocate around 6700k of memory, but only
150 touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the
151 decompressor will allocate 3700k but only touch 100k +
152 20000 * 4 = 180 kbytes.
153
154 Here is a table which summarises the maximum memory usage
155 for different block sizes. Also recorded is the total
156 compressed size for 14 files of the Calgary Text Compres-
157 sion Corpus totalling 3,141,622 bytes. This column gives
158 some feel for how compression varies with block size.
159 These figures tend to understate the advantage of larger
160 block sizes for larger files, since the Corpus is domi-
161 nated by smaller files.
162
163 Compress Decompress Decompress Corpus
164 Flag usage usage -s usage Size
165
166 -1 1100k 500k 350k 914704
167 -2 1800k 900k 600k 877703
168 -3 2500k 1300k 850k 860338
169 -4 3200k 1700k 1100k 846899
170 -5 3900k 2100k 1350k 845160
171 -6 4600k 2500k 1600k 838626
172 -7 5400k 2900k 1850k 834096
173 -8 6000k 3300k 2100k 828642
174 -9 6700k 3700k 2350k 828642
175
176
177OPTIONS 118OPTIONS
178 -c --stdout 119 -c --stdout
179 Compress or decompress to standard output. -c will 120 Compress or decompress to standard output.
180 decompress multiple files to stdout, but will only
181 compress a single file to stdout.
182 121
183 -d --decompress 122 -d --decompress
184 Force decompression. bzip2, bunzip2 and bzcat are 123 Force decompression. bzip2, bunzip2 and bzcat are
@@ -198,7 +137,9 @@ OPTIONS
198 137
199 -f --force 138 -f --force
200 Force overwrite of output files. Normally, bzip2 139 Force overwrite of output files. Normally, bzip2
201 will not overwrite existing output files. 140 will not overwrite existing output files. Also
141 forces bzip2 to break hard links to files, which it
142 otherwise wouldn't do.
202 143
203 -k --keep 144 -k --keep
204 Keep (don't delete) input files during compression 145 Keep (don't delete) input files during compression
@@ -217,7 +158,12 @@ OPTIONS
217 figure, at the expense of your compression ratio. 158 figure, at the expense of your compression ratio.
218 In short, if your machine is low on memory (8 159 In short, if your machine is low on memory (8
219 megabytes or less), use -s for everything. See 160 megabytes or less), use -s for everything. See
220 MEMORY MANAGEMENT above. 161 MEMORY MANAGEMENT below.
162
163 -q --quiet
164 Suppress non-essential warning messages. Messages
165 pertaining to I/O errors and other critical events
166 will not be suppressed.
221 167
222 -v --verbose 168 -v --verbose
223 Verbose mode -- show the compression ratio for each 169 Verbose mode -- show the compression ratio for each
@@ -232,21 +178,96 @@ OPTIONS
232 -1 to -9 178 -1 to -9
233 Set the block size to 100 k, 200 k .. 900 k when 179 Set the block size to 100 k, 200 k .. 900 k when
234 compressing. Has no effect when decompressing. 180 compressing. Has no effect when decompressing.
235 See MEMORY MANAGEMENT above. 181 See MEMORY MANAGEMENT below.
236 182
237 --repetitive-fast 183 -- Treats all subsequent arguments as file names, even
238 bzip2 injects some small pseudo-random variations 184 if they start with a dash. This is so you can han-
239 into very repetitive blocks to limit worst-case 185 dle files with names beginning with a dash, for
240 performance during compression. If sorting runs 186 example: bzip2 -- -myfilename.
241 into difficulties, the block is randomised, and
242 sorting is restarted. Very roughly, bzip2 persists
243 for three times as long as a well-behaved input
244 would take before resorting to randomisation. This
245 flag makes it give up much sooner.
246 187
247 --repetitive-best 188 --repetitive-fast --repetitive-best
248 Opposite of --repetitive-fast; try a lot harder 189 These flags are redundant in versions 0.9.5 and
249 before resorting to randomisation. 190 above. They provided some coarse control over the
191 behaviour of the sorting algorithm in earlier ver-
192 sions, which was sometimes useful. 0.9.5 and above
193 have an improved algorithm which renders these
194 flags irrelevant.
195
196
197MEMORY MANAGEMENT
198 bzip2 compresses large files in blocks. The block size
199 affects both the compression ratio achieved, and the
200 amount of memory needed for compression and decompression.
201 The flags -1 through -9 specify the block size to be
202 100,000 bytes through 900,000 bytes (the default) respec-
203 tively. At decompression time, the block size used for
204 compression is read from the header of the compressed
205 file, and bunzip2 then allocates itself just enough memory
206 to decompress the file. Since block sizes are stored in
207 compressed files, it follows that the flags -1 to -9 are
208 irrelevant to and so ignored during decompression.
209
210 Compression and decompression requirements, in bytes, can
211 be estimated as:
212
213 Compression: 400k + ( 8 x block size )
214
215 Decompression: 100k + ( 4 x block size ), or
216 100k + ( 2.5 x block size )
217
218 Larger block sizes give rapidly diminishing marginal
219 returns. Most of the compression comes from the first two
220 or three hundred k of block size, a fact worth bearing in
221 mind when using bzip2 on small machines. It is also
222 important to appreciate that the decompression memory
223 requirement is set at compression time by the choice of
224 block size.
225
226 For files compressed with the default 900k block size,
227 bunzip2 will require about 3700 kbytes to decompress. To
228 support decompression of any file on a 4 megabyte machine,
229 bunzip2 has an option to decompress using approximately
230 half this amount of memory, about 2300 kbytes. Decompres-
231 sion speed is also halved, so you should use this option
232 only where necessary. The relevant flag is -s.
233
234 In general, try and use the largest block size memory con-
235 straints allow, since that maximises the compression
236 achieved. Compression and decompression speed are virtu-
237 ally unaffected by block size.
238
239 Another significant point applies to files which fit in a
240 single block -- that means most files you'd encounter
241 using a large block size. The amount of real memory
242 touched is proportional to the size of the file, since the
243 file is smaller than a block. For example, compressing a
244 file 20,000 bytes long with the flag -9 will cause the
245 compressor to allocate around 7600k of memory, but only
246 touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the
247 decompressor will allocate 3700k but only touch 100k +
248 20000 * 4 = 180 kbytes.
249
250 Here is a table which summarises the maximum memory usage
251 for different block sizes. Also recorded is the total
252 compressed size for 14 files of the Calgary Text Compres-
253 sion Corpus totalling 3,141,622 bytes. This column gives
254 some feel for how compression varies with block size.
255 These figures tend to understate the advantage of larger
256 block sizes for larger files, since the Corpus is domi-
257 nated by smaller files.
258
259 Compress Decompress Decompress Corpus
260 Flag usage usage -s usage Size
261
262 -1 1200k 500k 350k 914704
263 -2 2000k 900k 600k 877703
264 -3 2800k 1300k 850k 860338
265 -4 3600k 1700k 1100k 846899
266 -5 4400k 2100k 1350k 845160
267 -6 5200k 2500k 1600k 838626
268 -7 6100k 2900k 1850k 834096
269 -8 6800k 3300k 2100k 828642
270 -9 7600k 3700k 2350k 828642
250 271
251 272
252RECOVERING DATA FROM DAMAGED FILES 273RECOVERING DATA FROM DAMAGED FILES
@@ -273,8 +294,8 @@ RECOVERING DATA FROM DAMAGED FILES
273 "rec0002file.bz2", etc, containing the extracted blocks. 294 "rec0002file.bz2", etc, containing the extracted blocks.
274 The output filenames are designed so that the use of 295 The output filenames are designed so that the use of
275 wildcards in subsequent processing -- for example, "bzip2 296 wildcards in subsequent processing -- for example, "bzip2
276 -dc rec*file.bz2 > recovered_data" -- lists the files in 297 -dc rec*file.bz2 > recovered_data" -- lists the files in
277 the "right" order. 298 the correct order.
278 299
279 bzip2recover should be of most use dealing with large .bz2 300 bzip2recover should be of most use dealing with large .bz2
280 files, as these will contain many blocks. It is clearly 301 files, as these will contain many blocks. It is clearly
@@ -289,17 +310,15 @@ PERFORMANCE NOTES
289 The sorting phase of compression gathers together similar 310 The sorting phase of compression gathers together similar
290 strings in the file. Because of this, files containing 311 strings in the file. Because of this, files containing
291 very long runs of repeated symbols, like "aabaabaabaab 312 very long runs of repeated symbols, like "aabaabaabaab
292 ..." (repeated several hundred times) may compress 313 ..." (repeated several hundred times) may compress more
293 extraordinarily slowly. You can use the -vvvvv option to 314 slowly than normal. Versions 0.9.5 and above fare much
294 monitor progress in great detail, if you want. Decompres- 315 better than previous versions in this respect. The ratio
295 sion speed is unaffected. 316 between worst-case and average-case compression time is in
296 317 the region of 10:1. For previous versions, this figure
297 Such pathological cases seem rare in practice, appearing 318 was more like 100:1. You can use the -vvvv option to mon-
298 mostly in artificially-constructed test files, and in low- 319 itor progress in great detail, if you want.
299 level disk images. It may be inadvisable to use bzip2 to 320
300 compress the latter. If you do get a file which causes 321 Decompression speed is unaffected by these phenomena.
301 severe slowness in compression, try making the block size
302 as small as possible, with flag -1.
303 322
304 bzip2 usually allocates several megabytes of memory to 323 bzip2 usually allocates several megabytes of memory to
305 operate in, and then charges all over it in a fairly ran- 324 operate in, and then charges all over it in a fairly ran-
@@ -314,42 +333,43 @@ PERFORMANCE NOTES
314 333
315CAVEATS 334CAVEATS
316 I/O error messages are not as helpful as they could be. 335 I/O error messages are not as helpful as they could be.
317 Bzip2 tries hard to detect I/O errors and exit cleanly, 336 bzip2 tries hard to detect I/O errors and exit cleanly,
318 but the details of what the problem is sometimes seem 337 but the details of what the problem is sometimes seem
319 rather misleading. 338 rather misleading.
320 339
321 This manual page pertains to version 0.9.0 of bzip2. Com- 340 This manual page pertains to version 0.9.5 of bzip2. Com-
322 pressed data created by this version is entirely forwards 341 pressed data created by this version is entirely forwards
323 and backwards compatible with the previous public release, 342 and backwards compatible with the previous public
324 version 0.1pl2, but with the following exception: 0.9.0 343 releases, versions 0.1pl2 and 0.9.0, but with the follow-
325 can correctly decompress multiple concatenated compressed 344 ing exception: 0.9.0 and above can correctly decompress
326 files. 0.1pl2 cannot do this; it will stop after decom- 345 multiple concatenated compressed files. 0.1pl2 cannot do
327 pressing just the first file in the stream. 346 this; it will stop after decompressing just the first file
328 347 in the stream.
329 Wildcard expansion for Windows 95 and NT is flaky. 348
330 349 bzip2recover uses 32-bit integers to represent bit posi-
331 bzip2recover uses 32-bit integers to represent bit posi- 350 tions in compressed files, so it cannot handle compressed
332 tions in compressed files, so it cannot handle compressed 351 files more than 512 megabytes long. This could easily be
333 files more than 512 megabytes long. This could easily be
334 fixed. 352 fixed.
335 353
336 354
337AUTHOR 355AUTHOR
338 Julian Seward, jseward@acm.org. 356 Julian Seward, jseward@acm.org.
357
339 http://www.muraroa.demon.co.uk 358 http://www.muraroa.demon.co.uk
340 359
341 The ideas embodied in bzip2 are due to (at least) the fol- 360 The ideas embodied in bzip2 are due to (at least) the fol-
342 lowing people: Michael Burrows and David Wheeler (for the 361 lowing people: Michael Burrows and David Wheeler (for the
343 block sorting transformation), David Wheeler (again, for 362 block sorting transformation), David Wheeler (again, for
344 the Huffman coder), Peter Fenwick (for the structured cod- 363 the Huffman coder), Peter Fenwick (for the structured cod-
345 ing model in the original bzip, and many refinements), and 364 ing model in the original bzip, and many refinements), and
346 Alistair Moffat, Radford Neal and Ian Witten (for the 365 Alistair Moffat, Radford Neal and Ian Witten (for the
347 arithmetic coder in the original bzip). I am much 366 arithmetic coder in the original bzip). I am much
348 indebted for their help, support and advice. See the man- 367 indebted for their help, support and advice. See the man-
349 ual in the source distribution for pointers to sources of 368 ual in the source distribution for pointers to sources of
350 documentation. Christian von Roques encouraged me to look 369 documentation. Christian von Roques encouraged me to look
351 for faster sorting algorithms, so as to speed up compres- 370 for faster sorting algorithms, so as to speed up compres-
352 sion. Bela Lubkin encouraged me to improve the worst-case 371 sion. Bela Lubkin encouraged me to improve the worst-case
353 compression performance. Many people sent patches, helped 372 compression performance. Many people sent patches, helped
354 with portability problems, lent machines, gave advice and 373 with portability problems, lent machines, gave advice and
355 were generally helpful. 374 were generally helpful.
375