diff options
author | Julian Seward <jseward@acm.org> | 1999-09-04 22:13:13 +0200 |
---|---|---|
committer | Julian Seward <jseward@acm.org> | 1999-09-04 22:13:13 +0200 |
commit | f93cd82a9a7094ad90fd19bbc6ccf6f4627f8060 (patch) | |
tree | c95407df5665f5a7395683f07552f2b13f2e501f /bzip2.1 | |
parent | 977101ad5f833f5c0a574bfeea408e5301a6b052 (diff) | |
download | bzip2-f93cd82a9a7094ad90fd19bbc6ccf6f4627f8060.tar.gz bzip2-f93cd82a9a7094ad90fd19bbc6ccf6f4627f8060.tar.bz2 bzip2-f93cd82a9a7094ad90fd19bbc6ccf6f4627f8060.zip |
bzip2-0.9.5dbzip2-0.9.5d
Diffstat (limited to 'bzip2.1')
-rw-r--r-- | bzip2.1 | 610 |
1 files changed, 314 insertions, 296 deletions
@@ -1,7 +1,7 @@ | |||
1 | .PU | 1 | .PU |
2 | .TH bzip2 1 | 2 | .TH bzip2 1 |
3 | .SH NAME | 3 | .SH NAME |
4 | bzip2, bunzip2 \- a block-sorting file compressor, v0.9.0 | 4 | bzip2, bunzip2 \- a block-sorting file compressor, v0.9.5 |
5 | .br | 5 | .br |
6 | bzcat \- decompresses files to stdout | 6 | bzcat \- decompresses files to stdout |
7 | .br | 7 | .br |
@@ -10,7 +10,7 @@ bzip2recover \- recovers data from damaged bzip2 files | |||
10 | .SH SYNOPSIS | 10 | .SH SYNOPSIS |
11 | .ll +8 | 11 | .ll +8 |
12 | .B bzip2 | 12 | .B bzip2 |
13 | .RB [ " \-cdfkstvzVL123456789 " ] | 13 | .RB [ " \-cdfkqstvzVL123456789 " ] |
14 | [ | 14 | [ |
15 | .I "filenames \&..." | 15 | .I "filenames \&..." |
16 | ] | 16 | ] |
@@ -18,13 +18,13 @@ bzip2recover \- recovers data from damaged bzip2 files | |||
18 | .br | 18 | .br |
19 | .B bunzip2 | 19 | .B bunzip2 |
20 | .RB [ " \-fkvsVL " ] | 20 | .RB [ " \-fkvsVL " ] |
21 | [ | 21 | [ |
22 | .I "filenames \&..." | 22 | .I "filenames \&..." |
23 | ] | 23 | ] |
24 | .br | 24 | .br |
25 | .B bzcat | 25 | .B bzcat |
26 | .RB [ " \-s " ] | 26 | .RB [ " \-s " ] |
27 | [ | 27 | [ |
28 | .I "filenames \&..." | 28 | .I "filenames \&..." |
29 | ] | 29 | ] |
30 | .br | 30 | .br |
@@ -33,211 +33,171 @@ bzip2recover \- recovers data from damaged bzip2 files | |||
33 | 33 | ||
34 | .SH DESCRIPTION | 34 | .SH DESCRIPTION |
35 | .I bzip2 | 35 | .I bzip2 |
36 | compresses files using the Burrows-Wheeler block-sorting | 36 | compresses files using the Burrows-Wheeler block sorting |
37 | text compression algorithm, and Huffman coding. | 37 | text compression algorithm, and Huffman coding. Compression is |
38 | Compression is generally considerably | 38 | generally considerably better than that achieved by more conventional |
39 | better than that | 39 | LZ77/LZ78-based compressors, and approaches the performance of the PPM |
40 | achieved by more conventional LZ77/LZ78-based compressors, | 40 | family of statistical compressors. |
41 | and approaches the performance of the PPM family of statistical | ||
42 | compressors. | ||
43 | 41 | ||
44 | The command-line options are deliberately very similar to | 42 | The command-line options are deliberately very similar to |
45 | those of | 43 | those of |
46 | .I GNU Gzip, | 44 | .I GNU gzip, |
47 | but they are not identical. | 45 | but they are not identical. |
48 | 46 | ||
49 | .I bzip2 | 47 | .I bzip2 |
50 | expects a list of file names to accompany the command-line flags. | 48 | expects a list of file names to accompany the |
51 | Each file is replaced by a compressed version of itself, | 49 | command-line flags. Each file is replaced by a compressed version of |
52 | with the name "original_name.bz2". | 50 | itself, with the name "original_name.bz2". |
53 | Each compressed file has the same modification date and permissions | 51 | Each compressed file |
54 | as the corresponding original, so that these properties can be | 52 | has the same modification date, permissions, and, when possible, |
55 | correctly restored at decompression time. File name handling is | 53 | ownership as the corresponding original, so that these properties can |
56 | naive in the sense that there is no mechanism for preserving | 54 | be correctly restored at decompression time. File name handling is |
57 | original file names, permissions and dates in filesystems | 55 | naive in the sense that there is no mechanism for preserving original |
58 | which lack these concepts, or have serious file name length | 56 | file names, permissions, ownerships or dates in filesystems which lack |
59 | restrictions, such as MS-DOS. | 57 | these concepts, or have serious file name length restrictions, such as |
58 | MS-DOS. | ||
60 | 59 | ||
61 | .I bzip2 | 60 | .I bzip2 |
62 | and | 61 | and |
63 | .I bunzip2 | 62 | .I bunzip2 |
64 | will by default not overwrite existing files; | 63 | will by default not overwrite existing |
65 | if you want this to happen, specify the \-f flag. | 64 | files. If you want this to happen, specify the \-f flag. |
66 | 65 | ||
67 | If no file names are specified, | 66 | If no file names are specified, |
68 | .I bzip2 | 67 | .I bzip2 |
69 | compresses from standard input to standard output. | 68 | compresses from standard |
70 | In this case, | 69 | input to standard output. In this case, |
71 | .I bzip2 | 70 | .I bzip2 |
72 | will decline to write compressed output to a terminal, as | 71 | will decline to |
73 | this would be entirely incomprehensible and therefore pointless. | 72 | write compressed output to a terminal, as this would be entirely |
73 | incomprehensible and therefore pointless. | ||
74 | 74 | ||
75 | .I bunzip2 | 75 | .I bunzip2 |
76 | (or | 76 | (or |
77 | .I bzip2 \-d | 77 | .I bzip2 \-d) |
78 | ) decompresses and restores all specified files whose names | 78 | decompresses all |
79 | end in ".bz2". | 79 | specified files. Files which were not created by |
80 | Files without this suffix are ignored. | ||
81 | Again, supplying no filenames | ||
82 | causes decompression from standard input to standard output. | ||
83 | |||
84 | .I bunzip2 | ||
85 | will correctly decompress a file which is the concatenation | ||
86 | of two or more compressed files. The result is the concatenation | ||
87 | of the corresponding uncompressed files. Integrity testing | ||
88 | (\-t) of concatenated compressed files is also supported. | ||
89 | |||
90 | You can also compress or decompress files to | ||
91 | the standard output by giving the \-c flag. | ||
92 | Multiple files may be compressed and decompressed like this. | ||
93 | The resulting outputs are fed sequentially to stdout. | ||
94 | Compression of multiple files in this manner generates | ||
95 | a stream containing multiple compressed file representations. | ||
96 | Such a stream can be decompressed correctly only by | ||
97 | .I bzip2 | 80 | .I bzip2 |
98 | version 0.9.0 or later. Earlier versions of | 81 | will be detected and ignored, and a warning issued. |
99 | .I bzip2 | 82 | .I bzip2 |
100 | will stop after decompressing the first file in the stream. | 83 | attempts to guess the filename for the decompressed file |
84 | from that of the compressed file as follows: | ||
85 | |||
86 | filename.bz2 becomes filename | ||
87 | filename.bz becomes filename | ||
88 | filename.tbz2 becomes filename.tar | ||
89 | filename.tbz becomes filename.tar | ||
90 | anyothername becomes anyothername.out | ||
91 | |||
92 | If the file does not end in one of the recognised endings, | ||
93 | .I .bz2, | ||
94 | .I .bz, | ||
95 | .I .tbz2 | ||
96 | or | ||
97 | .I .tbz, | ||
98 | .I bzip2 | ||
99 | complains that it cannot | ||
100 | guess the name of the original file, and uses the original name | ||
101 | with | ||
102 | .I .out | ||
103 | appended. | ||
104 | |||
105 | As with compression, supplying no | ||
106 | filenames causes decompression from | ||
107 | standard input to standard output. | ||
108 | |||
109 | .I bunzip2 | ||
110 | will correctly decompress a file which is the | ||
111 | concatenation of two or more compressed files. The result is the | ||
112 | concatenation of the corresponding uncompressed files. Integrity | ||
113 | testing (\-t) | ||
114 | of concatenated | ||
115 | compressed files is also supported. | ||
116 | |||
117 | You can also compress or decompress files to the standard output by | ||
118 | giving the \-c flag. Multiple files may be compressed and | ||
119 | decompressed like this. The resulting outputs are fed sequentially to | ||
120 | stdout. Compression of multiple files | ||
121 | in this manner generates a stream | ||
122 | containing multiple compressed file representations. Such a stream | ||
123 | can be decompressed correctly only by | ||
124 | .I bzip2 | ||
125 | version 0.9.0 or | ||
126 | later. Earlier versions of | ||
127 | .I bzip2 | ||
128 | will stop after decompressing | ||
129 | the first file in the stream. | ||
101 | 130 | ||
102 | .I bzcat | 131 | .I bzcat |
103 | (or | 132 | (or |
104 | .I bzip2 \-dc | 133 | .I bzip2 -dc) |
105 | ) decompresses all specified files to the standard output. | 134 | decompresses all specified files to |
106 | 135 | the standard output. | |
107 | Compression is always performed, even if the compressed file is | 136 | |
108 | slightly larger than the original. Files of less than about | ||
109 | one hundred bytes tend to get larger, since the compression | ||
110 | mechanism has a constant overhead in the region of 50 bytes. | ||
111 | Random data (including the output of most file compressors) | ||
112 | is coded at about 8.05 bits per byte, giving an expansion of | ||
113 | around 0.5%. | ||
114 | |||
115 | As a self-check for your protection, | ||
116 | .I bzip2 | 137 | .I bzip2 |
117 | uses 32-bit CRCs to make sure that the decompressed | 138 | will read arguments from the environment variables |
118 | version of a file is identical to the original. | 139 | .I BZIP2 |
119 | This guards against corruption of the compressed data, | 140 | and |
120 | and against undetected bugs in | 141 | .I BZIP, |
142 | in that order, and will process them | ||
143 | before any arguments read from the command line. This gives a | ||
144 | convenient way to supply default arguments. | ||
145 | |||
146 | Compression is always performed, even if the compressed | ||
147 | file is slightly | ||
148 | larger than the original. Files of less than about one hundred bytes | ||
149 | tend to get larger, since the compression mechanism has a constant | ||
150 | overhead in the region of 50 bytes. Random data (including the output | ||
151 | of most file compressors) is coded at about 8.05 bits per byte, giving | ||
152 | an expansion of around 0.5%. | ||
153 | |||
154 | As a self-check for your protection, | ||
155 | .I | ||
156 | bzip2 | ||
157 | uses 32-bit CRCs to | ||
158 | make sure that the decompressed version of a file is identical to the | ||
159 | original. This guards against corruption of the compressed data, and | ||
160 | against undetected bugs in | ||
121 | .I bzip2 | 161 | .I bzip2 |
122 | (hopefully very unlikely). | 162 | (hopefully very unlikely). The |
123 | The chances of data corruption going undetected is | 163 | chances of data corruption going undetected is microscopic, about one |
124 | microscopic, about one chance in four billion | 164 | chance in four billion for each file processed. Be aware, though, that |
125 | for each file processed. Be aware, though, that the check | 165 | the check occurs upon decompression, so it can only tell you that |
126 | occurs upon decompression, so it can only tell you that | 166 | something is wrong. It can't help you |
127 | that something is wrong. It can't help you recover the | 167 | recover the original uncompressed |
128 | original uncompressed data. | 168 | data. You can use |
129 | You can use | ||
130 | .I bzip2recover | 169 | .I bzip2recover |
131 | to try to recover data from damaged files. | 170 | to try to recover data from |
132 | 171 | damaged files. | |
133 | Return values: | ||
134 | 0 for a normal exit, | ||
135 | 1 for environmental | ||
136 | problems (file not found, invalid flags, I/O errors, &c), | ||
137 | 2 to indicate a corrupt compressed file, | ||
138 | 3 for an internal consistency error (eg, bug) which caused | ||
139 | .I bzip2 | ||
140 | to panic. | ||
141 | 172 | ||
142 | .SH MEMORY MANAGEMENT | 173 | Return values: 0 for a normal exit, 1 for environmental problems (file |
143 | .I Bzip2 | 174 | not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt |
144 | compresses large files in blocks. The block size affects both the | 175 | compressed file, 3 for an internal consistency error (eg, bug) which |
145 | compression ratio achieved, and the amount of memory needed both for | 176 | caused |
146 | compression and decompression. The flags \-1 through \-9 | ||
147 | specify the block size to be 100,000 bytes through 900,000 bytes | ||
148 | (the default) respectively. At decompression-time, the block size used for | ||
149 | compression is read from the header of the compressed file, and | ||
150 | .I bunzip2 | ||
151 | then allocates itself just enough memory to decompress the file. | ||
152 | Since block sizes are stored in compressed files, it follows that the flags | ||
153 | \-1 to \-9 | ||
154 | are irrelevant to and so ignored during decompression. | ||
155 | Compression and decompression requirements, in bytes, can be estimated as: | ||
156 | |||
157 | Compression: 400k + ( 7 x block size ) | ||
158 | |||
159 | Decompression: 100k + ( 4 x block size ), or | ||
160 | .br | ||
161 | 100k + ( 2.5 x block size ) | ||
162 | |||
163 | Larger block sizes give rapidly diminishing marginal returns; most | ||
164 | of the | ||
165 | compression comes from the first two or three hundred k of block size, | ||
166 | a fact worth bearing in mind when using | ||
167 | .I bzip2 | 177 | .I bzip2 |
168 | on small machines. It is also important to appreciate that the | 178 | to panic. |
169 | decompression memory requirement is set at compression-time by the | ||
170 | choice of block size. | ||
171 | |||
172 | For files compressed with the default 900k block size, | ||
173 | .I bunzip2 | ||
174 | will require about 3700 kbytes to decompress. | ||
175 | To support decompression of any file on a 4 megabyte machine, | ||
176 | .I bunzip2 | ||
177 | has an option to decompress using approximately half this | ||
178 | amount of memory, about 2300 kbytes. Decompression speed is | ||
179 | also halved, so you should use this option only where necessary. | ||
180 | The relevant flag is \-s. | ||
181 | |||
182 | In general, try and use the largest block size | ||
183 | memory constraints allow, since that maximises the compression | ||
184 | achieved. Compression and decompression | ||
185 | speed are virtually unaffected by block size. | ||
186 | |||
187 | Another significant point applies to files which fit in a single | ||
188 | block -- that means most files you'd encounter using a large | ||
189 | block size. The amount of real memory touched is proportional | ||
190 | to the size of the file, since the file is smaller than a block. | ||
191 | For example, compressing a file 20,000 bytes long with the flag | ||
192 | \-9 | ||
193 | will cause the compressor to allocate around | ||
194 | 6700k of memory, but only touch 400k + 20000 * 7 = 540 | ||
195 | kbytes of it. Similarly, the decompressor will allocate 3700k but | ||
196 | only touch 100k + 20000 * 4 = 180 kbytes. | ||
197 | |||
198 | Here is a table which summarises the maximum memory usage for | ||
199 | different block sizes. Also recorded is the total compressed | ||
200 | size for 14 files of the Calgary Text Compression Corpus | ||
201 | totalling 3,141,622 bytes. This column gives some feel for how | ||
202 | compression varies with block size. These figures tend to understate | ||
203 | the advantage of larger block sizes for larger files, since the | ||
204 | Corpus is dominated by smaller files. | ||
205 | |||
206 | Compress Decompress Decompress Corpus | ||
207 | Flag usage usage -s usage Size | ||
208 | |||
209 | -1 1100k 500k 350k 914704 | ||
210 | -2 1800k 900k 600k 877703 | ||
211 | -3 2500k 1300k 850k 860338 | ||
212 | -4 3200k 1700k 1100k 846899 | ||
213 | -5 3900k 2100k 1350k 845160 | ||
214 | -6 4600k 2500k 1600k 838626 | ||
215 | -7 5400k 2900k 1850k 834096 | ||
216 | -8 6000k 3300k 2100k 828642 | ||
217 | -9 6700k 3700k 2350k 828642 | ||
218 | 179 | ||
219 | .SH OPTIONS | 180 | .SH OPTIONS |
220 | .TP | 181 | .TP |
221 | .B \-c --stdout | 182 | .B \-c --stdout |
222 | Compress or decompress to standard output. \-c will decompress | 183 | Compress or decompress to standard output. |
223 | multiple files to stdout, but will only compress a single file to | ||
224 | stdout. | ||
225 | .TP | 184 | .TP |
226 | .B \-d --decompress | 185 | .B \-d --decompress |
227 | Force decompression. | 186 | Force decompression. |
228 | .I bzip2, | 187 | .I bzip2, |
229 | .I bunzip2 | 188 | .I bunzip2 |
230 | and | 189 | and |
231 | .I bzcat | 190 | .I bzcat |
232 | are really the same program, and the decision about what actions | 191 | are |
233 | to take is done on the basis of which name is | 192 | really the same program, and the decision about what actions to take is |
234 | used. This flag overrides that mechanism, and forces | 193 | done on the basis of which name is used. This flag overrides that |
194 | mechanism, and forces | ||
235 | .I bzip2 | 195 | .I bzip2 |
236 | to decompress. | 196 | to decompress. |
237 | .TP | 197 | .TP |
238 | .B \-z --compress | 198 | .B \-z --compress |
239 | The complement to \-d: forces compression, regardless of the invokation | 199 | The complement to \-d: forces compression, regardless of the |
240 | name. | 200 | invokation name. |
241 | .TP | 201 | .TP |
242 | .B \-t --test | 202 | .B \-t --test |
243 | Check integrity of the specified file(s), but don't decompress them. | 203 | Check integrity of the specified file(s), but don't decompress them. |
@@ -245,25 +205,31 @@ This really performs a trial decompression and throws away the result. | |||
245 | .TP | 205 | .TP |
246 | .B \-f --force | 206 | .B \-f --force |
247 | Force overwrite of output files. Normally, | 207 | Force overwrite of output files. Normally, |
248 | .I bzip2 | 208 | .I bzip2 |
249 | will not overwrite existing output files. | 209 | will not overwrite |
210 | existing output files. Also forces | ||
211 | .I bzip2 | ||
212 | to break hard links | ||
213 | to files, which it otherwise wouldn't do. | ||
250 | .TP | 214 | .TP |
251 | .B \-k --keep | 215 | .B \-k --keep |
252 | Keep (don't delete) input files during compression or decompression. | 216 | Keep (don't delete) input files during compression |
217 | or decompression. | ||
253 | .TP | 218 | .TP |
254 | .B \-s --small | 219 | .B \-s --small |
255 | Reduce memory usage, for compression, decompression and | 220 | Reduce memory usage, for compression, decompression and testing. Files |
256 | testing. | 221 | are decompressed and tested using a modified algorithm which only |
257 | Files are decompressed and tested using a modified algorithm which only | ||
258 | requires 2.5 bytes per block byte. This means any file can be | 222 | requires 2.5 bytes per block byte. This means any file can be |
259 | decompressed in 2300k of memory, albeit at about half the normal | 223 | decompressed in 2300k of memory, albeit at about half the normal speed. |
260 | speed. | 224 | |
261 | 225 | During compression, \-s selects a block size of 200k, which limits | |
262 | During compression, -s selects a block size of 200k, which limits | 226 | memory use to around the same figure, at the expense of your compression |
263 | memory use to around the same figure, at the expense of your | 227 | ratio. In short, if your machine is low on memory (8 megabytes or |
264 | compression ratio. In short, if your machine is low on memory | 228 | less), use \-s for everything. See MEMORY MANAGEMENT below. |
265 | (8 megabytes or less), use -s for everything. See | 229 | .TP |
266 | MEMORY MANAGEMENT above. | 230 | .B \-q --quiet |
231 | Suppress non-essential warning messages. Messages pertaining to | ||
232 | I/O errors and other critical events will not be suppressed. | ||
267 | .TP | 233 | .TP |
268 | .B \-v --verbose | 234 | .B \-v --verbose |
269 | Verbose mode -- show the compression ratio for each file processed. | 235 | Verbose mode -- show the compression ratio for each file processed. |
@@ -273,147 +239,199 @@ information which is primarily of interest for diagnostic purposes. | |||
273 | .B \-L --license -V --version | 239 | .B \-L --license -V --version |
274 | Display the software version, license terms and conditions. | 240 | Display the software version, license terms and conditions. |
275 | .TP | 241 | .TP |
276 | .B \-1 to \-9 | 242 | .B \-1 to \-9 |
277 | Set the block size to 100 k, 200 k .. 900 k when | 243 | Set the block size to 100 k, 200 k .. 900 k when compressing. Has no |
278 | compressing. Has no effect when decompressing. | 244 | effect when decompressing. See MEMORY MANAGEMENT below. |
279 | See MEMORY MANAGEMENT above. | ||
280 | .TP | 245 | .TP |
281 | .B \--repetitive-fast | 246 | .B \-- |
282 | .I bzip2 | 247 | Treats all subsequent arguments as file names, even if they start |
283 | injects some small pseudo-random variations | 248 | with a dash. This is so you can handle files with names beginning |
284 | into very repetitive blocks to limit | 249 | with a dash, for example: bzip2 \-- \-myfilename. |
285 | worst-case performance during compression. | 250 | .TP |
286 | If sorting runs into difficulties, the block | 251 | .B \--repetitive-fast --repetitive-best |
287 | is randomised, and sorting is restarted. | 252 | These flags are redundant in versions 0.9.5 and above. They provided |
288 | Very roughly, | 253 | some coarse control over the behaviour of the sorting algorithm in |
254 | earlier versions, which was sometimes useful. 0.9.5 and above have an | ||
255 | improved algorithm which renders these flags irrelevant. | ||
256 | |||
257 | .SH MEMORY MANAGEMENT | ||
258 | .I bzip2 | ||
259 | compresses large files in blocks. The block size affects | ||
260 | both the compression ratio achieved, and the amount of memory needed for | ||
261 | compression and decompression. The flags \-1 through \-9 | ||
262 | specify the block size to be 100,000 bytes through 900,000 bytes (the | ||
263 | default) respectively. At decompression time, the block size used for | ||
264 | compression is read from the header of the compressed file, and | ||
265 | .I bunzip2 | ||
266 | then allocates itself just enough memory to decompress | ||
267 | the file. Since block sizes are stored in compressed files, it follows | ||
268 | that the flags \-1 to \-9 are irrelevant to and so ignored | ||
269 | during decompression. | ||
270 | |||
271 | Compression and decompression requirements, | ||
272 | in bytes, can be estimated as: | ||
273 | |||
274 | Compression: 400k + ( 8 x block size ) | ||
275 | |||
276 | Decompression: 100k + ( 4 x block size ), or | ||
277 | 100k + ( 2.5 x block size ) | ||
278 | |||
279 | Larger block sizes give rapidly diminishing marginal returns. Most of | ||
280 | the compression comes from the first two or three hundred k of block | ||
281 | size, a fact worth bearing in mind when using | ||
289 | .I bzip2 | 282 | .I bzip2 |
290 | persists for three times as long as a well-behaved input | 283 | on small machines. |
291 | would take before resorting to randomisation. | 284 | It is also important to appreciate that the decompression memory |
292 | This flag makes it give up much sooner. | 285 | requirement is set at compression time by the choice of block size. |
293 | 286 | ||
294 | .TP | 287 | For files compressed with the default 900k block size, |
295 | .B \--repetitive-best | 288 | .I bunzip2 |
296 | Opposite of \--repetitive-fast; try a lot harder before | 289 | will require about 3700 kbytes to decompress. To support decompression |
297 | resorting to randomisation. | 290 | of any file on a 4 megabyte machine, |
291 | .I bunzip2 | ||
292 | has an option to | ||
293 | decompress using approximately half this amount of memory, about 2300 | ||
294 | kbytes. Decompression speed is also halved, so you should use this | ||
295 | option only where necessary. The relevant flag is -s. | ||
296 | |||
297 | In general, try and use the largest block size memory constraints allow, | ||
298 | since that maximises the compression achieved. Compression and | ||
299 | decompression speed are virtually unaffected by block size. | ||
300 | |||
301 | Another significant point applies to files which fit in a single block | ||
302 | -- that means most files you'd encounter using a large block size. The | ||
303 | amount of real memory touched is proportional to the size of the file, | ||
304 | since the file is smaller than a block. For example, compressing a file | ||
305 | 20,000 bytes long with the flag -9 will cause the compressor to | ||
306 | allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 | ||
307 | kbytes of it. Similarly, the decompressor will allocate 3700k but only | ||
308 | touch 100k + 20000 * 4 = 180 kbytes. | ||
309 | |||
310 | Here is a table which summarises the maximum memory usage for different | ||
311 | block sizes. Also recorded is the total compressed size for 14 files of | ||
312 | the Calgary Text Compression Corpus totalling 3,141,622 bytes. This | ||
313 | column gives some feel for how compression varies with block size. | ||
314 | These figures tend to understate the advantage of larger block sizes for | ||
315 | larger files, since the Corpus is dominated by smaller files. | ||
316 | |||
317 | Compress Decompress Decompress Corpus | ||
318 | Flag usage usage -s usage Size | ||
319 | |||
320 | -1 1200k 500k 350k 914704 | ||
321 | -2 2000k 900k 600k 877703 | ||
322 | -3 2800k 1300k 850k 860338 | ||
323 | -4 3600k 1700k 1100k 846899 | ||
324 | -5 4400k 2100k 1350k 845160 | ||
325 | -6 5200k 2500k 1600k 838626 | ||
326 | -7 6100k 2900k 1850k 834096 | ||
327 | -8 6800k 3300k 2100k 828642 | ||
328 | -9 7600k 3700k 2350k 828642 | ||
298 | 329 | ||
299 | .SH RECOVERING DATA FROM DAMAGED FILES | 330 | .SH RECOVERING DATA FROM DAMAGED FILES |
300 | .I bzip2 | 331 | .I bzip2 |
301 | compresses files in blocks, usually 900kbytes long. | 332 | compresses files in blocks, usually 900kbytes long. Each |
302 | Each block is handled independently. If a media or | 333 | block is handled independently. If a media or transmission error causes |
303 | transmission error causes a multi-block .bz2 | 334 | a multi-block .bz2 |
304 | file to become damaged, | 335 | file to become damaged, it may be possible to |
305 | it may be possible to recover data from the undamaged blocks | 336 | recover data from the undamaged blocks in the file. |
306 | in the file. | 337 | |
307 | 338 | The compressed representation of each block is delimited by a 48-bit | |
308 | The compressed representation of each block is delimited by | 339 | pattern, which makes it possible to find the block boundaries with |
309 | a 48-bit pattern, which makes it possible to find the block | 340 | reasonable certainty. Each block also carries its own 32-bit CRC, so |
310 | boundaries with reasonable certainty. Each block also carries | 341 | damaged blocks can be distinguished from undamaged ones. |
311 | its own 32-bit CRC, so damaged blocks can be | ||
312 | distinguished from undamaged ones. | ||
313 | 342 | ||
314 | .I bzip2recover | 343 | .I bzip2recover |
315 | is a simple program whose purpose is to search for | 344 | is a simple program whose purpose is to search for |
316 | blocks in .bz2 files, and write each block out into | 345 | blocks in .bz2 files, and write each block out into its own .bz2 |
317 | its own .bz2 file. You can then use | 346 | file. You can then use |
318 | .I bzip2 -t | 347 | .I bzip2 |
319 | to test the integrity of the resulting files, | 348 | \-t |
320 | and decompress those which are undamaged. | 349 | to test the |
350 | integrity of the resulting files, and decompress those which are | ||
351 | undamaged. | ||
321 | 352 | ||
322 | .I bzip2recover | 353 | .I bzip2recover |
323 | takes a single argument, the name of the damaged file, | 354 | takes a single argument, the name of the damaged file, |
324 | and writes a number of files "rec0001file.bz2", "rec0002file.bz2", | 355 | and writes a number of files "rec0001file.bz2", |
325 | etc, containing the extracted blocks. The output filenames | 356 | "rec0002file.bz2", etc, containing the extracted blocks. |
326 | are designed so that the use of wildcards in subsequent processing | 357 | The output filenames are designed so that the use of |
327 | -- for example, "bzip2 -dc rec*file.bz2 > recovered_data" -- | 358 | wildcards in subsequent processing -- for example, |
328 | lists the files in the "right" order. | 359 | "bzip2 -dc rec*file.bz2 > recovered_data" -- lists the files in |
360 | the correct order. | ||
329 | 361 | ||
330 | .I bzip2recover | 362 | .I bzip2recover |
331 | should be of most use dealing with large .bz2 files, as | 363 | should be of most use dealing with large .bz2 |
332 | these will contain many blocks. It is clearly futile to | 364 | files, as these will contain many blocks. It is clearly |
333 | use it on damaged single-block files, since a damaged | 365 | futile to use it on damaged single-block files, since a |
334 | block cannot be recovered. If you wish to minimise | 366 | damaged block cannot be recovered. If you wish to minimise |
335 | any potential data loss through media or transmission | 367 | any potential data loss through media or transmission errors, |
336 | errors, you might consider compressing with a smaller | 368 | you might consider compressing with a smaller |
337 | block size. | 369 | block size. |
338 | 370 | ||
339 | .SH PERFORMANCE NOTES | 371 | .SH PERFORMANCE NOTES |
340 | The sorting phase of compression gathers together similar strings | 372 | The sorting phase of compression gathers together similar strings in the |
341 | in the file. Because of this, files containing very long | 373 | file. Because of this, files containing very long runs of repeated |
342 | runs of repeated symbols, like "aabaabaabaab ..." (repeated | 374 | symbols, like "aabaabaabaab ..." (repeated several hundred times) may |
343 | several hundred times) may compress extraordinarily slowly. | 375 | compress more slowly than normal. Versions 0.9.5 and above fare much |
344 | You can use the | 376 | better than previous versions in this respect. The ratio between |
345 | \-vvvvv | 377 | worst-case and average-case compression time is in the region of 10:1. |
346 | option to monitor progress in great detail, if you want. | 378 | For previous versions, this figure was more like 100:1. You can use the |
347 | Decompression speed is unaffected. | 379 | \-vvvv option to monitor progress in great detail, if you want. |
348 | 380 | ||
349 | Such pathological cases | 381 | Decompression speed is unaffected by these phenomena. |
350 | seem rare in practice, appearing mostly in artificially-constructed | ||
351 | test files, and in low-level disk images. It may be inadvisable to | ||
352 | use | ||
353 | .I bzip2 | ||
354 | to compress the latter. | ||
355 | If you do get a file which causes severe slowness in compression, | ||
356 | try making the block size as small as possible, with flag \-1. | ||
357 | 382 | ||
358 | .I bzip2 | 383 | .I bzip2 |
359 | usually allocates several megabytes of memory to operate in, | 384 | usually allocates several megabytes of memory to operate |
360 | and then charges all over it in a fairly random fashion. This | 385 | in, and then charges all over it in a fairly random fashion. This means |
361 | means that performance, both for compressing and decompressing, | 386 | that performance, both for compressing and decompressing, is largely |
362 | is largely determined by the speed | 387 | determined by the speed at which your machine can service cache misses. |
363 | at which your machine can service cache misses. | 388 | Because of this, small changes to the code to reduce the miss rate have |
364 | Because of this, small changes | 389 | been observed to give disproportionately large performance improvements. |
365 | to the code to reduce the miss rate have been observed to give | ||
366 | disproportionately large performance improvements. | ||
367 | I imagine | 390 | I imagine |
368 | .I bzip2 | 391 | .I bzip2 |
369 | will perform best on machines with very large caches. | 392 | will perform best on machines with very large caches. |
370 | 393 | ||
371 | .SH CAVEATS | 394 | .SH CAVEATS |
372 | I/O error messages are not as helpful as they could be. | 395 | I/O error messages are not as helpful as they could be. |
373 | .I Bzip2 | 396 | .I bzip2 |
374 | tries hard to detect I/O errors and exit cleanly, but the | 397 | tries hard to detect I/O errors and exit cleanly, but the details of |
375 | details of what the problem is sometimes seem rather misleading. | 398 | what the problem is sometimes seem rather misleading. |
376 | 399 | ||
377 | This manual page pertains to version 0.9.0 of | 400 | This manual page pertains to version 0.9.5 of |
378 | .I bzip2. | 401 | .I bzip2. |
379 | Compressed data created by this version is entirely forwards and | 402 | Compressed |
380 | backwards compatible with the previous public release, version 0.1pl2, | 403 | data created by this version is entirely forwards and backwards |
381 | but with the following exception: 0.9.0 can correctly decompress | 404 | compatible with the previous public releases, versions 0.1pl2 and 0.9.0, |
382 | multiple concatenated compressed files. 0.1pl2 cannot do this; it | 405 | but with the following exception: 0.9.0 and above can correctly |
383 | will stop after decompressing just the first file in the stream. | 406 | decompress multiple concatenated compressed files. 0.1pl2 cannot do |
384 | 407 | this; it will stop after decompressing just the first file in the | |
385 | Wildcard expansion for Windows 95 and NT | 408 | stream. |
386 | is flaky. | ||
387 | 409 | ||
388 | .I bzip2recover | 410 | .I bzip2recover |
389 | uses 32-bit integers to represent bit positions in | 411 | uses 32-bit integers to represent bit positions in |
390 | compressed files, so it cannot handle compressed files | 412 | compressed files, so it cannot handle compressed files more than 512 |
391 | more than 512 megabytes long. This could easily be fixed. | 413 | megabytes long. This could easily be fixed. |
392 | 414 | ||
393 | .SH AUTHOR | 415 | .SH AUTHOR |
394 | Julian Seward, jseward@acm.org. | 416 | Julian Seward, jseward@acm.org. |
395 | 417 | ||
396 | http://www.muraroa.demon.co.uk | 418 | http://www.muraroa.demon.co.uk |
397 | 419 | ||
398 | The ideas embodied in | 420 | The ideas embodied in |
399 | .I bzip2 | 421 | .I bzip2 |
400 | are due to (at least) the following people: | 422 | are due to (at least) the following |
401 | Michael Burrows and David Wheeler (for the block sorting | 423 | people: Michael Burrows and David Wheeler (for the block sorting |
402 | transformation), David Wheeler (again, for the Huffman coder), | 424 | transformation), David Wheeler (again, for the Huffman coder), Peter |
403 | Peter Fenwick (for the structured coding model in the original | 425 | Fenwick (for the structured coding model in the original |
404 | .I bzip, | 426 | .I bzip, |
405 | and many refinements), | 427 | and many refinements), and Alistair Moffat, Radford Neal and Ian Witten |
406 | and | 428 | (for the arithmetic coder in the original |
407 | Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic | ||
408 | coder in the original | ||
409 | .I bzip). | 429 | .I bzip). |
410 | I am much indebted for their help, support and advice. | 430 | I am much |
411 | See the manual in the source distribution for pointers to | 431 | indebted for their help, support and advice. See the manual in the |
412 | sources of documentation. | 432 | source distribution for pointers to sources of documentation. Christian |
413 | Christian von Roques encouraged me to look for faster | 433 | von Roques encouraged me to look for faster sorting algorithms, so as to |
414 | sorting algorithms, so as to speed up compression. | 434 | speed up compression. Bela Lubkin encouraged me to improve the |
415 | Bela Lubkin encouraged me to improve the worst-case | 435 | worst-case compression performance. Many people sent patches, helped |
416 | compression performance. | 436 | with portability problems, lent machines, gave advice and were generally |
417 | Many people sent patches, helped with portability problems, | 437 | helpful. |
418 | lent machines, gave advice and were generally helpful. | ||
419 | |||