diff options
Diffstat (limited to 'manual.texi')
-rw-r--r-- | manual.texi | 869 |
1 files changed, 459 insertions, 410 deletions
diff --git a/manual.texi b/manual.texi index 99ce661..e48e656 100644 --- a/manual.texi +++ b/manual.texi | |||
@@ -2,10 +2,10 @@ | |||
2 | @setfilename bzip2.info | 2 | @setfilename bzip2.info |
3 | 3 | ||
4 | @ignore | 4 | @ignore |
5 | This file documents bzip2 version 0.9.0c, and associated library | 5 | This file documents bzip2 version 0.9.5, and associated library |
6 | libbzip2, written by Julian Seward (jseward@acm.org). | 6 | libbzip2, written by Julian Seward (jseward@acm.org). |
7 | 7 | ||
8 | Copyright (C) 1996-1998 Julian R Seward | 8 | Copyright (C) 1996-1999 Julian R Seward |
9 | 9 | ||
10 | Permission is granted to make and distribute verbatim copies of | 10 | Permission is granted to make and distribute verbatim copies of |
11 | this manual provided the copyright notice and this permission notice | 11 | this manual provided the copyright notice and this permission notice |
@@ -30,21 +30,21 @@ END-INFO-DIR-ENTRY | |||
30 | @titlepage | 30 | @titlepage |
31 | @title bzip2 and libbzip2 | 31 | @title bzip2 and libbzip2 |
32 | @subtitle a program and library for data compression | 32 | @subtitle a program and library for data compression |
33 | @subtitle copyright (C) 1996-1998 Julian Seward | 33 | @subtitle copyright (C) 1996-1999 Julian Seward |
34 | @subtitle version 0.9.0c of 18 October 1998 | 34 | @subtitle version 0.9.5d of 4 September 1999 |
35 | @author Julian Seward | 35 | @author Julian Seward |
36 | 36 | ||
37 | @end titlepage | 37 | @end titlepage |
38 | @end iftex | ||
39 | |||
40 | 38 | ||
41 | @parindent 0mm | 39 | @parindent 0mm |
42 | @parskip 2mm | 40 | @parskip 2mm |
43 | 41 | ||
42 | @end iftex | ||
43 | @node Top, Overview, (dir), (dir) | ||
44 | 44 | ||
45 | This program, @code{bzip2}, | 45 | This program, @code{bzip2}, |
46 | and associated library @code{libbzip2}, are | 46 | and associated library @code{libbzip2}, are |
47 | Copyright (C) 1996-1998 Julian R Seward. All rights reserved. | 47 | Copyright (C) 1996-1999 Julian R Seward. All rights reserved. |
48 | 48 | ||
49 | Redistribution and use in source and binary forms, with or without | 49 | Redistribution and use in source and binary forms, with or without |
50 | modification, are permitted provided that the following conditions | 50 | modification, are permitted provided that the following conditions |
@@ -78,13 +78,13 @@ WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING | |||
78 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS | 78 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS |
79 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | 79 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
80 | 80 | ||
81 | Julian Seward, Guildford, Surrey, UK. | 81 | Julian Seward, Cambridge, UK. |
82 | 82 | ||
83 | @code{jseward@@acm.org} | 83 | @code{jseward@@acm.org} |
84 | 84 | ||
85 | @code{http://www.muraroa.demon.co.uk} | 85 | @code{http://www.muraroa.demon.co.uk} |
86 | 86 | ||
87 | @code{bzip2}/@code{libbzip2} version 0.9.0c of 18 October 1998. | 87 | @code{bzip2}/@code{libbzip2} version 0.9.5 of 24 May 1999. |
88 | 88 | ||
89 | PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented | 89 | PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented |
90 | algorithms. However, I do not have the resources available to carry out | 90 | algorithms. However, I do not have the resources available to carry out |
@@ -124,360 +124,345 @@ ought to be recorded somewhere. | |||
124 | 124 | ||
125 | This chapter contains a copy of the @code{bzip2} man page, | 125 | This chapter contains a copy of the @code{bzip2} man page, |
126 | and nothing else. | 126 | and nothing else. |
127 | |||
128 | @quotation | ||
129 | |||
130 | @unnumberedsubsubsec NAME | ||
131 | @itemize | ||
132 | @item @code{bzip2}, @code{bunzip2} | ||
133 | - a block-sorting file compressor, v0.9.5 | ||
134 | @item @code{bzcat} | ||
135 | - decompresses files to stdout | ||
136 | @item @code{bzip2recover} | ||
137 | - recovers data from damaged bzip2 files | ||
138 | @end itemize | ||
139 | |||
140 | @unnumberedsubsubsec SYNOPSIS | ||
141 | @itemize | ||
142 | @item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ... ] | ||
143 | @item @code{bunzip2} [ -fkvsVL ] [ filenames ... ] | ||
144 | @item @code{bzcat} [ -s ] [ filenames ... ] | ||
145 | @item @code{bzip2recover} filename | ||
146 | @end itemize | ||
147 | |||
148 | @unnumberedsubsubsec DESCRIPTION | ||
149 | |||
150 | @code{bzip2} compresses files using the Burrows-Wheeler block sorting | ||
151 | text compression algorithm, and Huffman coding. Compression is | ||
152 | generally considerably better than that achieved by more conventional | ||
153 | LZ77/LZ78-based compressors, and approaches the performance of the PPM | ||
154 | family of statistical compressors. | ||
155 | |||
156 | The command-line options are deliberately very similar to those of GNU | ||
157 | @code{gzip}, but they are not identical. | ||
158 | |||
159 | @code{bzip2} expects a list of file names to accompany the command-line | ||
160 | flags. Each file is replaced by a compressed version of itself, with | ||
161 | the name @code{original_name.bz2}. Each compressed file has the same | ||
162 | modification date, permissions, and, when possible, ownership as the | ||
163 | corresponding original, so that these properties can be correctly | ||
164 | restored at decompression time. File name handling is naive in the | ||
165 | sense that there is no mechanism for preserving original file names, | ||
166 | permissions, ownerships or dates in filesystems which lack these | ||
167 | concepts, or have serious file name length restrictions, such as MS-DOS. | ||
168 | |||
169 | @code{bzip2} and @code{bunzip2} will by default not overwrite existing | ||
170 | files. If you want this to happen, specify the @code{-f} flag. | ||
171 | |||
172 | If no file names are specified, @code{bzip2} compresses from standard | ||
173 | input to standard output. In this case, @code{bzip2} will decline to | ||
174 | write compressed output to a terminal, as this would be entirely | ||
175 | incomprehensible and therefore pointless. | ||
176 | |||
177 | @code{bunzip2} (or @code{bzip2 -d}) decompresses all | ||
178 | specified files. Files which were not created by @code{bzip2} | ||
179 | will be detected and ignored, and a warning issued. | ||
180 | @code{bzip2} attempts to guess the filename for the decompressed file | ||
181 | from that of the compressed file as follows: | ||
182 | @itemize | ||
183 | @item @code{filename.bz2 } becomes @code{filename} | ||
184 | @item @code{filename.bz } becomes @code{filename} | ||
185 | @item @code{filename.tbz2} becomes @code{filename.tar} | ||
186 | @item @code{filename.tbz } becomes @code{filename.tar} | ||
187 | @item @code{anyothername } becomes @code{anyothername.out} | ||
188 | @end itemize | ||
189 | If the file does not end in one of the recognised endings, | ||
190 | @code{.bz2}, @code{.bz}, | ||
191 | @code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot | ||
192 | guess the name of the original file, and uses the original name | ||
193 | with @code{.out} appended. | ||
194 | |||
195 | As with compression, supplying no | ||
196 | filenames causes decompression from standard input to standard output. | ||
197 | |||
198 | @code{bunzip2} will correctly decompress a file which is the | ||
199 | concatenation of two or more compressed files. The result is the | ||
200 | concatenation of the corresponding uncompressed files. Integrity | ||
201 | testing (@code{-t}) of concatenated compressed files is also supported. | ||
202 | |||
203 | You can also compress or decompress files to the standard output by | ||
204 | giving the @code{-c} flag. Multiple files may be compressed and | ||
205 | decompressed like this. The resulting outputs are fed sequentially to | ||
206 | stdout. Compression of multiple files in this manner generates a stream | ||
207 | containing multiple compressed file representations. Such a stream | ||
208 | can be decompressed correctly only by @code{bzip2} version 0.9.0 or | ||
209 | later. Earlier versions of @code{bzip2} will stop after decompressing | ||
210 | the first file in the stream. | ||
211 | |||
212 | @code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to | ||
213 | the standard output. | ||
214 | |||
215 | @code{bzip2} will read arguments from the environment variables | ||
216 | @code{BZIP2} and @code{BZIP}, in that order, and will process them | ||
217 | before any arguments read from the command line. This gives a | ||
218 | convenient way to supply default arguments. | ||
219 | |||
220 | Compression is always performed, even if the compressed file is slightly | ||
221 | larger than the original. Files of less than about one hundred bytes | ||
222 | tend to get larger, since the compression mechanism has a constant | ||
223 | overhead in the region of 50 bytes. Random data (including the output | ||
224 | of most file compressors) is coded at about 8.05 bits per byte, giving | ||
225 | an expansion of around 0.5%. | ||
226 | |||
227 | As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to | ||
228 | make sure that the decompressed version of a file is identical to the | ||
229 | original. This guards against corruption of the compressed data, and | ||
230 | against undetected bugs in @code{bzip2} (hopefully very unlikely). The | ||
231 | chances of data corruption going undetected is microscopic, about one | ||
232 | chance in four billion for each file processed. Be aware, though, that | ||
233 | the check occurs upon decompression, so it can only tell you that | ||
234 | something is wrong. It can't help you recover the original uncompressed | ||
235 | data. You can use @code{bzip2recover} to try to recover data from | ||
236 | damaged files. | ||
237 | |||
238 | Return values: 0 for a normal exit, 1 for environmental problems (file | ||
239 | not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt | ||
240 | compressed file, 3 for an internal consistency error (eg, bug) which | ||
241 | caused @code{bzip2} to panic. | ||
242 | |||
243 | |||
244 | @unnumberedsubsubsec OPTIONS | ||
245 | @table @code | ||
246 | @item -c --stdout | ||
247 | Compress or decompress to standard output. | ||
248 | @item -d --decompress | ||
249 | Force decompression. @code{bzip2}, @code{bunzip2} and @code{bzcat} are | ||
250 | really the same program, and the decision about what actions to take is | ||
251 | done on the basis of which name is used. This flag overrides that | ||
252 | mechanism, and forces bzip2 to decompress. | ||
253 | @item -z --compress | ||
254 | The complement to @code{-d}: forces compression, regardless of the | ||
255 | invokation name. | ||
256 | @item -t --test | ||
257 | Check integrity of the specified file(s), but don't decompress them. | ||
258 | This really performs a trial decompression and throws away the result. | ||
259 | @item -f --force | ||
260 | Force overwrite of output files. Normally, @code{bzip2} will not overwrite | ||
261 | existing output files. Also forces @code{bzip2} to break hard links | ||
262 | to files, which it otherwise wouldn't do. | ||
263 | @item -k --keep | ||
264 | Keep (don't delete) input files during compression | ||
265 | or decompression. | ||
266 | @item -s --small | ||
267 | Reduce memory usage, for compression, decompression and testing. Files | ||
268 | are decompressed and tested using a modified algorithm which only | ||
269 | requires 2.5 bytes per block byte. This means any file can be | ||
270 | decompressed in 2300k of memory, albeit at about half the normal speed. | ||
271 | |||
272 | During compression, @code{-s} selects a block size of 200k, which limits | ||
273 | memory use to around the same figure, at the expense of your compression | ||
274 | ratio. In short, if your machine is low on memory (8 megabytes or | ||
275 | less), use -s for everything. See MEMORY MANAGEMENT below. | ||
276 | @item -q --quiet | ||
277 | Suppress non-essential warning messages. Messages pertaining to | ||
278 | I/O errors and other critical events will not be suppressed. | ||
279 | @item -v --verbose | ||
280 | Verbose mode -- show the compression ratio for each file processed. | ||
281 | Further @code{-v}'s increase the verbosity level, spewing out lots of | ||
282 | information which is primarily of interest for diagnostic purposes. | ||
283 | @item -L --license -V --version | ||
284 | Display the software version, license terms and conditions. | ||
285 | @item -1 to -9 | ||
286 | Set the block size to 100 k, 200 k .. 900 k when compressing. Has no | ||
287 | effect when decompressing. See MEMORY MANAGEMENT below. | ||
288 | @item -- | ||
289 | Treats all subsequent arguments as file names, even if they start | ||
290 | with a dash. This is so you can handle files with names beginning | ||
291 | with a dash, for example: @code{bzip2 -- -myfilename}. | ||
292 | @item --repetitive-fast | ||
293 | @item --repetitive-best | ||
294 | These flags are redundant in versions 0.9.5 and above. They provided | ||
295 | some coarse control over the behaviour of the sorting algorithm in | ||
296 | earlier versions, which was sometimes useful. 0.9.5 and above have an | ||
297 | improved algorithm which renders these flags irrelevant. | ||
298 | @end table | ||
299 | |||
300 | |||
301 | @unnumberedsubsubsec MEMORY MANAGEMENT | ||
302 | |||
303 | @code{bzip2} compresses large files in blocks. The block size affects | ||
304 | both the compression ratio achieved, and the amount of memory needed for | ||
305 | compression and decompression. The flags @code{-1} through @code{-9} | ||
306 | specify the block size to be 100,000 bytes through 900,000 bytes (the | ||
307 | default) respectively. At decompression time, the block size used for | ||
308 | compression is read from the header of the compressed file, and | ||
309 | @code{bunzip2} then allocates itself just enough memory to decompress | ||
310 | the file. Since block sizes are stored in compressed files, it follows | ||
311 | that the flags @code{-1} to @code{-9} are irrelevant to and so ignored | ||
312 | during decompression. | ||
313 | |||
314 | Compression and decompression requirements, in bytes, can be estimated | ||
315 | as: | ||
127 | @example | 316 | @example |
128 | NAME | 317 | Compression: 400k + ( 8 x block size ) |
129 | bzip2, bunzip2 - a block-sorting file compressor, v0.9.0 | 318 | |
130 | bzcat - decompresses files to stdout | 319 | Decompression: 100k + ( 4 x block size ), or |
131 | bzip2recover - recovers data from damaged bzip2 files | 320 | 100k + ( 2.5 x block size ) |
132 | 321 | @end example | |
133 | 322 | Larger block sizes give rapidly diminishing marginal returns. Most of | |
134 | SYNOPSIS | 323 | the compression comes from the first two or three hundred k of block |
135 | bzip2 [ -cdfkstvzVL123456789 ] [ filenames ... ] | 324 | size, a fact worth bearing in mind when using @code{bzip2} on small machines. |
136 | bunzip2 [ -fkvsVL ] [ filenames ... ] | 325 | It is also important to appreciate that the decompression memory |
137 | bzcat [ -s ] [ filenames ... ] | 326 | requirement is set at compression time by the choice of block size. |
138 | bzip2recover filename | 327 | |
139 | 328 | For files compressed with the default 900k block size, @code{bunzip2} | |
140 | 329 | will require about 3700 kbytes to decompress. To support decompression | |
141 | DESCRIPTION | 330 | of any file on a 4 megabyte machine, @code{bunzip2} has an option to |
142 | bzip2 compresses files using the Burrows-Wheeler block- | 331 | decompress using approximately half this amount of memory, about 2300 |
143 | sorting text compression algorithm, and Huffman coding. | 332 | kbytes. Decompression speed is also halved, so you should use this |
144 | Compression is generally considerably better than that | 333 | option only where necessary. The relevant flag is @code{-s}. |
145 | achieved by more conventional LZ77/LZ78-based compressors, | 334 | |
146 | and approaches the performance of the PPM family of sta- | 335 | In general, try and use the largest block size memory constraints allow, |
147 | tistical compressors. | 336 | since that maximises the compression achieved. Compression and |
148 | 337 | decompression speed are virtually unaffected by block size. | |
149 | The command-line options are deliberately very similar to | 338 | |
150 | those of GNU Gzip, but they are not identical. | 339 | Another significant point applies to files which fit in a single block |
151 | 340 | -- that means most files you'd encounter using a large block size. The | |
152 | bzip2 expects a list of file names to accompany the com- | 341 | amount of real memory touched is proportional to the size of the file, |
153 | mand-line flags. Each file is replaced by a compressed | 342 | since the file is smaller than a block. For example, compressing a file |
154 | version of itself, with the name "original_name.bz2". | 343 | 20,000 bytes long with the flag @code{-9} will cause the compressor to |
155 | Each compressed file has the same modification date and | 344 | allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 |
156 | permissions as the corresponding original, so that these | 345 | kbytes of it. Similarly, the decompressor will allocate 3700k but only |
157 | properties can be correctly restored at decompression | 346 | touch 100k + 20000 * 4 = 180 kbytes. |
158 | time. File name handling is naive in the sense that there | 347 | |
159 | is no mechanism for preserving original file names, per- | 348 | Here is a table which summarises the maximum memory usage for different |
160 | missions and dates in filesystems which lack these con- | 349 | block sizes. Also recorded is the total compressed size for 14 files of |
161 | cepts, or have serious file name length restrictions, such | 350 | the Calgary Text Compression Corpus totalling 3,141,622 bytes. This |
162 | as MS-DOS. | 351 | column gives some feel for how compression varies with block size. |
163 | 352 | These figures tend to understate the advantage of larger block sizes for | |
164 | bzip2 and bunzip2 will by default not overwrite existing | 353 | larger files, since the Corpus is dominated by smaller files. |
165 | files; if you want this to happen, specify the -f flag. | 354 | @example |
166 | 355 | Compress Decompress Decompress Corpus | |
167 | If no file names are specified, bzip2 compresses from | 356 | Flag usage usage -s usage Size |
168 | standard input to standard output. In this case, bzip2 | 357 | |
169 | will decline to write compressed output to a terminal, as | 358 | -1 1200k 500k 350k 914704 |
170 | this would be entirely incomprehensible and therefore | 359 | -2 2000k 900k 600k 877703 |
171 | pointless. | 360 | -3 2800k 1300k 850k 860338 |
172 | 361 | -4 3600k 1700k 1100k 846899 | |
173 | bunzip2 (or bzip2 -d ) decompresses and restores all spec- | 362 | -5 4400k 2100k 1350k 845160 |
174 | ified files whose names end in ".bz2". Files without this | 363 | -6 5200k 2500k 1600k 838626 |
175 | suffix are ignored. Again, supplying no filenames causes | 364 | -7 6100k 2900k 1850k 834096 |
176 | decompression from standard input to standard output. | 365 | -8 6800k 3300k 2100k 828642 |
177 | 366 | -9 7600k 3700k 2350k 828642 | |
178 | bunzip2 will correctly decompress a file which is the con- | 367 | @end example |
179 | catenation of two or more compressed files. The result is | 368 | |
180 | the concatenation of the corresponding uncompressed files. | 369 | @unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES |
181 | Integrity testing (-t) of concatenated compressed files is | ||
182 | also supported. | ||
183 | |||
184 | You can also compress or decompress files to the standard | ||
185 | output by giving the -c flag. Multiple files may be com- | ||
186 | pressed and decompressed like this. The resulting outputs | ||
187 | are fed sequentially to stdout. Compression of multiple | ||
188 | files in this manner generates a stream containing multi- | ||
189 | ple compressed file representations. Such a stream can be | ||
190 | decompressed correctly only by bzip2 version 0.9.0 or | ||
191 | later. Earlier versions of bzip2 will stop after decom- | ||
192 | pressing the first file in the stream. | ||
193 | |||
194 | bzcat (or bzip2 -dc ) decompresses all specified files to | ||
195 | the standard output. | ||
196 | |||
197 | Compression is always performed, even if the compressed | ||
198 | file is slightly larger than the original. Files of less | ||
199 | than about one hundred bytes tend to get larger, since the | ||
200 | compression mechanism has a constant overhead in the | ||
201 | region of 50 bytes. Random data (including the output of | ||
202 | most file compressors) is coded at about 8.05 bits per | ||
203 | byte, giving an expansion of around 0.5%. | ||
204 | |||
205 | As a self-check for your protection, bzip2 uses 32-bit | ||
206 | CRCs to make sure that the decompressed version of a file | ||
207 | is identical to the original. This guards against corrup- | ||
208 | tion of the compressed data, and against undetected bugs | ||
209 | in bzip2 (hopefully very unlikely). The chances of data | ||
210 | corruption going undetected is microscopic, about one | ||
211 | chance in four billion for each file processed. Be aware, | ||
212 | though, that the check occurs upon decompression, so it | ||
213 | can only tell you that that something is wrong. It can't | ||
214 | help you recover the original uncompressed data. You can | ||
215 | use bzip2recover to try to recover data from damaged | ||
216 | files. | ||
217 | |||
218 | Return values: 0 for a normal exit, 1 for environmental | ||
219 | problems (file not found, invalid flags, I/O errors, &c), | ||
220 | 2 to indicate a corrupt compressed file, 3 for an internal | ||
221 | consistency error (eg, bug) which caused bzip2 to panic. | ||
222 | |||
223 | |||
224 | MEMORY MANAGEMENT | ||
225 | Bzip2 compresses large files in blocks. The block size | ||
226 | affects both the compression ratio achieved, and the | ||
227 | amount of memory needed both for compression and decom- | ||
228 | pression. The flags -1 through -9 specify the block size | ||
229 | to be 100,000 bytes through 900,000 bytes (the default) | ||
230 | respectively. At decompression-time, the block size used | ||
231 | for compression is read from the header of the compressed | ||
232 | file, and bunzip2 then allocates itself just enough memory | ||
233 | to decompress the file. Since block sizes are stored in | ||
234 | compressed files, it follows that the flags -1 to -9 are | ||
235 | irrelevant to and so ignored during decompression. | ||
236 | |||
237 | Compression and decompression requirements, in bytes, can | ||
238 | be estimated as: | ||
239 | |||
240 | Compression: 400k + ( 7 x block size ) | ||
241 | |||
242 | Decompression: 100k + ( 4 x block size ), or | ||
243 | 100k + ( 2.5 x block size ) | ||
244 | |||
245 | Larger block sizes give rapidly diminishing marginal | ||
246 | returns; most of the compression comes from the first two | ||
247 | or three hundred k of block size, a fact worth bearing in | ||
248 | mind when using bzip2 on small machines. It is also | ||
249 | important to appreciate that the decompression memory | ||
250 | requirement is set at compression-time by the choice of | ||
251 | block size. | ||
252 | 370 | ||
253 | For files compressed with the default 900k block size, | 371 | @code{bzip2} compresses files in blocks, usually 900kbytes long. Each |
254 | bunzip2 will require about 3700 kbytes to decompress. To | 372 | block is handled independently. If a media or transmission error causes |
255 | support decompression of any file on a 4 megabyte machine, | 373 | a multi-block @code{.bz2} file to become damaged, it may be possible to |
256 | bunzip2 has an option to decompress using approximately | 374 | recover data from the undamaged blocks in the file. |
257 | half this amount of memory, about 2300 kbytes. Decompres- | 375 | |
258 | sion speed is also halved, so you should use this option | 376 | The compressed representation of each block is delimited by a 48-bit |
259 | only where necessary. The relevant flag is -s. | 377 | pattern, which makes it possible to find the block boundaries with |
260 | 378 | reasonable certainty. Each block also carries its own 32-bit CRC, so | |
261 | In general, try and use the largest block size memory con- | 379 | damaged blocks can be distinguished from undamaged ones. |
262 | straints allow, since that maximises the compression | 380 | |
263 | achieved. Compression and decompression speed are virtu- | 381 | @code{bzip2recover} is a simple program whose purpose is to search for |
264 | ally unaffected by block size. | 382 | blocks in @code{.bz2} files, and write each block out into its own |
265 | 383 | @code{.bz2} file. You can then use @code{bzip2 -t} to test the | |
266 | Another significant point applies to files which fit in a | 384 | integrity of the resulting files, and decompress those which are |
267 | single block -- that means most files you'd encounter | 385 | undamaged. |
268 | using a large block size. The amount of real memory | 386 | |
269 | touched is proportional to the size of the file, since the | 387 | @code{bzip2recover} |
270 | file is smaller than a block. For example, compressing a | 388 | takes a single argument, the name of the damaged file, |
271 | file 20,000 bytes long with the flag -9 will cause the | 389 | and writes a number of files @code{rec0001file.bz2}, |
272 | compressor to allocate around 6700k of memory, but only | 390 | @code{rec0002file.bz2}, etc, containing the extracted blocks. |
273 | touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the | ||
274 | decompressor will allocate 3700k but only touch 100k + | ||
275 | 20000 * 4 = 180 kbytes. | ||
276 | |||
277 | Here is a table which summarises the maximum memory usage | ||
278 | for different block sizes. Also recorded is the total | ||
279 | compressed size for 14 files of the Calgary Text Compres- | ||
280 | sion Corpus totalling 3,141,622 bytes. This column gives | ||
281 | some feel for how compression varies with block size. | ||
282 | These figures tend to understate the advantage of larger | ||
283 | block sizes for larger files, since the Corpus is domi- | ||
284 | nated by smaller files. | ||
285 | |||
286 | Compress Decompress Decompress Corpus | ||
287 | Flag usage usage -s usage Size | ||
288 | |||
289 | -1 1100k 500k 350k 914704 | ||
290 | -2 1800k 900k 600k 877703 | ||
291 | -3 2500k 1300k 850k 860338 | ||
292 | -4 3200k 1700k 1100k 846899 | ||
293 | -5 3900k 2100k 1350k 845160 | ||
294 | -6 4600k 2500k 1600k 838626 | ||
295 | -7 5400k 2900k 1850k 834096 | ||
296 | -8 6000k 3300k 2100k 828642 | ||
297 | -9 6700k 3700k 2350k 828642 | ||
298 | |||
299 | |||
300 | OPTIONS | ||
301 | -c --stdout | ||
302 | Compress or decompress to standard output. -c will | ||
303 | decompress multiple files to stdout, but will only | ||
304 | compress a single file to stdout. | ||
305 | |||
306 | -d --decompress | ||
307 | Force decompression. bzip2, bunzip2 and bzcat are | ||
308 | really the same program, and the decision about | ||
309 | what actions to take is done on the basis of which | ||
310 | name is used. This flag overrides that mechanism, | ||
311 | and forces bzip2 to decompress. | ||
312 | |||
313 | -z --compress | ||
314 | The complement to -d: forces compression, regard- | ||
315 | less of the invokation name. | ||
316 | |||
317 | -t --test | ||
318 | Check integrity of the specified file(s), but don't | ||
319 | decompress them. This really performs a trial | ||
320 | decompression and throws away the result. | ||
321 | |||
322 | -f --force | ||
323 | Force overwrite of output files. Normally, bzip2 | ||
324 | will not overwrite existing output files. | ||
325 | |||
326 | -k --keep | ||
327 | Keep (don't delete) input files during compression | ||
328 | or decompression. | ||
329 | |||
330 | -s --small | ||
331 | Reduce memory usage, for compression, decompression | ||
332 | and testing. Files are decompressed and tested | ||
333 | using a modified algorithm which only requires 2.5 | ||
334 | bytes per block byte. This means any file can be | ||
335 | decompressed in 2300k of memory, albeit at about | ||
336 | half the normal speed. | ||
337 | |||
338 | During compression, -s selects a block size of | ||
339 | 200k, which limits memory use to around the same | ||
340 | figure, at the expense of your compression ratio. | ||
341 | In short, if your machine is low on memory (8 | ||
342 | megabytes or less), use -s for everything. See | ||
343 | MEMORY MANAGEMENT above. | ||
344 | |||
345 | -v --verbose | ||
346 | Verbose mode -- show the compression ratio for each | ||
347 | file processed. Further -v's increase the ver- | ||
348 | bosity level, spewing out lots of information which | ||
349 | is primarily of interest for diagnostic purposes. | ||
350 | |||
351 | -L --license -V --version | ||
352 | Display the software version, license terms and | ||
353 | conditions. | ||
354 | |||
355 | -1 to -9 | ||
356 | Set the block size to 100 k, 200 k .. 900 k when | ||
357 | compressing. Has no effect when decompressing. | ||
358 | See MEMORY MANAGEMENT above. | ||
359 | |||
360 | --repetitive-fast | ||
361 | bzip2 injects some small pseudo-random variations | ||
362 | into very repetitive blocks to limit worst-case | ||
363 | performance during compression. If sorting runs | ||
364 | into difficulties, the block is randomised, and | ||
365 | sorting is restarted. Very roughly, bzip2 persists | ||
366 | for three times as long as a well-behaved input | ||
367 | would take before resorting to randomisation. This | ||
368 | flag makes it give up much sooner. | ||
369 | |||
370 | --repetitive-best | ||
371 | Opposite of --repetitive-fast; try a lot harder | ||
372 | before resorting to randomisation. | ||
373 | |||
374 | |||
375 | RECOVERING DATA FROM DAMAGED FILES | ||
376 | bzip2 compresses files in blocks, usually 900kbytes long. | ||
377 | Each block is handled independently. If a media or trans- | ||
378 | mission error causes a multi-block .bz2 file to become | ||
379 | damaged, it may be possible to recover data from the | ||
380 | undamaged blocks in the file. | ||
381 | |||
382 | The compressed representation of each block is delimited | ||
383 | by a 48-bit pattern, which makes it possible to find the | ||
384 | block boundaries with reasonable certainty. Each block | ||
385 | also carries its own 32-bit CRC, so damaged blocks can be | ||
386 | distinguished from undamaged ones. | ||
387 | |||
388 | bzip2recover is a simple program whose purpose is to | ||
389 | search for blocks in .bz2 files, and write each block out | ||
390 | into its own .bz2 file. You can then use bzip2 -t to test | ||
391 | the integrity of the resulting files, and decompress those | ||
392 | which are undamaged. | ||
393 | |||
394 | bzip2recover takes a single argument, the name of the dam- | ||
395 | aged file, and writes a number of files "rec0001file.bz2", | ||
396 | "rec0002file.bz2", etc, containing the extracted blocks. | ||
397 | The output filenames are designed so that the use of | 391 | The output filenames are designed so that the use of |
398 | wildcards in subsequent processing -- for example, "bzip2 | 392 | wildcards in subsequent processing -- for example, |
399 | -dc rec*file.bz2 > recovered_data" -- lists the files in | 393 | @code{bzip2 -dc rec*file.bz2 > recovered_data} -- lists the files in |
400 | the "right" order. | 394 | the correct order. |
401 | 395 | ||
402 | bzip2recover should be of most use dealing with large .bz2 | 396 | @code{bzip2recover} should be of most use dealing with large @code{.bz2} |
403 | files, as these will contain many blocks. It is clearly | 397 | files, as these will contain many blocks. It is clearly |
404 | futile to use it on damaged single-block files, since a | 398 | futile to use it on damaged single-block files, since a |
405 | damaged block cannot be recovered. If you wish to min- | 399 | damaged block cannot be recovered. If you wish to minimise |
406 | imise any potential data loss through media or transmis- | 400 | any potential data loss through media or transmission errors, |
407 | sion errors, you might consider compressing with a smaller | 401 | you might consider compressing with a smaller |
408 | block size. | 402 | block size. |
409 | 403 | ||
410 | 404 | ||
411 | PERFORMANCE NOTES | 405 | @unnumberedsubsubsec PERFORMANCE NOTES |
412 | The sorting phase of compression gathers together similar | 406 | |
413 | strings in the file. Because of this, files containing | 407 | The sorting phase of compression gathers together similar strings in the |
414 | very long runs of repeated symbols, like "aabaabaabaab | 408 | file. Because of this, files containing very long runs of repeated |
415 | ..." (repeated several hundred times) may compress | 409 | symbols, like "aabaabaabaab ..." (repeated several hundred times) may |
416 | extraordinarily slowly. You can use the -vvvvv option to | 410 | compress more slowly than normal. Versions 0.9.5 and above fare much |
417 | monitor progress in great detail, if you want. Decompres- | 411 | better than previous versions in this respect. The ratio between |
418 | sion speed is unaffected. | 412 | worst-case and average-case compression time is in the region of 10:1. |
419 | 413 | For previous versions, this figure was more like 100:1. You can use the | |
420 | Such pathological cases seem rare in practice, appearing | 414 | @code{-vvvv} option to monitor progress in great detail, if you want. |
421 | mostly in artificially-constructed test files, and in low- | 415 | |
422 | level disk images. It may be inadvisable to use bzip2 to | 416 | Decompression speed is unaffected by these phenomena. |
423 | compress the latter. If you do get a file which causes | 417 | |
424 | severe slowness in compression, try making the block size | 418 | @code{bzip2} usually allocates several megabytes of memory to operate |
425 | as small as possible, with flag -1. | 419 | in, and then charges all over it in a fairly random fashion. This means |
426 | 420 | that performance, both for compressing and decompressing, is largely | |
427 | bzip2 usually allocates several megabytes of memory to | 421 | determined by the speed at which your machine can service cache misses. |
428 | operate in, and then charges all over it in a fairly ran- | 422 | Because of this, small changes to the code to reduce the miss rate have |
429 | dom fashion. This means that performance, both for com- | 423 | been observed to give disproportionately large performance improvements. |
430 | pressing and decompressing, is largely determined by the | 424 | I imagine @code{bzip2} will perform best on machines with very large |
431 | speed at which your machine can service cache misses. | 425 | caches. |
432 | Because of this, small changes to the code to reduce the | ||
433 | miss rate have been observed to give disproportionately | ||
434 | large performance improvements. I imagine bzip2 will per- | ||
435 | form best on machines with very large caches. | ||
436 | |||
437 | |||
438 | CAVEATS | ||
439 | I/O error messages are not as helpful as they could be. | ||
440 | Bzip2 tries hard to detect I/O errors and exit cleanly, | ||
441 | but the details of what the problem is sometimes seem | ||
442 | rather misleading. | ||
443 | |||
444 | This manual page pertains to version 0.9.0 of bzip2. Com- | ||
445 | pressed data created by this version is entirely forwards | ||
446 | and backwards compatible with the previous public release, | ||
447 | version 0.1pl2, but with the following exception: 0.9.0 | ||
448 | can correctly decompress multiple concatenated compressed | ||
449 | files. 0.1pl2 cannot do this; it will stop after decom- | ||
450 | pressing just the first file in the stream. | ||
451 | |||
452 | Wildcard expansion for Windows 95 and NT is flaky. | ||
453 | |||
454 | bzip2recover uses 32-bit integers to represent bit posi- | ||
455 | tions in compressed files, so it cannot handle compressed | ||
456 | files more than 512 megabytes long. This could easily be | ||
457 | fixed. | ||
458 | |||
459 | |||
460 | AUTHOR | ||
461 | Julian Seward, jseward@@acm.org. | ||
462 | |||
463 | The ideas embodied in bzip2 are due to (at least) the fol- | ||
464 | lowing people: Michael Burrows and David Wheeler (for the | ||
465 | block sorting transformation), David Wheeler (again, for | ||
466 | the Huffman coder), Peter Fenwick (for the structured cod- | ||
467 | ing model in the original bzip, and many refinements), and | ||
468 | Alistair Moffat, Radford Neal and Ian Witten (for the | ||
469 | arithmetic coder in the original bzip). I am much | ||
470 | indebted for their help, support and advice. See the man- | ||
471 | ual in the source distribution for pointers to sources of | ||
472 | documentation. Christian von Roques encouraged me to look | ||
473 | for faster sorting algorithms, so as to speed up compres- | ||
474 | sion. Bela Lubkin encouraged me to improve the worst-case | ||
475 | compression performance. Many people sent patches, helped | ||
476 | with portability problems, lent machines, gave advice and | ||
477 | were generally helpful. | ||
478 | @end example | ||
479 | 426 | ||
480 | 427 | ||
428 | @unnumberedsubsubsec CAVEATS | ||
429 | |||
430 | I/O error messages are not as helpful as they could be. @code{bzip2} | ||
431 | tries hard to detect I/O errors and exit cleanly, but the details of | ||
432 | what the problem is sometimes seem rather misleading. | ||
433 | |||
434 | This manual page pertains to version 0.9.5 of @code{bzip2}. Compressed | ||
435 | data created by this version is entirely forwards and backwards | ||
436 | compatible with the previous public releases, versions 0.1pl2 and 0.9.0, | ||
437 | but with the following exception: 0.9.0 and above can correctly | ||
438 | decompress multiple concatenated compressed files. 0.1pl2 cannot do | ||
439 | this; it will stop after decompressing just the first file in the | ||
440 | stream. | ||
441 | |||
442 | @code{bzip2recover} uses 32-bit integers to represent bit positions in | ||
443 | compressed files, so it cannot handle compressed files more than 512 | ||
444 | megabytes long. This could easily be fixed. | ||
445 | |||
446 | |||
447 | @unnumberedsubsubsec AUTHOR | ||
448 | Julian Seward, @code{jseward@@acm.org}. | ||
449 | |||
450 | The ideas embodied in @code{bzip2} are due to (at least) the following | ||
451 | people: Michael Burrows and David Wheeler (for the block sorting | ||
452 | transformation), David Wheeler (again, for the Huffman coder), Peter | ||
453 | Fenwick (for the structured coding model in the original @code{bzip}, | ||
454 | and many refinements), and Alistair Moffat, Radford Neal and Ian Witten | ||
455 | (for the arithmetic coder in the original @code{bzip}). I am much | ||
456 | indebted for their help, support and advice. See the manual in the | ||
457 | source distribution for pointers to sources of documentation. Christian | ||
458 | von Roques encouraged me to look for faster sorting algorithms, so as to | ||
459 | speed up compression. Bela Lubkin encouraged me to improve the | ||
460 | worst-case compression performance. Many people sent patches, helped | ||
461 | with portability problems, lent machines, gave advice and were generally | ||
462 | helpful. | ||
463 | |||
464 | @end quotation | ||
465 | |||
481 | 466 | ||
482 | 467 | ||
483 | 468 | ||
@@ -579,7 +564,7 @@ give better @code{zlib} compatibility. These functions are | |||
579 | @code{bzerror} and @code{bzlibVersion}. You may find these functions | 564 | @code{bzerror} and @code{bzlibVersion}. You may find these functions |
580 | more convenient for simple file reading and writing, than those in the | 565 | more convenient for simple file reading and writing, than those in the |
581 | high-level interface. These functions are not (yet) officially part of | 566 | high-level interface. These functions are not (yet) officially part of |
582 | the library, and are not further documented here. If they break, you | 567 | the library, and are minimally documented here. If they break, you |
583 | get to keep all the pieces. I hope to document them properly when time | 568 | get to keep all the pieces. I hope to document them properly when time |
584 | permits. | 569 | permits. |
585 | 570 | ||
@@ -748,25 +733,31 @@ monitoring/debugging output. If the library has been compiled with | |||
748 | setting. | 733 | setting. |
749 | 734 | ||
750 | Parameter @code{workFactor} controls how the compression phase behaves | 735 | Parameter @code{workFactor} controls how the compression phase behaves |
751 | when presented with worst case, highly repetitive, input data. | 736 | when presented with worst case, highly repetitive, input data. If |
752 | If compression runs into difficulties caused by repetitive data, | 737 | compression runs into difficulties caused by repetitive data, the |
753 | some pseudo-random variations are inserted into the block, and | 738 | library switches from the standard sorting algorithm to a fallback |
754 | compression is restarted. Lower values of @code{workFactor} | 739 | algorithm. The fallback is slower than the standard algorithm by |
755 | reduce the tolerance of compression to repetitive data. | 740 | perhaps a factor of three, but always behaves reasonably, no matter how |
756 | You should set this parameter carefully; too low, and | 741 | bad the input. |
757 | compression ratio suffers, too high, and your average-to-worst | 742 | |
758 | case compression times can become very large. | 743 | Lower values of @code{workFactor} reduce the amount of effort the |
759 | The default value of 30 | 744 | standard algorithm will expend before resorting to the fallback. You |
760 | gives reasonable behaviour over a wide range of circumstances. | 745 | should set this parameter carefully; too low, and many inputs will be |
761 | 746 | handled by the fallback algorithm and so compress rather slowly, too | |
762 | Allowable values range from 0 to 250 inclusive. 0 is a special | 747 | high, and your average-to-worst case compression times can become very |
763 | case, equivalent to using the default value of 30. | 748 | large. The default value of 30 gives reasonable behaviour over a wide |
764 | 749 | range of circumstances. | |
765 | Note that the randomisation process is entirely transparent. | 750 | |
766 | If the library decides to randomise and restart compression on a | 751 | Allowable values range from 0 to 250 inclusive. 0 is a special case, |
767 | block, it does so without comment. Randomised blocks are | 752 | equivalent to using the default value of 30. |
768 | automatically de-randomised during decompression, so data | 753 | |
769 | integrity is never compromised. | 754 | Note that the compressed output generated is the same regardless of |
755 | whether or not the fallback algorithm is used. | ||
756 | |||
757 | Be aware also that this parameter may disappear entirely in future | ||
758 | versions of the library. In principle it should be possible to devise a | ||
759 | good way to automatically choose which algorithm to use. Such a | ||
760 | mechanism would render the parameter obsolete. | ||
770 | 761 | ||
771 | Possible return values: | 762 | Possible return values: |
772 | @display | 763 | @display |
@@ -1652,6 +1643,48 @@ Possible return values: | |||
1652 | 1643 | ||
1653 | 1644 | ||
1654 | 1645 | ||
1646 | @section @code{zlib} compatibility functions | ||
1647 | Yoshioka Tsuneo has contributed some functions to | ||
1648 | give better @code{zlib} compatibility. These functions are | ||
1649 | @code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush}, | ||
1650 | @code{bzclose}, | ||
1651 | @code{bzerror} and @code{bzlibVersion}. | ||
1652 | These functions are not (yet) officially part of | ||
1653 | the library. If they break, you get to keep all the pieces. | ||
1654 | Nevertheless, I think they work ok. | ||
1655 | @example | ||
1656 | typedef void BZFILE; | ||
1657 | |||
1658 | const char * bzlibVersion ( void ); | ||
1659 | @end example | ||
1660 | Returns a string indicating the library version. | ||
1661 | @example | ||
1662 | BZFILE * bzopen ( const char *path, const char *mode ); | ||
1663 | BZFILE * bzdopen ( int fd, const char *mode ); | ||
1664 | @end example | ||
1665 | Opens a @code{.bz2} file for reading or writing, using either its name | ||
1666 | or a pre-existing file descriptor. | ||
1667 | Analogous to @code{fopen} and @code{fdopen}. | ||
1668 | @example | ||
1669 | int bzread ( BZFILE* b, void* buf, int len ); | ||
1670 | int bzwrite ( BZFILE* b, void* buf, int len ); | ||
1671 | @end example | ||
1672 | Reads/writes data from/to a previously opened @code{BZFILE}. | ||
1673 | Analogous to @code{fread} and @code{fwrite}. | ||
1674 | @example | ||
1675 | int bzflush ( BZFILE* b ); | ||
1676 | void bzclose ( BZFILE* b ); | ||
1677 | @end example | ||
1678 | Flushes/closes a @code{BZFILE}. @code{bzflush} doesn't actually do | ||
1679 | anything. Analogous to @code{fflush} and @code{fclose}. | ||
1680 | |||
1681 | @example | ||
1682 | const char * bzerror ( BZFILE *b, int *errnum ) | ||
1683 | @end example | ||
1684 | Returns a string describing the more recent error status of | ||
1685 | @code{b}, and also sets @code{*errnum} to its numerical value. | ||
1686 | |||
1687 | |||
1655 | @section Using the library in a @code{stdio}-free environment | 1688 | @section Using the library in a @code{stdio}-free environment |
1656 | 1689 | ||
1657 | @subsection Getting rid of @code{stdio} | 1690 | @subsection Getting rid of @code{stdio} |
@@ -1677,14 +1710,14 @@ was compiled with @code{BZ_NO_STDIO} set. | |||
1677 | 1710 | ||
1678 | For a normal compile, an assertion failure yields the message | 1711 | For a normal compile, an assertion failure yields the message |
1679 | @example | 1712 | @example |
1680 | bzip2/libbzip2, v0.9.0: internal error number N. | 1713 | bzip2/libbzip2, v0.9.5: internal error number N. |
1681 | This is a bug in bzip2/libbzip2, v0.9.0. Please report | 1714 | This is a bug in bzip2/libbzip2, v0.9.5. Please report |
1682 | it to me at: jseward@@acm.org. If this happened when | 1715 | it to me at: jseward@@acm.org. If this happened when |
1683 | you were using some program which uses libbzip2 as a | 1716 | you were using some program which uses libbzip2 as a |
1684 | component, you should also report this bug to the author(s) | 1717 | component, you should also report this bug to the author(s) |
1685 | of that program. Please make an effort to report this bug; | 1718 | of that program. Please make an effort to report this bug; |
1686 | timely and accurate bug reports eventually lead to higher | 1719 | timely and accurate bug reports eventually lead to higher |
1687 | quality software. Thx. Julian Seward, 27 June 1998. | 1720 | quality software. Thanks. Julian Seward, 24 May 1999. |
1688 | @end example | 1721 | @end example |
1689 | where @code{N} is some error code number. @code{exit(3)} | 1722 | where @code{N} is some error code number. @code{exit(3)} |
1690 | is then called. | 1723 | is then called. |
@@ -1721,7 +1754,7 @@ If you can't | |||
1721 | open the project file for some reason, make a new one, naming these files: | 1754 | open the project file for some reason, make a new one, naming these files: |
1722 | @code{blocksort.c}, @code{bzlib.c}, @code{compress.c}, | 1755 | @code{blocksort.c}, @code{bzlib.c}, @code{compress.c}, |
1723 | @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @* | 1756 | @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @* |
1724 | @code{randtable.c} and @code{libbz2.def}. You might also need | 1757 | @code{randtable.c} and @code{libbz2.def}. You will also need |
1725 | to name the header files @code{bzlib.h} and @code{bzlib_private.h}. | 1758 | to name the header files @code{bzlib.h} and @code{bzlib_private.h}. |
1726 | 1759 | ||
1727 | If you don't use VC++, you may need to define the proprocessor symbol | 1760 | If you don't use VC++, you may need to define the proprocessor symbol |
@@ -1730,6 +1763,14 @@ If you don't use VC++, you may need to define the proprocessor symbol | |||
1730 | Finally, @code{dlltest.c} is a sample program using the DLL. It has a | 1763 | Finally, @code{dlltest.c} is a sample program using the DLL. It has a |
1731 | project file, @code{dlltest.dsp}. | 1764 | project file, @code{dlltest.dsp}. |
1732 | 1765 | ||
1766 | If you just want a makefile for Visual C, have a look at | ||
1767 | @code{makefile.msc}. | ||
1768 | |||
1769 | Be aware that if you compile @code{bzip2} itself on Win32, you must set | ||
1770 | @code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file | ||
1771 | @code{bzip2.c}, before compiling. Otherwise the resulting binary won't | ||
1772 | work correctly. | ||
1773 | |||
1733 | I haven't tried any of this stuff myself, but it all looks plausible. | 1774 | I haven't tried any of this stuff myself, but it all looks plausible. |
1734 | 1775 | ||
1735 | 1776 | ||
@@ -1740,7 +1781,8 @@ These are just some random thoughts of mine. Your mileage may | |||
1740 | vary. | 1781 | vary. |
1741 | 1782 | ||
1742 | @section Limitations of the compressed file format | 1783 | @section Limitations of the compressed file format |
1743 | @code{bzip2-0.9.0} uses exactly the same file format as the previous | 1784 | @code{bzip2-0.9.5} and @code{0.9.0} |
1785 | use exactly the same file format as the previous | ||
1744 | version, @code{bzip2-0.1}. This decision was made in the interests of | 1786 | version, @code{bzip2-0.1}. This decision was made in the interests of |
1745 | stability. Creating yet another incompatible compressed file format | 1787 | stability. Creating yet another incompatible compressed file format |
1746 | would create further confusion and disruption for users. | 1788 | would create further confusion and disruption for users. |
@@ -1776,12 +1818,11 @@ decompression and, in retrospect, are unnecessary. These are: | |||
1776 | @code{bzip2-0.1}'s performance on repetitive data, so | 1818 | @code{bzip2-0.1}'s performance on repetitive data, so |
1777 | perhaps it isn't a problem for real inputs. | 1819 | perhaps it isn't a problem for real inputs. |
1778 | 1820 | ||
1779 | Probably the best long-term solution | 1821 | Probably the best long-term solution, |
1822 | and the one I have incorporated into 0.9.5 and above, | ||
1780 | is to use the existing sorting | 1823 | is to use the existing sorting |
1781 | algorithm initially, and fall back to a O(N (log N)^2) | 1824 | algorithm initially, and fall back to a O(N (log N)^2) |
1782 | algorithm if the standard algorithm gets into difficulties. | 1825 | algorithm if the standard algorithm gets into difficulties. |
1783 | This can be done without much difficulty; I made | ||
1784 | a prototype implementation of it some months now. | ||
1785 | @item The compressed file format was never designed to be | 1826 | @item The compressed file format was never designed to be |
1786 | handled by a library, and I have had to jump though | 1827 | handled by a library, and I have had to jump though |
1787 | some hoops to produce an efficient implementation of | 1828 | some hoops to produce an efficient implementation of |
@@ -1797,7 +1838,7 @@ It would be fair to say that the @code{bzip2} format was frozen | |||
1797 | before I properly and fully understood the performance | 1838 | before I properly and fully understood the performance |
1798 | consequences of doing so. | 1839 | consequences of doing so. |
1799 | 1840 | ||
1800 | Improvements which I have been able to incorporate into | 1841 | Improvements which I was able to incorporate into |
1801 | 0.9.0, despite using the same file format, are: | 1842 | 0.9.0, despite using the same file format, are: |
1802 | @itemize @bullet | 1843 | @itemize @bullet |
1803 | @item Single array implementation of the inverse BWT. This | 1844 | @item Single array implementation of the inverse BWT. This |
@@ -1808,8 +1849,7 @@ Improvements which I have been able to incorporate into | |||
1808 | of values. | 1849 | of values. |
1809 | @item @code{bzip2-0.9.0} now reads and writes files with @code{fread} | 1850 | @item @code{bzip2-0.9.0} now reads and writes files with @code{fread} |
1810 | and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}. | 1851 | and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}. |
1811 | Duh! I'm embarrassed at my own moronicness (moronicity?) on this | 1852 | Duh! Well, you live and learn. |
1812 | one. | ||
1813 | 1853 | ||
1814 | @end itemize | 1854 | @end itemize |
1815 | Further ahead, it would be nice | 1855 | Further ahead, it would be nice |
@@ -1820,7 +1860,7 @@ require some careful design of compressed file formats. | |||
1820 | 1860 | ||
1821 | @section Portability issues | 1861 | @section Portability issues |
1822 | After some consideration, I have decided not to use | 1862 | After some consideration, I have decided not to use |
1823 | GNU @code{autoconf} to configure 0.9.0. | 1863 | GNU @code{autoconf} to configure 0.9.5. |
1824 | 1864 | ||
1825 | @code{autoconf}, admirable and wonderful though it is, | 1865 | @code{autoconf}, admirable and wonderful though it is, |
1826 | mainly assists with portability problems between Unix-like | 1866 | mainly assists with portability problems between Unix-like |
@@ -1835,8 +1875,10 @@ under Unix straight out-of-the-box, so to speak, especially | |||
1835 | if you have a version of GNU C available. | 1875 | if you have a version of GNU C available. |
1836 | 1876 | ||
1837 | There are a couple of @code{__inline__} directives in the code. GNU C | 1877 | There are a couple of @code{__inline__} directives in the code. GNU C |
1838 | (@code{gcc}) should be able to handle them. If your compiler doesn't | 1878 | (@code{gcc}) should be able to handle them. If you're not using |
1839 | like them, just @code{#define} @code{__inline__} to be null. One | 1879 | GNU C, your C compiler shouldn't see them at all. |
1880 | If your compiler does, for some reason, see them and doesn't | ||
1881 | like them, just @code{#define} @code{__inline__} to be @code{/* */}. One | ||
1840 | easy way to do this is to compile with the flag @code{-D__inline__=}, | 1882 | easy way to do this is to compile with the flag @code{-D__inline__=}, |
1841 | which should be understood by most Unix compilers. | 1883 | which should be understood by most Unix compilers. |
1842 | 1884 | ||
@@ -1853,6 +1895,11 @@ distribution, please try and link it statically (@code{gcc -s}). This | |||
1853 | avoids all sorts of library-version issues that others may encounter | 1895 | avoids all sorts of library-version issues that others may encounter |
1854 | later on. | 1896 | later on. |
1855 | 1897 | ||
1898 | If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and | ||
1899 | @code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling. | ||
1900 | Otherwise the resulting binary won't work correctly. | ||
1901 | |||
1902 | |||
1856 | 1903 | ||
1857 | @section Reporting bugs | 1904 | @section Reporting bugs |
1858 | I tried pretty hard to make sure @code{bzip2} is | 1905 | I tried pretty hard to make sure @code{bzip2} is |
@@ -1976,45 +2023,39 @@ A record of the tests I've done. | |||
1976 | 2023 | ||
1977 | First, some data sets: | 2024 | First, some data sets: |
1978 | @itemize @bullet | 2025 | @itemize @bullet |
1979 | @item B: a directory containing a 6001 files, one for every length in the | 2026 | @item B: a directory containing 6001 files, one for every length in the |
1980 | range 0 to 6000 bytes. The files contain random lowercase | 2027 | range 0 to 6000 bytes. The files contain random lowercase |
1981 | letters. 18.7 megabytes. | 2028 | letters. 18.7 megabytes. |
1982 | @item H: my home directory tree. Documents, source code, mail files, | 2029 | @item H: my home directory tree. Documents, source code, mail files, |
1983 | compressed data. H contains B, and also a directory of | 2030 | compressed data. H contains B, and also a directory of |
1984 | files designed as boundary cases for the sorting; mostly very | 2031 | files designed as boundary cases for the sorting; mostly very |
1985 | repetitive, nasty files. 445 megabytes. | 2032 | repetitive, nasty files. 565 megabytes. |
1986 | @item A: directory tree holding various applications built from source: | 2033 | @item A: directory tree holding various applications built from source: |
1987 | @code{egcs-1.0.2}, @code{gcc-2.8.1}, KDE Beta 4, GTK, Octave, etc. | 2034 | @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc. |
1988 | 827 megabytes. | 2035 | 2200 megabytes. |
1989 | @item P: directory tree holding large amounts of source code (@code{.tar} | ||
1990 | files) of the entire GNU distribution, plus a couple of | ||
1991 | Linux distributions. 2400 megabytes. | ||
1992 | @end itemize | 2036 | @end itemize |
1993 | The tests conducted are as follows. Each test means compressing | 2037 | The tests conducted are as follows. Each test means compressing |
1994 | (a copy of) each file in the data set, decompressing it and | 2038 | (a copy of) each file in the data set, decompressing it and |
1995 | comparing it against the original. | 2039 | comparing it against the original. |
1996 | 2040 | ||
1997 | First, a bunch of tests with block sizes, internal buffer | 2041 | First, a bunch of tests with block sizes and internal buffer |
1998 | sizes and randomisation lengths set very small, | 2042 | sizes set very small, |
1999 | to detect any problems with the | 2043 | to detect any problems with the |
2000 | blocking, buffering and randomisation mechanisms. | 2044 | blocking and buffering mechanisms. |
2001 | This required modifying the source code so as to try to | 2045 | This required modifying the source code so as to try to |
2002 | break it. | 2046 | break it. |
2003 | @enumerate | 2047 | @enumerate |
2004 | @item Data set H, with | 2048 | @item Data set H, with |
2005 | buffer size of 1 byte, and block size of 23 bytes. | 2049 | buffer size of 1 byte, and block size of 23 bytes. |
2006 | @item Data set B, buffer sizes 1 byte, block size 1 byte. | 2050 | @item Data set B, buffer sizes 1 byte, block size 1 byte. |
2007 | @item As (2) but small-mode decompression (first 1700 files). | 2051 | @item As (2) but small-mode decompression. |
2008 | @item As (2) with block size 2 bytes. | 2052 | @item As (2) with block size 2 bytes. |
2009 | @item As (2) with block size 3 bytes. | 2053 | @item As (2) with block size 3 bytes. |
2010 | @item As (2) with block size 4 bytes. | 2054 | @item As (2) with block size 4 bytes. |
2011 | @item As (2) with block size 5 bytes. | 2055 | @item As (2) with block size 5 bytes. |
2012 | @item As (2) with block size 6 bytes and small-mode decompression. | 2056 | @item As (2) with block size 6 bytes and small-mode decompression. |
2013 | @item H with normal buffer sizes (5000 bytes), normal block | 2057 | @item H with buffer size of 1 byte, but normal block |
2014 | size (up to 900000 bytes), but with randomisation | 2058 | size (up to 900000 bytes). |
2015 | mechanism running intensely (randomising approximately every | ||
2016 | third byte). | ||
2017 | @item As (9) with small-mode decompression. | ||
2018 | @end enumerate | 2059 | @end enumerate |
2019 | Then some tests with unmodified source code. | 2060 | Then some tests with unmodified source code. |
2020 | @enumerate | 2061 | @enumerate |
@@ -2023,18 +2064,26 @@ Then some tests with unmodified source code. | |||
2023 | @item H, compress with flag @code{-1}. | 2064 | @item H, compress with flag @code{-1}. |
2024 | @item H, compress with flag @code{-s}, decompress with flag @code{-s}. | 2065 | @item H, compress with flag @code{-s}, decompress with flag @code{-s}. |
2025 | @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing, | 2066 | @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing, |
2026 | @code{bzip2-0.9.0} decompressing, all settings normal. | 2067 | @code{bzip2-0.9.5} decompressing, all settings normal. |
2027 | @item Backwards compatibility: H, @code{bzip2-0.9.0} compressing, | 2068 | @item Backwards compatibility: H, @code{bzip2-0.9.5} compressing, |
2028 | @code{bzip2-0.1pl2} decompressing, all settings normal. | 2069 | @code{bzip2-0.1pl2} decompressing, all settings normal. |
2029 | @item Bigger tests: A, all settings normal. | 2070 | @item Bigger tests: A, all settings normal. |
2030 | @item P, all settings normal. | 2071 | @item As (7), using the fallback (Sadakane-like) sorting algorithm. |
2031 | @item Misc test: about 100 megabytes of @code{.tar} files with | 2072 | @item As (8), compress with flag @code{-1}, decompress with flag |
2032 | @code{bzip2} compiled with Purify. | 2073 | @code{-s}. |
2074 | @item H, using the fallback sorting algorithm. | ||
2075 | @item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing, | ||
2076 | @code{bzip2-0.9.5} decompressing, all settings normal. | ||
2077 | @item Backwards compatibility: A, @code{bzip2-0.9.5} compressing, | ||
2078 | @code{bzip2-0.1pl2} decompressing, all settings normal. | ||
2079 | @item Misc test: about 400 megabytes of @code{.tar} files with | ||
2080 | @code{bzip2} compiled with Checker (a memory access error | ||
2081 | detector, like Purify). | ||
2033 | @item Misc tests to make sure it builds and runs ok on non-Linux/x86 | 2082 | @item Misc tests to make sure it builds and runs ok on non-Linux/x86 |
2034 | platforms. | 2083 | platforms. |
2035 | @end enumerate | 2084 | @end enumerate |
2036 | These tests were conducted on a 205 MHz Cyrix 6x86MX machine, running | 2085 | These tests were conducted on a 225 MHz IDT WinChip machine, running |
2037 | Linux 2.0.32. They represent nearly a week of continuous computation. | 2086 | Linux 2.0.36. They represent nearly a week of continuous computation. |
2038 | All tests completed successfully. | 2087 | All tests completed successfully. |
2039 | 2088 | ||
2040 | 2089 | ||
@@ -2063,7 +2112,7 @@ David J. Wheeler | |||
2063 | Program bred3.c and accompanying document bred3.ps. | 2112 | Program bred3.c and accompanying document bred3.ps. |
2064 | This contains the idea behind the multi-table Huffman | 2113 | This contains the idea behind the multi-table Huffman |
2065 | coding scheme. | 2114 | coding scheme. |
2066 | ftp://ftp.cl.cam.ac.uk/pub/user/djw3/ | 2115 | ftp://ftp.cl.cam.ac.uk/users/djw3/ |
2067 | 2116 | ||
2068 | Jon L. Bentley and Robert Sedgewick | 2117 | Jon L. Bentley and Robert Sedgewick |
2069 | "Fast Algorithms for Sorting and Searching Strings" | 2118 | "Fast Algorithms for Sorting and Searching Strings" |