diff options
Diffstat (limited to 'manual.texi')
-rw-r--r-- | manual.texi | 2243 |
1 files changed, 0 insertions, 2243 deletions
diff --git a/manual.texi b/manual.texi deleted file mode 100644 index 5bc27d5..0000000 --- a/manual.texi +++ /dev/null | |||
@@ -1,2243 +0,0 @@ | |||
1 | \input texinfo @c -*- Texinfo -*- | ||
2 | @setfilename bzip2.info | ||
3 | |||
4 | @ignore | ||
5 | This file documents bzip2 version 1.0.2, and associated library | ||
6 | libbzip2, written by Julian Seward (jseward@acm.org). | ||
7 | |||
8 | Copyright (C) 1996-2002 Julian R Seward | ||
9 | |||
10 | Permission is granted to make and distribute verbatim copies of | ||
11 | this manual provided the copyright notice and this permission notice | ||
12 | are preserved on all copies. | ||
13 | |||
14 | Permission is granted to copy and distribute translations of this manual | ||
15 | into another language, under the above conditions for verbatim copies. | ||
16 | @end ignore | ||
17 | |||
18 | @ifinfo | ||
19 | @format | ||
20 | START-INFO-DIR-ENTRY | ||
21 | * Bzip2: (bzip2). A program and library for data compression. | ||
22 | END-INFO-DIR-ENTRY | ||
23 | @end format | ||
24 | |||
25 | @end ifinfo | ||
26 | |||
27 | @iftex | ||
28 | @c @finalout | ||
29 | @settitle bzip2 and libbzip2 | ||
30 | @titlepage | ||
31 | @title bzip2 and libbzip2 | ||
32 | @subtitle a program and library for data compression | ||
33 | @subtitle copyright (C) 1996-2002 Julian Seward | ||
34 | @subtitle version 1.0.2 of 30 December 2001 | ||
35 | @author Julian Seward | ||
36 | |||
37 | @end titlepage | ||
38 | |||
39 | @parindent 0mm | ||
40 | @parskip 2mm | ||
41 | |||
42 | @end iftex | ||
43 | @node Top,,, (dir) | ||
44 | |||
45 | The following text is the License for this software. You should | ||
46 | find it identical to that contained in the file LICENSE in the | ||
47 | source distribution. | ||
48 | |||
49 | @bf{------------------ START OF THE LICENSE ------------------} | ||
50 | |||
51 | This program, @code{bzip2}, | ||
52 | and associated library @code{libbzip2}, are | ||
53 | Copyright (C) 1996-2002 Julian R Seward. All rights reserved. | ||
54 | |||
55 | Redistribution and use in source and binary forms, with or without | ||
56 | modification, are permitted provided that the following conditions | ||
57 | are met: | ||
58 | @itemize @bullet | ||
59 | @item | ||
60 | Redistributions of source code must retain the above copyright | ||
61 | notice, this list of conditions and the following disclaimer. | ||
62 | @item | ||
63 | The origin of this software must not be misrepresented; you must | ||
64 | not claim that you wrote the original software. If you use this | ||
65 | software in a product, an acknowledgment in the product | ||
66 | documentation would be appreciated but is not required. | ||
67 | @item | ||
68 | Altered source versions must be plainly marked as such, and must | ||
69 | not be misrepresented as being the original software. | ||
70 | @item | ||
71 | The name of the author may not be used to endorse or promote | ||
72 | products derived from this software without specific prior written | ||
73 | permission. | ||
74 | @end itemize | ||
75 | THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS | ||
76 | OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED | ||
77 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | ||
78 | ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY | ||
79 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | ||
80 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE | ||
81 | GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS | ||
82 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, | ||
83 | WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING | ||
84 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS | ||
85 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
86 | |||
87 | Julian Seward, Cambridge, UK. | ||
88 | |||
89 | @code{jseward@@acm.org} | ||
90 | |||
91 | @code{bzip2}/@code{libbzip2} version 1.0.2 of 30 December 2001. | ||
92 | |||
93 | @bf{------------------ END OF THE LICENSE ------------------} | ||
94 | |||
95 | Web sites: | ||
96 | |||
97 | @code{http://sources.redhat.com/bzip2} | ||
98 | |||
99 | @code{http://www.cacheprof.org} | ||
100 | |||
101 | PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented | ||
102 | algorithms. However, I do not have the resources available to carry out | ||
103 | a full patent search. Therefore I cannot give any guarantee of the | ||
104 | above statement. | ||
105 | |||
106 | |||
107 | |||
108 | |||
109 | |||
110 | |||
111 | |||
112 | @chapter Introduction | ||
113 | |||
114 | @code{bzip2} compresses files using the Burrows-Wheeler | ||
115 | block-sorting text compression algorithm, and Huffman coding. | ||
116 | Compression is generally considerably better than that | ||
117 | achieved by more conventional LZ77/LZ78-based compressors, | ||
118 | and approaches the performance of the PPM family of statistical compressors. | ||
119 | |||
120 | @code{bzip2} is built on top of @code{libbzip2}, a flexible library | ||
121 | for handling compressed data in the @code{bzip2} format. This manual | ||
122 | describes both how to use the program and | ||
123 | how to work with the library interface. Most of the | ||
124 | manual is devoted to this library, not the program, | ||
125 | which is good news if your interest is only in the program. | ||
126 | |||
127 | Chapter 2 describes how to use @code{bzip2}; this is the only part | ||
128 | you need to read if you just want to know how to operate the program. | ||
129 | Chapter 3 describes the programming interfaces in detail, and | ||
130 | Chapter 4 records some miscellaneous notes which I thought | ||
131 | ought to be recorded somewhere. | ||
132 | |||
133 | |||
134 | @chapter How to use @code{bzip2} | ||
135 | |||
136 | This chapter contains a copy of the @code{bzip2} man page, | ||
137 | and nothing else. | ||
138 | |||
139 | @quotation | ||
140 | |||
141 | @unnumberedsubsubsec NAME | ||
142 | @itemize | ||
143 | @item @code{bzip2}, @code{bunzip2} | ||
144 | - a block-sorting file compressor, v1.0.2 | ||
145 | @item @code{bzcat} | ||
146 | - decompresses files to stdout | ||
147 | @item @code{bzip2recover} | ||
148 | - recovers data from damaged bzip2 files | ||
149 | @end itemize | ||
150 | |||
151 | @unnumberedsubsubsec SYNOPSIS | ||
152 | @itemize | ||
153 | @item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ... ] | ||
154 | @item @code{bunzip2} [ -fkvsVL ] [ filenames ... ] | ||
155 | @item @code{bzcat} [ -s ] [ filenames ... ] | ||
156 | @item @code{bzip2recover} filename | ||
157 | @end itemize | ||
158 | |||
159 | @unnumberedsubsubsec DESCRIPTION | ||
160 | |||
161 | @code{bzip2} compresses files using the Burrows-Wheeler block sorting | ||
162 | text compression algorithm, and Huffman coding. Compression is | ||
163 | generally considerably better than that achieved by more conventional | ||
164 | LZ77/LZ78-based compressors, and approaches the performance of the PPM | ||
165 | family of statistical compressors. | ||
166 | |||
167 | The command-line options are deliberately very similar to those of GNU | ||
168 | @code{gzip}, but they are not identical. | ||
169 | |||
170 | @code{bzip2} expects a list of file names to accompany the command-line | ||
171 | flags. Each file is replaced by a compressed version of itself, with | ||
172 | the name @code{original_name.bz2}. Each compressed file has the same | ||
173 | modification date, permissions, and, when possible, ownership as the | ||
174 | corresponding original, so that these properties can be correctly | ||
175 | restored at decompression time. File name handling is naive in the | ||
176 | sense that there is no mechanism for preserving original file names, | ||
177 | permissions, ownerships or dates in filesystems which lack these | ||
178 | concepts, or have serious file name length restrictions, such as MS-DOS. | ||
179 | |||
180 | @code{bzip2} and @code{bunzip2} will by default not overwrite existing | ||
181 | files. If you want this to happen, specify the @code{-f} flag. | ||
182 | |||
183 | If no file names are specified, @code{bzip2} compresses from standard | ||
184 | input to standard output. In this case, @code{bzip2} will decline to | ||
185 | write compressed output to a terminal, as this would be entirely | ||
186 | incomprehensible and therefore pointless. | ||
187 | |||
188 | @code{bunzip2} (or @code{bzip2 -d}) decompresses all | ||
189 | specified files. Files which were not created by @code{bzip2} | ||
190 | will be detected and ignored, and a warning issued. | ||
191 | @code{bzip2} attempts to guess the filename for the decompressed file | ||
192 | from that of the compressed file as follows: | ||
193 | @itemize | ||
194 | @item @code{filename.bz2 } becomes @code{filename} | ||
195 | @item @code{filename.bz } becomes @code{filename} | ||
196 | @item @code{filename.tbz2} becomes @code{filename.tar} | ||
197 | @item @code{filename.tbz } becomes @code{filename.tar} | ||
198 | @item @code{anyothername } becomes @code{anyothername.out} | ||
199 | @end itemize | ||
200 | If the file does not end in one of the recognised endings, | ||
201 | @code{.bz2}, @code{.bz}, | ||
202 | @code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot | ||
203 | guess the name of the original file, and uses the original name | ||
204 | with @code{.out} appended. | ||
205 | |||
206 | As with compression, supplying no | ||
207 | filenames causes decompression from standard input to standard output. | ||
208 | |||
209 | @code{bunzip2} will correctly decompress a file which is the | ||
210 | concatenation of two or more compressed files. The result is the | ||
211 | concatenation of the corresponding uncompressed files. Integrity | ||
212 | testing (@code{-t}) of concatenated compressed files is also supported. | ||
213 | |||
214 | You can also compress or decompress files to the standard output by | ||
215 | giving the @code{-c} flag. Multiple files may be compressed and | ||
216 | decompressed like this. The resulting outputs are fed sequentially to | ||
217 | stdout. Compression of multiple files in this manner generates a stream | ||
218 | containing multiple compressed file representations. Such a stream | ||
219 | can be decompressed correctly only by @code{bzip2} version 0.9.0 or | ||
220 | later. Earlier versions of @code{bzip2} will stop after decompressing | ||
221 | the first file in the stream. | ||
222 | |||
223 | @code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to | ||
224 | the standard output. | ||
225 | |||
226 | @code{bzip2} will read arguments from the environment variables | ||
227 | @code{BZIP2} and @code{BZIP}, in that order, and will process them | ||
228 | before any arguments read from the command line. This gives a | ||
229 | convenient way to supply default arguments. | ||
230 | |||
231 | Compression is always performed, even if the compressed file is slightly | ||
232 | larger than the original. Files of less than about one hundred bytes | ||
233 | tend to get larger, since the compression mechanism has a constant | ||
234 | overhead in the region of 50 bytes. Random data (including the output | ||
235 | of most file compressors) is coded at about 8.05 bits per byte, giving | ||
236 | an expansion of around 0.5%. | ||
237 | |||
238 | As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to | ||
239 | make sure that the decompressed version of a file is identical to the | ||
240 | original. This guards against corruption of the compressed data, and | ||
241 | against undetected bugs in @code{bzip2} (hopefully very unlikely). The | ||
242 | chances of data corruption going undetected is microscopic, about one | ||
243 | chance in four billion for each file processed. Be aware, though, that | ||
244 | the check occurs upon decompression, so it can only tell you that | ||
245 | something is wrong. It can't help you recover the original uncompressed | ||
246 | data. You can use @code{bzip2recover} to try to recover data from | ||
247 | damaged files. | ||
248 | |||
249 | Return values: 0 for a normal exit, 1 for environmental problems (file | ||
250 | not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt | ||
251 | compressed file, 3 for an internal consistency error (eg, bug) which | ||
252 | caused @code{bzip2} to panic. | ||
253 | |||
254 | |||
255 | @unnumberedsubsubsec OPTIONS | ||
256 | @table @code | ||
257 | @item -c --stdout | ||
258 | Compress or decompress to standard output. | ||
259 | @item -d --decompress | ||
260 | Force decompression. @code{bzip2}, @code{bunzip2} and @code{bzcat} are | ||
261 | really the same program, and the decision about what actions to take is | ||
262 | done on the basis of which name is used. This flag overrides that | ||
263 | mechanism, and forces bzip2 to decompress. | ||
264 | @item -z --compress | ||
265 | The complement to @code{-d}: forces compression, regardless of the | ||
266 | invokation name. | ||
267 | @item -t --test | ||
268 | Check integrity of the specified file(s), but don't decompress them. | ||
269 | This really performs a trial decompression and throws away the result. | ||
270 | @item -f --force | ||
271 | Force overwrite of output files. Normally, @code{bzip2} will not overwrite | ||
272 | existing output files. Also forces @code{bzip2} to break hard links | ||
273 | to files, which it otherwise wouldn't do. | ||
274 | |||
275 | @code{bzip2} normally declines to decompress files which don't have the | ||
276 | correct magic header bytes. If forced (@code{-f}), however, it will | ||
277 | pass such files through unmodified. This is how GNU @code{gzip} | ||
278 | behaves. | ||
279 | @item -k --keep | ||
280 | Keep (don't delete) input files during compression | ||
281 | or decompression. | ||
282 | @item -s --small | ||
283 | Reduce memory usage, for compression, decompression and testing. Files | ||
284 | are decompressed and tested using a modified algorithm which only | ||
285 | requires 2.5 bytes per block byte. This means any file can be | ||
286 | decompressed in 2300k of memory, albeit at about half the normal speed. | ||
287 | |||
288 | During compression, @code{-s} selects a block size of 200k, which limits | ||
289 | memory use to around the same figure, at the expense of your compression | ||
290 | ratio. In short, if your machine is low on memory (8 megabytes or | ||
291 | less), use -s for everything. See MEMORY MANAGEMENT below. | ||
292 | @item -q --quiet | ||
293 | Suppress non-essential warning messages. Messages pertaining to | ||
294 | I/O errors and other critical events will not be suppressed. | ||
295 | @item -v --verbose | ||
296 | Verbose mode -- show the compression ratio for each file processed. | ||
297 | Further @code{-v}'s increase the verbosity level, spewing out lots of | ||
298 | information which is primarily of interest for diagnostic purposes. | ||
299 | @item -L --license -V --version | ||
300 | Display the software version, license terms and conditions. | ||
301 | @item -1 (or --fast) to -9 (or --best) | ||
302 | Set the block size to 100 k, 200 k .. 900 k when compressing. Has no | ||
303 | effect when decompressing. See MEMORY MANAGEMENT below. | ||
304 | The @code{--fast} and @code{--best} aliases are primarily for GNU | ||
305 | @code{gzip} compatibility. In particular, @code{--fast} doesn't make | ||
306 | things significantly faster. And @code{--best} merely selects the | ||
307 | default behaviour. | ||
308 | @item -- | ||
309 | Treats all subsequent arguments as file names, even if they start | ||
310 | with a dash. This is so you can handle files with names beginning | ||
311 | with a dash, for example: @code{bzip2 -- -myfilename}. | ||
312 | @item --repetitive-fast | ||
313 | @item --repetitive-best | ||
314 | These flags are redundant in versions 0.9.5 and above. They provided | ||
315 | some coarse control over the behaviour of the sorting algorithm in | ||
316 | earlier versions, which was sometimes useful. 0.9.5 and above have an | ||
317 | improved algorithm which renders these flags irrelevant. | ||
318 | @end table | ||
319 | |||
320 | |||
321 | @unnumberedsubsubsec MEMORY MANAGEMENT | ||
322 | |||
323 | @code{bzip2} compresses large files in blocks. The block size affects | ||
324 | both the compression ratio achieved, and the amount of memory needed for | ||
325 | compression and decompression. The flags @code{-1} through @code{-9} | ||
326 | specify the block size to be 100,000 bytes through 900,000 bytes (the | ||
327 | default) respectively. At decompression time, the block size used for | ||
328 | compression is read from the header of the compressed file, and | ||
329 | @code{bunzip2} then allocates itself just enough memory to decompress | ||
330 | the file. Since block sizes are stored in compressed files, it follows | ||
331 | that the flags @code{-1} to @code{-9} are irrelevant to and so ignored | ||
332 | during decompression. | ||
333 | |||
334 | Compression and decompression requirements, in bytes, can be estimated | ||
335 | as: | ||
336 | @example | ||
337 | Compression: 400k + ( 8 x block size ) | ||
338 | |||
339 | Decompression: 100k + ( 4 x block size ), or | ||
340 | 100k + ( 2.5 x block size ) | ||
341 | @end example | ||
342 | Larger block sizes give rapidly diminishing marginal returns. Most of | ||
343 | the compression comes from the first two or three hundred k of block | ||
344 | size, a fact worth bearing in mind when using @code{bzip2} on small machines. | ||
345 | It is also important to appreciate that the decompression memory | ||
346 | requirement is set at compression time by the choice of block size. | ||
347 | |||
348 | For files compressed with the default 900k block size, @code{bunzip2} | ||
349 | will require about 3700 kbytes to decompress. To support decompression | ||
350 | of any file on a 4 megabyte machine, @code{bunzip2} has an option to | ||
351 | decompress using approximately half this amount of memory, about 2300 | ||
352 | kbytes. Decompression speed is also halved, so you should use this | ||
353 | option only where necessary. The relevant flag is @code{-s}. | ||
354 | |||
355 | In general, try and use the largest block size memory constraints allow, | ||
356 | since that maximises the compression achieved. Compression and | ||
357 | decompression speed are virtually unaffected by block size. | ||
358 | |||
359 | Another significant point applies to files which fit in a single block | ||
360 | -- that means most files you'd encounter using a large block size. The | ||
361 | amount of real memory touched is proportional to the size of the file, | ||
362 | since the file is smaller than a block. For example, compressing a file | ||
363 | 20,000 bytes long with the flag @code{-9} will cause the compressor to | ||
364 | allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 | ||
365 | kbytes of it. Similarly, the decompressor will allocate 3700k but only | ||
366 | touch 100k + 20000 * 4 = 180 kbytes. | ||
367 | |||
368 | Here is a table which summarises the maximum memory usage for different | ||
369 | block sizes. Also recorded is the total compressed size for 14 files of | ||
370 | the Calgary Text Compression Corpus totalling 3,141,622 bytes. This | ||
371 | column gives some feel for how compression varies with block size. | ||
372 | These figures tend to understate the advantage of larger block sizes for | ||
373 | larger files, since the Corpus is dominated by smaller files. | ||
374 | @example | ||
375 | Compress Decompress Decompress Corpus | ||
376 | Flag usage usage -s usage Size | ||
377 | |||
378 | -1 1200k 500k 350k 914704 | ||
379 | -2 2000k 900k 600k 877703 | ||
380 | -3 2800k 1300k 850k 860338 | ||
381 | -4 3600k 1700k 1100k 846899 | ||
382 | -5 4400k 2100k 1350k 845160 | ||
383 | -6 5200k 2500k 1600k 838626 | ||
384 | -7 6100k 2900k 1850k 834096 | ||
385 | -8 6800k 3300k 2100k 828642 | ||
386 | -9 7600k 3700k 2350k 828642 | ||
387 | @end example | ||
388 | |||
389 | @unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES | ||
390 | |||
391 | @code{bzip2} compresses files in blocks, usually 900kbytes long. Each | ||
392 | block is handled independently. If a media or transmission error causes | ||
393 | a multi-block @code{.bz2} file to become damaged, it may be possible to | ||
394 | recover data from the undamaged blocks in the file. | ||
395 | |||
396 | The compressed representation of each block is delimited by a 48-bit | ||
397 | pattern, which makes it possible to find the block boundaries with | ||
398 | reasonable certainty. Each block also carries its own 32-bit CRC, so | ||
399 | damaged blocks can be distinguished from undamaged ones. | ||
400 | |||
401 | @code{bzip2recover} is a simple program whose purpose is to search for | ||
402 | blocks in @code{.bz2} files, and write each block out into its own | ||
403 | @code{.bz2} file. You can then use @code{bzip2 -t} to test the | ||
404 | integrity of the resulting files, and decompress those which are | ||
405 | undamaged. | ||
406 | |||
407 | @code{bzip2recover} | ||
408 | takes a single argument, the name of the damaged file, and writes a | ||
409 | number of files @code{rec00001file.bz2}, @code{rec00002file.bz2}, etc, | ||
410 | containing the extracted blocks. The output filenames are designed so | ||
411 | that the use of wildcards in subsequent processing -- for example, | ||
412 | @code{bzip2 -dc rec*file.bz2 > recovered_data} -- processes the files in | ||
413 | the correct order. | ||
414 | |||
415 | @code{bzip2recover} should be of most use dealing with large @code{.bz2} | ||
416 | files, as these will contain many blocks. It is clearly futile to use | ||
417 | it on damaged single-block files, since a damaged block cannot be | ||
418 | recovered. If you wish to minimise any potential data loss through | ||
419 | media or transmission errors, you might consider compressing with a | ||
420 | smaller block size. | ||
421 | |||
422 | |||
423 | @unnumberedsubsubsec PERFORMANCE NOTES | ||
424 | |||
425 | The sorting phase of compression gathers together similar strings in the | ||
426 | file. Because of this, files containing very long runs of repeated | ||
427 | symbols, like "aabaabaabaab ..." (repeated several hundred times) may | ||
428 | compress more slowly than normal. Versions 0.9.5 and above fare much | ||
429 | better than previous versions in this respect. The ratio between | ||
430 | worst-case and average-case compression time is in the region of 10:1. | ||
431 | For previous versions, this figure was more like 100:1. You can use the | ||
432 | @code{-vvvv} option to monitor progress in great detail, if you want. | ||
433 | |||
434 | Decompression speed is unaffected by these phenomena. | ||
435 | |||
436 | @code{bzip2} usually allocates several megabytes of memory to operate | ||
437 | in, and then charges all over it in a fairly random fashion. This means | ||
438 | that performance, both for compressing and decompressing, is largely | ||
439 | determined by the speed at which your machine can service cache misses. | ||
440 | Because of this, small changes to the code to reduce the miss rate have | ||
441 | been observed to give disproportionately large performance improvements. | ||
442 | I imagine @code{bzip2} will perform best on machines with very large | ||
443 | caches. | ||
444 | |||
445 | |||
446 | @unnumberedsubsubsec CAVEATS | ||
447 | |||
448 | I/O error messages are not as helpful as they could be. @code{bzip2} | ||
449 | tries hard to detect I/O errors and exit cleanly, but the details of | ||
450 | what the problem is sometimes seem rather misleading. | ||
451 | |||
452 | This manual page pertains to version 1.0.2 of @code{bzip2}. Compressed | ||
453 | data created by this version is entirely forwards and backwards | ||
454 | compatible with the previous public releases, versions 0.1pl2, 0.9.0, | ||
455 | 0.9.5, 1.0.0 and 1.0.1, but with the following exception: 0.9.0 and | ||
456 | above can correctly decompress multiple concatenated compressed files. | ||
457 | 0.1pl2 cannot do this; it will stop after decompressing just the first | ||
458 | file in the stream. | ||
459 | |||
460 | @code{bzip2recover} versions prior to this one, 1.0.2, used 32-bit | ||
461 | integers to represent bit positions in compressed files, so it could not | ||
462 | handle compressed files more than 512 megabytes long. Version 1.0.2 and | ||
463 | above uses 64-bit ints on some platforms which support them (GNU | ||
464 | supported targets, and Windows). To establish whether or not | ||
465 | @code{bzip2recover} was built with such a limitation, run it without | ||
466 | arguments. In any event you can build yourself an unlimited version if | ||
467 | you can recompile it with @code{MaybeUInt64} set to be an unsigned | ||
468 | 64-bit integer. | ||
469 | |||
470 | |||
471 | |||
472 | @unnumberedsubsubsec AUTHOR | ||
473 | Julian Seward, @code{jseward@@acm.org}. | ||
474 | |||
475 | @code{http://sources.redhat.com/bzip2} | ||
476 | |||
477 | The ideas embodied in @code{bzip2} are due to (at least) the following | ||
478 | people: Michael Burrows and David Wheeler (for the block sorting | ||
479 | transformation), David Wheeler (again, for the Huffman coder), Peter | ||
480 | Fenwick (for the structured coding model in the original @code{bzip}, | ||
481 | and many refinements), and Alistair Moffat, Radford Neal and Ian Witten | ||
482 | (for the arithmetic coder in the original @code{bzip}). I am much | ||
483 | indebted for their help, support and advice. See the manual in the | ||
484 | source distribution for pointers to sources of documentation. Christian | ||
485 | von Roques encouraged me to look for faster sorting algorithms, so as to | ||
486 | speed up compression. Bela Lubkin encouraged me to improve the | ||
487 | worst-case compression performance. The @code{bz*} scripts are derived | ||
488 | from those of GNU @code{gzip}. Many people sent patches, helped with | ||
489 | portability problems, lent machines, gave advice and were generally | ||
490 | helpful. | ||
491 | |||
492 | @end quotation | ||
493 | |||
494 | |||
495 | |||
496 | |||
497 | @chapter Programming with @code{libbzip2} | ||
498 | |||
499 | This chapter describes the programming interface to @code{libbzip2}. | ||
500 | |||
501 | For general background information, particularly about memory | ||
502 | use and performance aspects, you'd be well advised to read Chapter 2 | ||
503 | as well. | ||
504 | |||
505 | @section Top-level structure | ||
506 | |||
507 | @code{libbzip2} is a flexible library for compressing and decompressing | ||
508 | data in the @code{bzip2} data format. Although packaged as a single | ||
509 | entity, it helps to regard the library as three separate parts: the low | ||
510 | level interface, and the high level interface, and some utility | ||
511 | functions. | ||
512 | |||
513 | The structure of @code{libbzip2}'s interfaces is similar to | ||
514 | that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib} | ||
515 | library. | ||
516 | |||
517 | All externally visible symbols have names beginning @code{BZ2_}. | ||
518 | This is new in version 1.0. The intention is to minimise pollution | ||
519 | of the namespaces of library clients. | ||
520 | |||
521 | @subsection Low-level summary | ||
522 | |||
523 | This interface provides services for compressing and decompressing | ||
524 | data in memory. There's no provision for dealing with files, streams | ||
525 | or any other I/O mechanisms, just straight memory-to-memory work. | ||
526 | In fact, this part of the library can be compiled without inclusion | ||
527 | of @code{stdio.h}, which may be helpful for embedded applications. | ||
528 | |||
529 | The low-level part of the library has no global variables and | ||
530 | is therefore thread-safe. | ||
531 | |||
532 | Six routines make up the low level interface: | ||
533 | @code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, and @* @code{BZ2_bzCompressEnd} | ||
534 | for compression, | ||
535 | and a corresponding trio @code{BZ2_bzDecompressInit}, @* @code{BZ2_bzDecompress} | ||
536 | and @code{BZ2_bzDecompressEnd} for decompression. | ||
537 | The @code{*Init} functions allocate | ||
538 | memory for compression/decompression and do other | ||
539 | initialisations, whilst the @code{*End} functions close down operations | ||
540 | and release memory. | ||
541 | |||
542 | The real work is done by @code{BZ2_bzCompress} and @code{BZ2_bzDecompress}. | ||
543 | These compress and decompress data from a user-supplied input buffer | ||
544 | to a user-supplied output buffer. These buffers can be any size; | ||
545 | arbitrary quantities of data are handled by making repeated calls | ||
546 | to these functions. This is a flexible mechanism allowing a | ||
547 | consumer-pull style of activity, or producer-push, or a mixture of | ||
548 | both. | ||
549 | |||
550 | |||
551 | |||
552 | @subsection High-level summary | ||
553 | |||
554 | This interface provides some handy wrappers around the low-level | ||
555 | interface to facilitate reading and writing @code{bzip2} format | ||
556 | files (@code{.bz2} files). The routines provide hooks to facilitate | ||
557 | reading files in which the @code{bzip2} data stream is embedded | ||
558 | within some larger-scale file structure, or where there are | ||
559 | multiple @code{bzip2} data streams concatenated end-to-end. | ||
560 | |||
561 | For reading files, @code{BZ2_bzReadOpen}, @code{BZ2_bzRead}, | ||
562 | @code{BZ2_bzReadClose} and @* @code{BZ2_bzReadGetUnused} are supplied. For | ||
563 | writing files, @code{BZ2_bzWriteOpen}, @code{BZ2_bzWrite} and | ||
564 | @code{BZ2_bzWriteFinish} are available. | ||
565 | |||
566 | As with the low-level library, no global variables are used | ||
567 | so the library is per se thread-safe. However, if I/O errors | ||
568 | occur whilst reading or writing the underlying compressed files, | ||
569 | you may have to consult @code{errno} to determine the cause of | ||
570 | the error. In that case, you'd need a C library which correctly | ||
571 | supports @code{errno} in a multithreaded environment. | ||
572 | |||
573 | To make the library a little simpler and more portable, | ||
574 | @code{BZ2_bzReadOpen} and @code{BZ2_bzWriteOpen} require you to pass them file | ||
575 | handles (@code{FILE*}s) which have previously been opened for reading or | ||
576 | writing respectively. That avoids portability problems associated with | ||
577 | file operations and file attributes, whilst not being much of an | ||
578 | imposition on the programmer. | ||
579 | |||
580 | |||
581 | |||
582 | @subsection Utility functions summary | ||
583 | For very simple needs, @code{BZ2_bzBuffToBuffCompress} and | ||
584 | @code{BZ2_bzBuffToBuffDecompress} are provided. These compress | ||
585 | data in memory from one buffer to another buffer in a single | ||
586 | function call. You should assess whether these functions | ||
587 | fulfill your memory-to-memory compression/decompression | ||
588 | requirements before investing effort in understanding the more | ||
589 | general but more complex low-level interface. | ||
590 | |||
591 | Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} / | ||
592 | @code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to | ||
593 | give better @code{zlib} compatibility. These functions are | ||
594 | @code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush}, | ||
595 | @code{BZ2_bzclose}, | ||
596 | @code{BZ2_bzerror} and @code{BZ2_bzlibVersion}. You may find these functions | ||
597 | more convenient for simple file reading and writing, than those in the | ||
598 | high-level interface. These functions are not (yet) officially part of | ||
599 | the library, and are minimally documented here. If they break, you | ||
600 | get to keep all the pieces. I hope to document them properly when time | ||
601 | permits. | ||
602 | |||
603 | Yoshioka also contributed modifications to allow the library to be | ||
604 | built as a Windows DLL. | ||
605 | |||
606 | |||
607 | @section Error handling | ||
608 | |||
609 | The library is designed to recover cleanly in all situations, including | ||
610 | the worst-case situation of decompressing random data. I'm not | ||
611 | 100% sure that it can always do this, so you might want to add | ||
612 | a signal handler to catch segmentation violations during decompression | ||
613 | if you are feeling especially paranoid. I would be interested in | ||
614 | hearing more about the robustness of the library to corrupted | ||
615 | compressed data. | ||
616 | |||
617 | Version 1.0 is much more robust in this respect than | ||
618 | 0.9.0 or 0.9.5. Investigations with Checker (a tool for | ||
619 | detecting problems with memory management, similar to Purify) | ||
620 | indicate that, at least for the few files I tested, all single-bit | ||
621 | errors in the decompressed data are caught properly, with no | ||
622 | segmentation faults, no reads of uninitialised data and no | ||
623 | out of range reads or writes. So it's certainly much improved, | ||
624 | although I wouldn't claim it to be totally bombproof. | ||
625 | |||
626 | The file @code{bzlib.h} contains all definitions needed to use | ||
627 | the library. In particular, you should definitely not include | ||
628 | @code{bzlib_private.h}. | ||
629 | |||
630 | In @code{bzlib.h}, the various return values are defined. The following | ||
631 | list is not intended as an exhaustive description of the circumstances | ||
632 | in which a given value may be returned -- those descriptions are given | ||
633 | later. Rather, it is intended to convey the rough meaning of each | ||
634 | return value. The first five actions are normal and not intended to | ||
635 | denote an error situation. | ||
636 | @table @code | ||
637 | @item BZ_OK | ||
638 | The requested action was completed successfully. | ||
639 | @item BZ_RUN_OK | ||
640 | @itemx BZ_FLUSH_OK | ||
641 | @itemx BZ_FINISH_OK | ||
642 | In @code{BZ2_bzCompress}, the requested flush/finish/nothing-special action | ||
643 | was completed successfully. | ||
644 | @item BZ_STREAM_END | ||
645 | Compression of data was completed, or the logical stream end was | ||
646 | detected during decompression. | ||
647 | @end table | ||
648 | |||
649 | The following return values indicate an error of some kind. | ||
650 | @table @code | ||
651 | @item BZ_CONFIG_ERROR | ||
652 | Indicates that the library has been improperly compiled on your | ||
653 | platform -- a major configuration error. Specifically, it means | ||
654 | that @code{sizeof(char)}, @code{sizeof(short)} and @code{sizeof(int)} | ||
655 | are not 1, 2 and 4 respectively, as they should be. Note that the | ||
656 | library should still work properly on 64-bit platforms which follow | ||
657 | the LP64 programming model -- that is, where @code{sizeof(long)} | ||
658 | and @code{sizeof(void*)} are 8. Under LP64, @code{sizeof(int)} is | ||
659 | still 4, so @code{libbzip2}, which doesn't use the @code{long} type, | ||
660 | is OK. | ||
661 | @item BZ_SEQUENCE_ERROR | ||
662 | When using the library, it is important to call the functions in the | ||
663 | correct sequence and with data structures (buffers etc) in the correct | ||
664 | states. @code{libbzip2} checks as much as it can to ensure this is | ||
665 | happening, and returns @code{BZ_SEQUENCE_ERROR} if not. Code which | ||
666 | complies precisely with the function semantics, as detailed below, | ||
667 | should never receive this value; such an event denotes buggy code | ||
668 | which you should investigate. | ||
669 | @item BZ_PARAM_ERROR | ||
670 | Returned when a parameter to a function call is out of range | ||
671 | or otherwise manifestly incorrect. As with @code{BZ_SEQUENCE_ERROR}, | ||
672 | this denotes a bug in the client code. The distinction between | ||
673 | @code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth | ||
674 | making. | ||
675 | @item BZ_MEM_ERROR | ||
676 | Returned when a request to allocate memory failed. Note that the | ||
677 | quantity of memory needed to decompress a stream cannot be determined | ||
678 | until the stream's header has been read. So @code{BZ2_bzDecompress} and | ||
679 | @code{BZ2_bzRead} may return @code{BZ_MEM_ERROR} even though some of | ||
680 | the compressed data has been read. The same is not true for | ||
681 | compression; once @code{BZ2_bzCompressInit} or @code{BZ2_bzWriteOpen} have | ||
682 | successfully completed, @code{BZ_MEM_ERROR} cannot occur. | ||
683 | @item BZ_DATA_ERROR | ||
684 | Returned when a data integrity error is detected during decompression. | ||
685 | Most importantly, this means when stored and computed CRCs for the | ||
686 | data do not match. This value is also returned upon detection of any | ||
687 | other anomaly in the compressed data. | ||
688 | @item BZ_DATA_ERROR_MAGIC | ||
689 | As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to | ||
690 | know when the compressed stream does not start with the correct | ||
691 | magic bytes (@code{'B' 'Z' 'h'}). | ||
692 | @item BZ_IO_ERROR | ||
693 | Returned by @code{BZ2_bzRead} and @code{BZ2_bzWrite} when there is an error | ||
694 | reading or writing in the compressed file, and by @code{BZ2_bzReadOpen} | ||
695 | and @code{BZ2_bzWriteOpen} for attempts to use a file for which the | ||
696 | error indicator (viz, @code{ferror(f)}) is set. | ||
697 | On receipt of @code{BZ_IO_ERROR}, the caller should consult | ||
698 | @code{errno} and/or @code{perror} to acquire operating-system | ||
699 | specific information about the problem. | ||
700 | @item BZ_UNEXPECTED_EOF | ||
701 | Returned by @code{BZ2_bzRead} when the compressed file finishes | ||
702 | before the logical end of stream is detected. | ||
703 | @item BZ_OUTBUFF_FULL | ||
704 | Returned by @code{BZ2_bzBuffToBuffCompress} and | ||
705 | @code{BZ2_bzBuffToBuffDecompress} to indicate that the output data | ||
706 | will not fit into the output buffer provided. | ||
707 | @end table | ||
708 | |||
709 | |||
710 | |||
711 | @section Low-level interface | ||
712 | |||
713 | @subsection @code{BZ2_bzCompressInit} | ||
714 | @example | ||
715 | typedef | ||
716 | struct @{ | ||
717 | char *next_in; | ||
718 | unsigned int avail_in; | ||
719 | unsigned int total_in_lo32; | ||
720 | unsigned int total_in_hi32; | ||
721 | |||
722 | char *next_out; | ||
723 | unsigned int avail_out; | ||
724 | unsigned int total_out_lo32; | ||
725 | unsigned int total_out_hi32; | ||
726 | |||
727 | void *state; | ||
728 | |||
729 | void *(*bzalloc)(void *,int,int); | ||
730 | void (*bzfree)(void *,void *); | ||
731 | void *opaque; | ||
732 | @} | ||
733 | bz_stream; | ||
734 | |||
735 | int BZ2_bzCompressInit ( bz_stream *strm, | ||
736 | int blockSize100k, | ||
737 | int verbosity, | ||
738 | int workFactor ); | ||
739 | |||
740 | @end example | ||
741 | |||
742 | Prepares for compression. The @code{bz_stream} structure | ||
743 | holds all data pertaining to the compression activity. | ||
744 | A @code{bz_stream} structure should be allocated and initialised | ||
745 | prior to the call. | ||
746 | The fields of @code{bz_stream} | ||
747 | comprise the entirety of the user-visible data. @code{state} | ||
748 | is a pointer to the private data structures required for compression. | ||
749 | |||
750 | Custom memory allocators are supported, via fields @code{bzalloc}, | ||
751 | @code{bzfree}, | ||
752 | and @code{opaque}. The value | ||
753 | @code{opaque} is passed to as the first argument to | ||
754 | all calls to @code{bzalloc} and @code{bzfree}, but is | ||
755 | otherwise ignored by the library. | ||
756 | The call @code{bzalloc ( opaque, n, m )} is expected to return a | ||
757 | pointer @code{p} to | ||
758 | @code{n * m} bytes of memory, and @code{bzfree ( opaque, p )} | ||
759 | should free | ||
760 | that memory. | ||
761 | |||
762 | If you don't want to use a custom memory allocator, set @code{bzalloc}, | ||
763 | @code{bzfree} and | ||
764 | @code{opaque} to @code{NULL}, | ||
765 | and the library will then use the standard @code{malloc}/@code{free} | ||
766 | routines. | ||
767 | |||
768 | Before calling @code{BZ2_bzCompressInit}, fields @code{bzalloc}, | ||
769 | @code{bzfree} and @code{opaque} should | ||
770 | be filled appropriately, as just described. Upon return, the internal | ||
771 | state will have been allocated and initialised, and @code{total_in_lo32}, | ||
772 | @code{total_in_hi32}, @code{total_out_lo32} and | ||
773 | @code{total_out_hi32} will have been set to zero. | ||
774 | These four fields are used by the library | ||
775 | to inform the caller of the total amount of data passed into and out of | ||
776 | the library, respectively. You should not try to change them. | ||
777 | As of version 1.0, 64-bit counts are maintained, even on 32-bit | ||
778 | platforms, using the @code{_hi32} fields to store the upper 32 bits | ||
779 | of the count. So, for example, the total amount of data in | ||
780 | is @code{(total_in_hi32 << 32) + total_in_lo32}. | ||
781 | |||
782 | Parameter @code{blockSize100k} specifies the block size to be used for | ||
783 | compression. It should be a value between 1 and 9 inclusive, and the | ||
784 | actual block size used is 100000 x this figure. 9 gives the best | ||
785 | compression but takes most memory. | ||
786 | |||
787 | Parameter @code{verbosity} should be set to a number between 0 and 4 | ||
788 | inclusive. 0 is silent, and greater numbers give increasingly verbose | ||
789 | monitoring/debugging output. If the library has been compiled with | ||
790 | @code{-DBZ_NO_STDIO}, no such output will appear for any verbosity | ||
791 | setting. | ||
792 | |||
793 | Parameter @code{workFactor} controls how the compression phase behaves | ||
794 | when presented with worst case, highly repetitive, input data. If | ||
795 | compression runs into difficulties caused by repetitive data, the | ||
796 | library switches from the standard sorting algorithm to a fallback | ||
797 | algorithm. The fallback is slower than the standard algorithm by | ||
798 | perhaps a factor of three, but always behaves reasonably, no matter how | ||
799 | bad the input. | ||
800 | |||
801 | Lower values of @code{workFactor} reduce the amount of effort the | ||
802 | standard algorithm will expend before resorting to the fallback. You | ||
803 | should set this parameter carefully; too low, and many inputs will be | ||
804 | handled by the fallback algorithm and so compress rather slowly, too | ||
805 | high, and your average-to-worst case compression times can become very | ||
806 | large. The default value of 30 gives reasonable behaviour over a wide | ||
807 | range of circumstances. | ||
808 | |||
809 | Allowable values range from 0 to 250 inclusive. 0 is a special case, | ||
810 | equivalent to using the default value of 30. | ||
811 | |||
812 | Note that the compressed output generated is the same regardless of | ||
813 | whether or not the fallback algorithm is used. | ||
814 | |||
815 | Be aware also that this parameter may disappear entirely in future | ||
816 | versions of the library. In principle it should be possible to devise a | ||
817 | good way to automatically choose which algorithm to use. Such a | ||
818 | mechanism would render the parameter obsolete. | ||
819 | |||
820 | Possible return values: | ||
821 | @display | ||
822 | @code{BZ_CONFIG_ERROR} | ||
823 | if the library has been mis-compiled | ||
824 | @code{BZ_PARAM_ERROR} | ||
825 | if @code{strm} is @code{NULL} | ||
826 | or @code{blockSize} < 1 or @code{blockSize} > 9 | ||
827 | or @code{verbosity} < 0 or @code{verbosity} > 4 | ||
828 | or @code{workFactor} < 0 or @code{workFactor} > 250 | ||
829 | @code{BZ_MEM_ERROR} | ||
830 | if not enough memory is available | ||
831 | @code{BZ_OK} | ||
832 | otherwise | ||
833 | @end display | ||
834 | Allowable next actions: | ||
835 | @display | ||
836 | @code{BZ2_bzCompress} | ||
837 | if @code{BZ_OK} is returned | ||
838 | no specific action needed in case of error | ||
839 | @end display | ||
840 | |||
841 | @subsection @code{BZ2_bzCompress} | ||
842 | @example | ||
843 | int BZ2_bzCompress ( bz_stream *strm, int action ); | ||
844 | @end example | ||
845 | Provides more input and/or output buffer space for the library. The | ||
846 | caller maintains input and output buffers, and calls @code{BZ2_bzCompress} to | ||
847 | transfer data between them. | ||
848 | |||
849 | Before each call to @code{BZ2_bzCompress}, @code{next_in} should point at | ||
850 | the data to be compressed, and @code{avail_in} should indicate how many | ||
851 | bytes the library may read. @code{BZ2_bzCompress} updates @code{next_in}, | ||
852 | @code{avail_in} and @code{total_in} to reflect the number of bytes it | ||
853 | has read. | ||
854 | |||
855 | Similarly, @code{next_out} should point to a buffer in which the | ||
856 | compressed data is to be placed, with @code{avail_out} indicating how | ||
857 | much output space is available. @code{BZ2_bzCompress} updates | ||
858 | @code{next_out}, @code{avail_out} and @code{total_out} to reflect the | ||
859 | number of bytes output. | ||
860 | |||
861 | You may provide and remove as little or as much data as you like on each | ||
862 | call of @code{BZ2_bzCompress}. In the limit, it is acceptable to supply and | ||
863 | remove data one byte at a time, although this would be terribly | ||
864 | inefficient. You should always ensure that at least one byte of output | ||
865 | space is available at each call. | ||
866 | |||
867 | A second purpose of @code{BZ2_bzCompress} is to request a change of mode of the | ||
868 | compressed stream. | ||
869 | |||
870 | Conceptually, a compressed stream can be in one of four states: IDLE, | ||
871 | RUNNING, FLUSHING and FINISHING. Before initialisation | ||
872 | (@code{BZ2_bzCompressInit}) and after termination (@code{BZ2_bzCompressEnd}), a | ||
873 | stream is regarded as IDLE. | ||
874 | |||
875 | Upon initialisation (@code{BZ2_bzCompressInit}), the stream is placed in the | ||
876 | RUNNING state. Subsequent calls to @code{BZ2_bzCompress} should pass | ||
877 | @code{BZ_RUN} as the requested action; other actions are illegal and | ||
878 | will result in @code{BZ_SEQUENCE_ERROR}. | ||
879 | |||
880 | At some point, the calling program will have provided all the input data | ||
881 | it wants to. It will then want to finish up -- in effect, asking the | ||
882 | library to process any data it might have buffered internally. In this | ||
883 | state, @code{BZ2_bzCompress} will no longer attempt to read data from | ||
884 | @code{next_in}, but it will want to write data to @code{next_out}. | ||
885 | Because the output buffer supplied by the user can be arbitrarily small, | ||
886 | the finishing-up operation cannot necessarily be done with a single call | ||
887 | of @code{BZ2_bzCompress}. | ||
888 | |||
889 | Instead, the calling program passes @code{BZ_FINISH} as an action to | ||
890 | @code{BZ2_bzCompress}. This changes the stream's state to FINISHING. Any | ||
891 | remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and | ||
892 | transferred to the output buffer. To do this, @code{BZ2_bzCompress} must be | ||
893 | called repeatedly until all the output has been consumed. At that | ||
894 | point, @code{BZ2_bzCompress} returns @code{BZ_STREAM_END}, and the stream's | ||
895 | state is set back to IDLE. @code{BZ2_bzCompressEnd} should then be | ||
896 | called. | ||
897 | |||
898 | Just to make sure the calling program does not cheat, the library makes | ||
899 | a note of @code{avail_in} at the time of the first call to | ||
900 | @code{BZ2_bzCompress} which has @code{BZ_FINISH} as an action (ie, at the | ||
901 | time the program has announced its intention to not supply any more | ||
902 | input). By comparing this value with that of @code{avail_in} over | ||
903 | subsequent calls to @code{BZ2_bzCompress}, the library can detect any | ||
904 | attempts to slip in more data to compress. Any calls for which this is | ||
905 | detected will return @code{BZ_SEQUENCE_ERROR}. This indicates a | ||
906 | programming mistake which should be corrected. | ||
907 | |||
908 | Instead of asking to finish, the calling program may ask | ||
909 | @code{BZ2_bzCompress} to take all the remaining input, compress it and | ||
910 | terminate the current (Burrows-Wheeler) compression block. This could | ||
911 | be useful for error control purposes. The mechanism is analogous to | ||
912 | that for finishing: call @code{BZ2_bzCompress} with an action of | ||
913 | @code{BZ_FLUSH}, remove output data, and persist with the | ||
914 | @code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned. As | ||
915 | with finishing, @code{BZ2_bzCompress} detects any attempt to provide more | ||
916 | input data once the flush has begun. | ||
917 | |||
918 | Once the flush is complete, the stream returns to the normal RUNNING | ||
919 | state. | ||
920 | |||
921 | This all sounds pretty complex, but isn't really. Here's a table | ||
922 | which shows which actions are allowable in each state, what action | ||
923 | will be taken, what the next state is, and what the non-error return | ||
924 | values are. Note that you can't explicitly ask what state the | ||
925 | stream is in, but nor do you need to -- it can be inferred from the | ||
926 | values returned by @code{BZ2_bzCompress}. | ||
927 | @display | ||
928 | IDLE/@code{any} | ||
929 | Illegal. IDLE state only exists after @code{BZ2_bzCompressEnd} or | ||
930 | before @code{BZ2_bzCompressInit}. | ||
931 | Return value = @code{BZ_SEQUENCE_ERROR} | ||
932 | |||
933 | RUNNING/@code{BZ_RUN} | ||
934 | Compress from @code{next_in} to @code{next_out} as much as possible. | ||
935 | Next state = RUNNING | ||
936 | Return value = @code{BZ_RUN_OK} | ||
937 | |||
938 | RUNNING/@code{BZ_FLUSH} | ||
939 | Remember current value of @code{next_in}. Compress from @code{next_in} | ||
940 | to @code{next_out} as much as possible, but do not accept any more input. | ||
941 | Next state = FLUSHING | ||
942 | Return value = @code{BZ_FLUSH_OK} | ||
943 | |||
944 | RUNNING/@code{BZ_FINISH} | ||
945 | Remember current value of @code{next_in}. Compress from @code{next_in} | ||
946 | to @code{next_out} as much as possible, but do not accept any more input. | ||
947 | Next state = FINISHING | ||
948 | Return value = @code{BZ_FINISH_OK} | ||
949 | |||
950 | FLUSHING/@code{BZ_FLUSH} | ||
951 | Compress from @code{next_in} to @code{next_out} as much as possible, | ||
952 | but do not accept any more input. | ||
953 | If all the existing input has been used up and all compressed | ||
954 | output has been removed | ||
955 | Next state = RUNNING; Return value = @code{BZ_RUN_OK} | ||
956 | else | ||
957 | Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK} | ||
958 | |||
959 | FLUSHING/other | ||
960 | Illegal. | ||
961 | Return value = @code{BZ_SEQUENCE_ERROR} | ||
962 | |||
963 | FINISHING/@code{BZ_FINISH} | ||
964 | Compress from @code{next_in} to @code{next_out} as much as possible, | ||
965 | but to not accept any more input. | ||
966 | If all the existing input has been used up and all compressed | ||
967 | output has been removed | ||
968 | Next state = IDLE; Return value = @code{BZ_STREAM_END} | ||
969 | else | ||
970 | Next state = FINISHING; Return value = @code{BZ_FINISHING} | ||
971 | |||
972 | FINISHING/other | ||
973 | Illegal. | ||
974 | Return value = @code{BZ_SEQUENCE_ERROR} | ||
975 | @end display | ||
976 | |||
977 | That still looks complicated? Well, fair enough. The usual sequence | ||
978 | of calls for compressing a load of data is: | ||
979 | @itemize @bullet | ||
980 | @item Get started with @code{BZ2_bzCompressInit}. | ||
981 | @item Shovel data in and shlurp out its compressed form using zero or more | ||
982 | calls of @code{BZ2_bzCompress} with action = @code{BZ_RUN}. | ||
983 | @item Finish up. | ||
984 | Repeatedly call @code{BZ2_bzCompress} with action = @code{BZ_FINISH}, | ||
985 | copying out the compressed output, until @code{BZ_STREAM_END} is returned. | ||
986 | @item Close up and go home. Call @code{BZ2_bzCompressEnd}. | ||
987 | @end itemize | ||
988 | If the data you want to compress fits into your input buffer all | ||
989 | at once, you can skip the calls of @code{BZ2_bzCompress ( ..., BZ_RUN )} and | ||
990 | just do the @code{BZ2_bzCompress ( ..., BZ_FINISH )} calls. | ||
991 | |||
992 | All required memory is allocated by @code{BZ2_bzCompressInit}. The | ||
993 | compression library can accept any data at all (obviously). So you | ||
994 | shouldn't get any error return values from the @code{BZ2_bzCompress} calls. | ||
995 | If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in | ||
996 | your programming. | ||
997 | |||
998 | Trivial other possible return values: | ||
999 | @display | ||
1000 | @code{BZ_PARAM_ERROR} | ||
1001 | if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL} | ||
1002 | @end display | ||
1003 | |||
1004 | @subsection @code{BZ2_bzCompressEnd} | ||
1005 | @example | ||
1006 | int BZ2_bzCompressEnd ( bz_stream *strm ); | ||
1007 | @end example | ||
1008 | Releases all memory associated with a compression stream. | ||
1009 | |||
1010 | Possible return values: | ||
1011 | @display | ||
1012 | @code{BZ_PARAM_ERROR} if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} | ||
1013 | @code{BZ_OK} otherwise | ||
1014 | @end display | ||
1015 | |||
1016 | |||
1017 | @subsection @code{BZ2_bzDecompressInit} | ||
1018 | @example | ||
1019 | int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small ); | ||
1020 | @end example | ||
1021 | Prepares for decompression. As with @code{BZ2_bzCompressInit}, a | ||
1022 | @code{bz_stream} record should be allocated and initialised before the | ||
1023 | call. Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be | ||
1024 | set if a custom memory allocator is required, or made @code{NULL} for | ||
1025 | the normal @code{malloc}/@code{free} routines. Upon return, the internal | ||
1026 | state will have been initialised, and @code{total_in} and | ||
1027 | @code{total_out} will be zero. | ||
1028 | |||
1029 | For the meaning of parameter @code{verbosity}, see @code{BZ2_bzCompressInit}. | ||
1030 | |||
1031 | If @code{small} is nonzero, the library will use an alternative | ||
1032 | decompression algorithm which uses less memory but at the cost of | ||
1033 | decompressing more slowly (roughly speaking, half the speed, but the | ||
1034 | maximum memory requirement drops to around 2300k). See Chapter 2 for | ||
1035 | more information on memory management. | ||
1036 | |||
1037 | Note that the amount of memory needed to decompress | ||
1038 | a stream cannot be determined until the stream's header has been read, | ||
1039 | so even if @code{BZ2_bzDecompressInit} succeeds, a subsequent | ||
1040 | @code{BZ2_bzDecompress} could fail with @code{BZ_MEM_ERROR}. | ||
1041 | |||
1042 | Possible return values: | ||
1043 | @display | ||
1044 | @code{BZ_CONFIG_ERROR} | ||
1045 | if the library has been mis-compiled | ||
1046 | @code{BZ_PARAM_ERROR} | ||
1047 | if @code{(small != 0 && small != 1)} | ||
1048 | or @code{(verbosity < 0 || verbosity > 4)} | ||
1049 | @code{BZ_MEM_ERROR} | ||
1050 | if insufficient memory is available | ||
1051 | @end display | ||
1052 | |||
1053 | Allowable next actions: | ||
1054 | @display | ||
1055 | @code{BZ2_bzDecompress} | ||
1056 | if @code{BZ_OK} was returned | ||
1057 | no specific action required in case of error | ||
1058 | @end display | ||
1059 | |||
1060 | |||
1061 | |||
1062 | @subsection @code{BZ2_bzDecompress} | ||
1063 | @example | ||
1064 | int BZ2_bzDecompress ( bz_stream *strm ); | ||
1065 | @end example | ||
1066 | Provides more input and/out output buffer space for the library. The | ||
1067 | caller maintains input and output buffers, and uses @code{BZ2_bzDecompress} | ||
1068 | to transfer data between them. | ||
1069 | |||
1070 | Before each call to @code{BZ2_bzDecompress}, @code{next_in} | ||
1071 | should point at the compressed data, | ||
1072 | and @code{avail_in} should indicate how many bytes the library | ||
1073 | may read. @code{BZ2_bzDecompress} updates @code{next_in}, @code{avail_in} | ||
1074 | and @code{total_in} | ||
1075 | to reflect the number of bytes it has read. | ||
1076 | |||
1077 | Similarly, @code{next_out} should point to a buffer in which the uncompressed | ||
1078 | output is to be placed, with @code{avail_out} indicating how much output space | ||
1079 | is available. @code{BZ2_bzCompress} updates @code{next_out}, | ||
1080 | @code{avail_out} and @code{total_out} to reflect | ||
1081 | the number of bytes output. | ||
1082 | |||
1083 | You may provide and remove as little or as much data as you like on | ||
1084 | each call of @code{BZ2_bzDecompress}. | ||
1085 | In the limit, it is acceptable to | ||
1086 | supply and remove data one byte at a time, although this would be | ||
1087 | terribly inefficient. You should always ensure that at least one | ||
1088 | byte of output space is available at each call. | ||
1089 | |||
1090 | Use of @code{BZ2_bzDecompress} is simpler than @code{BZ2_bzCompress}. | ||
1091 | |||
1092 | You should provide input and remove output as described above, and | ||
1093 | repeatedly call @code{BZ2_bzDecompress} until @code{BZ_STREAM_END} is | ||
1094 | returned. Appearance of @code{BZ_STREAM_END} denotes that | ||
1095 | @code{BZ2_bzDecompress} has detected the logical end of the compressed | ||
1096 | stream. @code{BZ2_bzDecompress} will not produce @code{BZ_STREAM_END} until | ||
1097 | all output data has been placed into the output buffer, so once | ||
1098 | @code{BZ_STREAM_END} appears, you are guaranteed to have available all | ||
1099 | the decompressed output, and @code{BZ2_bzDecompressEnd} can safely be | ||
1100 | called. | ||
1101 | |||
1102 | If case of an error return value, you should call @code{BZ2_bzDecompressEnd} | ||
1103 | to clean up and release memory. | ||
1104 | |||
1105 | Possible return values: | ||
1106 | @display | ||
1107 | @code{BZ_PARAM_ERROR} | ||
1108 | if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} | ||
1109 | or @code{strm->avail_out < 1} | ||
1110 | @code{BZ_DATA_ERROR} | ||
1111 | if a data integrity error is detected in the compressed stream | ||
1112 | @code{BZ_DATA_ERROR_MAGIC} | ||
1113 | if the compressed stream doesn't begin with the right magic bytes | ||
1114 | @code{BZ_MEM_ERROR} | ||
1115 | if there wasn't enough memory available | ||
1116 | @code{BZ_STREAM_END} | ||
1117 | if the logical end of the data stream was detected and all | ||
1118 | output in has been consumed, eg @code{s->avail_out > 0} | ||
1119 | @code{BZ_OK} | ||
1120 | otherwise | ||
1121 | @end display | ||
1122 | Allowable next actions: | ||
1123 | @display | ||
1124 | @code{BZ2_bzDecompress} | ||
1125 | if @code{BZ_OK} was returned | ||
1126 | @code{BZ2_bzDecompressEnd} | ||
1127 | otherwise | ||
1128 | @end display | ||
1129 | |||
1130 | |||
1131 | @subsection @code{BZ2_bzDecompressEnd} | ||
1132 | @example | ||
1133 | int BZ2_bzDecompressEnd ( bz_stream *strm ); | ||
1134 | @end example | ||
1135 | Releases all memory associated with a decompression stream. | ||
1136 | |||
1137 | Possible return values: | ||
1138 | @display | ||
1139 | @code{BZ_PARAM_ERROR} | ||
1140 | if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} | ||
1141 | @code{BZ_OK} | ||
1142 | otherwise | ||
1143 | @end display | ||
1144 | |||
1145 | Allowable next actions: | ||
1146 | @display | ||
1147 | None. | ||
1148 | @end display | ||
1149 | |||
1150 | |||
1151 | @section High-level interface | ||
1152 | |||
1153 | This interface provides functions for reading and writing | ||
1154 | @code{bzip2} format files. First, some general points. | ||
1155 | |||
1156 | @itemize @bullet | ||
1157 | @item All of the functions take an @code{int*} first argument, | ||
1158 | @code{bzerror}. | ||
1159 | After each call, @code{bzerror} should be consulted first to determine | ||
1160 | the outcome of the call. If @code{bzerror} is @code{BZ_OK}, | ||
1161 | the call completed | ||
1162 | successfully, and only then should the return value of the function | ||
1163 | (if any) be consulted. If @code{bzerror} is @code{BZ_IO_ERROR}, | ||
1164 | there was an error | ||
1165 | reading/writing the underlying compressed file, and you should | ||
1166 | then consult @code{errno}/@code{perror} to determine the | ||
1167 | cause of the difficulty. | ||
1168 | @code{bzerror} may also be set to various other values; precise details are | ||
1169 | given on a per-function basis below. | ||
1170 | @item If @code{bzerror} indicates an error | ||
1171 | (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}), | ||
1172 | you should immediately call @code{BZ2_bzReadClose} (or @code{BZ2_bzWriteClose}, | ||
1173 | depending on whether you are attempting to read or to write) | ||
1174 | to free up all resources associated | ||
1175 | with the stream. Once an error has been indicated, behaviour of all calls | ||
1176 | except @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) is undefined. | ||
1177 | The implication is that (1) @code{bzerror} should | ||
1178 | be checked after each call, and (2) if @code{bzerror} indicates an error, | ||
1179 | @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) should then be called to clean up. | ||
1180 | @item The @code{FILE*} arguments passed to | ||
1181 | @code{BZ2_bzReadOpen}/@code{BZ2_bzWriteOpen} | ||
1182 | should be set to binary mode. | ||
1183 | Most Unix systems will do this by default, but other platforms, | ||
1184 | including Windows and Mac, will not. If you omit this, you may | ||
1185 | encounter problems when moving code to new platforms. | ||
1186 | @item Memory allocation requests are handled by | ||
1187 | @code{malloc}/@code{free}. | ||
1188 | At present | ||
1189 | there is no facility for user-defined memory allocators in the file I/O | ||
1190 | functions (could easily be added, though). | ||
1191 | @end itemize | ||
1192 | |||
1193 | |||
1194 | |||
1195 | @subsection @code{BZ2_bzReadOpen} | ||
1196 | @example | ||
1197 | typedef void BZFILE; | ||
1198 | |||
1199 | BZFILE *BZ2_bzReadOpen ( int *bzerror, FILE *f, | ||
1200 | int small, int verbosity, | ||
1201 | void *unused, int nUnused ); | ||
1202 | @end example | ||
1203 | Prepare to read compressed data from file handle @code{f}. @code{f} | ||
1204 | should refer to a file which has been opened for reading, and for which | ||
1205 | the error indicator (@code{ferror(f)})is not set. If @code{small} is 1, | ||
1206 | the library will try to decompress using less memory, at the expense of | ||
1207 | speed. | ||
1208 | |||
1209 | For reasons explained below, @code{BZ2_bzRead} will decompress the | ||
1210 | @code{nUnused} bytes starting at @code{unused}, before starting to read | ||
1211 | from the file @code{f}. At most @code{BZ_MAX_UNUSED} bytes may be | ||
1212 | supplied like this. If this facility is not required, you should pass | ||
1213 | @code{NULL} and @code{0} for @code{unused} and n@code{Unused} | ||
1214 | respectively. | ||
1215 | |||
1216 | For the meaning of parameters @code{small} and @code{verbosity}, | ||
1217 | see @code{BZ2_bzDecompressInit}. | ||
1218 | |||
1219 | The amount of memory needed to decompress a file cannot be determined | ||
1220 | until the file's header has been read. So it is possible that | ||
1221 | @code{BZ2_bzReadOpen} returns @code{BZ_OK} but a subsequent call of | ||
1222 | @code{BZ2_bzRead} will return @code{BZ_MEM_ERROR}. | ||
1223 | |||
1224 | Possible assignments to @code{bzerror}: | ||
1225 | @display | ||
1226 | @code{BZ_CONFIG_ERROR} | ||
1227 | if the library has been mis-compiled | ||
1228 | @code{BZ_PARAM_ERROR} | ||
1229 | if @code{f} is @code{NULL} | ||
1230 | or @code{small} is neither @code{0} nor @code{1} | ||
1231 | or @code{(unused == NULL && nUnused != 0)} | ||
1232 | or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))} | ||
1233 | @code{BZ_IO_ERROR} | ||
1234 | if @code{ferror(f)} is nonzero | ||
1235 | @code{BZ_MEM_ERROR} | ||
1236 | if insufficient memory is available | ||
1237 | @code{BZ_OK} | ||
1238 | otherwise. | ||
1239 | @end display | ||
1240 | |||
1241 | Possible return values: | ||
1242 | @display | ||
1243 | Pointer to an abstract @code{BZFILE} | ||
1244 | if @code{bzerror} is @code{BZ_OK} | ||
1245 | @code{NULL} | ||
1246 | otherwise | ||
1247 | @end display | ||
1248 | |||
1249 | Allowable next actions: | ||
1250 | @display | ||
1251 | @code{BZ2_bzRead} | ||
1252 | if @code{bzerror} is @code{BZ_OK} | ||
1253 | @code{BZ2_bzClose} | ||
1254 | otherwise | ||
1255 | @end display | ||
1256 | |||
1257 | |||
1258 | @subsection @code{BZ2_bzRead} | ||
1259 | @example | ||
1260 | int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len ); | ||
1261 | @end example | ||
1262 | Reads up to @code{len} (uncompressed) bytes from the compressed file | ||
1263 | @code{b} into | ||
1264 | the buffer @code{buf}. If the read was successful, | ||
1265 | @code{bzerror} is set to @code{BZ_OK} | ||
1266 | and the number of bytes read is returned. If the logical end-of-stream | ||
1267 | was detected, @code{bzerror} will be set to @code{BZ_STREAM_END}, | ||
1268 | and the number | ||
1269 | of bytes read is returned. All other @code{bzerror} values denote an error. | ||
1270 | |||
1271 | @code{BZ2_bzRead} will supply @code{len} bytes, | ||
1272 | unless the logical stream end is detected | ||
1273 | or an error occurs. Because of this, it is possible to detect the | ||
1274 | stream end by observing when the number of bytes returned is | ||
1275 | less than the number | ||
1276 | requested. Nevertheless, this is regarded as inadvisable; you should | ||
1277 | instead check @code{bzerror} after every call and watch out for | ||
1278 | @code{BZ_STREAM_END}. | ||
1279 | |||
1280 | Internally, @code{BZ2_bzRead} copies data from the compressed file in chunks | ||
1281 | of size @code{BZ_MAX_UNUSED} bytes | ||
1282 | before decompressing it. If the file contains more bytes than strictly | ||
1283 | needed to reach the logical end-of-stream, @code{BZ2_bzRead} will almost certainly | ||
1284 | read some of the trailing data before signalling @code{BZ_SEQUENCE_END}. | ||
1285 | To collect the read but unused data once @code{BZ_SEQUENCE_END} has | ||
1286 | appeared, call @code{BZ2_bzReadGetUnused} immediately before @code{BZ2_bzReadClose}. | ||
1287 | |||
1288 | Possible assignments to @code{bzerror}: | ||
1289 | @display | ||
1290 | @code{BZ_PARAM_ERROR} | ||
1291 | if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} | ||
1292 | @code{BZ_SEQUENCE_ERROR} | ||
1293 | if @code{b} was opened with @code{BZ2_bzWriteOpen} | ||
1294 | @code{BZ_IO_ERROR} | ||
1295 | if there is an error reading from the compressed file | ||
1296 | @code{BZ_UNEXPECTED_EOF} | ||
1297 | if the compressed file ended before the logical end-of-stream was detected | ||
1298 | @code{BZ_DATA_ERROR} | ||
1299 | if a data integrity error was detected in the compressed stream | ||
1300 | @code{BZ_DATA_ERROR_MAGIC} | ||
1301 | if the stream does not begin with the requisite header bytes (ie, is not | ||
1302 | a @code{bzip2} data file). This is really a special case of @code{BZ_DATA_ERROR}. | ||
1303 | @code{BZ_MEM_ERROR} | ||
1304 | if insufficient memory was available | ||
1305 | @code{BZ_STREAM_END} | ||
1306 | if the logical end of stream was detected. | ||
1307 | @code{BZ_OK} | ||
1308 | otherwise. | ||
1309 | @end display | ||
1310 | |||
1311 | Possible return values: | ||
1312 | @display | ||
1313 | number of bytes read | ||
1314 | if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END} | ||
1315 | undefined | ||
1316 | otherwise | ||
1317 | @end display | ||
1318 | |||
1319 | Allowable next actions: | ||
1320 | @display | ||
1321 | collect data from @code{buf}, then @code{BZ2_bzRead} or @code{BZ2_bzReadClose} | ||
1322 | if @code{bzerror} is @code{BZ_OK} | ||
1323 | collect data from @code{buf}, then @code{BZ2_bzReadClose} or @code{BZ2_bzReadGetUnused} | ||
1324 | if @code{bzerror} is @code{BZ_SEQUENCE_END} | ||
1325 | @code{BZ2_bzReadClose} | ||
1326 | otherwise | ||
1327 | @end display | ||
1328 | |||
1329 | |||
1330 | |||
1331 | @subsection @code{BZ2_bzReadGetUnused} | ||
1332 | @example | ||
1333 | void BZ2_bzReadGetUnused ( int* bzerror, BZFILE *b, | ||
1334 | void** unused, int* nUnused ); | ||
1335 | @end example | ||
1336 | Returns data which was read from the compressed file but was not needed | ||
1337 | to get to the logical end-of-stream. @code{*unused} is set to the address | ||
1338 | of the data, and @code{*nUnused} to the number of bytes. @code{*nUnused} will | ||
1339 | be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive. | ||
1340 | |||
1341 | This function may only be called once @code{BZ2_bzRead} has signalled | ||
1342 | @code{BZ_STREAM_END} but before @code{BZ2_bzReadClose}. | ||
1343 | |||
1344 | Possible assignments to @code{bzerror}: | ||
1345 | @display | ||
1346 | @code{BZ_PARAM_ERROR} | ||
1347 | if @code{b} is @code{NULL} | ||
1348 | or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL} | ||
1349 | @code{BZ_SEQUENCE_ERROR} | ||
1350 | if @code{BZ_STREAM_END} has not been signalled | ||
1351 | or if @code{b} was opened with @code{BZ2_bzWriteOpen} | ||
1352 | @code{BZ_OK} | ||
1353 | otherwise | ||
1354 | @end display | ||
1355 | |||
1356 | Allowable next actions: | ||
1357 | @display | ||
1358 | @code{BZ2_bzReadClose} | ||
1359 | @end display | ||
1360 | |||
1361 | |||
1362 | @subsection @code{BZ2_bzReadClose} | ||
1363 | @example | ||
1364 | void BZ2_bzReadClose ( int *bzerror, BZFILE *b ); | ||
1365 | @end example | ||
1366 | Releases all memory pertaining to the compressed file @code{b}. | ||
1367 | @code{BZ2_bzReadClose} does not call @code{fclose} on the underlying file | ||
1368 | handle, so you should do that yourself if appropriate. | ||
1369 | @code{BZ2_bzReadClose} should be called to clean up after all error | ||
1370 | situations. | ||
1371 | |||
1372 | Possible assignments to @code{bzerror}: | ||
1373 | @display | ||
1374 | @code{BZ_SEQUENCE_ERROR} | ||
1375 | if @code{b} was opened with @code{BZ2_bzOpenWrite} | ||
1376 | @code{BZ_OK} | ||
1377 | otherwise | ||
1378 | @end display | ||
1379 | |||
1380 | Allowable next actions: | ||
1381 | @display | ||
1382 | none | ||
1383 | @end display | ||
1384 | |||
1385 | |||
1386 | |||
1387 | @subsection @code{BZ2_bzWriteOpen} | ||
1388 | @example | ||
1389 | BZFILE *BZ2_bzWriteOpen ( int *bzerror, FILE *f, | ||
1390 | int blockSize100k, int verbosity, | ||
1391 | int workFactor ); | ||
1392 | @end example | ||
1393 | Prepare to write compressed data to file handle @code{f}. | ||
1394 | @code{f} should refer to | ||
1395 | a file which has been opened for writing, and for which the error | ||
1396 | indicator (@code{ferror(f)})is not set. | ||
1397 | |||
1398 | For the meaning of parameters @code{blockSize100k}, | ||
1399 | @code{verbosity} and @code{workFactor}, see | ||
1400 | @* @code{BZ2_bzCompressInit}. | ||
1401 | |||
1402 | All required memory is allocated at this stage, so if the call | ||
1403 | completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a | ||
1404 | subsequent call to @code{BZ2_bzWrite}. | ||
1405 | |||
1406 | Possible assignments to @code{bzerror}: | ||
1407 | @display | ||
1408 | @code{BZ_CONFIG_ERROR} | ||
1409 | if the library has been mis-compiled | ||
1410 | @code{BZ_PARAM_ERROR} | ||
1411 | if @code{f} is @code{NULL} | ||
1412 | or @code{blockSize100k < 1} or @code{blockSize100k > 9} | ||
1413 | @code{BZ_IO_ERROR} | ||
1414 | if @code{ferror(f)} is nonzero | ||
1415 | @code{BZ_MEM_ERROR} | ||
1416 | if insufficient memory is available | ||
1417 | @code{BZ_OK} | ||
1418 | otherwise | ||
1419 | @end display | ||
1420 | |||
1421 | Possible return values: | ||
1422 | @display | ||
1423 | Pointer to an abstract @code{BZFILE} | ||
1424 | if @code{bzerror} is @code{BZ_OK} | ||
1425 | @code{NULL} | ||
1426 | otherwise | ||
1427 | @end display | ||
1428 | |||
1429 | Allowable next actions: | ||
1430 | @display | ||
1431 | @code{BZ2_bzWrite} | ||
1432 | if @code{bzerror} is @code{BZ_OK} | ||
1433 | (you could go directly to @code{BZ2_bzWriteClose}, but this would be pretty pointless) | ||
1434 | @code{BZ2_bzWriteClose} | ||
1435 | otherwise | ||
1436 | @end display | ||
1437 | |||
1438 | |||
1439 | |||
1440 | @subsection @code{BZ2_bzWrite} | ||
1441 | @example | ||
1442 | void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len ); | ||
1443 | @end example | ||
1444 | Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be | ||
1445 | compressed and written to the file. | ||
1446 | |||
1447 | Possible assignments to @code{bzerror}: | ||
1448 | @display | ||
1449 | @code{BZ_PARAM_ERROR} | ||
1450 | if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} | ||
1451 | @code{BZ_SEQUENCE_ERROR} | ||
1452 | if b was opened with @code{BZ2_bzReadOpen} | ||
1453 | @code{BZ_IO_ERROR} | ||
1454 | if there is an error writing the compressed file. | ||
1455 | @code{BZ_OK} | ||
1456 | otherwise | ||
1457 | @end display | ||
1458 | |||
1459 | |||
1460 | |||
1461 | |||
1462 | @subsection @code{BZ2_bzWriteClose} | ||
1463 | @example | ||
1464 | void BZ2_bzWriteClose ( int *bzerror, BZFILE* f, | ||
1465 | int abandon, | ||
1466 | unsigned int* nbytes_in, | ||
1467 | unsigned int* nbytes_out ); | ||
1468 | |||
1469 | void BZ2_bzWriteClose64 ( int *bzerror, BZFILE* f, | ||
1470 | int abandon, | ||
1471 | unsigned int* nbytes_in_lo32, | ||
1472 | unsigned int* nbytes_in_hi32, | ||
1473 | unsigned int* nbytes_out_lo32, | ||
1474 | unsigned int* nbytes_out_hi32 ); | ||
1475 | @end example | ||
1476 | |||
1477 | Compresses and flushes to the compressed file all data so far supplied | ||
1478 | by @code{BZ2_bzWrite}. The logical end-of-stream markers are also written, so | ||
1479 | subsequent calls to @code{BZ2_bzWrite} are illegal. All memory associated | ||
1480 | with the compressed file @code{b} is released. | ||
1481 | @code{fflush} is called on the | ||
1482 | compressed file, but it is not @code{fclose}'d. | ||
1483 | |||
1484 | If @code{BZ2_bzWriteClose} is called to clean up after an error, the only | ||
1485 | action is to release the memory. The library records the error codes | ||
1486 | issued by previous calls, so this situation will be detected | ||
1487 | automatically. There is no attempt to complete the compression | ||
1488 | operation, nor to @code{fflush} the compressed file. You can force this | ||
1489 | behaviour to happen even in the case of no error, by passing a nonzero | ||
1490 | value to @code{abandon}. | ||
1491 | |||
1492 | If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the | ||
1493 | total volume of uncompressed data handled. Similarly, @code{nbytes_out} | ||
1494 | will be set to the total volume of compressed data written. For | ||
1495 | compatibility with older versions of the library, @code{BZ2_bzWriteClose} | ||
1496 | only yields the lower 32 bits of these counts. Use | ||
1497 | @code{BZ2_bzWriteClose64} if you want the full 64 bit counts. These | ||
1498 | two functions are otherwise absolutely identical. | ||
1499 | |||
1500 | |||
1501 | Possible assignments to @code{bzerror}: | ||
1502 | @display | ||
1503 | @code{BZ_SEQUENCE_ERROR} | ||
1504 | if @code{b} was opened with @code{BZ2_bzReadOpen} | ||
1505 | @code{BZ_IO_ERROR} | ||
1506 | if there is an error writing the compressed file | ||
1507 | @code{BZ_OK} | ||
1508 | otherwise | ||
1509 | @end display | ||
1510 | |||
1511 | @subsection Handling embedded compressed data streams | ||
1512 | |||
1513 | The high-level library facilitates use of | ||
1514 | @code{bzip2} data streams which form some part of a surrounding, larger | ||
1515 | data stream. | ||
1516 | @itemize @bullet | ||
1517 | @item For writing, the library takes an open file handle, writes | ||
1518 | compressed data to it, @code{fflush}es it but does not @code{fclose} it. | ||
1519 | The calling application can write its own data before and after the | ||
1520 | compressed data stream, using that same file handle. | ||
1521 | @item Reading is more complex, and the facilities are not as general | ||
1522 | as they could be since generality is hard to reconcile with efficiency. | ||
1523 | @code{BZ2_bzRead} reads from the compressed file in blocks of size | ||
1524 | @code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot | ||
1525 | the logical end of compressed stream. | ||
1526 | To recover this data once decompression has | ||
1527 | ended, call @code{BZ2_bzReadGetUnused} after the last call of @code{BZ2_bzRead} | ||
1528 | (the one returning @code{BZ_STREAM_END}) but before calling | ||
1529 | @code{BZ2_bzReadClose}. | ||
1530 | @end itemize | ||
1531 | |||
1532 | This mechanism makes it easy to decompress multiple @code{bzip2} | ||
1533 | streams placed end-to-end. As the end of one stream, when @code{BZ2_bzRead} | ||
1534 | returns @code{BZ_STREAM_END}, call @code{BZ2_bzReadGetUnused} to collect the | ||
1535 | unused data (copy it into your own buffer somewhere). | ||
1536 | That data forms the start of the next compressed stream. | ||
1537 | To start uncompressing that next stream, call @code{BZ2_bzReadOpen} again, | ||
1538 | feeding in the unused data via the @code{unused}/@code{nUnused} | ||
1539 | parameters. | ||
1540 | Keep doing this until @code{BZ_STREAM_END} return coincides with the | ||
1541 | physical end of file (@code{feof(f)}). In this situation | ||
1542 | @code{BZ2_bzReadGetUnused} | ||
1543 | will of course return no data. | ||
1544 | |||
1545 | This should give some feel for how the high-level interface can be used. | ||
1546 | If you require extra flexibility, you'll have to bite the bullet and get | ||
1547 | to grips with the low-level interface. | ||
1548 | |||
1549 | @subsection Standard file-reading/writing code | ||
1550 | Here's how you'd write data to a compressed file: | ||
1551 | @example @code | ||
1552 | FILE* f; | ||
1553 | BZFILE* b; | ||
1554 | int nBuf; | ||
1555 | char buf[ /* whatever size you like */ ]; | ||
1556 | int bzerror; | ||
1557 | int nWritten; | ||
1558 | |||
1559 | f = fopen ( "myfile.bz2", "w" ); | ||
1560 | if (!f) @{ | ||
1561 | /* handle error */ | ||
1562 | @} | ||
1563 | b = BZ2_bzWriteOpen ( &bzerror, f, 9 ); | ||
1564 | if (bzerror != BZ_OK) @{ | ||
1565 | BZ2_bzWriteClose ( b ); | ||
1566 | /* handle error */ | ||
1567 | @} | ||
1568 | |||
1569 | while ( /* condition */ ) @{ | ||
1570 | /* get data to write into buf, and set nBuf appropriately */ | ||
1571 | nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf ); | ||
1572 | if (bzerror == BZ_IO_ERROR) @{ | ||
1573 | BZ2_bzWriteClose ( &bzerror, b ); | ||
1574 | /* handle error */ | ||
1575 | @} | ||
1576 | @} | ||
1577 | |||
1578 | BZ2_bzWriteClose ( &bzerror, b ); | ||
1579 | if (bzerror == BZ_IO_ERROR) @{ | ||
1580 | /* handle error */ | ||
1581 | @} | ||
1582 | @end example | ||
1583 | And to read from a compressed file: | ||
1584 | @example | ||
1585 | FILE* f; | ||
1586 | BZFILE* b; | ||
1587 | int nBuf; | ||
1588 | char buf[ /* whatever size you like */ ]; | ||
1589 | int bzerror; | ||
1590 | int nWritten; | ||
1591 | |||
1592 | f = fopen ( "myfile.bz2", "r" ); | ||
1593 | if (!f) @{ | ||
1594 | /* handle error */ | ||
1595 | @} | ||
1596 | b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 ); | ||
1597 | if (bzerror != BZ_OK) @{ | ||
1598 | BZ2_bzReadClose ( &bzerror, b ); | ||
1599 | /* handle error */ | ||
1600 | @} | ||
1601 | |||
1602 | bzerror = BZ_OK; | ||
1603 | while (bzerror == BZ_OK && /* arbitrary other conditions */) @{ | ||
1604 | nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ ); | ||
1605 | if (bzerror == BZ_OK) @{ | ||
1606 | /* do something with buf[0 .. nBuf-1] */ | ||
1607 | @} | ||
1608 | @} | ||
1609 | if (bzerror != BZ_STREAM_END) @{ | ||
1610 | BZ2_bzReadClose ( &bzerror, b ); | ||
1611 | /* handle error */ | ||
1612 | @} else @{ | ||
1613 | BZ2_bzReadClose ( &bzerror ); | ||
1614 | @} | ||
1615 | @end example | ||
1616 | |||
1617 | |||
1618 | |||
1619 | @section Utility functions | ||
1620 | @subsection @code{BZ2_bzBuffToBuffCompress} | ||
1621 | @example | ||
1622 | int BZ2_bzBuffToBuffCompress( char* dest, | ||
1623 | unsigned int* destLen, | ||
1624 | char* source, | ||
1625 | unsigned int sourceLen, | ||
1626 | int blockSize100k, | ||
1627 | int verbosity, | ||
1628 | int workFactor ); | ||
1629 | @end example | ||
1630 | Attempts to compress the data in @code{source[0 .. sourceLen-1]} | ||
1631 | into the destination buffer, @code{dest[0 .. *destLen-1]}. | ||
1632 | If the destination buffer is big enough, @code{*destLen} is | ||
1633 | set to the size of the compressed data, and @code{BZ_OK} is | ||
1634 | returned. If the compressed data won't fit, @code{*destLen} | ||
1635 | is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. | ||
1636 | |||
1637 | Compression in this manner is a one-shot event, done with a single call | ||
1638 | to this function. The resulting compressed data is a complete | ||
1639 | @code{bzip2} format data stream. There is no mechanism for making | ||
1640 | additional calls to provide extra input data. If you want that kind of | ||
1641 | mechanism, use the low-level interface. | ||
1642 | |||
1643 | For the meaning of parameters @code{blockSize100k}, @code{verbosity} | ||
1644 | and @code{workFactor}, @* see @code{BZ2_bzCompressInit}. | ||
1645 | |||
1646 | To guarantee that the compressed data will fit in its buffer, allocate | ||
1647 | an output buffer of size 1% larger than the uncompressed data, plus | ||
1648 | six hundred extra bytes. | ||
1649 | |||
1650 | @code{BZ2_bzBuffToBuffDecompress} will not write data at or | ||
1651 | beyond @code{dest[*destLen]}, even in case of buffer overflow. | ||
1652 | |||
1653 | Possible return values: | ||
1654 | @display | ||
1655 | @code{BZ_CONFIG_ERROR} | ||
1656 | if the library has been mis-compiled | ||
1657 | @code{BZ_PARAM_ERROR} | ||
1658 | if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} | ||
1659 | or @code{blockSize100k < 1} or @code{blockSize100k > 9} | ||
1660 | or @code{verbosity < 0} or @code{verbosity > 4} | ||
1661 | or @code{workFactor < 0} or @code{workFactor > 250} | ||
1662 | @code{BZ_MEM_ERROR} | ||
1663 | if insufficient memory is available | ||
1664 | @code{BZ_OUTBUFF_FULL} | ||
1665 | if the size of the compressed data exceeds @code{*destLen} | ||
1666 | @code{BZ_OK} | ||
1667 | otherwise | ||
1668 | @end display | ||
1669 | |||
1670 | |||
1671 | |||
1672 | @subsection @code{BZ2_bzBuffToBuffDecompress} | ||
1673 | @example | ||
1674 | int BZ2_bzBuffToBuffDecompress ( char* dest, | ||
1675 | unsigned int* destLen, | ||
1676 | char* source, | ||
1677 | unsigned int sourceLen, | ||
1678 | int small, | ||
1679 | int verbosity ); | ||
1680 | @end example | ||
1681 | Attempts to decompress the data in @code{source[0 .. sourceLen-1]} | ||
1682 | into the destination buffer, @code{dest[0 .. *destLen-1]}. | ||
1683 | If the destination buffer is big enough, @code{*destLen} is | ||
1684 | set to the size of the uncompressed data, and @code{BZ_OK} is | ||
1685 | returned. If the compressed data won't fit, @code{*destLen} | ||
1686 | is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. | ||
1687 | |||
1688 | @code{source} is assumed to hold a complete @code{bzip2} format | ||
1689 | data stream. @* @code{BZ2_bzBuffToBuffDecompress} tries to decompress | ||
1690 | the entirety of the stream into the output buffer. | ||
1691 | |||
1692 | For the meaning of parameters @code{small} and @code{verbosity}, | ||
1693 | see @code{BZ2_bzDecompressInit}. | ||
1694 | |||
1695 | Because the compression ratio of the compressed data cannot be known in | ||
1696 | advance, there is no easy way to guarantee that the output buffer will | ||
1697 | be big enough. You may of course make arrangements in your code to | ||
1698 | record the size of the uncompressed data, but such a mechanism is beyond | ||
1699 | the scope of this library. | ||
1700 | |||
1701 | @code{BZ2_bzBuffToBuffDecompress} will not write data at or | ||
1702 | beyond @code{dest[*destLen]}, even in case of buffer overflow. | ||
1703 | |||
1704 | Possible return values: | ||
1705 | @display | ||
1706 | @code{BZ_CONFIG_ERROR} | ||
1707 | if the library has been mis-compiled | ||
1708 | @code{BZ_PARAM_ERROR} | ||
1709 | if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} | ||
1710 | or @code{small != 0 && small != 1} | ||
1711 | or @code{verbosity < 0} or @code{verbosity > 4} | ||
1712 | @code{BZ_MEM_ERROR} | ||
1713 | if insufficient memory is available | ||
1714 | @code{BZ_OUTBUFF_FULL} | ||
1715 | if the size of the compressed data exceeds @code{*destLen} | ||
1716 | @code{BZ_DATA_ERROR} | ||
1717 | if a data integrity error was detected in the compressed data | ||
1718 | @code{BZ_DATA_ERROR_MAGIC} | ||
1719 | if the compressed data doesn't begin with the right magic bytes | ||
1720 | @code{BZ_UNEXPECTED_EOF} | ||
1721 | if the compressed data ends unexpectedly | ||
1722 | @code{BZ_OK} | ||
1723 | otherwise | ||
1724 | @end display | ||
1725 | |||
1726 | |||
1727 | |||
1728 | @section @code{zlib} compatibility functions | ||
1729 | Yoshioka Tsuneo has contributed some functions to | ||
1730 | give better @code{zlib} compatibility. These functions are | ||
1731 | @code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush}, | ||
1732 | @code{BZ2_bzclose}, | ||
1733 | @code{BZ2_bzerror} and @code{BZ2_bzlibVersion}. | ||
1734 | These functions are not (yet) officially part of | ||
1735 | the library. If they break, you get to keep all the pieces. | ||
1736 | Nevertheless, I think they work ok. | ||
1737 | @example | ||
1738 | typedef void BZFILE; | ||
1739 | |||
1740 | const char * BZ2_bzlibVersion ( void ); | ||
1741 | @end example | ||
1742 | Returns a string indicating the library version. | ||
1743 | @example | ||
1744 | BZFILE * BZ2_bzopen ( const char *path, const char *mode ); | ||
1745 | BZFILE * BZ2_bzdopen ( int fd, const char *mode ); | ||
1746 | @end example | ||
1747 | Opens a @code{.bz2} file for reading or writing, using either its name | ||
1748 | or a pre-existing file descriptor. | ||
1749 | Analogous to @code{fopen} and @code{fdopen}. | ||
1750 | @example | ||
1751 | int BZ2_bzread ( BZFILE* b, void* buf, int len ); | ||
1752 | int BZ2_bzwrite ( BZFILE* b, void* buf, int len ); | ||
1753 | @end example | ||
1754 | Reads/writes data from/to a previously opened @code{BZFILE}. | ||
1755 | Analogous to @code{fread} and @code{fwrite}. | ||
1756 | @example | ||
1757 | int BZ2_bzflush ( BZFILE* b ); | ||
1758 | void BZ2_bzclose ( BZFILE* b ); | ||
1759 | @end example | ||
1760 | Flushes/closes a @code{BZFILE}. @code{BZ2_bzflush} doesn't actually do | ||
1761 | anything. Analogous to @code{fflush} and @code{fclose}. | ||
1762 | |||
1763 | @example | ||
1764 | const char * BZ2_bzerror ( BZFILE *b, int *errnum ) | ||
1765 | @end example | ||
1766 | Returns a string describing the more recent error status of | ||
1767 | @code{b}, and also sets @code{*errnum} to its numerical value. | ||
1768 | |||
1769 | |||
1770 | @section Using the library in a @code{stdio}-free environment | ||
1771 | |||
1772 | @subsection Getting rid of @code{stdio} | ||
1773 | |||
1774 | In a deeply embedded application, you might want to use just | ||
1775 | the memory-to-memory functions. You can do this conveniently | ||
1776 | by compiling the library with preprocessor symbol @code{BZ_NO_STDIO} | ||
1777 | defined. Doing this gives you a library containing only the following | ||
1778 | eight functions: | ||
1779 | |||
1780 | @code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, @code{BZ2_bzCompressEnd} @* | ||
1781 | @code{BZ2_bzDecompressInit}, @code{BZ2_bzDecompress}, @code{BZ2_bzDecompressEnd} @* | ||
1782 | @code{BZ2_bzBuffToBuffCompress}, @code{BZ2_bzBuffToBuffDecompress} | ||
1783 | |||
1784 | When compiled like this, all functions will ignore @code{verbosity} | ||
1785 | settings. | ||
1786 | |||
1787 | @subsection Critical error handling | ||
1788 | @code{libbzip2} contains a number of internal assertion checks which | ||
1789 | should, needless to say, never be activated. Nevertheless, if an | ||
1790 | assertion should fail, behaviour depends on whether or not the library | ||
1791 | was compiled with @code{BZ_NO_STDIO} set. | ||
1792 | |||
1793 | For a normal compile, an assertion failure yields the message | ||
1794 | @example | ||
1795 | bzip2/libbzip2: internal error number N. | ||
1796 | This is a bug in bzip2/libbzip2, 1.0.2, 30-Dec-2001. | ||
1797 | Please report it to me at: jseward@@acm.org. If this happened | ||
1798 | when you were using some program which uses libbzip2 as a | ||
1799 | component, you should also report this bug to the author(s) | ||
1800 | of that program. Please make an effort to report this bug; | ||
1801 | timely and accurate bug reports eventually lead to higher | ||
1802 | quality software. Thanks. Julian Seward, 30 December 2001. | ||
1803 | @end example | ||
1804 | where @code{N} is some error code number. If @code{N == 1007}, it also | ||
1805 | prints some extra text advising the reader that unreliable memory is | ||
1806 | often associated with internal error 1007. (This is a | ||
1807 | frequently-observed-phenomenon with versions 1.0.0/1.0.1). | ||
1808 | |||
1809 | @code{exit(3)} is then called. | ||
1810 | |||
1811 | For a @code{stdio}-free library, assertion failures result | ||
1812 | in a call to a function declared as: | ||
1813 | @example | ||
1814 | extern void bz_internal_error ( int errcode ); | ||
1815 | @end example | ||
1816 | The relevant code is passed as a parameter. You should supply | ||
1817 | such a function. | ||
1818 | |||
1819 | In either case, once an assertion failure has occurred, any | ||
1820 | @code{bz_stream} records involved can be regarded as invalid. | ||
1821 | You should not attempt to resume normal operation with them. | ||
1822 | |||
1823 | You may, of course, change critical error handling to suit | ||
1824 | your needs. As I said above, critical errors indicate bugs | ||
1825 | in the library and should not occur. All "normal" error | ||
1826 | situations are indicated via error return codes from functions, | ||
1827 | and can be recovered from. | ||
1828 | |||
1829 | |||
1830 | @section Making a Windows DLL | ||
1831 | Everything related to Windows has been contributed by Yoshioka Tsuneo | ||
1832 | @* (@code{QWF00133@@niftyserve.or.jp} / | ||
1833 | @code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to | ||
1834 | him (but perhaps Cc: me, @code{jseward@@acm.org}). | ||
1835 | |||
1836 | My vague understanding of what to do is: using Visual C++ 5.0, | ||
1837 | open the project file @code{libbz2.dsp}, and build. That's all. | ||
1838 | |||
1839 | If you can't | ||
1840 | open the project file for some reason, make a new one, naming these files: | ||
1841 | @code{blocksort.c}, @code{bzlib.c}, @code{compress.c}, | ||
1842 | @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @* | ||
1843 | @code{randtable.c} and @code{libbz2.def}. You will also need | ||
1844 | to name the header files @code{bzlib.h} and @code{bzlib_private.h}. | ||
1845 | |||
1846 | If you don't use VC++, you may need to define the proprocessor symbol | ||
1847 | @code{_WIN32}. | ||
1848 | |||
1849 | Finally, @code{dlltest.c} is a sample program using the DLL. It has a | ||
1850 | project file, @code{dlltest.dsp}. | ||
1851 | |||
1852 | If you just want a makefile for Visual C, have a look at | ||
1853 | @code{makefile.msc}. | ||
1854 | |||
1855 | Be aware that if you compile @code{bzip2} itself on Win32, you must set | ||
1856 | @code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file | ||
1857 | @code{bzip2.c}, before compiling. Otherwise the resulting binary won't | ||
1858 | work correctly. | ||
1859 | |||
1860 | I haven't tried any of this stuff myself, but it all looks plausible. | ||
1861 | |||
1862 | |||
1863 | |||
1864 | @chapter Miscellanea | ||
1865 | |||
1866 | These are just some random thoughts of mine. Your mileage may | ||
1867 | vary. | ||
1868 | |||
1869 | @section Limitations of the compressed file format | ||
1870 | @code{bzip2-1.0}, @code{0.9.5} and @code{0.9.0} | ||
1871 | use exactly the same file format as the previous | ||
1872 | version, @code{bzip2-0.1}. This decision was made in the interests of | ||
1873 | stability. Creating yet another incompatible compressed file format | ||
1874 | would create further confusion and disruption for users. | ||
1875 | |||
1876 | Nevertheless, this is not a painless decision. Development | ||
1877 | work since the release of @code{bzip2-0.1} in August 1997 | ||
1878 | has shown complexities in the file format which slow down | ||
1879 | decompression and, in retrospect, are unnecessary. These are: | ||
1880 | @itemize @bullet | ||
1881 | @item The run-length encoder, which is the first of the | ||
1882 | compression transformations, is entirely irrelevant. | ||
1883 | The original purpose was to protect the sorting algorithm | ||
1884 | from the very worst case input: a string of repeated | ||
1885 | symbols. But algorithm steps Q6a and Q6b in the original | ||
1886 | Burrows-Wheeler technical report (SRC-124) show how | ||
1887 | repeats can be handled without difficulty in block | ||
1888 | sorting. | ||
1889 | @item The randomisation mechanism doesn't really need to be | ||
1890 | there. Udi Manber and Gene Myers published a suffix | ||
1891 | array construction algorithm a few years back, which | ||
1892 | can be employed to sort any block, no matter how | ||
1893 | repetitive, in O(N log N) time. Subsequent work by | ||
1894 | Kunihiko Sadakane has produced a derivative O(N (log N)^2) | ||
1895 | algorithm which usually outperforms the Manber-Myers | ||
1896 | algorithm. | ||
1897 | |||
1898 | I could have changed to Sadakane's algorithm, but I find | ||
1899 | it to be slower than @code{bzip2}'s existing algorithm for | ||
1900 | most inputs, and the randomisation mechanism protects | ||
1901 | adequately against bad cases. I didn't think it was | ||
1902 | a good tradeoff to make. Partly this is due to the fact | ||
1903 | that I was not flooded with email complaints about | ||
1904 | @code{bzip2-0.1}'s performance on repetitive data, so | ||
1905 | perhaps it isn't a problem for real inputs. | ||
1906 | |||
1907 | Probably the best long-term solution, | ||
1908 | and the one I have incorporated into 0.9.5 and above, | ||
1909 | is to use the existing sorting | ||
1910 | algorithm initially, and fall back to a O(N (log N)^2) | ||
1911 | algorithm if the standard algorithm gets into difficulties. | ||
1912 | @item The compressed file format was never designed to be | ||
1913 | handled by a library, and I have had to jump though | ||
1914 | some hoops to produce an efficient implementation of | ||
1915 | decompression. It's a bit hairy. Try passing | ||
1916 | @code{decompress.c} through the C preprocessor | ||
1917 | and you'll see what I mean. Much of this complexity | ||
1918 | could have been avoided if the compressed size of | ||
1919 | each block of data was recorded in the data stream. | ||
1920 | @item An Adler-32 checksum, rather than a CRC32 checksum, | ||
1921 | would be faster to compute. | ||
1922 | @end itemize | ||
1923 | It would be fair to say that the @code{bzip2} format was frozen | ||
1924 | before I properly and fully understood the performance | ||
1925 | consequences of doing so. | ||
1926 | |||
1927 | Improvements which I was able to incorporate into | ||
1928 | 0.9.0, despite using the same file format, are: | ||
1929 | @itemize @bullet | ||
1930 | @item Single array implementation of the inverse BWT. This | ||
1931 | significantly speeds up decompression, presumably | ||
1932 | because it reduces the number of cache misses. | ||
1933 | @item Faster inverse MTF transform for large MTF values. The | ||
1934 | new implementation is based on the notion of sliding blocks | ||
1935 | of values. | ||
1936 | @item @code{bzip2-0.9.0} now reads and writes files with @code{fread} | ||
1937 | and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}. | ||
1938 | Duh! Well, you live and learn. | ||
1939 | |||
1940 | @end itemize | ||
1941 | Further ahead, it would be nice | ||
1942 | to be able to do random access into files. This will | ||
1943 | require some careful design of compressed file formats. | ||
1944 | |||
1945 | |||
1946 | |||
1947 | @section Portability issues | ||
1948 | After some consideration, I have decided not to use | ||
1949 | GNU @code{autoconf} to configure 0.9.5 or 1.0. | ||
1950 | |||
1951 | @code{autoconf}, admirable and wonderful though it is, | ||
1952 | mainly assists with portability problems between Unix-like | ||
1953 | platforms. But @code{bzip2} doesn't have much in the way | ||
1954 | of portability problems on Unix; most of the difficulties appear | ||
1955 | when porting to the Mac, or to Microsoft's operating systems. | ||
1956 | @code{autoconf} doesn't help in those cases, and brings in a | ||
1957 | whole load of new complexity. | ||
1958 | |||
1959 | Most people should be able to compile the library and program | ||
1960 | under Unix straight out-of-the-box, so to speak, especially | ||
1961 | if you have a version of GNU C available. | ||
1962 | |||
1963 | There are a couple of @code{__inline__} directives in the code. GNU C | ||
1964 | (@code{gcc}) should be able to handle them. If you're not using | ||
1965 | GNU C, your C compiler shouldn't see them at all. | ||
1966 | If your compiler does, for some reason, see them and doesn't | ||
1967 | like them, just @code{#define} @code{__inline__} to be @code{/* */}. One | ||
1968 | easy way to do this is to compile with the flag @code{-D__inline__=}, | ||
1969 | which should be understood by most Unix compilers. | ||
1970 | |||
1971 | If you still have difficulties, try compiling with the macro | ||
1972 | @code{BZ_STRICT_ANSI} defined. This should enable you to build the | ||
1973 | library in a strictly ANSI compliant environment. Building the program | ||
1974 | itself like this is dangerous and not supported, since you remove | ||
1975 | @code{bzip2}'s checks against compressing directories, symbolic links, | ||
1976 | devices, and other not-really-a-file entities. This could cause | ||
1977 | filesystem corruption! | ||
1978 | |||
1979 | One other thing: if you create a @code{bzip2} binary for public | ||
1980 | distribution, please try and link it statically (@code{gcc -s}). This | ||
1981 | avoids all sorts of library-version issues that others may encounter | ||
1982 | later on. | ||
1983 | |||
1984 | If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and | ||
1985 | @code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling. | ||
1986 | Otherwise the resulting binary won't work correctly. | ||
1987 | |||
1988 | |||
1989 | |||
1990 | @section Reporting bugs | ||
1991 | I tried pretty hard to make sure @code{bzip2} is | ||
1992 | bug free, both by design and by testing. Hopefully | ||
1993 | you'll never need to read this section for real. | ||
1994 | |||
1995 | Nevertheless, if @code{bzip2} dies with a segmentation | ||
1996 | fault, a bus error or an internal assertion failure, it | ||
1997 | will ask you to email me a bug report. Experience with | ||
1998 | version 0.1 shows that almost all these problems can | ||
1999 | be traced to either compiler bugs or hardware problems. | ||
2000 | @itemize @bullet | ||
2001 | @item | ||
2002 | Recompile the program with no optimisation, and see if it | ||
2003 | works. And/or try a different compiler. | ||
2004 | I heard all sorts of stories about various flavours | ||
2005 | of GNU C (and other compilers) generating bad code for | ||
2006 | @code{bzip2}, and I've run across two such examples myself. | ||
2007 | |||
2008 | 2.7.X versions of GNU C are known to generate bad code from | ||
2009 | time to time, at high optimisation levels. | ||
2010 | If you get problems, try using the flags | ||
2011 | @code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}. | ||
2012 | You should specifically @emph{not} use @code{-funroll-loops}. | ||
2013 | |||
2014 | You may notice that the Makefile runs six tests as part of | ||
2015 | the build process. If the program passes all of these, it's | ||
2016 | a pretty good (but not 100%) indication that the compiler has | ||
2017 | done its job correctly. | ||
2018 | @item | ||
2019 | If @code{bzip2} crashes randomly, and the crashes are not | ||
2020 | repeatable, you may have a flaky memory subsystem. @code{bzip2} | ||
2021 | really hammers your memory hierarchy, and if it's a bit marginal, | ||
2022 | you may get these problems. Ditto if your disk or I/O subsystem | ||
2023 | is slowly failing. Yup, this really does happen. | ||
2024 | |||
2025 | Try using a different machine of the same type, and see if | ||
2026 | you can repeat the problem. | ||
2027 | @item This isn't really a bug, but ... If @code{bzip2} tells | ||
2028 | you your file is corrupted on decompression, and you | ||
2029 | obtained the file via FTP, there is a possibility that you | ||
2030 | forgot to tell FTP to do a binary mode transfer. That absolutely | ||
2031 | will cause the file to be non-decompressible. You'll have to transfer | ||
2032 | it again. | ||
2033 | @end itemize | ||
2034 | |||
2035 | If you've incorporated @code{libbzip2} into your own program | ||
2036 | and are getting problems, please, please, please, check that the | ||
2037 | parameters you are passing in calls to the library, are | ||
2038 | correct, and in accordance with what the documentation says | ||
2039 | is allowable. I have tried to make the library robust against | ||
2040 | such problems, but I'm sure I haven't succeeded. | ||
2041 | |||
2042 | Finally, if the above comments don't help, you'll have to send | ||
2043 | me a bug report. Now, it's just amazing how many people will | ||
2044 | send me a bug report saying something like | ||
2045 | @display | ||
2046 | bzip2 crashed with segmentation fault on my machine | ||
2047 | @end display | ||
2048 | and absolutely nothing else. Needless to say, a such a report | ||
2049 | is @emph{totally, utterly, completely and comprehensively 100% useless; | ||
2050 | a waste of your time, my time, and net bandwidth}. | ||
2051 | With no details at all, there's no way I can possibly begin | ||
2052 | to figure out what the problem is. | ||
2053 | |||
2054 | The rules of the game are: facts, facts, facts. Don't omit | ||
2055 | them because "oh, they won't be relevant". At the bare | ||
2056 | minimum: | ||
2057 | @display | ||
2058 | Machine type. Operating system version. | ||
2059 | Exact version of @code{bzip2} (do @code{bzip2 -V}). | ||
2060 | Exact version of the compiler used. | ||
2061 | Flags passed to the compiler. | ||
2062 | @end display | ||
2063 | However, the most important single thing that will help me is | ||
2064 | the file that you were trying to compress or decompress at the | ||
2065 | time the problem happened. Without that, my ability to do anything | ||
2066 | more than speculate about the cause, is limited. | ||
2067 | |||
2068 | Please remember that I connect to the Internet with a modem, so | ||
2069 | you should contact me before mailing me huge files. | ||
2070 | |||
2071 | |||
2072 | @section Did you get the right package? | ||
2073 | |||
2074 | @code{bzip2} is a resource hog. It soaks up large amounts of CPU cycles | ||
2075 | and memory. Also, it gives very large latencies. In the worst case, you | ||
2076 | can feed many megabytes of uncompressed data into the library before | ||
2077 | getting any compressed output, so this probably rules out applications | ||
2078 | requiring interactive behaviour. | ||
2079 | |||
2080 | These aren't faults of my implementation, I hope, but more | ||
2081 | an intrinsic property of the Burrows-Wheeler transform (unfortunately). | ||
2082 | Maybe this isn't what you want. | ||
2083 | |||
2084 | If you want a compressor and/or library which is faster, uses less | ||
2085 | memory but gets pretty good compression, and has minimal latency, | ||
2086 | consider Jean-loup | ||
2087 | Gailly's and Mark Adler's work, @code{zlib-1.1.3} and | ||
2088 | @code{gzip-1.2.4}. Look for them at | ||
2089 | |||
2090 | @code{http://www.zlib.org} and | ||
2091 | @code{http://www.gzip.org} respectively. | ||
2092 | |||
2093 | For something faster and lighter still, you might try Markus F X J | ||
2094 | Oberhumer's @code{LZO} real-time compression/decompression library, at | ||
2095 | @* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}. | ||
2096 | |||
2097 | If you want to use the @code{bzip2} algorithms to compress small blocks | ||
2098 | of data, 64k bytes or smaller, for example on an on-the-fly disk | ||
2099 | compressor, you'd be well advised not to use this library. Instead, | ||
2100 | I've made a special library tuned for that kind of use. It's part of | ||
2101 | @code{e2compr-0.40}, an on-the-fly disk compressor for the Linux | ||
2102 | @code{ext2} filesystem. Look at | ||
2103 | @code{http://www.netspace.net.au/~reiter/e2compr}. | ||
2104 | |||
2105 | |||
2106 | |||
2107 | @section Testing | ||
2108 | |||
2109 | A record of the tests I've done. | ||
2110 | |||
2111 | First, some data sets: | ||
2112 | @itemize @bullet | ||
2113 | @item B: a directory containing 6001 files, one for every length in the | ||
2114 | range 0 to 6000 bytes. The files contain random lowercase | ||
2115 | letters. 18.7 megabytes. | ||
2116 | @item H: my home directory tree. Documents, source code, mail files, | ||
2117 | compressed data. H contains B, and also a directory of | ||
2118 | files designed as boundary cases for the sorting; mostly very | ||
2119 | repetitive, nasty files. 565 megabytes. | ||
2120 | @item A: directory tree holding various applications built from source: | ||
2121 | @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc. | ||
2122 | 2200 megabytes. | ||
2123 | @end itemize | ||
2124 | The tests conducted are as follows. Each test means compressing | ||
2125 | (a copy of) each file in the data set, decompressing it and | ||
2126 | comparing it against the original. | ||
2127 | |||
2128 | First, a bunch of tests with block sizes and internal buffer | ||
2129 | sizes set very small, | ||
2130 | to detect any problems with the | ||
2131 | blocking and buffering mechanisms. | ||
2132 | This required modifying the source code so as to try to | ||
2133 | break it. | ||
2134 | @enumerate | ||
2135 | @item Data set H, with | ||
2136 | buffer size of 1 byte, and block size of 23 bytes. | ||
2137 | @item Data set B, buffer sizes 1 byte, block size 1 byte. | ||
2138 | @item As (2) but small-mode decompression. | ||
2139 | @item As (2) with block size 2 bytes. | ||
2140 | @item As (2) with block size 3 bytes. | ||
2141 | @item As (2) with block size 4 bytes. | ||
2142 | @item As (2) with block size 5 bytes. | ||
2143 | @item As (2) with block size 6 bytes and small-mode decompression. | ||
2144 | @item H with buffer size of 1 byte, but normal block | ||
2145 | size (up to 900000 bytes). | ||
2146 | @end enumerate | ||
2147 | Then some tests with unmodified source code. | ||
2148 | @enumerate | ||
2149 | @item H, all settings normal. | ||
2150 | @item As (1), with small-mode decompress. | ||
2151 | @item H, compress with flag @code{-1}. | ||
2152 | @item H, compress with flag @code{-s}, decompress with flag @code{-s}. | ||
2153 | @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing, | ||
2154 | @code{bzip2-0.9.5} decompressing, all settings normal. | ||
2155 | @item Backwards compatibility: H, @code{bzip2-0.9.5} compressing, | ||
2156 | @code{bzip2-0.1pl2} decompressing, all settings normal. | ||
2157 | @item Bigger tests: A, all settings normal. | ||
2158 | @item As (7), using the fallback (Sadakane-like) sorting algorithm. | ||
2159 | @item As (8), compress with flag @code{-1}, decompress with flag | ||
2160 | @code{-s}. | ||
2161 | @item H, using the fallback sorting algorithm. | ||
2162 | @item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing, | ||
2163 | @code{bzip2-0.9.5} decompressing, all settings normal. | ||
2164 | @item Backwards compatibility: A, @code{bzip2-0.9.5} compressing, | ||
2165 | @code{bzip2-0.1pl2} decompressing, all settings normal. | ||
2166 | @item Misc test: about 400 megabytes of @code{.tar} files with | ||
2167 | @code{bzip2} compiled with Checker (a memory access error | ||
2168 | detector, like Purify). | ||
2169 | @item Misc tests to make sure it builds and runs ok on non-Linux/x86 | ||
2170 | platforms. | ||
2171 | @end enumerate | ||
2172 | These tests were conducted on a 225 MHz IDT WinChip machine, running | ||
2173 | Linux 2.0.36. They represent nearly a week of continuous computation. | ||
2174 | All tests completed successfully. | ||
2175 | |||
2176 | |||
2177 | @section Further reading | ||
2178 | @code{bzip2} is not research work, in the sense that it doesn't present | ||
2179 | any new ideas. Rather, it's an engineering exercise based on existing | ||
2180 | ideas. | ||
2181 | |||
2182 | Four documents describe essentially all the ideas behind @code{bzip2}: | ||
2183 | @example | ||
2184 | Michael Burrows and D. J. Wheeler: | ||
2185 | "A block-sorting lossless data compression algorithm" | ||
2186 | 10th May 1994. | ||
2187 | Digital SRC Research Report 124. | ||
2188 | ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz | ||
2189 | If you have trouble finding it, try searching at the | ||
2190 | New Zealand Digital Library, http://www.nzdl.org. | ||
2191 | |||
2192 | Daniel S. Hirschberg and Debra A. LeLewer | ||
2193 | "Efficient Decoding of Prefix Codes" | ||
2194 | Communications of the ACM, April 1990, Vol 33, Number 4. | ||
2195 | You might be able to get an electronic copy of this | ||
2196 | from the ACM Digital Library. | ||
2197 | |||
2198 | David J. Wheeler | ||
2199 | Program bred3.c and accompanying document bred3.ps. | ||
2200 | This contains the idea behind the multi-table Huffman | ||
2201 | coding scheme. | ||
2202 | ftp://ftp.cl.cam.ac.uk/users/djw3/ | ||
2203 | |||
2204 | Jon L. Bentley and Robert Sedgewick | ||
2205 | "Fast Algorithms for Sorting and Searching Strings" | ||
2206 | Available from Sedgewick's web page, | ||
2207 | www.cs.princeton.edu/~rs | ||
2208 | @end example | ||
2209 | The following paper gives valuable additional insights into the | ||
2210 | algorithm, but is not immediately the basis of any code | ||
2211 | used in bzip2. | ||
2212 | @example | ||
2213 | Peter Fenwick: | ||
2214 | Block Sorting Text Compression | ||
2215 | Proceedings of the 19th Australasian Computer Science Conference, | ||
2216 | Melbourne, Australia. Jan 31 - Feb 2, 1996. | ||
2217 | ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps | ||
2218 | @end example | ||
2219 | Kunihiko Sadakane's sorting algorithm, mentioned above, | ||
2220 | is available from: | ||
2221 | @example | ||
2222 | http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz | ||
2223 | @end example | ||
2224 | The Manber-Myers suffix array construction | ||
2225 | algorithm is described in a paper | ||
2226 | available from: | ||
2227 | @example | ||
2228 | http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps | ||
2229 | @end example | ||
2230 | Finally, the following paper documents some recent investigations | ||
2231 | I made into the performance of sorting algorithms: | ||
2232 | @example | ||
2233 | Julian Seward: | ||
2234 | On the Performance of BWT Sorting Algorithms | ||
2235 | Proceedings of the IEEE Data Compression Conference 2000 | ||
2236 | Snowbird, Utah. 28-30 March 2000. | ||
2237 | @end example | ||
2238 | |||
2239 | |||
2240 | @contents | ||
2241 | |||
2242 | @bye | ||
2243 | |||