diff options
Diffstat (limited to 'manual.texi')
-rw-r--r-- | manual.texi | 2100 |
1 files changed, 2100 insertions, 0 deletions
diff --git a/manual.texi b/manual.texi new file mode 100644 index 0000000..99ce661 --- /dev/null +++ b/manual.texi | |||
@@ -0,0 +1,2100 @@ | |||
1 | \input texinfo @c -*- Texinfo -*- | ||
2 | @setfilename bzip2.info | ||
3 | |||
4 | @ignore | ||
5 | This file documents bzip2 version 0.9.0c, and associated library | ||
6 | libbzip2, written by Julian Seward (jseward@acm.org). | ||
7 | |||
8 | Copyright (C) 1996-1998 Julian R Seward | ||
9 | |||
10 | Permission is granted to make and distribute verbatim copies of | ||
11 | this manual provided the copyright notice and this permission notice | ||
12 | are preserved on all copies. | ||
13 | |||
14 | Permission is granted to copy and distribute translations of this manual | ||
15 | into another language, under the above conditions for verbatim copies. | ||
16 | @end ignore | ||
17 | |||
18 | @ifinfo | ||
19 | @format | ||
20 | START-INFO-DIR-ENTRY | ||
21 | * Bzip2: (bzip2). A program and library for data compression. | ||
22 | END-INFO-DIR-ENTRY | ||
23 | @end format | ||
24 | |||
25 | @end ifinfo | ||
26 | |||
27 | @iftex | ||
28 | @c @finalout | ||
29 | @settitle bzip2 and libbzip2 | ||
30 | @titlepage | ||
31 | @title bzip2 and libbzip2 | ||
32 | @subtitle a program and library for data compression | ||
33 | @subtitle copyright (C) 1996-1998 Julian Seward | ||
34 | @subtitle version 0.9.0c of 18 October 1998 | ||
35 | @author Julian Seward | ||
36 | |||
37 | @end titlepage | ||
38 | @end iftex | ||
39 | |||
40 | |||
41 | @parindent 0mm | ||
42 | @parskip 2mm | ||
43 | |||
44 | |||
45 | This program, @code{bzip2}, | ||
46 | and associated library @code{libbzip2}, are | ||
47 | Copyright (C) 1996-1998 Julian R Seward. All rights reserved. | ||
48 | |||
49 | Redistribution and use in source and binary forms, with or without | ||
50 | modification, are permitted provided that the following conditions | ||
51 | are met: | ||
52 | @itemize @bullet | ||
53 | @item | ||
54 | Redistributions of source code must retain the above copyright | ||
55 | notice, this list of conditions and the following disclaimer. | ||
56 | @item | ||
57 | The origin of this software must not be misrepresented; you must | ||
58 | not claim that you wrote the original software. If you use this | ||
59 | software in a product, an acknowledgment in the product | ||
60 | documentation would be appreciated but is not required. | ||
61 | @item | ||
62 | Altered source versions must be plainly marked as such, and must | ||
63 | not be misrepresented as being the original software. | ||
64 | @item | ||
65 | The name of the author may not be used to endorse or promote | ||
66 | products derived from this software without specific prior written | ||
67 | permission. | ||
68 | @end itemize | ||
69 | THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS | ||
70 | OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED | ||
71 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | ||
72 | ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY | ||
73 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | ||
74 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE | ||
75 | GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS | ||
76 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, | ||
77 | WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING | ||
78 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS | ||
79 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
80 | |||
81 | Julian Seward, Guildford, Surrey, UK. | ||
82 | |||
83 | @code{jseward@@acm.org} | ||
84 | |||
85 | @code{http://www.muraroa.demon.co.uk} | ||
86 | |||
87 | @code{bzip2}/@code{libbzip2} version 0.9.0c of 18 October 1998. | ||
88 | |||
89 | PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented | ||
90 | algorithms. However, I do not have the resources available to carry out | ||
91 | a full patent search. Therefore I cannot give any guarantee of the | ||
92 | above statement. | ||
93 | |||
94 | |||
95 | |||
96 | |||
97 | |||
98 | |||
99 | |||
100 | @node Overview, Implementation, Top, Top | ||
101 | @chapter Introduction | ||
102 | |||
103 | @code{bzip2} compresses files using the Burrows-Wheeler | ||
104 | block-sorting text compression algorithm, and Huffman coding. | ||
105 | Compression is generally considerably better than that | ||
106 | achieved by more conventional LZ77/LZ78-based compressors, | ||
107 | and approaches the performance of the PPM family of statistical compressors. | ||
108 | |||
109 | @code{bzip2} is built on top of @code{libbzip2}, a flexible library | ||
110 | for handling compressed data in the @code{bzip2} format. This manual | ||
111 | describes both how to use the program and | ||
112 | how to work with the library interface. Most of the | ||
113 | manual is devoted to this library, not the program, | ||
114 | which is good news if your interest is only in the program. | ||
115 | |||
116 | Chapter 2 describes how to use @code{bzip2}; this is the only part | ||
117 | you need to read if you just want to know how to operate the program. | ||
118 | Chapter 3 describes the programming interfaces in detail, and | ||
119 | Chapter 4 records some miscellaneous notes which I thought | ||
120 | ought to be recorded somewhere. | ||
121 | |||
122 | |||
123 | @chapter How to use @code{bzip2} | ||
124 | |||
125 | This chapter contains a copy of the @code{bzip2} man page, | ||
126 | and nothing else. | ||
127 | @example | ||
128 | NAME | ||
129 | bzip2, bunzip2 - a block-sorting file compressor, v0.9.0 | ||
130 | bzcat - decompresses files to stdout | ||
131 | bzip2recover - recovers data from damaged bzip2 files | ||
132 | |||
133 | |||
134 | SYNOPSIS | ||
135 | bzip2 [ -cdfkstvzVL123456789 ] [ filenames ... ] | ||
136 | bunzip2 [ -fkvsVL ] [ filenames ... ] | ||
137 | bzcat [ -s ] [ filenames ... ] | ||
138 | bzip2recover filename | ||
139 | |||
140 | |||
141 | DESCRIPTION | ||
142 | bzip2 compresses files using the Burrows-Wheeler block- | ||
143 | sorting text compression algorithm, and Huffman coding. | ||
144 | Compression is generally considerably better than that | ||
145 | achieved by more conventional LZ77/LZ78-based compressors, | ||
146 | and approaches the performance of the PPM family of sta- | ||
147 | tistical compressors. | ||
148 | |||
149 | The command-line options are deliberately very similar to | ||
150 | those of GNU Gzip, but they are not identical. | ||
151 | |||
152 | bzip2 expects a list of file names to accompany the com- | ||
153 | mand-line flags. Each file is replaced by a compressed | ||
154 | version of itself, with the name "original_name.bz2". | ||
155 | Each compressed file has the same modification date and | ||
156 | permissions as the corresponding original, so that these | ||
157 | properties can be correctly restored at decompression | ||
158 | time. File name handling is naive in the sense that there | ||
159 | is no mechanism for preserving original file names, per- | ||
160 | missions and dates in filesystems which lack these con- | ||
161 | cepts, or have serious file name length restrictions, such | ||
162 | as MS-DOS. | ||
163 | |||
164 | bzip2 and bunzip2 will by default not overwrite existing | ||
165 | files; if you want this to happen, specify the -f flag. | ||
166 | |||
167 | If no file names are specified, bzip2 compresses from | ||
168 | standard input to standard output. In this case, bzip2 | ||
169 | will decline to write compressed output to a terminal, as | ||
170 | this would be entirely incomprehensible and therefore | ||
171 | pointless. | ||
172 | |||
173 | bunzip2 (or bzip2 -d ) decompresses and restores all spec- | ||
174 | ified files whose names end in ".bz2". Files without this | ||
175 | suffix are ignored. Again, supplying no filenames causes | ||
176 | decompression from standard input to standard output. | ||
177 | |||
178 | bunzip2 will correctly decompress a file which is the con- | ||
179 | catenation of two or more compressed files. The result is | ||
180 | the concatenation of the corresponding uncompressed files. | ||
181 | Integrity testing (-t) of concatenated compressed files is | ||
182 | also supported. | ||
183 | |||
184 | You can also compress or decompress files to the standard | ||
185 | output by giving the -c flag. Multiple files may be com- | ||
186 | pressed and decompressed like this. The resulting outputs | ||
187 | are fed sequentially to stdout. Compression of multiple | ||
188 | files in this manner generates a stream containing multi- | ||
189 | ple compressed file representations. Such a stream can be | ||
190 | decompressed correctly only by bzip2 version 0.9.0 or | ||
191 | later. Earlier versions of bzip2 will stop after decom- | ||
192 | pressing the first file in the stream. | ||
193 | |||
194 | bzcat (or bzip2 -dc ) decompresses all specified files to | ||
195 | the standard output. | ||
196 | |||
197 | Compression is always performed, even if the compressed | ||
198 | file is slightly larger than the original. Files of less | ||
199 | than about one hundred bytes tend to get larger, since the | ||
200 | compression mechanism has a constant overhead in the | ||
201 | region of 50 bytes. Random data (including the output of | ||
202 | most file compressors) is coded at about 8.05 bits per | ||
203 | byte, giving an expansion of around 0.5%. | ||
204 | |||
205 | As a self-check for your protection, bzip2 uses 32-bit | ||
206 | CRCs to make sure that the decompressed version of a file | ||
207 | is identical to the original. This guards against corrup- | ||
208 | tion of the compressed data, and against undetected bugs | ||
209 | in bzip2 (hopefully very unlikely). The chances of data | ||
210 | corruption going undetected is microscopic, about one | ||
211 | chance in four billion for each file processed. Be aware, | ||
212 | though, that the check occurs upon decompression, so it | ||
213 | can only tell you that that something is wrong. It can't | ||
214 | help you recover the original uncompressed data. You can | ||
215 | use bzip2recover to try to recover data from damaged | ||
216 | files. | ||
217 | |||
218 | Return values: 0 for a normal exit, 1 for environmental | ||
219 | problems (file not found, invalid flags, I/O errors, &c), | ||
220 | 2 to indicate a corrupt compressed file, 3 for an internal | ||
221 | consistency error (eg, bug) which caused bzip2 to panic. | ||
222 | |||
223 | |||
224 | MEMORY MANAGEMENT | ||
225 | Bzip2 compresses large files in blocks. The block size | ||
226 | affects both the compression ratio achieved, and the | ||
227 | amount of memory needed both for compression and decom- | ||
228 | pression. The flags -1 through -9 specify the block size | ||
229 | to be 100,000 bytes through 900,000 bytes (the default) | ||
230 | respectively. At decompression-time, the block size used | ||
231 | for compression is read from the header of the compressed | ||
232 | file, and bunzip2 then allocates itself just enough memory | ||
233 | to decompress the file. Since block sizes are stored in | ||
234 | compressed files, it follows that the flags -1 to -9 are | ||
235 | irrelevant to and so ignored during decompression. | ||
236 | |||
237 | Compression and decompression requirements, in bytes, can | ||
238 | be estimated as: | ||
239 | |||
240 | Compression: 400k + ( 7 x block size ) | ||
241 | |||
242 | Decompression: 100k + ( 4 x block size ), or | ||
243 | 100k + ( 2.5 x block size ) | ||
244 | |||
245 | Larger block sizes give rapidly diminishing marginal | ||
246 | returns; most of the compression comes from the first two | ||
247 | or three hundred k of block size, a fact worth bearing in | ||
248 | mind when using bzip2 on small machines. It is also | ||
249 | important to appreciate that the decompression memory | ||
250 | requirement is set at compression-time by the choice of | ||
251 | block size. | ||
252 | |||
253 | For files compressed with the default 900k block size, | ||
254 | bunzip2 will require about 3700 kbytes to decompress. To | ||
255 | support decompression of any file on a 4 megabyte machine, | ||
256 | bunzip2 has an option to decompress using approximately | ||
257 | half this amount of memory, about 2300 kbytes. Decompres- | ||
258 | sion speed is also halved, so you should use this option | ||
259 | only where necessary. The relevant flag is -s. | ||
260 | |||
261 | In general, try and use the largest block size memory con- | ||
262 | straints allow, since that maximises the compression | ||
263 | achieved. Compression and decompression speed are virtu- | ||
264 | ally unaffected by block size. | ||
265 | |||
266 | Another significant point applies to files which fit in a | ||
267 | single block -- that means most files you'd encounter | ||
268 | using a large block size. The amount of real memory | ||
269 | touched is proportional to the size of the file, since the | ||
270 | file is smaller than a block. For example, compressing a | ||
271 | file 20,000 bytes long with the flag -9 will cause the | ||
272 | compressor to allocate around 6700k of memory, but only | ||
273 | touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the | ||
274 | decompressor will allocate 3700k but only touch 100k + | ||
275 | 20000 * 4 = 180 kbytes. | ||
276 | |||
277 | Here is a table which summarises the maximum memory usage | ||
278 | for different block sizes. Also recorded is the total | ||
279 | compressed size for 14 files of the Calgary Text Compres- | ||
280 | sion Corpus totalling 3,141,622 bytes. This column gives | ||
281 | some feel for how compression varies with block size. | ||
282 | These figures tend to understate the advantage of larger | ||
283 | block sizes for larger files, since the Corpus is domi- | ||
284 | nated by smaller files. | ||
285 | |||
286 | Compress Decompress Decompress Corpus | ||
287 | Flag usage usage -s usage Size | ||
288 | |||
289 | -1 1100k 500k 350k 914704 | ||
290 | -2 1800k 900k 600k 877703 | ||
291 | -3 2500k 1300k 850k 860338 | ||
292 | -4 3200k 1700k 1100k 846899 | ||
293 | -5 3900k 2100k 1350k 845160 | ||
294 | -6 4600k 2500k 1600k 838626 | ||
295 | -7 5400k 2900k 1850k 834096 | ||
296 | -8 6000k 3300k 2100k 828642 | ||
297 | -9 6700k 3700k 2350k 828642 | ||
298 | |||
299 | |||
300 | OPTIONS | ||
301 | -c --stdout | ||
302 | Compress or decompress to standard output. -c will | ||
303 | decompress multiple files to stdout, but will only | ||
304 | compress a single file to stdout. | ||
305 | |||
306 | -d --decompress | ||
307 | Force decompression. bzip2, bunzip2 and bzcat are | ||
308 | really the same program, and the decision about | ||
309 | what actions to take is done on the basis of which | ||
310 | name is used. This flag overrides that mechanism, | ||
311 | and forces bzip2 to decompress. | ||
312 | |||
313 | -z --compress | ||
314 | The complement to -d: forces compression, regard- | ||
315 | less of the invokation name. | ||
316 | |||
317 | -t --test | ||
318 | Check integrity of the specified file(s), but don't | ||
319 | decompress them. This really performs a trial | ||
320 | decompression and throws away the result. | ||
321 | |||
322 | -f --force | ||
323 | Force overwrite of output files. Normally, bzip2 | ||
324 | will not overwrite existing output files. | ||
325 | |||
326 | -k --keep | ||
327 | Keep (don't delete) input files during compression | ||
328 | or decompression. | ||
329 | |||
330 | -s --small | ||
331 | Reduce memory usage, for compression, decompression | ||
332 | and testing. Files are decompressed and tested | ||
333 | using a modified algorithm which only requires 2.5 | ||
334 | bytes per block byte. This means any file can be | ||
335 | decompressed in 2300k of memory, albeit at about | ||
336 | half the normal speed. | ||
337 | |||
338 | During compression, -s selects a block size of | ||
339 | 200k, which limits memory use to around the same | ||
340 | figure, at the expense of your compression ratio. | ||
341 | In short, if your machine is low on memory (8 | ||
342 | megabytes or less), use -s for everything. See | ||
343 | MEMORY MANAGEMENT above. | ||
344 | |||
345 | -v --verbose | ||
346 | Verbose mode -- show the compression ratio for each | ||
347 | file processed. Further -v's increase the ver- | ||
348 | bosity level, spewing out lots of information which | ||
349 | is primarily of interest for diagnostic purposes. | ||
350 | |||
351 | -L --license -V --version | ||
352 | Display the software version, license terms and | ||
353 | conditions. | ||
354 | |||
355 | -1 to -9 | ||
356 | Set the block size to 100 k, 200 k .. 900 k when | ||
357 | compressing. Has no effect when decompressing. | ||
358 | See MEMORY MANAGEMENT above. | ||
359 | |||
360 | --repetitive-fast | ||
361 | bzip2 injects some small pseudo-random variations | ||
362 | into very repetitive blocks to limit worst-case | ||
363 | performance during compression. If sorting runs | ||
364 | into difficulties, the block is randomised, and | ||
365 | sorting is restarted. Very roughly, bzip2 persists | ||
366 | for three times as long as a well-behaved input | ||
367 | would take before resorting to randomisation. This | ||
368 | flag makes it give up much sooner. | ||
369 | |||
370 | --repetitive-best | ||
371 | Opposite of --repetitive-fast; try a lot harder | ||
372 | before resorting to randomisation. | ||
373 | |||
374 | |||
375 | RECOVERING DATA FROM DAMAGED FILES | ||
376 | bzip2 compresses files in blocks, usually 900kbytes long. | ||
377 | Each block is handled independently. If a media or trans- | ||
378 | mission error causes a multi-block .bz2 file to become | ||
379 | damaged, it may be possible to recover data from the | ||
380 | undamaged blocks in the file. | ||
381 | |||
382 | The compressed representation of each block is delimited | ||
383 | by a 48-bit pattern, which makes it possible to find the | ||
384 | block boundaries with reasonable certainty. Each block | ||
385 | also carries its own 32-bit CRC, so damaged blocks can be | ||
386 | distinguished from undamaged ones. | ||
387 | |||
388 | bzip2recover is a simple program whose purpose is to | ||
389 | search for blocks in .bz2 files, and write each block out | ||
390 | into its own .bz2 file. You can then use bzip2 -t to test | ||
391 | the integrity of the resulting files, and decompress those | ||
392 | which are undamaged. | ||
393 | |||
394 | bzip2recover takes a single argument, the name of the dam- | ||
395 | aged file, and writes a number of files "rec0001file.bz2", | ||
396 | "rec0002file.bz2", etc, containing the extracted blocks. | ||
397 | The output filenames are designed so that the use of | ||
398 | wildcards in subsequent processing -- for example, "bzip2 | ||
399 | -dc rec*file.bz2 > recovered_data" -- lists the files in | ||
400 | the "right" order. | ||
401 | |||
402 | bzip2recover should be of most use dealing with large .bz2 | ||
403 | files, as these will contain many blocks. It is clearly | ||
404 | futile to use it on damaged single-block files, since a | ||
405 | damaged block cannot be recovered. If you wish to min- | ||
406 | imise any potential data loss through media or transmis- | ||
407 | sion errors, you might consider compressing with a smaller | ||
408 | block size. | ||
409 | |||
410 | |||
411 | PERFORMANCE NOTES | ||
412 | The sorting phase of compression gathers together similar | ||
413 | strings in the file. Because of this, files containing | ||
414 | very long runs of repeated symbols, like "aabaabaabaab | ||
415 | ..." (repeated several hundred times) may compress | ||
416 | extraordinarily slowly. You can use the -vvvvv option to | ||
417 | monitor progress in great detail, if you want. Decompres- | ||
418 | sion speed is unaffected. | ||
419 | |||
420 | Such pathological cases seem rare in practice, appearing | ||
421 | mostly in artificially-constructed test files, and in low- | ||
422 | level disk images. It may be inadvisable to use bzip2 to | ||
423 | compress the latter. If you do get a file which causes | ||
424 | severe slowness in compression, try making the block size | ||
425 | as small as possible, with flag -1. | ||
426 | |||
427 | bzip2 usually allocates several megabytes of memory to | ||
428 | operate in, and then charges all over it in a fairly ran- | ||
429 | dom fashion. This means that performance, both for com- | ||
430 | pressing and decompressing, is largely determined by the | ||
431 | speed at which your machine can service cache misses. | ||
432 | Because of this, small changes to the code to reduce the | ||
433 | miss rate have been observed to give disproportionately | ||
434 | large performance improvements. I imagine bzip2 will per- | ||
435 | form best on machines with very large caches. | ||
436 | |||
437 | |||
438 | CAVEATS | ||
439 | I/O error messages are not as helpful as they could be. | ||
440 | Bzip2 tries hard to detect I/O errors and exit cleanly, | ||
441 | but the details of what the problem is sometimes seem | ||
442 | rather misleading. | ||
443 | |||
444 | This manual page pertains to version 0.9.0 of bzip2. Com- | ||
445 | pressed data created by this version is entirely forwards | ||
446 | and backwards compatible with the previous public release, | ||
447 | version 0.1pl2, but with the following exception: 0.9.0 | ||
448 | can correctly decompress multiple concatenated compressed | ||
449 | files. 0.1pl2 cannot do this; it will stop after decom- | ||
450 | pressing just the first file in the stream. | ||
451 | |||
452 | Wildcard expansion for Windows 95 and NT is flaky. | ||
453 | |||
454 | bzip2recover uses 32-bit integers to represent bit posi- | ||
455 | tions in compressed files, so it cannot handle compressed | ||
456 | files more than 512 megabytes long. This could easily be | ||
457 | fixed. | ||
458 | |||
459 | |||
460 | AUTHOR | ||
461 | Julian Seward, jseward@@acm.org. | ||
462 | |||
463 | The ideas embodied in bzip2 are due to (at least) the fol- | ||
464 | lowing people: Michael Burrows and David Wheeler (for the | ||
465 | block sorting transformation), David Wheeler (again, for | ||
466 | the Huffman coder), Peter Fenwick (for the structured cod- | ||
467 | ing model in the original bzip, and many refinements), and | ||
468 | Alistair Moffat, Radford Neal and Ian Witten (for the | ||
469 | arithmetic coder in the original bzip). I am much | ||
470 | indebted for their help, support and advice. See the man- | ||
471 | ual in the source distribution for pointers to sources of | ||
472 | documentation. Christian von Roques encouraged me to look | ||
473 | for faster sorting algorithms, so as to speed up compres- | ||
474 | sion. Bela Lubkin encouraged me to improve the worst-case | ||
475 | compression performance. Many people sent patches, helped | ||
476 | with portability problems, lent machines, gave advice and | ||
477 | were generally helpful. | ||
478 | @end example | ||
479 | |||
480 | |||
481 | |||
482 | |||
483 | |||
484 | @chapter Programming with @code{libbzip2} | ||
485 | |||
486 | This chapter describes the programming interface to @code{libbzip2}. | ||
487 | |||
488 | For general background information, particularly about memory | ||
489 | use and performance aspects, you'd be well advised to read Chapter 2 | ||
490 | as well. | ||
491 | |||
492 | @section Top-level structure | ||
493 | |||
494 | @code{libbzip2} is a flexible library for compressing and decompressing | ||
495 | data in the @code{bzip2} data format. Although packaged as a single | ||
496 | entity, it helps to regard the library as three separate parts: the low | ||
497 | level interface, and the high level interface, and some utility | ||
498 | functions. | ||
499 | |||
500 | The structure of @code{libbzip2}'s interfaces is similar to | ||
501 | that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib} | ||
502 | library. | ||
503 | |||
504 | @subsection Low-level summary | ||
505 | |||
506 | This interface provides services for compressing and decompressing | ||
507 | data in memory. There's no provision for dealing with files, streams | ||
508 | or any other I/O mechanisms, just straight memory-to-memory work. | ||
509 | In fact, this part of the library can be compiled without inclusion | ||
510 | of @code{stdio.h}, which may be helpful for embedded applications. | ||
511 | |||
512 | The low-level part of the library has no global variables and | ||
513 | is therefore thread-safe. | ||
514 | |||
515 | Six routines make up the low level interface: | ||
516 | @code{bzCompressInit}, @code{bzCompress}, and @* @code{bzCompressEnd} | ||
517 | for compression, | ||
518 | and a corresponding trio @code{bzDecompressInit}, @* @code{bzDecompress} | ||
519 | and @code{bzDecompressEnd} for decompression. | ||
520 | The @code{*Init} functions allocate | ||
521 | memory for compression/decompression and do other | ||
522 | initialisations, whilst the @code{*End} functions close down operations | ||
523 | and release memory. | ||
524 | |||
525 | The real work is done by @code{bzCompress} and @code{bzDecompress}. | ||
526 | These compress/decompress data from a user-supplied input buffer | ||
527 | to a user-supplied output buffer. These buffers can be any size; | ||
528 | arbitrary quantities of data are handled by making repeated calls | ||
529 | to these functions. This is a flexible mechanism allowing a | ||
530 | consumer-pull style of activity, or producer-push, or a mixture of | ||
531 | both. | ||
532 | |||
533 | |||
534 | |||
535 | @subsection High-level summary | ||
536 | |||
537 | This interface provides some handy wrappers around the low-level | ||
538 | interface to facilitate reading and writing @code{bzip2} format | ||
539 | files (@code{.bz2} files). The routines provide hooks to facilitate | ||
540 | reading files in which the @code{bzip2} data stream is embedded | ||
541 | within some larger-scale file structure, or where there are | ||
542 | multiple @code{bzip2} data streams concatenated end-to-end. | ||
543 | |||
544 | For reading files, @code{bzReadOpen}, @code{bzRead}, @code{bzReadClose} | ||
545 | and @code{bzReadGetUnused} are supplied. For writing files, | ||
546 | @code{bzWriteOpen}, @code{bzWrite} and @code{bzWriteFinish} are | ||
547 | available. | ||
548 | |||
549 | As with the low-level library, no global variables are used | ||
550 | so the library is per se thread-safe. However, if I/O errors | ||
551 | occur whilst reading or writing the underlying compressed files, | ||
552 | you may have to consult @code{errno} to determine the cause of | ||
553 | the error. In that case, you'd need a C library which correctly | ||
554 | supports @code{errno} in a multithreaded environment. | ||
555 | |||
556 | To make the library a little simpler and more portable, | ||
557 | @code{bzReadOpen} and @code{bzWriteOpen} require you to pass them file | ||
558 | handles (@code{FILE*}s) which have previously been opened for reading or | ||
559 | writing respectively. That avoids portability problems associated with | ||
560 | file operations and file attributes, whilst not being much of an | ||
561 | imposition on the programmer. | ||
562 | |||
563 | |||
564 | |||
565 | @subsection Utility functions summary | ||
566 | For very simple needs, @code{bzBuffToBuffCompress} and | ||
567 | @code{bzBuffToBuffDecompress} are provided. These compress | ||
568 | data in memory from one buffer to another buffer in a single | ||
569 | function call. You should assess whether these functions | ||
570 | fulfill your memory-to-memory compression/decompression | ||
571 | requirements before investing effort in understanding the more | ||
572 | general but more complex low-level interface. | ||
573 | |||
574 | Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} / | ||
575 | @code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to | ||
576 | give better @code{zlib} compatibility. These functions are | ||
577 | @code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush}, | ||
578 | @code{bzclose}, | ||
579 | @code{bzerror} and @code{bzlibVersion}. You may find these functions | ||
580 | more convenient for simple file reading and writing, than those in the | ||
581 | high-level interface. These functions are not (yet) officially part of | ||
582 | the library, and are not further documented here. If they break, you | ||
583 | get to keep all the pieces. I hope to document them properly when time | ||
584 | permits. | ||
585 | |||
586 | Yoshioka also contributed modifications to allow the library to be | ||
587 | built as a Windows DLL. | ||
588 | |||
589 | |||
590 | @section Error handling | ||
591 | |||
592 | The library is designed to recover cleanly in all situations, including | ||
593 | the worst-case situation of decompressing random data. I'm not | ||
594 | 100% sure that it can always do this, so you might want to add | ||
595 | a signal handler to catch segmentation violations during decompression | ||
596 | if you are feeling especially paranoid. I would be interested in | ||
597 | hearing more about the robustness of the library to corrupted | ||
598 | compressed data. | ||
599 | |||
600 | The file @code{bzlib.h} contains all definitions needed to use | ||
601 | the library. In particular, you should definitely not include | ||
602 | @code{bzlib_private.h}. | ||
603 | |||
604 | In @code{bzlib.h}, the various return values are defined. The following | ||
605 | list is not intended as an exhaustive description of the circumstances | ||
606 | in which a given value may be returned -- those descriptions are given | ||
607 | later. Rather, it is intended to convey the rough meaning of each | ||
608 | return value. The first five actions are normal and not intended to | ||
609 | denote an error situation. | ||
610 | @table @code | ||
611 | @item BZ_OK | ||
612 | The requested action was completed successfully. | ||
613 | @item BZ_RUN_OK | ||
614 | @itemx BZ_FLUSH_OK | ||
615 | @itemx BZ_FINISH_OK | ||
616 | In @code{bzCompress}, the requested flush/finish/nothing-special action | ||
617 | was completed successfully. | ||
618 | @item BZ_STREAM_END | ||
619 | Compression of data was completed, or the logical stream end was | ||
620 | detected during decompression. | ||
621 | @end table | ||
622 | |||
623 | The following return values indicate an error of some kind. | ||
624 | @table @code | ||
625 | @item BZ_SEQUENCE_ERROR | ||
626 | When using the library, it is important to call the functions in the | ||
627 | correct sequence and with data structures (buffers etc) in the correct | ||
628 | states. @code{libbzip2} checks as much as it can to ensure this is | ||
629 | happening, and returns @code{BZ_SEQUENCE_ERROR} if not. Code which | ||
630 | complies precisely with the function semantics, as detailed below, | ||
631 | should never receive this value; such an event denotes buggy code | ||
632 | which you should investigate. | ||
633 | @item BZ_PARAM_ERROR | ||
634 | Returned when a parameter to a function call is out of range | ||
635 | or otherwise manifestly incorrect. As with @code{BZ_SEQUENCE_ERROR}, | ||
636 | this denotes a bug in the client code. The distinction between | ||
637 | @code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth | ||
638 | making. | ||
639 | @item BZ_MEM_ERROR | ||
640 | Returned when a request to allocate memory failed. Note that the | ||
641 | quantity of memory needed to decompress a stream cannot be determined | ||
642 | until the stream's header has been read. So @code{bzDecompress} and | ||
643 | @code{bzRead} may return @code{BZ_MEM_ERROR} even though some of | ||
644 | the compressed data has been read. The same is not true for | ||
645 | compression; once @code{bzCompressInit} or @code{bzWriteOpen} have | ||
646 | successfully completed, @code{BZ_MEM_ERROR} cannot occur. | ||
647 | @item BZ_DATA_ERROR | ||
648 | Returned when a data integrity error is detected during decompression. | ||
649 | Most importantly, this means when stored and computed CRCs for the | ||
650 | data do not match. This value is also returned upon detection of any | ||
651 | other anomaly in the compressed data. | ||
652 | @item BZ_DATA_ERROR_MAGIC | ||
653 | As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to | ||
654 | know when the compressed stream does not start with the correct | ||
655 | magic bytes (@code{'B' 'Z' 'h'}). | ||
656 | @item BZ_IO_ERROR | ||
657 | Returned by @code{bzRead} and @code{bzRead} when there is an error | ||
658 | reading or writing in the compressed file, and by @code{bzReadOpen} | ||
659 | and @code{bzWriteOpen} for attempts to use a file for which the | ||
660 | error indicator (viz, @code{ferror(f)}) is set. | ||
661 | On receipt of @code{BZ_IO_ERROR}, the caller should consult | ||
662 | @code{errno} and/or @code{perror} to acquire operating-system | ||
663 | specific information about the problem. | ||
664 | @item BZ_UNEXPECTED_EOF | ||
665 | Returned by @code{bzRead} when the compressed file finishes | ||
666 | before the logical end of stream is detected. | ||
667 | @item BZ_OUTBUFF_FULL | ||
668 | Returned by @code{bzBuffToBuffCompress} and | ||
669 | @code{bzBuffToBuffDecompress} to indicate that the output data | ||
670 | will not fit into the output buffer provided. | ||
671 | @end table | ||
672 | |||
673 | |||
674 | |||
675 | @section Low-level interface | ||
676 | |||
677 | @subsection @code{bzCompressInit} | ||
678 | @example | ||
679 | typedef | ||
680 | struct @{ | ||
681 | char *next_in; | ||
682 | unsigned int avail_in; | ||
683 | unsigned int total_in; | ||
684 | |||
685 | char *next_out; | ||
686 | unsigned int avail_out; | ||
687 | unsigned int total_out; | ||
688 | |||
689 | void *state; | ||
690 | |||
691 | void *(*bzalloc)(void *,int,int); | ||
692 | void (*bzfree)(void *,void *); | ||
693 | void *opaque; | ||
694 | @} | ||
695 | bz_stream; | ||
696 | |||
697 | int bzCompressInit ( bz_stream *strm, | ||
698 | int blockSize100k, | ||
699 | int verbosity, | ||
700 | int workFactor ); | ||
701 | |||
702 | @end example | ||
703 | |||
704 | Prepares for compression. The @code{bz_stream} structure | ||
705 | holds all data pertaining to the compression activity. | ||
706 | A @code{bz_stream} structure should be allocated and initialised | ||
707 | prior to the call. | ||
708 | The fields of @code{bz_stream} | ||
709 | comprise the entirety of the user-visible data. @code{state} | ||
710 | is a pointer to the private data structures required for compression. | ||
711 | |||
712 | Custom memory allocators are supported, via fields @code{bzalloc}, | ||
713 | @code{bzfree}, | ||
714 | and @code{opaque}. The value | ||
715 | @code{opaque} is passed to as the first argument to | ||
716 | all calls to @code{bzalloc} and @code{bzfree}, but is | ||
717 | otherwise ignored by the library. | ||
718 | The call @code{bzalloc ( opaque, n, m )} is expected to return a | ||
719 | pointer @code{p} to | ||
720 | @code{n * m} bytes of memory, and @code{bzfree ( opaque, p )} | ||
721 | should free | ||
722 | that memory. | ||
723 | |||
724 | If you don't want to use a custom memory allocator, set @code{bzalloc}, | ||
725 | @code{bzfree} and | ||
726 | @code{opaque} to @code{NULL}, | ||
727 | and the library will then use the standard @code{malloc}/@code{free} | ||
728 | routines. | ||
729 | |||
730 | Before calling @code{bzCompressInit}, fields @code{bzalloc}, | ||
731 | @code{bzfree} and @code{opaque} should | ||
732 | be filled appropriately, as just described. Upon return, the internal | ||
733 | state will have been allocated and initialised, and @code{total_in} and | ||
734 | @code{total_out} will have been set to zero. | ||
735 | These last two fields are used by the library | ||
736 | to inform the caller of the total amount of data passed into and out of | ||
737 | the library, respectively. You should not try to change them. | ||
738 | |||
739 | Parameter @code{blockSize100k} specifies the block size to be used for | ||
740 | compression. It should be a value between 1 and 9 inclusive, and the | ||
741 | actual block size used is 100000 x this figure. 9 gives the best | ||
742 | compression but takes most memory. | ||
743 | |||
744 | Parameter @code{verbosity} should be set to a number between 0 and 4 | ||
745 | inclusive. 0 is silent, and greater numbers give increasingly verbose | ||
746 | monitoring/debugging output. If the library has been compiled with | ||
747 | @code{-DBZ_NO_STDIO}, no such output will appear for any verbosity | ||
748 | setting. | ||
749 | |||
750 | Parameter @code{workFactor} controls how the compression phase behaves | ||
751 | when presented with worst case, highly repetitive, input data. | ||
752 | If compression runs into difficulties caused by repetitive data, | ||
753 | some pseudo-random variations are inserted into the block, and | ||
754 | compression is restarted. Lower values of @code{workFactor} | ||
755 | reduce the tolerance of compression to repetitive data. | ||
756 | You should set this parameter carefully; too low, and | ||
757 | compression ratio suffers, too high, and your average-to-worst | ||
758 | case compression times can become very large. | ||
759 | The default value of 30 | ||
760 | gives reasonable behaviour over a wide range of circumstances. | ||
761 | |||
762 | Allowable values range from 0 to 250 inclusive. 0 is a special | ||
763 | case, equivalent to using the default value of 30. | ||
764 | |||
765 | Note that the randomisation process is entirely transparent. | ||
766 | If the library decides to randomise and restart compression on a | ||
767 | block, it does so without comment. Randomised blocks are | ||
768 | automatically de-randomised during decompression, so data | ||
769 | integrity is never compromised. | ||
770 | |||
771 | Possible return values: | ||
772 | @display | ||
773 | @code{BZ_PARAM_ERROR} | ||
774 | if @code{strm} is @code{NULL} | ||
775 | or @code{blockSize} < 1 or @code{blockSize} > 9 | ||
776 | or @code{verbosity} < 0 or @code{verbosity} > 4 | ||
777 | or @code{workFactor} < 0 or @code{workFactor} > 250 | ||
778 | @code{BZ_MEM_ERROR} | ||
779 | if not enough memory is available | ||
780 | @code{BZ_OK} | ||
781 | otherwise | ||
782 | @end display | ||
783 | Allowable next actions: | ||
784 | @display | ||
785 | @code{bzCompress} | ||
786 | if @code{BZ_OK} is returned | ||
787 | no specific action needed in case of error | ||
788 | @end display | ||
789 | |||
790 | @subsection @code{bzCompress} | ||
791 | @example | ||
792 | int bzCompress ( bz_stream *strm, int action ); | ||
793 | @end example | ||
794 | Provides more input and/or output buffer space for the library. The | ||
795 | caller maintains input and output buffers, and calls @code{bzCompress} to | ||
796 | transfer data between them. | ||
797 | |||
798 | Before each call to @code{bzCompress}, @code{next_in} should point at | ||
799 | the data to be compressed, and @code{avail_in} should indicate how many | ||
800 | bytes the library may read. @code{bzCompress} updates @code{next_in}, | ||
801 | @code{avail_in} and @code{total_in} to reflect the number of bytes it | ||
802 | has read. | ||
803 | |||
804 | Similarly, @code{next_out} should point to a buffer in which the | ||
805 | compressed data is to be placed, with @code{avail_out} indicating how | ||
806 | much output space is available. @code{bzCompress} updates | ||
807 | @code{next_out}, @code{avail_out} and @code{total_out} to reflect the | ||
808 | number of bytes output. | ||
809 | |||
810 | You may provide and remove as little or as much data as you like on each | ||
811 | call of @code{bzCompress}. In the limit, it is acceptable to supply and | ||
812 | remove data one byte at a time, although this would be terribly | ||
813 | inefficient. You should always ensure that at least one byte of output | ||
814 | space is available at each call. | ||
815 | |||
816 | A second purpose of @code{bzCompress} is to request a change of mode of the | ||
817 | compressed stream. | ||
818 | |||
819 | Conceptually, a compressed stream can be in one of four states: IDLE, | ||
820 | RUNNING, FLUSHING and FINISHING. Before initialisation | ||
821 | (@code{bzCompressInit}) and after termination (@code{bzCompressEnd}), a | ||
822 | stream is regarded as IDLE. | ||
823 | |||
824 | Upon initialisation (@code{bzCompressInit}), the stream is placed in the | ||
825 | RUNNING state. Subsequent calls to @code{bzCompress} should pass | ||
826 | @code{BZ_RUN} as the requested action; other actions are illegal and | ||
827 | will result in @code{BZ_SEQUENCE_ERROR}. | ||
828 | |||
829 | At some point, the calling program will have provided all the input data | ||
830 | it wants to. It will then want to finish up -- in effect, asking the | ||
831 | library to process any data it might have buffered internally. In this | ||
832 | state, @code{bzCompress} will no longer attempt to read data from | ||
833 | @code{next_in}, but it will want to write data to @code{next_out}. | ||
834 | Because the output buffer supplied by the user can be arbitrarily small, | ||
835 | the finishing-up operation cannot necessarily be done with a single call | ||
836 | of @code{bzCompress}. | ||
837 | |||
838 | Instead, the calling program passes @code{BZ_FINISH} as an action to | ||
839 | @code{bzCompress}. This changes the stream's state to FINISHING. Any | ||
840 | remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and | ||
841 | transferred to the output buffer. To do this, @code{bzCompress} must be | ||
842 | called repeatedly until all the output has been consumed. At that | ||
843 | point, @code{bzCompress} returns @code{BZ_STREAM_END}, and the stream's | ||
844 | state is set back to IDLE. @code{bzCompressEnd} should then be | ||
845 | called. | ||
846 | |||
847 | Just to make sure the calling program does not cheat, the library makes | ||
848 | a note of @code{avail_in} at the time of the first call to | ||
849 | @code{bzCompress} which has @code{BZ_FINISH} as an action (ie, at the | ||
850 | time the program has announced its intention to not supply any more | ||
851 | input). By comparing this value with that of @code{avail_in} over | ||
852 | subsequent calls to @code{bzCompress}, the library can detect any | ||
853 | attempts to slip in more data to compress. Any calls for which this is | ||
854 | detected will return @code{BZ_SEQUENCE_ERROR}. This indicates a | ||
855 | programming mistake which should be corrected. | ||
856 | |||
857 | Instead of asking to finish, the calling program may ask | ||
858 | @code{bzCompress} to take all the remaining input, compress it and | ||
859 | terminate the current (Burrows-Wheeler) compression block. This could | ||
860 | be useful for error control purposes. The mechanism is analogous to | ||
861 | that for finishing: call @code{bzCompress} with an action of | ||
862 | @code{BZ_FLUSH}, remove output data, and persist with the | ||
863 | @code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned. As | ||
864 | with finishing, @code{bzCompress} detects any attempt to provide more | ||
865 | input data once the flush has begun. | ||
866 | |||
867 | Once the flush is complete, the stream returns to the normal RUNNING | ||
868 | state. | ||
869 | |||
870 | This all sounds pretty complex, but isn't really. Here's a table | ||
871 | which shows which actions are allowable in each state, what action | ||
872 | will be taken, what the next state is, and what the non-error return | ||
873 | values are. Note that you can't explicitly ask what state the | ||
874 | stream is in, but nor do you need to -- it can be inferred from the | ||
875 | values returned by @code{bzCompress}. | ||
876 | @display | ||
877 | IDLE/@code{any} | ||
878 | Illegal. IDLE state only exists after @code{bzCompressEnd} or | ||
879 | before @code{bzCompressInit}. | ||
880 | Return value = @code{BZ_SEQUENCE_ERROR} | ||
881 | |||
882 | RUNNING/@code{BZ_RUN} | ||
883 | Compress from @code{next_in} to @code{next_out} as much as possible. | ||
884 | Next state = RUNNING | ||
885 | Return value = @code{BZ_RUN_OK} | ||
886 | |||
887 | RUNNING/@code{BZ_FLUSH} | ||
888 | Remember current value of @code{next_in}. Compress from @code{next_in} | ||
889 | to @code{next_out} as much as possible, but do not accept any more input. | ||
890 | Next state = FLUSHING | ||
891 | Return value = @code{BZ_FLUSH_OK} | ||
892 | |||
893 | RUNNING/@code{BZ_FINISH} | ||
894 | Remember current value of @code{next_in}. Compress from @code{next_in} | ||
895 | to @code{next_out} as much as possible, but do not accept any more input. | ||
896 | Next state = FINISHING | ||
897 | Return value = @code{BZ_FINISH_OK} | ||
898 | |||
899 | FLUSHING/@code{BZ_FLUSH} | ||
900 | Compress from @code{next_in} to @code{next_out} as much as possible, | ||
901 | but do not accept any more input. | ||
902 | If all the existing input has been used up and all compressed | ||
903 | output has been removed | ||
904 | Next state = RUNNING; Return value = @code{BZ_RUN_OK} | ||
905 | else | ||
906 | Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK} | ||
907 | |||
908 | FLUSHING/other | ||
909 | Illegal. | ||
910 | Return value = @code{BZ_SEQUENCE_ERROR} | ||
911 | |||
912 | FINISHING/@code{BZ_FINISH} | ||
913 | Compress from @code{next_in} to @code{next_out} as much as possible, | ||
914 | but to not accept any more input. | ||
915 | If all the existing input has been used up and all compressed | ||
916 | output has been removed | ||
917 | Next state = IDLE; Return value = @code{BZ_STREAM_END} | ||
918 | else | ||
919 | Next state = FINISHING; Return value = @code{BZ_FINISHING} | ||
920 | |||
921 | FINISHING/other | ||
922 | Illegal. | ||
923 | Return value = @code{BZ_SEQUENCE_ERROR} | ||
924 | @end display | ||
925 | |||
926 | That still looks complicated? Well, fair enough. The usual sequence | ||
927 | of calls for compressing a load of data is: | ||
928 | @itemize @bullet | ||
929 | @item Get started with @code{bzCompressInit}. | ||
930 | @item Shovel data in and shlurp out its compressed form using zero or more | ||
931 | calls of @code{bzCompress} with action = @code{BZ_RUN}. | ||
932 | @item Finish up. | ||
933 | Repeatedly call @code{bzCompress} with action = @code{BZ_FINISH}, | ||
934 | copying out the compressed output, until @code{BZ_STREAM_END} is returned. | ||
935 | @item Close up and go home. Call @code{bzCompressEnd}. | ||
936 | @end itemize | ||
937 | If the data you want to compress fits into your input buffer all | ||
938 | at once, you can skip the calls of @code{bzCompress ( ..., BZ_RUN )} and | ||
939 | just do the @code{bzCompress ( ..., BZ_FINISH )} calls. | ||
940 | |||
941 | All required memory is allocated by @code{bzCompressInit}. The | ||
942 | compression library can accept any data at all (obviously). So you | ||
943 | shouldn't get any error return values from the @code{bzCompress} calls. | ||
944 | If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in | ||
945 | your programming. | ||
946 | |||
947 | Trivial other possible return values: | ||
948 | @display | ||
949 | @code{BZ_PARAM_ERROR} | ||
950 | if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL} | ||
951 | @end display | ||
952 | |||
953 | @subsection @code{bzCompressEnd} | ||
954 | @example | ||
955 | int bzCompressEnd ( bz_stream *strm ); | ||
956 | @end example | ||
957 | Releases all memory associated with a compression stream. | ||
958 | |||
959 | Possible return values: | ||
960 | @display | ||
961 | @code{BZ_PARAM_ERROR} if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} | ||
962 | @code{BZ_OK} otherwise | ||
963 | @end display | ||
964 | |||
965 | |||
966 | @subsection @code{bzDecompressInit} | ||
967 | @example | ||
968 | int bzDecompressInit ( bz_stream *strm, int verbosity, int small ); | ||
969 | @end example | ||
970 | Prepares for decompression. As with @code{bzCompressInit}, a | ||
971 | @code{bz_stream} record should be allocated and initialised before the | ||
972 | call. Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be | ||
973 | set if a custom memory allocator is required, or made @code{NULL} for | ||
974 | the normal @code{malloc}/@code{free} routines. Upon return, the internal | ||
975 | state will have been initialised, and @code{total_in} and | ||
976 | @code{total_out} will be zero. | ||
977 | |||
978 | For the meaning of parameter @code{verbosity}, see @code{bzCompressInit}. | ||
979 | |||
980 | If @code{small} is nonzero, the library will use an alternative | ||
981 | decompression algorithm which uses less memory but at the cost of | ||
982 | decompressing more slowly (roughly speaking, half the speed, but the | ||
983 | maximum memory requirement drops to around 2300k). See Chapter 2 for | ||
984 | more information on memory management. | ||
985 | |||
986 | Note that the amount of memory needed to decompress | ||
987 | a stream cannot be determined until the stream's header has been read, | ||
988 | so even if @code{bzDecompressInit} succeeds, a subsequent | ||
989 | @code{bzDecompress} could fail with @code{BZ_MEM_ERROR}. | ||
990 | |||
991 | Possible return values: | ||
992 | @display | ||
993 | @code{BZ_PARAM_ERROR} | ||
994 | if @code{(small != 0 && small != 1)} | ||
995 | or @code{(verbosity < 0 || verbosity > 4)} | ||
996 | @code{BZ_MEM_ERROR} | ||
997 | if insufficient memory is available | ||
998 | @end display | ||
999 | |||
1000 | Allowable next actions: | ||
1001 | @display | ||
1002 | @code{bzDecompress} | ||
1003 | if @code{BZ_OK} was returned | ||
1004 | no specific action required in case of error | ||
1005 | @end display | ||
1006 | |||
1007 | |||
1008 | |||
1009 | @subsection @code{bzDecompress} | ||
1010 | @example | ||
1011 | int bzDecompress ( bz_stream *strm ); | ||
1012 | @end example | ||
1013 | Provides more input and/out output buffer space for the library. The | ||
1014 | caller maintains input and output buffers, and uses @code{bzDecompress} | ||
1015 | to transfer data between them. | ||
1016 | |||
1017 | Before each call to @code{bzDecompress}, @code{next_in} | ||
1018 | should point at the compressed data, | ||
1019 | and @code{avail_in} should indicate how many bytes the library | ||
1020 | may read. @code{bzDecompress} updates @code{next_in}, @code{avail_in} | ||
1021 | and @code{total_in} | ||
1022 | to reflect the number of bytes it has read. | ||
1023 | |||
1024 | Similarly, @code{next_out} should point to a buffer in which the uncompressed | ||
1025 | output is to be placed, with @code{avail_out} indicating how much output space | ||
1026 | is available. @code{bzCompress} updates @code{next_out}, | ||
1027 | @code{avail_out} and @code{total_out} to reflect | ||
1028 | the number of bytes output. | ||
1029 | |||
1030 | You may provide and remove as little or as much data as you like on | ||
1031 | each call of @code{bzDecompress}. | ||
1032 | In the limit, it is acceptable to | ||
1033 | supply and remove data one byte at a time, although this would be | ||
1034 | terribly inefficient. You should always ensure that at least one | ||
1035 | byte of output space is available at each call. | ||
1036 | |||
1037 | Use of @code{bzDecompress} is simpler than @code{bzCompress}. | ||
1038 | |||
1039 | You should provide input and remove output as described above, and | ||
1040 | repeatedly call @code{bzDecompress} until @code{BZ_STREAM_END} is | ||
1041 | returned. Appearance of @code{BZ_STREAM_END} denotes that | ||
1042 | @code{bzDecompress} has detected the logical end of the compressed | ||
1043 | stream. @code{bzDecompress} will not produce @code{BZ_STREAM_END} until | ||
1044 | all output data has been placed into the output buffer, so once | ||
1045 | @code{BZ_STREAM_END} appears, you are guaranteed to have available all | ||
1046 | the decompressed output, and @code{bzDecompressEnd} can safely be | ||
1047 | called. | ||
1048 | |||
1049 | If case of an error return value, you should call @code{bzDecompressEnd} | ||
1050 | to clean up and release memory. | ||
1051 | |||
1052 | Possible return values: | ||
1053 | @display | ||
1054 | @code{BZ_PARAM_ERROR} | ||
1055 | if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} | ||
1056 | or @code{strm->avail_out < 1} | ||
1057 | @code{BZ_DATA_ERROR} | ||
1058 | if a data integrity error is detected in the compressed stream | ||
1059 | @code{BZ_DATA_ERROR_MAGIC} | ||
1060 | if the compressed stream doesn't begin with the right magic bytes | ||
1061 | @code{BZ_MEM_ERROR} | ||
1062 | if there wasn't enough memory available | ||
1063 | @code{BZ_STREAM_END} | ||
1064 | if the logical end of the data stream was detected and all | ||
1065 | output in has been consumed, eg @code{s->avail_out > 0} | ||
1066 | @code{BZ_OK} | ||
1067 | otherwise | ||
1068 | @end display | ||
1069 | Allowable next actions: | ||
1070 | @display | ||
1071 | @code{bzDecompress} | ||
1072 | if @code{BZ_OK} was returned | ||
1073 | @code{bzDecompressEnd} | ||
1074 | otherwise | ||
1075 | @end display | ||
1076 | |||
1077 | |||
1078 | @subsection @code{bzDecompressEnd} | ||
1079 | @example | ||
1080 | int bzDecompressEnd ( bz_stream *strm ); | ||
1081 | @end example | ||
1082 | Releases all memory associated with a decompression stream. | ||
1083 | |||
1084 | Possible return values: | ||
1085 | @display | ||
1086 | @code{BZ_PARAM_ERROR} | ||
1087 | if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL} | ||
1088 | @code{BZ_OK} | ||
1089 | otherwise | ||
1090 | @end display | ||
1091 | |||
1092 | Allowable next actions: | ||
1093 | @display | ||
1094 | None. | ||
1095 | @end display | ||
1096 | |||
1097 | |||
1098 | @section High-level interface | ||
1099 | |||
1100 | This interface provides functions for reading and writing | ||
1101 | @code{bzip2} format files. First, some general points. | ||
1102 | |||
1103 | @itemize @bullet | ||
1104 | @item All of the functions take an @code{int*} first argument, | ||
1105 | @code{bzerror}. | ||
1106 | After each call, @code{bzerror} should be consulted first to determine | ||
1107 | the outcome of the call. If @code{bzerror} is @code{BZ_OK}, | ||
1108 | the call completed | ||
1109 | successfully, and only then should the return value of the function | ||
1110 | (if any) be consulted. If @code{bzerror} is @code{BZ_IO_ERROR}, | ||
1111 | there was an error | ||
1112 | reading/writing the underlying compressed file, and you should | ||
1113 | then consult @code{errno}/@code{perror} to determine the | ||
1114 | cause of the difficulty. | ||
1115 | @code{bzerror} may also be set to various other values; precise details are | ||
1116 | given on a per-function basis below. | ||
1117 | @item If @code{bzerror} indicates an error | ||
1118 | (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}), | ||
1119 | you should immediately call @code{bzReadClose} (or @code{bzWriteClose}, | ||
1120 | depending on whether you are attempting to read or to write) | ||
1121 | to free up all resources associated | ||
1122 | with the stream. Once an error has been indicated, behaviour of all calls | ||
1123 | except @code{bzReadClose} (@code{bzWriteClose}) is undefined. | ||
1124 | The implication is that (1) @code{bzerror} should | ||
1125 | be checked after each call, and (2) if @code{bzerror} indicates an error, | ||
1126 | @code{bzReadClose} (@code{bzWriteClose}) should then be called to clean up. | ||
1127 | @item The @code{FILE*} arguments passed to | ||
1128 | @code{bzReadOpen}/@code{bzWriteOpen} | ||
1129 | should be set to binary mode. | ||
1130 | Most Unix systems will do this by default, but other platforms, | ||
1131 | including Windows and Mac, will not. If you omit this, you may | ||
1132 | encounter problems when moving code to new platforms. | ||
1133 | @item Memory allocation requests are handled by | ||
1134 | @code{malloc}/@code{free}. | ||
1135 | At present | ||
1136 | there is no facility for user-defined memory allocators in the file I/O | ||
1137 | functions (could easily be added, though). | ||
1138 | @end itemize | ||
1139 | |||
1140 | |||
1141 | |||
1142 | @subsection @code{bzReadOpen} | ||
1143 | @example | ||
1144 | typedef void BZFILE; | ||
1145 | |||
1146 | BZFILE *bzReadOpen ( int *bzerror, FILE *f, | ||
1147 | int small, int verbosity, | ||
1148 | void *unused, int nUnused ); | ||
1149 | @end example | ||
1150 | Prepare to read compressed data from file handle @code{f}. @code{f} | ||
1151 | should refer to a file which has been opened for reading, and for which | ||
1152 | the error indicator (@code{ferror(f)})is not set. If @code{small} is 1, | ||
1153 | the library will try to decompress using less memory, at the expense of | ||
1154 | speed. | ||
1155 | |||
1156 | For reasons explained below, @code{bzRead} will decompress the | ||
1157 | @code{nUnused} bytes starting at @code{unused}, before starting to read | ||
1158 | from the file @code{f}. At most @code{BZ_MAX_UNUSED} bytes may be | ||
1159 | supplied like this. If this facility is not required, you should pass | ||
1160 | @code{NULL} and @code{0} for @code{unused} and n@code{Unused} | ||
1161 | respectively. | ||
1162 | |||
1163 | For the meaning of parameters @code{small} and @code{verbosity}, | ||
1164 | see @code{bzDecompressInit}. | ||
1165 | |||
1166 | The amount of memory needed to decompress a file cannot be determined | ||
1167 | until the file's header has been read. So it is possible that | ||
1168 | @code{bzReadOpen} returns @code{BZ_OK} but a subsequent call of | ||
1169 | @code{bzRead} will return @code{BZ_MEM_ERROR}. | ||
1170 | |||
1171 | Possible assignments to @code{bzerror}: | ||
1172 | @display | ||
1173 | @code{BZ_PARAM_ERROR} | ||
1174 | if @code{f} is @code{NULL} | ||
1175 | or @code{small} is neither @code{0} nor @code{1} | ||
1176 | or @code{(unused == NULL && nUnused != 0)} | ||
1177 | or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))} | ||
1178 | @code{BZ_IO_ERROR} | ||
1179 | if @code{ferror(f)} is nonzero | ||
1180 | @code{BZ_MEM_ERROR} | ||
1181 | if insufficient memory is available | ||
1182 | @code{BZ_OK} | ||
1183 | otherwise. | ||
1184 | @end display | ||
1185 | |||
1186 | Possible return values: | ||
1187 | @display | ||
1188 | Pointer to an abstract @code{BZFILE} | ||
1189 | if @code{bzerror} is @code{BZ_OK} | ||
1190 | @code{NULL} | ||
1191 | otherwise | ||
1192 | @end display | ||
1193 | |||
1194 | Allowable next actions: | ||
1195 | @display | ||
1196 | @code{bzRead} | ||
1197 | if @code{bzerror} is @code{BZ_OK} | ||
1198 | @code{bzClose} | ||
1199 | otherwise | ||
1200 | @end display | ||
1201 | |||
1202 | |||
1203 | @subsection @code{bzRead} | ||
1204 | @example | ||
1205 | int bzRead ( int *bzerror, BZFILE *b, void *buf, int len ); | ||
1206 | @end example | ||
1207 | Reads up to @code{len} (uncompressed) bytes from the compressed file | ||
1208 | @code{b} into | ||
1209 | the buffer @code{buf}. If the read was successful, | ||
1210 | @code{bzerror} is set to @code{BZ_OK} | ||
1211 | and the number of bytes read is returned. If the logical end-of-stream | ||
1212 | was detected, @code{bzerror} will be set to @code{BZ_STREAM_END}, | ||
1213 | and the number | ||
1214 | of bytes read is returned. All other @code{bzerror} values denote an error. | ||
1215 | |||
1216 | @code{bzRead} will supply @code{len} bytes, | ||
1217 | unless the logical stream end is detected | ||
1218 | or an error occurs. Because of this, it is possible to detect the | ||
1219 | stream end by observing when the number of bytes returned is | ||
1220 | less than the number | ||
1221 | requested. Nevertheless, this is regarded as inadvisable; you should | ||
1222 | instead check @code{bzerror} after every call and watch out for | ||
1223 | @code{BZ_STREAM_END}. | ||
1224 | |||
1225 | Internally, @code{bzRead} copies data from the compressed file in chunks | ||
1226 | of size @code{BZ_MAX_UNUSED} bytes | ||
1227 | before decompressing it. If the file contains more bytes than strictly | ||
1228 | needed to reach the logical end-of-stream, @code{bzRead} will almost certainly | ||
1229 | read some of the trailing data before signalling @code{BZ_SEQUENCE_END}. | ||
1230 | To collect the read but unused data once @code{BZ_SEQUENCE_END} has | ||
1231 | appeared, call @code{bzReadGetUnused} immediately before @code{bzReadClose}. | ||
1232 | |||
1233 | Possible assignments to @code{bzerror}: | ||
1234 | @display | ||
1235 | @code{BZ_PARAM_ERROR} | ||
1236 | if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} | ||
1237 | @code{BZ_SEQUENCE_ERROR} | ||
1238 | if @code{b} was opened with @code{bzWriteOpen} | ||
1239 | @code{BZ_IO_ERROR} | ||
1240 | if there is an error reading from the compressed file | ||
1241 | @code{BZ_UNEXPECTED_EOF} | ||
1242 | if the compressed file ended before the logical end-of-stream was detected | ||
1243 | @code{BZ_DATA_ERROR} | ||
1244 | if a data integrity error was detected in the compressed stream | ||
1245 | @code{BZ_DATA_ERROR_MAGIC} | ||
1246 | if the stream does not begin with the requisite header bytes (ie, is not | ||
1247 | a @code{bzip2} data file). This is really a special case of @code{BZ_DATA_ERROR}. | ||
1248 | @code{BZ_MEM_ERROR} | ||
1249 | if insufficient memory was available | ||
1250 | @code{BZ_STREAM_END} | ||
1251 | if the logical end of stream was detected. | ||
1252 | @code{BZ_OK} | ||
1253 | otherwise. | ||
1254 | @end display | ||
1255 | |||
1256 | Possible return values: | ||
1257 | @display | ||
1258 | number of bytes read | ||
1259 | if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END} | ||
1260 | undefined | ||
1261 | otherwise | ||
1262 | @end display | ||
1263 | |||
1264 | Allowable next actions: | ||
1265 | @display | ||
1266 | collect data from @code{buf}, then @code{bzRead} or @code{bzReadClose} | ||
1267 | if @code{bzerror} is @code{BZ_OK} | ||
1268 | collect data from @code{buf}, then @code{bzReadClose} or @code{bzReadGetUnused} | ||
1269 | if @code{bzerror} is @code{BZ_SEQUENCE_END} | ||
1270 | @code{bzReadClose} | ||
1271 | otherwise | ||
1272 | @end display | ||
1273 | |||
1274 | |||
1275 | |||
1276 | @subsection @code{bzReadGetUnused} | ||
1277 | @example | ||
1278 | void bzReadGetUnused ( int* bzerror, BZFILE *b, | ||
1279 | void** unused, int* nUnused ); | ||
1280 | @end example | ||
1281 | Returns data which was read from the compressed file but was not needed | ||
1282 | to get to the logical end-of-stream. @code{*unused} is set to the address | ||
1283 | of the data, and @code{*nUnused} to the number of bytes. @code{*nUnused} will | ||
1284 | be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive. | ||
1285 | |||
1286 | This function may only be called once @code{bzRead} has signalled | ||
1287 | @code{BZ_STREAM_END} but before @code{bzReadClose}. | ||
1288 | |||
1289 | Possible assignments to @code{bzerror}: | ||
1290 | @display | ||
1291 | @code{BZ_PARAM_ERROR} | ||
1292 | if @code{b} is @code{NULL} | ||
1293 | or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL} | ||
1294 | @code{BZ_SEQUENCE_ERROR} | ||
1295 | if @code{BZ_STREAM_END} has not been signalled | ||
1296 | or if @code{b} was opened with @code{bzWriteOpen} | ||
1297 | @code{BZ_OK} | ||
1298 | otherwise | ||
1299 | @end display | ||
1300 | |||
1301 | Allowable next actions: | ||
1302 | @display | ||
1303 | @code{bzReadClose} | ||
1304 | @end display | ||
1305 | |||
1306 | |||
1307 | @subsection @code{bzReadClose} | ||
1308 | @example | ||
1309 | void bzReadClose ( int *bzerror, BZFILE *b ); | ||
1310 | @end example | ||
1311 | Releases all memory pertaining to the compressed file @code{b}. | ||
1312 | @code{bzReadClose} does not call @code{fclose} on the underlying file | ||
1313 | handle, so you should do that yourself if appropriate. | ||
1314 | @code{bzReadClose} should be called to clean up after all error | ||
1315 | situations. | ||
1316 | |||
1317 | Possible assignments to @code{bzerror}: | ||
1318 | @display | ||
1319 | @code{BZ_SEQUENCE_ERROR} | ||
1320 | if @code{b} was opened with @code{bzOpenWrite} | ||
1321 | @code{BZ_OK} | ||
1322 | otherwise | ||
1323 | @end display | ||
1324 | |||
1325 | Allowable next actions: | ||
1326 | @display | ||
1327 | none | ||
1328 | @end display | ||
1329 | |||
1330 | |||
1331 | |||
1332 | @subsection @code{bzWriteOpen} | ||
1333 | @example | ||
1334 | BZFILE *bzWriteOpen ( int *bzerror, FILE *f, | ||
1335 | int blockSize100k, int verbosity, | ||
1336 | int workFactor ); | ||
1337 | @end example | ||
1338 | Prepare to write compressed data to file handle @code{f}. | ||
1339 | @code{f} should refer to | ||
1340 | a file which has been opened for writing, and for which the error | ||
1341 | indicator (@code{ferror(f)})is not set. | ||
1342 | |||
1343 | For the meaning of parameters @code{blockSize100k}, | ||
1344 | @code{verbosity} and @code{workFactor}, see | ||
1345 | @* @code{bzCompressInit}. | ||
1346 | |||
1347 | All required memory is allocated at this stage, so if the call | ||
1348 | completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a | ||
1349 | subsequent call to @code{bzWrite}. | ||
1350 | |||
1351 | Possible assignments to @code{bzerror}: | ||
1352 | @display | ||
1353 | @code{BZ_PARAM_ERROR} | ||
1354 | if @code{f} is @code{NULL} | ||
1355 | or @code{blockSize100k < 1} or @code{blockSize100k > 9} | ||
1356 | @code{BZ_IO_ERROR} | ||
1357 | if @code{ferror(f)} is nonzero | ||
1358 | @code{BZ_MEM_ERROR} | ||
1359 | if insufficient memory is available | ||
1360 | @code{BZ_OK} | ||
1361 | otherwise | ||
1362 | @end display | ||
1363 | |||
1364 | Possible return values: | ||
1365 | @display | ||
1366 | Pointer to an abstract @code{BZFILE} | ||
1367 | if @code{bzerror} is @code{BZ_OK} | ||
1368 | @code{NULL} | ||
1369 | otherwise | ||
1370 | @end display | ||
1371 | |||
1372 | Allowable next actions: | ||
1373 | @display | ||
1374 | @code{bzWrite} | ||
1375 | if @code{bzerror} is @code{BZ_OK} | ||
1376 | (you could go directly to @code{bzWriteClose}, but this would be pretty pointless) | ||
1377 | @code{bzWriteClose} | ||
1378 | otherwise | ||
1379 | @end display | ||
1380 | |||
1381 | |||
1382 | |||
1383 | @subsection @code{bzWrite} | ||
1384 | @example | ||
1385 | void bzWrite ( int *bzerror, BZFILE *b, void *buf, int len ); | ||
1386 | @end example | ||
1387 | Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be | ||
1388 | compressed and written to the file. | ||
1389 | |||
1390 | Possible assignments to @code{bzerror}: | ||
1391 | @display | ||
1392 | @code{BZ_PARAM_ERROR} | ||
1393 | if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0} | ||
1394 | @code{BZ_SEQUENCE_ERROR} | ||
1395 | if b was opened with @code{bzReadOpen} | ||
1396 | @code{BZ_IO_ERROR} | ||
1397 | if there is an error writing the compressed file. | ||
1398 | @code{BZ_OK} | ||
1399 | otherwise | ||
1400 | @end display | ||
1401 | |||
1402 | |||
1403 | |||
1404 | |||
1405 | @subsection @code{bzWriteClose} | ||
1406 | @example | ||
1407 | int bzWriteClose ( int *bzerror, BZFILE* f, | ||
1408 | int abandon, | ||
1409 | unsigned int* nbytes_in, | ||
1410 | unsigned int* nbytes_out ); | ||
1411 | @end example | ||
1412 | |||
1413 | Compresses and flushes to the compressed file all data so far supplied | ||
1414 | by @code{bzWrite}. The logical end-of-stream markers are also written, so | ||
1415 | subsequent calls to @code{bzWrite} are illegal. All memory associated | ||
1416 | with the compressed file @code{b} is released. | ||
1417 | @code{fflush} is called on the | ||
1418 | compressed file, but it is not @code{fclose}'d. | ||
1419 | |||
1420 | If @code{bzWriteClose} is called to clean up after an error, the only | ||
1421 | action is to release the memory. The library records the error codes | ||
1422 | issued by previous calls, so this situation will be detected | ||
1423 | automatically. There is no attempt to complete the compression | ||
1424 | operation, nor to @code{fflush} the compressed file. You can force this | ||
1425 | behaviour to happen even in the case of no error, by passing a nonzero | ||
1426 | value to @code{abandon}. | ||
1427 | |||
1428 | If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the | ||
1429 | total volume of uncompressed data handled. Similarly, @code{nbytes_out} | ||
1430 | will be set to the total volume of compressed data written. | ||
1431 | |||
1432 | Possible assignments to @code{bzerror}: | ||
1433 | @display | ||
1434 | @code{BZ_SEQUENCE_ERROR} | ||
1435 | if @code{b} was opened with @code{bzReadOpen} | ||
1436 | @code{BZ_IO_ERROR} | ||
1437 | if there is an error writing the compressed file | ||
1438 | @code{BZ_OK} | ||
1439 | otherwise | ||
1440 | @end display | ||
1441 | |||
1442 | @subsection Handling embedded compressed data streams | ||
1443 | |||
1444 | The high-level library facilitates use of | ||
1445 | @code{bzip2} data streams which form some part of a surrounding, larger | ||
1446 | data stream. | ||
1447 | @itemize @bullet | ||
1448 | @item For writing, the library takes an open file handle, writes | ||
1449 | compressed data to it, @code{fflush}es it but does not @code{fclose} it. | ||
1450 | The calling application can write its own data before and after the | ||
1451 | compressed data stream, using that same file handle. | ||
1452 | @item Reading is more complex, and the facilities are not as general | ||
1453 | as they could be since generality is hard to reconcile with efficiency. | ||
1454 | @code{bzRead} reads from the compressed file in blocks of size | ||
1455 | @code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot | ||
1456 | the logical end of compressed stream. | ||
1457 | To recover this data once decompression has | ||
1458 | ended, call @code{bzReadGetUnused} after the last call of @code{bzRead} | ||
1459 | (the one returning @code{BZ_STREAM_END}) but before calling | ||
1460 | @code{bzReadClose}. | ||
1461 | @end itemize | ||
1462 | |||
1463 | This mechanism makes it easy to decompress multiple @code{bzip2} | ||
1464 | streams placed end-to-end. As the end of one stream, when @code{bzRead} | ||
1465 | returns @code{BZ_STREAM_END}, call @code{bzReadGetUnused} to collect the | ||
1466 | unused data (copy it into your own buffer somewhere). | ||
1467 | That data forms the start of the next compressed stream. | ||
1468 | To start uncompressing that next stream, call @code{bzReadOpen} again, | ||
1469 | feeding in the unused data via the @code{unused}/@code{nUnused} | ||
1470 | parameters. | ||
1471 | Keep doing this until @code{BZ_STREAM_END} return coincides with the | ||
1472 | physical end of file (@code{feof(f)}). In this situation | ||
1473 | @code{bzReadGetUnused} | ||
1474 | will of course return no data. | ||
1475 | |||
1476 | This should give some feel for how the high-level interface can be used. | ||
1477 | If you require extra flexibility, you'll have to bite the bullet and get | ||
1478 | to grips with the low-level interface. | ||
1479 | |||
1480 | @subsection Standard file-reading/writing code | ||
1481 | Here's how you'd write data to a compressed file: | ||
1482 | @example @code | ||
1483 | FILE* f; | ||
1484 | BZFILE* b; | ||
1485 | int nBuf; | ||
1486 | char buf[ /* whatever size you like */ ]; | ||
1487 | int bzerror; | ||
1488 | int nWritten; | ||
1489 | |||
1490 | f = fopen ( "myfile.bz2", "w" ); | ||
1491 | if (!f) @{ | ||
1492 | /* handle error */ | ||
1493 | @} | ||
1494 | b = bzWriteOpen ( &bzerror, f, 9 ); | ||
1495 | if (bzerror != BZ_OK) @{ | ||
1496 | bzWriteClose ( b ); | ||
1497 | /* handle error */ | ||
1498 | @} | ||
1499 | |||
1500 | while ( /* condition */ ) @{ | ||
1501 | /* get data to write into buf, and set nBuf appropriately */ | ||
1502 | nWritten = bzWrite ( &bzerror, b, buf, nBuf ); | ||
1503 | if (bzerror == BZ_IO_ERROR) @{ | ||
1504 | bzWriteClose ( &bzerror, b ); | ||
1505 | /* handle error */ | ||
1506 | @} | ||
1507 | @} | ||
1508 | |||
1509 | bzWriteClose ( &bzerror, b ); | ||
1510 | if (bzerror == BZ_IO_ERROR) @{ | ||
1511 | /* handle error */ | ||
1512 | @} | ||
1513 | @end example | ||
1514 | And to read from a compressed file: | ||
1515 | @example | ||
1516 | FILE* f; | ||
1517 | BZFILE* b; | ||
1518 | int nBuf; | ||
1519 | char buf[ /* whatever size you like */ ]; | ||
1520 | int bzerror; | ||
1521 | int nWritten; | ||
1522 | |||
1523 | f = fopen ( "myfile.bz2", "r" ); | ||
1524 | if (!f) @{ | ||
1525 | /* handle error */ | ||
1526 | @} | ||
1527 | b = bzReadOpen ( &bzerror, f, 0, NULL, 0 ); | ||
1528 | if (bzerror != BZ_OK) @{ | ||
1529 | bzReadClose ( &bzerror, b ); | ||
1530 | /* handle error */ | ||
1531 | @} | ||
1532 | |||
1533 | bzerror = BZ_OK; | ||
1534 | while (bzerror == BZ_OK && /* arbitrary other conditions */) @{ | ||
1535 | nBuf = bzRead ( &bzerror, b, buf, /* size of buf */ ); | ||
1536 | if (bzerror == BZ_OK) @{ | ||
1537 | /* do something with buf[0 .. nBuf-1] */ | ||
1538 | @} | ||
1539 | @} | ||
1540 | if (bzerror != BZ_STREAM_END) @{ | ||
1541 | bzReadClose ( &bzerror, b ); | ||
1542 | /* handle error */ | ||
1543 | @} else @{ | ||
1544 | bzReadClose ( &bzerror ); | ||
1545 | @} | ||
1546 | @end example | ||
1547 | |||
1548 | |||
1549 | |||
1550 | @section Utility functions | ||
1551 | @subsection @code{bzBuffToBuffCompress} | ||
1552 | @example | ||
1553 | int bzBuffToBuffCompress( char* dest, | ||
1554 | unsigned int* destLen, | ||
1555 | char* source, | ||
1556 | unsigned int sourceLen, | ||
1557 | int blockSize100k, | ||
1558 | int verbosity, | ||
1559 | int workFactor ); | ||
1560 | @end example | ||
1561 | Attempts to compress the data in @code{source[0 .. sourceLen-1]} | ||
1562 | into the destination buffer, @code{dest[0 .. *destLen-1]}. | ||
1563 | If the destination buffer is big enough, @code{*destLen} is | ||
1564 | set to the size of the compressed data, and @code{BZ_OK} is | ||
1565 | returned. If the compressed data won't fit, @code{*destLen} | ||
1566 | is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. | ||
1567 | |||
1568 | Compression in this manner is a one-shot event, done with a single call | ||
1569 | to this function. The resulting compressed data is a complete | ||
1570 | @code{bzip2} format data stream. There is no mechanism for making | ||
1571 | additional calls to provide extra input data. If you want that kind of | ||
1572 | mechanism, use the low-level interface. | ||
1573 | |||
1574 | For the meaning of parameters @code{blockSize100k}, @code{verbosity} | ||
1575 | and @code{workFactor}, @* see @code{bzCompressInit}. | ||
1576 | |||
1577 | To guarantee that the compressed data will fit in its buffer, allocate | ||
1578 | an output buffer of size 1% larger than the uncompressed data, plus | ||
1579 | six hundred extra bytes. | ||
1580 | |||
1581 | @code{bzBuffToBuffDecompress} will not write data at or | ||
1582 | beyond @code{dest[*destLen]}, even in case of buffer overflow. | ||
1583 | |||
1584 | Possible return values: | ||
1585 | @display | ||
1586 | @code{BZ_PARAM_ERROR} | ||
1587 | if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} | ||
1588 | or @code{blockSize100k < 1} or @code{blockSize100k > 9} | ||
1589 | or @code{verbosity < 0} or @code{verbosity > 4} | ||
1590 | or @code{workFactor < 0} or @code{workFactor > 250} | ||
1591 | @code{BZ_MEM_ERROR} | ||
1592 | if insufficient memory is available | ||
1593 | @code{BZ_OUTBUFF_FULL} | ||
1594 | if the size of the compressed data exceeds @code{*destLen} | ||
1595 | @code{BZ_OK} | ||
1596 | otherwise | ||
1597 | @end display | ||
1598 | |||
1599 | |||
1600 | |||
1601 | @subsection @code{bzBuffToBuffDecompress} | ||
1602 | @example | ||
1603 | int bzBuffToBuffDecompress ( char* dest, | ||
1604 | unsigned int* destLen, | ||
1605 | char* source, | ||
1606 | unsigned int sourceLen, | ||
1607 | int small, | ||
1608 | int verbosity ); | ||
1609 | @end example | ||
1610 | Attempts to decompress the data in @code{source[0 .. sourceLen-1]} | ||
1611 | into the destination buffer, @code{dest[0 .. *destLen-1]}. | ||
1612 | If the destination buffer is big enough, @code{*destLen} is | ||
1613 | set to the size of the uncompressed data, and @code{BZ_OK} is | ||
1614 | returned. If the compressed data won't fit, @code{*destLen} | ||
1615 | is unchanged, and @code{BZ_OUTBUFF_FULL} is returned. | ||
1616 | |||
1617 | @code{source} is assumed to hold a complete @code{bzip2} format | ||
1618 | data stream. @code{bzBuffToBuffDecompress} tries to decompress | ||
1619 | the entirety of the stream into the output buffer. | ||
1620 | |||
1621 | For the meaning of parameters @code{small} and @code{verbosity}, | ||
1622 | see @code{bzDecompressInit}. | ||
1623 | |||
1624 | Because the compression ratio of the compressed data cannot be known in | ||
1625 | advance, there is no easy way to guarantee that the output buffer will | ||
1626 | be big enough. You may of course make arrangements in your code to | ||
1627 | record the size of the uncompressed data, but such a mechanism is beyond | ||
1628 | the scope of this library. | ||
1629 | |||
1630 | @code{bzBuffToBuffDecompress} will not write data at or | ||
1631 | beyond @code{dest[*destLen]}, even in case of buffer overflow. | ||
1632 | |||
1633 | Possible return values: | ||
1634 | @display | ||
1635 | @code{BZ_PARAM_ERROR} | ||
1636 | if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL} | ||
1637 | or @code{small != 0 && small != 1} | ||
1638 | or @code{verbosity < 0} or @code{verbosity > 4} | ||
1639 | @code{BZ_MEM_ERROR} | ||
1640 | if insufficient memory is available | ||
1641 | @code{BZ_OUTBUFF_FULL} | ||
1642 | if the size of the compressed data exceeds @code{*destLen} | ||
1643 | @code{BZ_DATA_ERROR} | ||
1644 | if a data integrity error was detected in the compressed data | ||
1645 | @code{BZ_DATA_ERROR_MAGIC} | ||
1646 | if the compressed data doesn't begin with the right magic bytes | ||
1647 | @code{BZ_UNEXPECTED_EOF} | ||
1648 | if the compressed data ends unexpectedly | ||
1649 | @code{BZ_OK} | ||
1650 | otherwise | ||
1651 | @end display | ||
1652 | |||
1653 | |||
1654 | |||
1655 | @section Using the library in a @code{stdio}-free environment | ||
1656 | |||
1657 | @subsection Getting rid of @code{stdio} | ||
1658 | |||
1659 | In a deeply embedded application, you might want to use just | ||
1660 | the memory-to-memory functions. You can do this conveniently | ||
1661 | by compiling the library with preprocessor symbol @code{BZ_NO_STDIO} | ||
1662 | defined. Doing this gives you a library containing only the following | ||
1663 | eight functions: | ||
1664 | |||
1665 | @code{bzCompressInit}, @code{bzCompress}, @code{bzCompressEnd} @* | ||
1666 | @code{bzDecompressInit}, @code{bzDecompress}, @code{bzDecompressEnd} @* | ||
1667 | @code{bzBuffToBuffCompress}, @code{bzBuffToBuffDecompress} | ||
1668 | |||
1669 | When compiled like this, all functions will ignore @code{verbosity} | ||
1670 | settings. | ||
1671 | |||
1672 | @subsection Critical error handling | ||
1673 | @code{libbzip2} contains a number of internal assertion checks which | ||
1674 | should, needless to say, never be activated. Nevertheless, if an | ||
1675 | assertion should fail, behaviour depends on whether or not the library | ||
1676 | was compiled with @code{BZ_NO_STDIO} set. | ||
1677 | |||
1678 | For a normal compile, an assertion failure yields the message | ||
1679 | @example | ||
1680 | bzip2/libbzip2, v0.9.0: internal error number N. | ||
1681 | This is a bug in bzip2/libbzip2, v0.9.0. Please report | ||
1682 | it to me at: jseward@@acm.org. If this happened when | ||
1683 | you were using some program which uses libbzip2 as a | ||
1684 | component, you should also report this bug to the author(s) | ||
1685 | of that program. Please make an effort to report this bug; | ||
1686 | timely and accurate bug reports eventually lead to higher | ||
1687 | quality software. Thx. Julian Seward, 27 June 1998. | ||
1688 | @end example | ||
1689 | where @code{N} is some error code number. @code{exit(3)} | ||
1690 | is then called. | ||
1691 | |||
1692 | For a @code{stdio}-free library, assertion failures result | ||
1693 | in a call to a function declared as: | ||
1694 | @example | ||
1695 | extern void bz_internal_error ( int errcode ); | ||
1696 | @end example | ||
1697 | The relevant code is passed as a parameter. You should supply | ||
1698 | such a function. | ||
1699 | |||
1700 | In either case, once an assertion failure has occurred, any | ||
1701 | @code{bz_stream} records involved can be regarded as invalid. | ||
1702 | You should not attempt to resume normal operation with them. | ||
1703 | |||
1704 | You may, of course, change critical error handling to suit | ||
1705 | your needs. As I said above, critical errors indicate bugs | ||
1706 | in the library and should not occur. All "normal" error | ||
1707 | situations are indicated via error return codes from functions, | ||
1708 | and can be recovered from. | ||
1709 | |||
1710 | |||
1711 | @section Making a Windows DLL | ||
1712 | Everything related to Windows has been contributed by Yoshioka Tsuneo | ||
1713 | @* (@code{QWF00133@@niftyserve.or.jp} / | ||
1714 | @code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to | ||
1715 | him (but perhaps Cc: me, @code{jseward@@acm.org}). | ||
1716 | |||
1717 | My vague understanding of what to do is: using Visual C++ 5.0, | ||
1718 | open the project file @code{libbz2.dsp}, and build. That's all. | ||
1719 | |||
1720 | If you can't | ||
1721 | open the project file for some reason, make a new one, naming these files: | ||
1722 | @code{blocksort.c}, @code{bzlib.c}, @code{compress.c}, | ||
1723 | @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @* | ||
1724 | @code{randtable.c} and @code{libbz2.def}. You might also need | ||
1725 | to name the header files @code{bzlib.h} and @code{bzlib_private.h}. | ||
1726 | |||
1727 | If you don't use VC++, you may need to define the proprocessor symbol | ||
1728 | @code{_WIN32}. | ||
1729 | |||
1730 | Finally, @code{dlltest.c} is a sample program using the DLL. It has a | ||
1731 | project file, @code{dlltest.dsp}. | ||
1732 | |||
1733 | I haven't tried any of this stuff myself, but it all looks plausible. | ||
1734 | |||
1735 | |||
1736 | |||
1737 | @chapter Miscellanea | ||
1738 | |||
1739 | These are just some random thoughts of mine. Your mileage may | ||
1740 | vary. | ||
1741 | |||
1742 | @section Limitations of the compressed file format | ||
1743 | @code{bzip2-0.9.0} uses exactly the same file format as the previous | ||
1744 | version, @code{bzip2-0.1}. This decision was made in the interests of | ||
1745 | stability. Creating yet another incompatible compressed file format | ||
1746 | would create further confusion and disruption for users. | ||
1747 | |||
1748 | Nevertheless, this is not a painless decision. Development | ||
1749 | work since the release of @code{bzip2-0.1} in August 1997 | ||
1750 | has shown complexities in the file format which slow down | ||
1751 | decompression and, in retrospect, are unnecessary. These are: | ||
1752 | @itemize @bullet | ||
1753 | @item The run-length encoder, which is the first of the | ||
1754 | compression transformations, is entirely irrelevant. | ||
1755 | The original purpose was to protect the sorting algorithm | ||
1756 | from the very worst case input: a string of repeated | ||
1757 | symbols. But algorithm steps Q6a and Q6b in the original | ||
1758 | Burrows-Wheeler technical report (SRC-124) show how | ||
1759 | repeats can be handled without difficulty in block | ||
1760 | sorting. | ||
1761 | @item The randomisation mechanism doesn't really need to be | ||
1762 | there. Udi Manber and Gene Myers published a suffix | ||
1763 | array construction algorithm a few years back, which | ||
1764 | can be employed to sort any block, no matter how | ||
1765 | repetitive, in O(N log N) time. Subsequent work by | ||
1766 | Kunihiko Sadakane has produced a derivative O(N (log N)^2) | ||
1767 | algorithm which usually outperforms the Manber-Myers | ||
1768 | algorithm. | ||
1769 | |||
1770 | I could have changed to Sadakane's algorithm, but I find | ||
1771 | it to be slower than @code{bzip2}'s existing algorithm for | ||
1772 | most inputs, and the randomisation mechanism protects | ||
1773 | adequately against bad cases. I didn't think it was | ||
1774 | a good tradeoff to make. Partly this is due to the fact | ||
1775 | that I was not flooded with email complaints about | ||
1776 | @code{bzip2-0.1}'s performance on repetitive data, so | ||
1777 | perhaps it isn't a problem for real inputs. | ||
1778 | |||
1779 | Probably the best long-term solution | ||
1780 | is to use the existing sorting | ||
1781 | algorithm initially, and fall back to a O(N (log N)^2) | ||
1782 | algorithm if the standard algorithm gets into difficulties. | ||
1783 | This can be done without much difficulty; I made | ||
1784 | a prototype implementation of it some months now. | ||
1785 | @item The compressed file format was never designed to be | ||
1786 | handled by a library, and I have had to jump though | ||
1787 | some hoops to produce an efficient implementation of | ||
1788 | decompression. It's a bit hairy. Try passing | ||
1789 | @code{decompress.c} through the C preprocessor | ||
1790 | and you'll see what I mean. Much of this complexity | ||
1791 | could have been avoided if the compressed size of | ||
1792 | each block of data was recorded in the data stream. | ||
1793 | @item An Adler-32 checksum, rather than a CRC32 checksum, | ||
1794 | would be faster to compute. | ||
1795 | @end itemize | ||
1796 | It would be fair to say that the @code{bzip2} format was frozen | ||
1797 | before I properly and fully understood the performance | ||
1798 | consequences of doing so. | ||
1799 | |||
1800 | Improvements which I have been able to incorporate into | ||
1801 | 0.9.0, despite using the same file format, are: | ||
1802 | @itemize @bullet | ||
1803 | @item Single array implementation of the inverse BWT. This | ||
1804 | significantly speeds up decompression, presumably | ||
1805 | because it reduces the number of cache misses. | ||
1806 | @item Faster inverse MTF transform for large MTF values. The | ||
1807 | new implementation is based on the notion of sliding blocks | ||
1808 | of values. | ||
1809 | @item @code{bzip2-0.9.0} now reads and writes files with @code{fread} | ||
1810 | and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}. | ||
1811 | Duh! I'm embarrassed at my own moronicness (moronicity?) on this | ||
1812 | one. | ||
1813 | |||
1814 | @end itemize | ||
1815 | Further ahead, it would be nice | ||
1816 | to be able to do random access into files. This will | ||
1817 | require some careful design of compressed file formats. | ||
1818 | |||
1819 | |||
1820 | |||
1821 | @section Portability issues | ||
1822 | After some consideration, I have decided not to use | ||
1823 | GNU @code{autoconf} to configure 0.9.0. | ||
1824 | |||
1825 | @code{autoconf}, admirable and wonderful though it is, | ||
1826 | mainly assists with portability problems between Unix-like | ||
1827 | platforms. But @code{bzip2} doesn't have much in the way | ||
1828 | of portability problems on Unix; most of the difficulties appear | ||
1829 | when porting to the Mac, or to Microsoft's operating systems. | ||
1830 | @code{autoconf} doesn't help in those cases, and brings in a | ||
1831 | whole load of new complexity. | ||
1832 | |||
1833 | Most people should be able to compile the library and program | ||
1834 | under Unix straight out-of-the-box, so to speak, especially | ||
1835 | if you have a version of GNU C available. | ||
1836 | |||
1837 | There are a couple of @code{__inline__} directives in the code. GNU C | ||
1838 | (@code{gcc}) should be able to handle them. If your compiler doesn't | ||
1839 | like them, just @code{#define} @code{__inline__} to be null. One | ||
1840 | easy way to do this is to compile with the flag @code{-D__inline__=}, | ||
1841 | which should be understood by most Unix compilers. | ||
1842 | |||
1843 | If you still have difficulties, try compiling with the macro | ||
1844 | @code{BZ_STRICT_ANSI} defined. This should enable you to build the | ||
1845 | library in a strictly ANSI compliant environment. Building the program | ||
1846 | itself like this is dangerous and not supported, since you remove | ||
1847 | @code{bzip2}'s checks against compressing directories, symbolic links, | ||
1848 | devices, and other not-really-a-file entities. This could cause | ||
1849 | filesystem corruption! | ||
1850 | |||
1851 | One other thing: if you create a @code{bzip2} binary for public | ||
1852 | distribution, please try and link it statically (@code{gcc -s}). This | ||
1853 | avoids all sorts of library-version issues that others may encounter | ||
1854 | later on. | ||
1855 | |||
1856 | |||
1857 | @section Reporting bugs | ||
1858 | I tried pretty hard to make sure @code{bzip2} is | ||
1859 | bug free, both by design and by testing. Hopefully | ||
1860 | you'll never need to read this section for real. | ||
1861 | |||
1862 | Nevertheless, if @code{bzip2} dies with a segmentation | ||
1863 | fault, a bus error or an internal assertion failure, it | ||
1864 | will ask you to email me a bug report. Experience with | ||
1865 | version 0.1 shows that almost all these problems can | ||
1866 | be traced to either compiler bugs or hardware problems. | ||
1867 | @itemize @bullet | ||
1868 | @item | ||
1869 | Recompile the program with no optimisation, and see if it | ||
1870 | works. And/or try a different compiler. | ||
1871 | I heard all sorts of stories about various flavours | ||
1872 | of GNU C (and other compilers) generating bad code for | ||
1873 | @code{bzip2}, and I've run across two such examples myself. | ||
1874 | |||
1875 | 2.7.X versions of GNU C are known to generate bad code from | ||
1876 | time to time, at high optimisation levels. | ||
1877 | If you get problems, try using the flags | ||
1878 | @code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}. | ||
1879 | You should specifically @emph{not} use @code{-funroll-loops}. | ||
1880 | |||
1881 | You may notice that the Makefile runs four tests as part of | ||
1882 | the build process. If the program passes all of these, it's | ||
1883 | a pretty good (but not 100%) indication that the compiler has | ||
1884 | done its job correctly. | ||
1885 | @item | ||
1886 | If @code{bzip2} crashes randomly, and the crashes are not | ||
1887 | repeatable, you may have a flaky memory subsystem. @code{bzip2} | ||
1888 | really hammers your memory hierarchy, and if it's a bit marginal, | ||
1889 | you may get these problems. Ditto if your disk or I/O subsystem | ||
1890 | is slowly failing. Yup, this really does happen. | ||
1891 | |||
1892 | Try using a different machine of the same type, and see if | ||
1893 | you can repeat the problem. | ||
1894 | @item This isn't really a bug, but ... If @code{bzip2} tells | ||
1895 | you your file is corrupted on decompression, and you | ||
1896 | obtained the file via FTP, there is a possibility that you | ||
1897 | forgot to tell FTP to do a binary mode transfer. That absolutely | ||
1898 | will cause the file to be non-decompressible. You'll have to transfer | ||
1899 | it again. | ||
1900 | @end itemize | ||
1901 | |||
1902 | If you've incorporated @code{libbzip2} into your own program | ||
1903 | and are getting problems, please, please, please, check that the | ||
1904 | parameters you are passing in calls to the library, are | ||
1905 | correct, and in accordance with what the documentation says | ||
1906 | is allowable. I have tried to make the library robust against | ||
1907 | such problems, but I'm sure I haven't succeeded. | ||
1908 | |||
1909 | Finally, if the above comments don't help, you'll have to send | ||
1910 | me a bug report. Now, it's just amazing how many people will | ||
1911 | send me a bug report saying something like | ||
1912 | @display | ||
1913 | bzip2 crashed with segmentation fault on my machine | ||
1914 | @end display | ||
1915 | and absolutely nothing else. Needless to say, a such a report | ||
1916 | is @emph{totally, utterly, completely and comprehensively 100% useless; | ||
1917 | a waste of your time, my time, and net bandwidth}. | ||
1918 | With no details at all, there's no way I can possibly begin | ||
1919 | to figure out what the problem is. | ||
1920 | |||
1921 | The rules of the game are: facts, facts, facts. Don't omit | ||
1922 | them because "oh, they won't be relevant". At the bare | ||
1923 | minimum: | ||
1924 | @display | ||
1925 | Machine type. Operating system version. | ||
1926 | Exact version of @code{bzip2} (do @code{bzip2 -V}). | ||
1927 | Exact version of the compiler used. | ||
1928 | Flags passed to the compiler. | ||
1929 | @end display | ||
1930 | However, the most important single thing that will help me is | ||
1931 | the file that you were trying to compress or decompress at the | ||
1932 | time the problem happened. Without that, my ability to do anything | ||
1933 | more than speculate about the cause, is limited. | ||
1934 | |||
1935 | Please remember that I connect to the Internet with a modem, so | ||
1936 | you should contact me before mailing me huge files. | ||
1937 | |||
1938 | |||
1939 | @section Did you get the right package? | ||
1940 | |||
1941 | @code{bzip2} is a resource hog. It soaks up large amounts of CPU cycles | ||
1942 | and memory. Also, it gives very large latencies. In the worst case, you | ||
1943 | can feed many megabytes of uncompressed data into the library before | ||
1944 | getting any compressed output, so this probably rules out applications | ||
1945 | requiring interactive behaviour. | ||
1946 | |||
1947 | These aren't faults of my implementation, I hope, but more | ||
1948 | an intrinsic property of the Burrows-Wheeler transform (unfortunately). | ||
1949 | Maybe this isn't what you want. | ||
1950 | |||
1951 | If you want a compressor and/or library which is faster, uses less | ||
1952 | memory but gets pretty good compression, and has minimal latency, | ||
1953 | consider Jean-loup | ||
1954 | Gailly's and Mark Adler's work, @code{zlib-1.1.2} and | ||
1955 | @code{gzip-1.2.4}. Look for them at | ||
1956 | @code{http://www.cdrom.com/pub/infozip/zlib} and | ||
1957 | @code{http://www.gzip.org} respectively. | ||
1958 | |||
1959 | For something faster and lighter still, you might try Markus F X J | ||
1960 | Oberhumer's @code{LZO} real-time compression/decompression library, at | ||
1961 | @* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}. | ||
1962 | |||
1963 | If you want to use the @code{bzip2} algorithms to compress small blocks | ||
1964 | of data, 64k bytes or smaller, for example on an on-the-fly disk | ||
1965 | compressor, you'd be well advised not to use this library. Instead, | ||
1966 | I've made a special library tuned for that kind of use. It's part of | ||
1967 | @code{e2compr-0.40}, an on-the-fly disk compressor for the Linux | ||
1968 | @code{ext2} filesystem. Look at | ||
1969 | @code{http://www.netspace.net.au/~reiter/e2compr}. | ||
1970 | |||
1971 | |||
1972 | |||
1973 | @section Testing | ||
1974 | |||
1975 | A record of the tests I've done. | ||
1976 | |||
1977 | First, some data sets: | ||
1978 | @itemize @bullet | ||
1979 | @item B: a directory containing a 6001 files, one for every length in the | ||
1980 | range 0 to 6000 bytes. The files contain random lowercase | ||
1981 | letters. 18.7 megabytes. | ||
1982 | @item H: my home directory tree. Documents, source code, mail files, | ||
1983 | compressed data. H contains B, and also a directory of | ||
1984 | files designed as boundary cases for the sorting; mostly very | ||
1985 | repetitive, nasty files. 445 megabytes. | ||
1986 | @item A: directory tree holding various applications built from source: | ||
1987 | @code{egcs-1.0.2}, @code{gcc-2.8.1}, KDE Beta 4, GTK, Octave, etc. | ||
1988 | 827 megabytes. | ||
1989 | @item P: directory tree holding large amounts of source code (@code{.tar} | ||
1990 | files) of the entire GNU distribution, plus a couple of | ||
1991 | Linux distributions. 2400 megabytes. | ||
1992 | @end itemize | ||
1993 | The tests conducted are as follows. Each test means compressing | ||
1994 | (a copy of) each file in the data set, decompressing it and | ||
1995 | comparing it against the original. | ||
1996 | |||
1997 | First, a bunch of tests with block sizes, internal buffer | ||
1998 | sizes and randomisation lengths set very small, | ||
1999 | to detect any problems with the | ||
2000 | blocking, buffering and randomisation mechanisms. | ||
2001 | This required modifying the source code so as to try to | ||
2002 | break it. | ||
2003 | @enumerate | ||
2004 | @item Data set H, with | ||
2005 | buffer size of 1 byte, and block size of 23 bytes. | ||
2006 | @item Data set B, buffer sizes 1 byte, block size 1 byte. | ||
2007 | @item As (2) but small-mode decompression (first 1700 files). | ||
2008 | @item As (2) with block size 2 bytes. | ||
2009 | @item As (2) with block size 3 bytes. | ||
2010 | @item As (2) with block size 4 bytes. | ||
2011 | @item As (2) with block size 5 bytes. | ||
2012 | @item As (2) with block size 6 bytes and small-mode decompression. | ||
2013 | @item H with normal buffer sizes (5000 bytes), normal block | ||
2014 | size (up to 900000 bytes), but with randomisation | ||
2015 | mechanism running intensely (randomising approximately every | ||
2016 | third byte). | ||
2017 | @item As (9) with small-mode decompression. | ||
2018 | @end enumerate | ||
2019 | Then some tests with unmodified source code. | ||
2020 | @enumerate | ||
2021 | @item H, all settings normal. | ||
2022 | @item As (1), with small-mode decompress. | ||
2023 | @item H, compress with flag @code{-1}. | ||
2024 | @item H, compress with flag @code{-s}, decompress with flag @code{-s}. | ||
2025 | @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing, | ||
2026 | @code{bzip2-0.9.0} decompressing, all settings normal. | ||
2027 | @item Backwards compatibility: H, @code{bzip2-0.9.0} compressing, | ||
2028 | @code{bzip2-0.1pl2} decompressing, all settings normal. | ||
2029 | @item Bigger tests: A, all settings normal. | ||
2030 | @item P, all settings normal. | ||
2031 | @item Misc test: about 100 megabytes of @code{.tar} files with | ||
2032 | @code{bzip2} compiled with Purify. | ||
2033 | @item Misc tests to make sure it builds and runs ok on non-Linux/x86 | ||
2034 | platforms. | ||
2035 | @end enumerate | ||
2036 | These tests were conducted on a 205 MHz Cyrix 6x86MX machine, running | ||
2037 | Linux 2.0.32. They represent nearly a week of continuous computation. | ||
2038 | All tests completed successfully. | ||
2039 | |||
2040 | |||
2041 | @section Further reading | ||
2042 | @code{bzip2} is not research work, in the sense that it doesn't present | ||
2043 | any new ideas. Rather, it's an engineering exercise based on existing | ||
2044 | ideas. | ||
2045 | |||
2046 | Four documents describe essentially all the ideas behind @code{bzip2}: | ||
2047 | @example | ||
2048 | Michael Burrows and D. J. Wheeler: | ||
2049 | "A block-sorting lossless data compression algorithm" | ||
2050 | 10th May 1994. | ||
2051 | Digital SRC Research Report 124. | ||
2052 | ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz | ||
2053 | If you have trouble finding it, try searching at the | ||
2054 | New Zealand Digital Library, http://www.nzdl.org. | ||
2055 | |||
2056 | Daniel S. Hirschberg and Debra A. LeLewer | ||
2057 | "Efficient Decoding of Prefix Codes" | ||
2058 | Communications of the ACM, April 1990, Vol 33, Number 4. | ||
2059 | You might be able to get an electronic copy of this | ||
2060 | from the ACM Digital Library. | ||
2061 | |||
2062 | David J. Wheeler | ||
2063 | Program bred3.c and accompanying document bred3.ps. | ||
2064 | This contains the idea behind the multi-table Huffman | ||
2065 | coding scheme. | ||
2066 | ftp://ftp.cl.cam.ac.uk/pub/user/djw3/ | ||
2067 | |||
2068 | Jon L. Bentley and Robert Sedgewick | ||
2069 | "Fast Algorithms for Sorting and Searching Strings" | ||
2070 | Available from Sedgewick's web page, | ||
2071 | www.cs.princeton.edu/~rs | ||
2072 | @end example | ||
2073 | The following paper gives valuable additional insights into the | ||
2074 | algorithm, but is not immediately the basis of any code | ||
2075 | used in bzip2. | ||
2076 | @example | ||
2077 | Peter Fenwick: | ||
2078 | Block Sorting Text Compression | ||
2079 | Proceedings of the 19th Australasian Computer Science Conference, | ||
2080 | Melbourne, Australia. Jan 31 - Feb 2, 1996. | ||
2081 | ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps | ||
2082 | @end example | ||
2083 | Kunihiko Sadakane's sorting algorithm, mentioned above, | ||
2084 | is available from: | ||
2085 | @example | ||
2086 | http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz | ||
2087 | @end example | ||
2088 | The Manber-Myers suffix array construction | ||
2089 | algorithm is described in a paper | ||
2090 | available from: | ||
2091 | @example | ||
2092 | http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps | ||
2093 | @end example | ||
2094 | |||
2095 | |||
2096 | |||
2097 | @contents | ||
2098 | |||
2099 | @bye | ||
2100 | |||