aboutsummaryrefslogtreecommitdiff
path: root/manual.texi
diff options
context:
space:
mode:
Diffstat (limited to 'manual.texi')
-rw-r--r--manual.texi2100
1 files changed, 2100 insertions, 0 deletions
diff --git a/manual.texi b/manual.texi
new file mode 100644
index 0000000..99ce661
--- /dev/null
+++ b/manual.texi
@@ -0,0 +1,2100 @@
1\input texinfo @c -*- Texinfo -*-
2@setfilename bzip2.info
3
4@ignore
5This file documents bzip2 version 0.9.0c, and associated library
6libbzip2, written by Julian Seward (jseward@acm.org).
7
8Copyright (C) 1996-1998 Julian R Seward
9
10Permission is granted to make and distribute verbatim copies of
11this manual provided the copyright notice and this permission notice
12are preserved on all copies.
13
14Permission is granted to copy and distribute translations of this manual
15into another language, under the above conditions for verbatim copies.
16@end ignore
17
18@ifinfo
19@format
20START-INFO-DIR-ENTRY
21* Bzip2: (bzip2). A program and library for data compression.
22END-INFO-DIR-ENTRY
23@end format
24
25@end ifinfo
26
27@iftex
28@c @finalout
29@settitle bzip2 and libbzip2
30@titlepage
31@title bzip2 and libbzip2
32@subtitle a program and library for data compression
33@subtitle copyright (C) 1996-1998 Julian Seward
34@subtitle version 0.9.0c of 18 October 1998
35@author Julian Seward
36
37@end titlepage
38@end iftex
39
40
41@parindent 0mm
42@parskip 2mm
43
44
45This program, @code{bzip2},
46and associated library @code{libbzip2}, are
47Copyright (C) 1996-1998 Julian R Seward. All rights reserved.
48
49Redistribution and use in source and binary forms, with or without
50modification, are permitted provided that the following conditions
51are met:
52@itemize @bullet
53@item
54 Redistributions of source code must retain the above copyright
55 notice, this list of conditions and the following disclaimer.
56@item
57 The origin of this software must not be misrepresented; you must
58 not claim that you wrote the original software. If you use this
59 software in a product, an acknowledgment in the product
60 documentation would be appreciated but is not required.
61@item
62 Altered source versions must be plainly marked as such, and must
63 not be misrepresented as being the original software.
64@item
65 The name of the author may not be used to endorse or promote
66 products derived from this software without specific prior written
67 permission.
68@end itemize
69THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS
70OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
71WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
72ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
73DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
74DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
75GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
76INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
77WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
78NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
79SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
80
81Julian Seward, Guildford, Surrey, UK.
82
83@code{jseward@@acm.org}
84
85@code{http://www.muraroa.demon.co.uk}
86
87@code{bzip2}/@code{libbzip2} version 0.9.0c of 18 October 1998.
88
89PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented
90algorithms. However, I do not have the resources available to carry out
91a full patent search. Therefore I cannot give any guarantee of the
92above statement.
93
94
95
96
97
98
99
100@node Overview, Implementation, Top, Top
101@chapter Introduction
102
103@code{bzip2} compresses files using the Burrows-Wheeler
104block-sorting text compression algorithm, and Huffman coding.
105Compression is generally considerably better than that
106achieved by more conventional LZ77/LZ78-based compressors,
107and approaches the performance of the PPM family of statistical compressors.
108
109@code{bzip2} is built on top of @code{libbzip2}, a flexible library
110for handling compressed data in the @code{bzip2} format. This manual
111describes both how to use the program and
112how to work with the library interface. Most of the
113manual is devoted to this library, not the program,
114which is good news if your interest is only in the program.
115
116Chapter 2 describes how to use @code{bzip2}; this is the only part
117you need to read if you just want to know how to operate the program.
118Chapter 3 describes the programming interfaces in detail, and
119Chapter 4 records some miscellaneous notes which I thought
120ought to be recorded somewhere.
121
122
123@chapter How to use @code{bzip2}
124
125This chapter contains a copy of the @code{bzip2} man page,
126and nothing else.
127@example
128NAME
129 bzip2, bunzip2 - a block-sorting file compressor, v0.9.0
130 bzcat - decompresses files to stdout
131 bzip2recover - recovers data from damaged bzip2 files
132
133
134SYNOPSIS
135 bzip2 [ -cdfkstvzVL123456789 ] [ filenames ... ]
136 bunzip2 [ -fkvsVL ] [ filenames ... ]
137 bzcat [ -s ] [ filenames ... ]
138 bzip2recover filename
139
140
141DESCRIPTION
142 bzip2 compresses files using the Burrows-Wheeler block-
143 sorting text compression algorithm, and Huffman coding.
144 Compression is generally considerably better than that
145 achieved by more conventional LZ77/LZ78-based compressors,
146 and approaches the performance of the PPM family of sta-
147 tistical compressors.
148
149 The command-line options are deliberately very similar to
150 those of GNU Gzip, but they are not identical.
151
152 bzip2 expects a list of file names to accompany the com-
153 mand-line flags. Each file is replaced by a compressed
154 version of itself, with the name "original_name.bz2".
155 Each compressed file has the same modification date and
156 permissions as the corresponding original, so that these
157 properties can be correctly restored at decompression
158 time. File name handling is naive in the sense that there
159 is no mechanism for preserving original file names, per-
160 missions and dates in filesystems which lack these con-
161 cepts, or have serious file name length restrictions, such
162 as MS-DOS.
163
164 bzip2 and bunzip2 will by default not overwrite existing
165 files; if you want this to happen, specify the -f flag.
166
167 If no file names are specified, bzip2 compresses from
168 standard input to standard output. In this case, bzip2
169 will decline to write compressed output to a terminal, as
170 this would be entirely incomprehensible and therefore
171 pointless.
172
173 bunzip2 (or bzip2 -d ) decompresses and restores all spec-
174 ified files whose names end in ".bz2". Files without this
175 suffix are ignored. Again, supplying no filenames causes
176 decompression from standard input to standard output.
177
178 bunzip2 will correctly decompress a file which is the con-
179 catenation of two or more compressed files. The result is
180 the concatenation of the corresponding uncompressed files.
181 Integrity testing (-t) of concatenated compressed files is
182 also supported.
183
184 You can also compress or decompress files to the standard
185 output by giving the -c flag. Multiple files may be com-
186 pressed and decompressed like this. The resulting outputs
187 are fed sequentially to stdout. Compression of multiple
188 files in this manner generates a stream containing multi-
189 ple compressed file representations. Such a stream can be
190 decompressed correctly only by bzip2 version 0.9.0 or
191 later. Earlier versions of bzip2 will stop after decom-
192 pressing the first file in the stream.
193
194 bzcat (or bzip2 -dc ) decompresses all specified files to
195 the standard output.
196
197 Compression is always performed, even if the compressed
198 file is slightly larger than the original. Files of less
199 than about one hundred bytes tend to get larger, since the
200 compression mechanism has a constant overhead in the
201 region of 50 bytes. Random data (including the output of
202 most file compressors) is coded at about 8.05 bits per
203 byte, giving an expansion of around 0.5%.
204
205 As a self-check for your protection, bzip2 uses 32-bit
206 CRCs to make sure that the decompressed version of a file
207 is identical to the original. This guards against corrup-
208 tion of the compressed data, and against undetected bugs
209 in bzip2 (hopefully very unlikely). The chances of data
210 corruption going undetected is microscopic, about one
211 chance in four billion for each file processed. Be aware,
212 though, that the check occurs upon decompression, so it
213 can only tell you that that something is wrong. It can't
214 help you recover the original uncompressed data. You can
215 use bzip2recover to try to recover data from damaged
216 files.
217
218 Return values: 0 for a normal exit, 1 for environmental
219 problems (file not found, invalid flags, I/O errors, &c),
220 2 to indicate a corrupt compressed file, 3 for an internal
221 consistency error (eg, bug) which caused bzip2 to panic.
222
223
224MEMORY MANAGEMENT
225 Bzip2 compresses large files in blocks. The block size
226 affects both the compression ratio achieved, and the
227 amount of memory needed both for compression and decom-
228 pression. The flags -1 through -9 specify the block size
229 to be 100,000 bytes through 900,000 bytes (the default)
230 respectively. At decompression-time, the block size used
231 for compression is read from the header of the compressed
232 file, and bunzip2 then allocates itself just enough memory
233 to decompress the file. Since block sizes are stored in
234 compressed files, it follows that the flags -1 to -9 are
235 irrelevant to and so ignored during decompression.
236
237 Compression and decompression requirements, in bytes, can
238 be estimated as:
239
240 Compression: 400k + ( 7 x block size )
241
242 Decompression: 100k + ( 4 x block size ), or
243 100k + ( 2.5 x block size )
244
245 Larger block sizes give rapidly diminishing marginal
246 returns; most of the compression comes from the first two
247 or three hundred k of block size, a fact worth bearing in
248 mind when using bzip2 on small machines. It is also
249 important to appreciate that the decompression memory
250 requirement is set at compression-time by the choice of
251 block size.
252
253 For files compressed with the default 900k block size,
254 bunzip2 will require about 3700 kbytes to decompress. To
255 support decompression of any file on a 4 megabyte machine,
256 bunzip2 has an option to decompress using approximately
257 half this amount of memory, about 2300 kbytes. Decompres-
258 sion speed is also halved, so you should use this option
259 only where necessary. The relevant flag is -s.
260
261 In general, try and use the largest block size memory con-
262 straints allow, since that maximises the compression
263 achieved. Compression and decompression speed are virtu-
264 ally unaffected by block size.
265
266 Another significant point applies to files which fit in a
267 single block -- that means most files you'd encounter
268 using a large block size. The amount of real memory
269 touched is proportional to the size of the file, since the
270 file is smaller than a block. For example, compressing a
271 file 20,000 bytes long with the flag -9 will cause the
272 compressor to allocate around 6700k of memory, but only
273 touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the
274 decompressor will allocate 3700k but only touch 100k +
275 20000 * 4 = 180 kbytes.
276
277 Here is a table which summarises the maximum memory usage
278 for different block sizes. Also recorded is the total
279 compressed size for 14 files of the Calgary Text Compres-
280 sion Corpus totalling 3,141,622 bytes. This column gives
281 some feel for how compression varies with block size.
282 These figures tend to understate the advantage of larger
283 block sizes for larger files, since the Corpus is domi-
284 nated by smaller files.
285
286 Compress Decompress Decompress Corpus
287 Flag usage usage -s usage Size
288
289 -1 1100k 500k 350k 914704
290 -2 1800k 900k 600k 877703
291 -3 2500k 1300k 850k 860338
292 -4 3200k 1700k 1100k 846899
293 -5 3900k 2100k 1350k 845160
294 -6 4600k 2500k 1600k 838626
295 -7 5400k 2900k 1850k 834096
296 -8 6000k 3300k 2100k 828642
297 -9 6700k 3700k 2350k 828642
298
299
300OPTIONS
301 -c --stdout
302 Compress or decompress to standard output. -c will
303 decompress multiple files to stdout, but will only
304 compress a single file to stdout.
305
306 -d --decompress
307 Force decompression. bzip2, bunzip2 and bzcat are
308 really the same program, and the decision about
309 what actions to take is done on the basis of which
310 name is used. This flag overrides that mechanism,
311 and forces bzip2 to decompress.
312
313 -z --compress
314 The complement to -d: forces compression, regard-
315 less of the invokation name.
316
317 -t --test
318 Check integrity of the specified file(s), but don't
319 decompress them. This really performs a trial
320 decompression and throws away the result.
321
322 -f --force
323 Force overwrite of output files. Normally, bzip2
324 will not overwrite existing output files.
325
326 -k --keep
327 Keep (don't delete) input files during compression
328 or decompression.
329
330 -s --small
331 Reduce memory usage, for compression, decompression
332 and testing. Files are decompressed and tested
333 using a modified algorithm which only requires 2.5
334 bytes per block byte. This means any file can be
335 decompressed in 2300k of memory, albeit at about
336 half the normal speed.
337
338 During compression, -s selects a block size of
339 200k, which limits memory use to around the same
340 figure, at the expense of your compression ratio.
341 In short, if your machine is low on memory (8
342 megabytes or less), use -s for everything. See
343 MEMORY MANAGEMENT above.
344
345 -v --verbose
346 Verbose mode -- show the compression ratio for each
347 file processed. Further -v's increase the ver-
348 bosity level, spewing out lots of information which
349 is primarily of interest for diagnostic purposes.
350
351 -L --license -V --version
352 Display the software version, license terms and
353 conditions.
354
355 -1 to -9
356 Set the block size to 100 k, 200 k .. 900 k when
357 compressing. Has no effect when decompressing.
358 See MEMORY MANAGEMENT above.
359
360 --repetitive-fast
361 bzip2 injects some small pseudo-random variations
362 into very repetitive blocks to limit worst-case
363 performance during compression. If sorting runs
364 into difficulties, the block is randomised, and
365 sorting is restarted. Very roughly, bzip2 persists
366 for three times as long as a well-behaved input
367 would take before resorting to randomisation. This
368 flag makes it give up much sooner.
369
370 --repetitive-best
371 Opposite of --repetitive-fast; try a lot harder
372 before resorting to randomisation.
373
374
375RECOVERING DATA FROM DAMAGED FILES
376 bzip2 compresses files in blocks, usually 900kbytes long.
377 Each block is handled independently. If a media or trans-
378 mission error causes a multi-block .bz2 file to become
379 damaged, it may be possible to recover data from the
380 undamaged blocks in the file.
381
382 The compressed representation of each block is delimited
383 by a 48-bit pattern, which makes it possible to find the
384 block boundaries with reasonable certainty. Each block
385 also carries its own 32-bit CRC, so damaged blocks can be
386 distinguished from undamaged ones.
387
388 bzip2recover is a simple program whose purpose is to
389 search for blocks in .bz2 files, and write each block out
390 into its own .bz2 file. You can then use bzip2 -t to test
391 the integrity of the resulting files, and decompress those
392 which are undamaged.
393
394 bzip2recover takes a single argument, the name of the dam-
395 aged file, and writes a number of files "rec0001file.bz2",
396 "rec0002file.bz2", etc, containing the extracted blocks.
397 The output filenames are designed so that the use of
398 wildcards in subsequent processing -- for example, "bzip2
399 -dc rec*file.bz2 > recovered_data" -- lists the files in
400 the "right" order.
401
402 bzip2recover should be of most use dealing with large .bz2
403 files, as these will contain many blocks. It is clearly
404 futile to use it on damaged single-block files, since a
405 damaged block cannot be recovered. If you wish to min-
406 imise any potential data loss through media or transmis-
407 sion errors, you might consider compressing with a smaller
408 block size.
409
410
411PERFORMANCE NOTES
412 The sorting phase of compression gathers together similar
413 strings in the file. Because of this, files containing
414 very long runs of repeated symbols, like "aabaabaabaab
415 ..." (repeated several hundred times) may compress
416 extraordinarily slowly. You can use the -vvvvv option to
417 monitor progress in great detail, if you want. Decompres-
418 sion speed is unaffected.
419
420 Such pathological cases seem rare in practice, appearing
421 mostly in artificially-constructed test files, and in low-
422 level disk images. It may be inadvisable to use bzip2 to
423 compress the latter. If you do get a file which causes
424 severe slowness in compression, try making the block size
425 as small as possible, with flag -1.
426
427 bzip2 usually allocates several megabytes of memory to
428 operate in, and then charges all over it in a fairly ran-
429 dom fashion. This means that performance, both for com-
430 pressing and decompressing, is largely determined by the
431 speed at which your machine can service cache misses.
432 Because of this, small changes to the code to reduce the
433 miss rate have been observed to give disproportionately
434 large performance improvements. I imagine bzip2 will per-
435 form best on machines with very large caches.
436
437
438CAVEATS
439 I/O error messages are not as helpful as they could be.
440 Bzip2 tries hard to detect I/O errors and exit cleanly,
441 but the details of what the problem is sometimes seem
442 rather misleading.
443
444 This manual page pertains to version 0.9.0 of bzip2. Com-
445 pressed data created by this version is entirely forwards
446 and backwards compatible with the previous public release,
447 version 0.1pl2, but with the following exception: 0.9.0
448 can correctly decompress multiple concatenated compressed
449 files. 0.1pl2 cannot do this; it will stop after decom-
450 pressing just the first file in the stream.
451
452 Wildcard expansion for Windows 95 and NT is flaky.
453
454 bzip2recover uses 32-bit integers to represent bit posi-
455 tions in compressed files, so it cannot handle compressed
456 files more than 512 megabytes long. This could easily be
457 fixed.
458
459
460AUTHOR
461 Julian Seward, jseward@@acm.org.
462
463 The ideas embodied in bzip2 are due to (at least) the fol-
464 lowing people: Michael Burrows and David Wheeler (for the
465 block sorting transformation), David Wheeler (again, for
466 the Huffman coder), Peter Fenwick (for the structured cod-
467 ing model in the original bzip, and many refinements), and
468 Alistair Moffat, Radford Neal and Ian Witten (for the
469 arithmetic coder in the original bzip). I am much
470 indebted for their help, support and advice. See the man-
471 ual in the source distribution for pointers to sources of
472 documentation. Christian von Roques encouraged me to look
473 for faster sorting algorithms, so as to speed up compres-
474 sion. Bela Lubkin encouraged me to improve the worst-case
475 compression performance. Many people sent patches, helped
476 with portability problems, lent machines, gave advice and
477 were generally helpful.
478@end example
479
480
481
482
483
484@chapter Programming with @code{libbzip2}
485
486This chapter describes the programming interface to @code{libbzip2}.
487
488For general background information, particularly about memory
489use and performance aspects, you'd be well advised to read Chapter 2
490as well.
491
492@section Top-level structure
493
494@code{libbzip2} is a flexible library for compressing and decompressing
495data in the @code{bzip2} data format. Although packaged as a single
496entity, it helps to regard the library as three separate parts: the low
497level interface, and the high level interface, and some utility
498functions.
499
500The structure of @code{libbzip2}'s interfaces is similar to
501that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib}
502library.
503
504@subsection Low-level summary
505
506This interface provides services for compressing and decompressing
507data in memory. There's no provision for dealing with files, streams
508or any other I/O mechanisms, just straight memory-to-memory work.
509In fact, this part of the library can be compiled without inclusion
510of @code{stdio.h}, which may be helpful for embedded applications.
511
512The low-level part of the library has no global variables and
513is therefore thread-safe.
514
515Six routines make up the low level interface:
516@code{bzCompressInit}, @code{bzCompress}, and @* @code{bzCompressEnd}
517for compression,
518and a corresponding trio @code{bzDecompressInit}, @* @code{bzDecompress}
519and @code{bzDecompressEnd} for decompression.
520The @code{*Init} functions allocate
521memory for compression/decompression and do other
522initialisations, whilst the @code{*End} functions close down operations
523and release memory.
524
525The real work is done by @code{bzCompress} and @code{bzDecompress}.
526These compress/decompress data from a user-supplied input buffer
527to a user-supplied output buffer. These buffers can be any size;
528arbitrary quantities of data are handled by making repeated calls
529to these functions. This is a flexible mechanism allowing a
530consumer-pull style of activity, or producer-push, or a mixture of
531both.
532
533
534
535@subsection High-level summary
536
537This interface provides some handy wrappers around the low-level
538interface to facilitate reading and writing @code{bzip2} format
539files (@code{.bz2} files). The routines provide hooks to facilitate
540reading files in which the @code{bzip2} data stream is embedded
541within some larger-scale file structure, or where there are
542multiple @code{bzip2} data streams concatenated end-to-end.
543
544For reading files, @code{bzReadOpen}, @code{bzRead}, @code{bzReadClose}
545and @code{bzReadGetUnused} are supplied. For writing files,
546@code{bzWriteOpen}, @code{bzWrite} and @code{bzWriteFinish} are
547available.
548
549As with the low-level library, no global variables are used
550so the library is per se thread-safe. However, if I/O errors
551occur whilst reading or writing the underlying compressed files,
552you may have to consult @code{errno} to determine the cause of
553the error. In that case, you'd need a C library which correctly
554supports @code{errno} in a multithreaded environment.
555
556To make the library a little simpler and more portable,
557@code{bzReadOpen} and @code{bzWriteOpen} require you to pass them file
558handles (@code{FILE*}s) which have previously been opened for reading or
559writing respectively. That avoids portability problems associated with
560file operations and file attributes, whilst not being much of an
561imposition on the programmer.
562
563
564
565@subsection Utility functions summary
566For very simple needs, @code{bzBuffToBuffCompress} and
567@code{bzBuffToBuffDecompress} are provided. These compress
568data in memory from one buffer to another buffer in a single
569function call. You should assess whether these functions
570fulfill your memory-to-memory compression/decompression
571requirements before investing effort in understanding the more
572general but more complex low-level interface.
573
574Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} /
575@code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to
576give better @code{zlib} compatibility. These functions are
577@code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush},
578@code{bzclose},
579@code{bzerror} and @code{bzlibVersion}. You may find these functions
580more convenient for simple file reading and writing, than those in the
581high-level interface. These functions are not (yet) officially part of
582the library, and are not further documented here. If they break, you
583get to keep all the pieces. I hope to document them properly when time
584permits.
585
586Yoshioka also contributed modifications to allow the library to be
587built as a Windows DLL.
588
589
590@section Error handling
591
592The library is designed to recover cleanly in all situations, including
593the worst-case situation of decompressing random data. I'm not
594100% sure that it can always do this, so you might want to add
595a signal handler to catch segmentation violations during decompression
596if you are feeling especially paranoid. I would be interested in
597hearing more about the robustness of the library to corrupted
598compressed data.
599
600The file @code{bzlib.h} contains all definitions needed to use
601the library. In particular, you should definitely not include
602@code{bzlib_private.h}.
603
604In @code{bzlib.h}, the various return values are defined. The following
605list is not intended as an exhaustive description of the circumstances
606in which a given value may be returned -- those descriptions are given
607later. Rather, it is intended to convey the rough meaning of each
608return value. The first five actions are normal and not intended to
609denote an error situation.
610@table @code
611@item BZ_OK
612The requested action was completed successfully.
613@item BZ_RUN_OK
614@itemx BZ_FLUSH_OK
615@itemx BZ_FINISH_OK
616In @code{bzCompress}, the requested flush/finish/nothing-special action
617was completed successfully.
618@item BZ_STREAM_END
619Compression of data was completed, or the logical stream end was
620detected during decompression.
621@end table
622
623The following return values indicate an error of some kind.
624@table @code
625@item BZ_SEQUENCE_ERROR
626When using the library, it is important to call the functions in the
627correct sequence and with data structures (buffers etc) in the correct
628states. @code{libbzip2} checks as much as it can to ensure this is
629happening, and returns @code{BZ_SEQUENCE_ERROR} if not. Code which
630complies precisely with the function semantics, as detailed below,
631should never receive this value; such an event denotes buggy code
632which you should investigate.
633@item BZ_PARAM_ERROR
634Returned when a parameter to a function call is out of range
635or otherwise manifestly incorrect. As with @code{BZ_SEQUENCE_ERROR},
636this denotes a bug in the client code. The distinction between
637@code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth
638making.
639@item BZ_MEM_ERROR
640Returned when a request to allocate memory failed. Note that the
641quantity of memory needed to decompress a stream cannot be determined
642until the stream's header has been read. So @code{bzDecompress} and
643@code{bzRead} may return @code{BZ_MEM_ERROR} even though some of
644the compressed data has been read. The same is not true for
645compression; once @code{bzCompressInit} or @code{bzWriteOpen} have
646successfully completed, @code{BZ_MEM_ERROR} cannot occur.
647@item BZ_DATA_ERROR
648Returned when a data integrity error is detected during decompression.
649Most importantly, this means when stored and computed CRCs for the
650data do not match. This value is also returned upon detection of any
651other anomaly in the compressed data.
652@item BZ_DATA_ERROR_MAGIC
653As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to
654know when the compressed stream does not start with the correct
655magic bytes (@code{'B' 'Z' 'h'}).
656@item BZ_IO_ERROR
657Returned by @code{bzRead} and @code{bzRead} when there is an error
658reading or writing in the compressed file, and by @code{bzReadOpen}
659and @code{bzWriteOpen} for attempts to use a file for which the
660error indicator (viz, @code{ferror(f)}) is set.
661On receipt of @code{BZ_IO_ERROR}, the caller should consult
662@code{errno} and/or @code{perror} to acquire operating-system
663specific information about the problem.
664@item BZ_UNEXPECTED_EOF
665Returned by @code{bzRead} when the compressed file finishes
666before the logical end of stream is detected.
667@item BZ_OUTBUFF_FULL
668Returned by @code{bzBuffToBuffCompress} and
669@code{bzBuffToBuffDecompress} to indicate that the output data
670will not fit into the output buffer provided.
671@end table
672
673
674
675@section Low-level interface
676
677@subsection @code{bzCompressInit}
678@example
679typedef
680 struct @{
681 char *next_in;
682 unsigned int avail_in;
683 unsigned int total_in;
684
685 char *next_out;
686 unsigned int avail_out;
687 unsigned int total_out;
688
689 void *state;
690
691 void *(*bzalloc)(void *,int,int);
692 void (*bzfree)(void *,void *);
693 void *opaque;
694 @}
695 bz_stream;
696
697int bzCompressInit ( bz_stream *strm,
698 int blockSize100k,
699 int verbosity,
700 int workFactor );
701
702@end example
703
704Prepares for compression. The @code{bz_stream} structure
705holds all data pertaining to the compression activity.
706A @code{bz_stream} structure should be allocated and initialised
707prior to the call.
708The fields of @code{bz_stream}
709comprise the entirety of the user-visible data. @code{state}
710is a pointer to the private data structures required for compression.
711
712Custom memory allocators are supported, via fields @code{bzalloc},
713@code{bzfree},
714and @code{opaque}. The value
715@code{opaque} is passed to as the first argument to
716all calls to @code{bzalloc} and @code{bzfree}, but is
717otherwise ignored by the library.
718The call @code{bzalloc ( opaque, n, m )} is expected to return a
719pointer @code{p} to
720@code{n * m} bytes of memory, and @code{bzfree ( opaque, p )}
721should free
722that memory.
723
724If you don't want to use a custom memory allocator, set @code{bzalloc},
725@code{bzfree} and
726@code{opaque} to @code{NULL},
727and the library will then use the standard @code{malloc}/@code{free}
728routines.
729
730Before calling @code{bzCompressInit}, fields @code{bzalloc},
731@code{bzfree} and @code{opaque} should
732be filled appropriately, as just described. Upon return, the internal
733state will have been allocated and initialised, and @code{total_in} and
734@code{total_out} will have been set to zero.
735These last two fields are used by the library
736to inform the caller of the total amount of data passed into and out of
737the library, respectively. You should not try to change them.
738
739Parameter @code{blockSize100k} specifies the block size to be used for
740compression. It should be a value between 1 and 9 inclusive, and the
741actual block size used is 100000 x this figure. 9 gives the best
742compression but takes most memory.
743
744Parameter @code{verbosity} should be set to a number between 0 and 4
745inclusive. 0 is silent, and greater numbers give increasingly verbose
746monitoring/debugging output. If the library has been compiled with
747@code{-DBZ_NO_STDIO}, no such output will appear for any verbosity
748setting.
749
750Parameter @code{workFactor} controls how the compression phase behaves
751when presented with worst case, highly repetitive, input data.
752If compression runs into difficulties caused by repetitive data,
753some pseudo-random variations are inserted into the block, and
754compression is restarted. Lower values of @code{workFactor}
755reduce the tolerance of compression to repetitive data.
756You should set this parameter carefully; too low, and
757compression ratio suffers, too high, and your average-to-worst
758case compression times can become very large.
759The default value of 30
760gives reasonable behaviour over a wide range of circumstances.
761
762Allowable values range from 0 to 250 inclusive. 0 is a special
763case, equivalent to using the default value of 30.
764
765Note that the randomisation process is entirely transparent.
766If the library decides to randomise and restart compression on a
767block, it does so without comment. Randomised blocks are
768automatically de-randomised during decompression, so data
769integrity is never compromised.
770
771Possible return values:
772@display
773 @code{BZ_PARAM_ERROR}
774 if @code{strm} is @code{NULL}
775 or @code{blockSize} < 1 or @code{blockSize} > 9
776 or @code{verbosity} < 0 or @code{verbosity} > 4
777 or @code{workFactor} < 0 or @code{workFactor} > 250
778 @code{BZ_MEM_ERROR}
779 if not enough memory is available
780 @code{BZ_OK}
781 otherwise
782@end display
783Allowable next actions:
784@display
785 @code{bzCompress}
786 if @code{BZ_OK} is returned
787 no specific action needed in case of error
788@end display
789
790@subsection @code{bzCompress}
791@example
792 int bzCompress ( bz_stream *strm, int action );
793@end example
794Provides more input and/or output buffer space for the library. The
795caller maintains input and output buffers, and calls @code{bzCompress} to
796transfer data between them.
797
798Before each call to @code{bzCompress}, @code{next_in} should point at
799the data to be compressed, and @code{avail_in} should indicate how many
800bytes the library may read. @code{bzCompress} updates @code{next_in},
801@code{avail_in} and @code{total_in} to reflect the number of bytes it
802has read.
803
804Similarly, @code{next_out} should point to a buffer in which the
805compressed data is to be placed, with @code{avail_out} indicating how
806much output space is available. @code{bzCompress} updates
807@code{next_out}, @code{avail_out} and @code{total_out} to reflect the
808number of bytes output.
809
810You may provide and remove as little or as much data as you like on each
811call of @code{bzCompress}. In the limit, it is acceptable to supply and
812remove data one byte at a time, although this would be terribly
813inefficient. You should always ensure that at least one byte of output
814space is available at each call.
815
816A second purpose of @code{bzCompress} is to request a change of mode of the
817compressed stream.
818
819Conceptually, a compressed stream can be in one of four states: IDLE,
820RUNNING, FLUSHING and FINISHING. Before initialisation
821(@code{bzCompressInit}) and after termination (@code{bzCompressEnd}), a
822stream is regarded as IDLE.
823
824Upon initialisation (@code{bzCompressInit}), the stream is placed in the
825RUNNING state. Subsequent calls to @code{bzCompress} should pass
826@code{BZ_RUN} as the requested action; other actions are illegal and
827will result in @code{BZ_SEQUENCE_ERROR}.
828
829At some point, the calling program will have provided all the input data
830it wants to. It will then want to finish up -- in effect, asking the
831library to process any data it might have buffered internally. In this
832state, @code{bzCompress} will no longer attempt to read data from
833@code{next_in}, but it will want to write data to @code{next_out}.
834Because the output buffer supplied by the user can be arbitrarily small,
835the finishing-up operation cannot necessarily be done with a single call
836of @code{bzCompress}.
837
838Instead, the calling program passes @code{BZ_FINISH} as an action to
839@code{bzCompress}. This changes the stream's state to FINISHING. Any
840remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and
841transferred to the output buffer. To do this, @code{bzCompress} must be
842called repeatedly until all the output has been consumed. At that
843point, @code{bzCompress} returns @code{BZ_STREAM_END}, and the stream's
844state is set back to IDLE. @code{bzCompressEnd} should then be
845called.
846
847Just to make sure the calling program does not cheat, the library makes
848a note of @code{avail_in} at the time of the first call to
849@code{bzCompress} which has @code{BZ_FINISH} as an action (ie, at the
850time the program has announced its intention to not supply any more
851input). By comparing this value with that of @code{avail_in} over
852subsequent calls to @code{bzCompress}, the library can detect any
853attempts to slip in more data to compress. Any calls for which this is
854detected will return @code{BZ_SEQUENCE_ERROR}. This indicates a
855programming mistake which should be corrected.
856
857Instead of asking to finish, the calling program may ask
858@code{bzCompress} to take all the remaining input, compress it and
859terminate the current (Burrows-Wheeler) compression block. This could
860be useful for error control purposes. The mechanism is analogous to
861that for finishing: call @code{bzCompress} with an action of
862@code{BZ_FLUSH}, remove output data, and persist with the
863@code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned. As
864with finishing, @code{bzCompress} detects any attempt to provide more
865input data once the flush has begun.
866
867Once the flush is complete, the stream returns to the normal RUNNING
868state.
869
870This all sounds pretty complex, but isn't really. Here's a table
871which shows which actions are allowable in each state, what action
872will be taken, what the next state is, and what the non-error return
873values are. Note that you can't explicitly ask what state the
874stream is in, but nor do you need to -- it can be inferred from the
875values returned by @code{bzCompress}.
876@display
877IDLE/@code{any}
878 Illegal. IDLE state only exists after @code{bzCompressEnd} or
879 before @code{bzCompressInit}.
880 Return value = @code{BZ_SEQUENCE_ERROR}
881
882RUNNING/@code{BZ_RUN}
883 Compress from @code{next_in} to @code{next_out} as much as possible.
884 Next state = RUNNING
885 Return value = @code{BZ_RUN_OK}
886
887RUNNING/@code{BZ_FLUSH}
888 Remember current value of @code{next_in}. Compress from @code{next_in}
889 to @code{next_out} as much as possible, but do not accept any more input.
890 Next state = FLUSHING
891 Return value = @code{BZ_FLUSH_OK}
892
893RUNNING/@code{BZ_FINISH}
894 Remember current value of @code{next_in}. Compress from @code{next_in}
895 to @code{next_out} as much as possible, but do not accept any more input.
896 Next state = FINISHING
897 Return value = @code{BZ_FINISH_OK}
898
899FLUSHING/@code{BZ_FLUSH}
900 Compress from @code{next_in} to @code{next_out} as much as possible,
901 but do not accept any more input.
902 If all the existing input has been used up and all compressed
903 output has been removed
904 Next state = RUNNING; Return value = @code{BZ_RUN_OK}
905 else
906 Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK}
907
908FLUSHING/other
909 Illegal.
910 Return value = @code{BZ_SEQUENCE_ERROR}
911
912FINISHING/@code{BZ_FINISH}
913 Compress from @code{next_in} to @code{next_out} as much as possible,
914 but to not accept any more input.
915 If all the existing input has been used up and all compressed
916 output has been removed
917 Next state = IDLE; Return value = @code{BZ_STREAM_END}
918 else
919 Next state = FINISHING; Return value = @code{BZ_FINISHING}
920
921FINISHING/other
922 Illegal.
923 Return value = @code{BZ_SEQUENCE_ERROR}
924@end display
925
926That still looks complicated? Well, fair enough. The usual sequence
927of calls for compressing a load of data is:
928@itemize @bullet
929@item Get started with @code{bzCompressInit}.
930@item Shovel data in and shlurp out its compressed form using zero or more
931calls of @code{bzCompress} with action = @code{BZ_RUN}.
932@item Finish up.
933Repeatedly call @code{bzCompress} with action = @code{BZ_FINISH},
934copying out the compressed output, until @code{BZ_STREAM_END} is returned.
935@item Close up and go home. Call @code{bzCompressEnd}.
936@end itemize
937If the data you want to compress fits into your input buffer all
938at once, you can skip the calls of @code{bzCompress ( ..., BZ_RUN )} and
939just do the @code{bzCompress ( ..., BZ_FINISH )} calls.
940
941All required memory is allocated by @code{bzCompressInit}. The
942compression library can accept any data at all (obviously). So you
943shouldn't get any error return values from the @code{bzCompress} calls.
944If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in
945your programming.
946
947Trivial other possible return values:
948@display
949 @code{BZ_PARAM_ERROR}
950 if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL}
951@end display
952
953@subsection @code{bzCompressEnd}
954@example
955int bzCompressEnd ( bz_stream *strm );
956@end example
957Releases all memory associated with a compression stream.
958
959Possible return values:
960@display
961 @code{BZ_PARAM_ERROR} if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
962 @code{BZ_OK} otherwise
963@end display
964
965
966@subsection @code{bzDecompressInit}
967@example
968int bzDecompressInit ( bz_stream *strm, int verbosity, int small );
969@end example
970Prepares for decompression. As with @code{bzCompressInit}, a
971@code{bz_stream} record should be allocated and initialised before the
972call. Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be
973set if a custom memory allocator is required, or made @code{NULL} for
974the normal @code{malloc}/@code{free} routines. Upon return, the internal
975state will have been initialised, and @code{total_in} and
976@code{total_out} will be zero.
977
978For the meaning of parameter @code{verbosity}, see @code{bzCompressInit}.
979
980If @code{small} is nonzero, the library will use an alternative
981decompression algorithm which uses less memory but at the cost of
982decompressing more slowly (roughly speaking, half the speed, but the
983maximum memory requirement drops to around 2300k). See Chapter 2 for
984more information on memory management.
985
986Note that the amount of memory needed to decompress
987a stream cannot be determined until the stream's header has been read,
988so even if @code{bzDecompressInit} succeeds, a subsequent
989@code{bzDecompress} could fail with @code{BZ_MEM_ERROR}.
990
991Possible return values:
992@display
993 @code{BZ_PARAM_ERROR}
994 if @code{(small != 0 && small != 1)}
995 or @code{(verbosity < 0 || verbosity > 4)}
996 @code{BZ_MEM_ERROR}
997 if insufficient memory is available
998@end display
999
1000Allowable next actions:
1001@display
1002 @code{bzDecompress}
1003 if @code{BZ_OK} was returned
1004 no specific action required in case of error
1005@end display
1006
1007
1008
1009@subsection @code{bzDecompress}
1010@example
1011int bzDecompress ( bz_stream *strm );
1012@end example
1013Provides more input and/out output buffer space for the library. The
1014caller maintains input and output buffers, and uses @code{bzDecompress}
1015to transfer data between them.
1016
1017Before each call to @code{bzDecompress}, @code{next_in}
1018should point at the compressed data,
1019and @code{avail_in} should indicate how many bytes the library
1020may read. @code{bzDecompress} updates @code{next_in}, @code{avail_in}
1021and @code{total_in}
1022to reflect the number of bytes it has read.
1023
1024Similarly, @code{next_out} should point to a buffer in which the uncompressed
1025output is to be placed, with @code{avail_out} indicating how much output space
1026is available. @code{bzCompress} updates @code{next_out},
1027@code{avail_out} and @code{total_out} to reflect
1028the number of bytes output.
1029
1030You may provide and remove as little or as much data as you like on
1031each call of @code{bzDecompress}.
1032In the limit, it is acceptable to
1033supply and remove data one byte at a time, although this would be
1034terribly inefficient. You should always ensure that at least one
1035byte of output space is available at each call.
1036
1037Use of @code{bzDecompress} is simpler than @code{bzCompress}.
1038
1039You should provide input and remove output as described above, and
1040repeatedly call @code{bzDecompress} until @code{BZ_STREAM_END} is
1041returned. Appearance of @code{BZ_STREAM_END} denotes that
1042@code{bzDecompress} has detected the logical end of the compressed
1043stream. @code{bzDecompress} will not produce @code{BZ_STREAM_END} until
1044all output data has been placed into the output buffer, so once
1045@code{BZ_STREAM_END} appears, you are guaranteed to have available all
1046the decompressed output, and @code{bzDecompressEnd} can safely be
1047called.
1048
1049If case of an error return value, you should call @code{bzDecompressEnd}
1050to clean up and release memory.
1051
1052Possible return values:
1053@display
1054 @code{BZ_PARAM_ERROR}
1055 if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1056 or @code{strm->avail_out < 1}
1057 @code{BZ_DATA_ERROR}
1058 if a data integrity error is detected in the compressed stream
1059 @code{BZ_DATA_ERROR_MAGIC}
1060 if the compressed stream doesn't begin with the right magic bytes
1061 @code{BZ_MEM_ERROR}
1062 if there wasn't enough memory available
1063 @code{BZ_STREAM_END}
1064 if the logical end of the data stream was detected and all
1065 output in has been consumed, eg @code{s->avail_out > 0}
1066 @code{BZ_OK}
1067 otherwise
1068@end display
1069Allowable next actions:
1070@display
1071 @code{bzDecompress}
1072 if @code{BZ_OK} was returned
1073 @code{bzDecompressEnd}
1074 otherwise
1075@end display
1076
1077
1078@subsection @code{bzDecompressEnd}
1079@example
1080int bzDecompressEnd ( bz_stream *strm );
1081@end example
1082Releases all memory associated with a decompression stream.
1083
1084Possible return values:
1085@display
1086 @code{BZ_PARAM_ERROR}
1087 if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1088 @code{BZ_OK}
1089 otherwise
1090@end display
1091
1092Allowable next actions:
1093@display
1094 None.
1095@end display
1096
1097
1098@section High-level interface
1099
1100This interface provides functions for reading and writing
1101@code{bzip2} format files. First, some general points.
1102
1103@itemize @bullet
1104@item All of the functions take an @code{int*} first argument,
1105 @code{bzerror}.
1106 After each call, @code{bzerror} should be consulted first to determine
1107 the outcome of the call. If @code{bzerror} is @code{BZ_OK},
1108 the call completed
1109 successfully, and only then should the return value of the function
1110 (if any) be consulted. If @code{bzerror} is @code{BZ_IO_ERROR},
1111 there was an error
1112 reading/writing the underlying compressed file, and you should
1113 then consult @code{errno}/@code{perror} to determine the
1114 cause of the difficulty.
1115 @code{bzerror} may also be set to various other values; precise details are
1116 given on a per-function basis below.
1117@item If @code{bzerror} indicates an error
1118 (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}),
1119 you should immediately call @code{bzReadClose} (or @code{bzWriteClose},
1120 depending on whether you are attempting to read or to write)
1121 to free up all resources associated
1122 with the stream. Once an error has been indicated, behaviour of all calls
1123 except @code{bzReadClose} (@code{bzWriteClose}) is undefined.
1124 The implication is that (1) @code{bzerror} should
1125 be checked after each call, and (2) if @code{bzerror} indicates an error,
1126 @code{bzReadClose} (@code{bzWriteClose}) should then be called to clean up.
1127@item The @code{FILE*} arguments passed to
1128 @code{bzReadOpen}/@code{bzWriteOpen}
1129 should be set to binary mode.
1130 Most Unix systems will do this by default, but other platforms,
1131 including Windows and Mac, will not. If you omit this, you may
1132 encounter problems when moving code to new platforms.
1133@item Memory allocation requests are handled by
1134 @code{malloc}/@code{free}.
1135 At present
1136 there is no facility for user-defined memory allocators in the file I/O
1137 functions (could easily be added, though).
1138@end itemize
1139
1140
1141
1142@subsection @code{bzReadOpen}
1143@example
1144 typedef void BZFILE;
1145
1146 BZFILE *bzReadOpen ( int *bzerror, FILE *f,
1147 int small, int verbosity,
1148 void *unused, int nUnused );
1149@end example
1150Prepare to read compressed data from file handle @code{f}. @code{f}
1151should refer to a file which has been opened for reading, and for which
1152the error indicator (@code{ferror(f)})is not set. If @code{small} is 1,
1153the library will try to decompress using less memory, at the expense of
1154speed.
1155
1156For reasons explained below, @code{bzRead} will decompress the
1157@code{nUnused} bytes starting at @code{unused}, before starting to read
1158from the file @code{f}. At most @code{BZ_MAX_UNUSED} bytes may be
1159supplied like this. If this facility is not required, you should pass
1160@code{NULL} and @code{0} for @code{unused} and n@code{Unused}
1161respectively.
1162
1163For the meaning of parameters @code{small} and @code{verbosity},
1164see @code{bzDecompressInit}.
1165
1166The amount of memory needed to decompress a file cannot be determined
1167until the file's header has been read. So it is possible that
1168@code{bzReadOpen} returns @code{BZ_OK} but a subsequent call of
1169@code{bzRead} will return @code{BZ_MEM_ERROR}.
1170
1171Possible assignments to @code{bzerror}:
1172@display
1173 @code{BZ_PARAM_ERROR}
1174 if @code{f} is @code{NULL}
1175 or @code{small} is neither @code{0} nor @code{1}
1176 or @code{(unused == NULL && nUnused != 0)}
1177 or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))}
1178 @code{BZ_IO_ERROR}
1179 if @code{ferror(f)} is nonzero
1180 @code{BZ_MEM_ERROR}
1181 if insufficient memory is available
1182 @code{BZ_OK}
1183 otherwise.
1184@end display
1185
1186Possible return values:
1187@display
1188 Pointer to an abstract @code{BZFILE}
1189 if @code{bzerror} is @code{BZ_OK}
1190 @code{NULL}
1191 otherwise
1192@end display
1193
1194Allowable next actions:
1195@display
1196 @code{bzRead}
1197 if @code{bzerror} is @code{BZ_OK}
1198 @code{bzClose}
1199 otherwise
1200@end display
1201
1202
1203@subsection @code{bzRead}
1204@example
1205 int bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
1206@end example
1207Reads up to @code{len} (uncompressed) bytes from the compressed file
1208@code{b} into
1209the buffer @code{buf}. If the read was successful,
1210@code{bzerror} is set to @code{BZ_OK}
1211and the number of bytes read is returned. If the logical end-of-stream
1212was detected, @code{bzerror} will be set to @code{BZ_STREAM_END},
1213and the number
1214of bytes read is returned. All other @code{bzerror} values denote an error.
1215
1216@code{bzRead} will supply @code{len} bytes,
1217unless the logical stream end is detected
1218or an error occurs. Because of this, it is possible to detect the
1219stream end by observing when the number of bytes returned is
1220less than the number
1221requested. Nevertheless, this is regarded as inadvisable; you should
1222instead check @code{bzerror} after every call and watch out for
1223@code{BZ_STREAM_END}.
1224
1225Internally, @code{bzRead} copies data from the compressed file in chunks
1226of size @code{BZ_MAX_UNUSED} bytes
1227before decompressing it. If the file contains more bytes than strictly
1228needed to reach the logical end-of-stream, @code{bzRead} will almost certainly
1229read some of the trailing data before signalling @code{BZ_SEQUENCE_END}.
1230To collect the read but unused data once @code{BZ_SEQUENCE_END} has
1231appeared, call @code{bzReadGetUnused} immediately before @code{bzReadClose}.
1232
1233Possible assignments to @code{bzerror}:
1234@display
1235 @code{BZ_PARAM_ERROR}
1236 if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1237 @code{BZ_SEQUENCE_ERROR}
1238 if @code{b} was opened with @code{bzWriteOpen}
1239 @code{BZ_IO_ERROR}
1240 if there is an error reading from the compressed file
1241 @code{BZ_UNEXPECTED_EOF}
1242 if the compressed file ended before the logical end-of-stream was detected
1243 @code{BZ_DATA_ERROR}
1244 if a data integrity error was detected in the compressed stream
1245 @code{BZ_DATA_ERROR_MAGIC}
1246 if the stream does not begin with the requisite header bytes (ie, is not
1247 a @code{bzip2} data file). This is really a special case of @code{BZ_DATA_ERROR}.
1248 @code{BZ_MEM_ERROR}
1249 if insufficient memory was available
1250 @code{BZ_STREAM_END}
1251 if the logical end of stream was detected.
1252 @code{BZ_OK}
1253 otherwise.
1254@end display
1255
1256Possible return values:
1257@display
1258 number of bytes read
1259 if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END}
1260 undefined
1261 otherwise
1262@end display
1263
1264Allowable next actions:
1265@display
1266 collect data from @code{buf}, then @code{bzRead} or @code{bzReadClose}
1267 if @code{bzerror} is @code{BZ_OK}
1268 collect data from @code{buf}, then @code{bzReadClose} or @code{bzReadGetUnused}
1269 if @code{bzerror} is @code{BZ_SEQUENCE_END}
1270 @code{bzReadClose}
1271 otherwise
1272@end display
1273
1274
1275
1276@subsection @code{bzReadGetUnused}
1277@example
1278 void bzReadGetUnused ( int* bzerror, BZFILE *b,
1279 void** unused, int* nUnused );
1280@end example
1281Returns data which was read from the compressed file but was not needed
1282to get to the logical end-of-stream. @code{*unused} is set to the address
1283of the data, and @code{*nUnused} to the number of bytes. @code{*nUnused} will
1284be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive.
1285
1286This function may only be called once @code{bzRead} has signalled
1287@code{BZ_STREAM_END} but before @code{bzReadClose}.
1288
1289Possible assignments to @code{bzerror}:
1290@display
1291 @code{BZ_PARAM_ERROR}
1292 if @code{b} is @code{NULL}
1293 or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL}
1294 @code{BZ_SEQUENCE_ERROR}
1295 if @code{BZ_STREAM_END} has not been signalled
1296 or if @code{b} was opened with @code{bzWriteOpen}
1297 @code{BZ_OK}
1298 otherwise
1299@end display
1300
1301Allowable next actions:
1302@display
1303 @code{bzReadClose}
1304@end display
1305
1306
1307@subsection @code{bzReadClose}
1308@example
1309 void bzReadClose ( int *bzerror, BZFILE *b );
1310@end example
1311Releases all memory pertaining to the compressed file @code{b}.
1312@code{bzReadClose} does not call @code{fclose} on the underlying file
1313handle, so you should do that yourself if appropriate.
1314@code{bzReadClose} should be called to clean up after all error
1315situations.
1316
1317Possible assignments to @code{bzerror}:
1318@display
1319 @code{BZ_SEQUENCE_ERROR}
1320 if @code{b} was opened with @code{bzOpenWrite}
1321 @code{BZ_OK}
1322 otherwise
1323@end display
1324
1325Allowable next actions:
1326@display
1327 none
1328@end display
1329
1330
1331
1332@subsection @code{bzWriteOpen}
1333@example
1334 BZFILE *bzWriteOpen ( int *bzerror, FILE *f,
1335 int blockSize100k, int verbosity,
1336 int workFactor );
1337@end example
1338Prepare to write compressed data to file handle @code{f}.
1339@code{f} should refer to
1340a file which has been opened for writing, and for which the error
1341indicator (@code{ferror(f)})is not set.
1342
1343For the meaning of parameters @code{blockSize100k},
1344@code{verbosity} and @code{workFactor}, see
1345@* @code{bzCompressInit}.
1346
1347All required memory is allocated at this stage, so if the call
1348completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a
1349subsequent call to @code{bzWrite}.
1350
1351Possible assignments to @code{bzerror}:
1352@display
1353 @code{BZ_PARAM_ERROR}
1354 if @code{f} is @code{NULL}
1355 or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1356 @code{BZ_IO_ERROR}
1357 if @code{ferror(f)} is nonzero
1358 @code{BZ_MEM_ERROR}
1359 if insufficient memory is available
1360 @code{BZ_OK}
1361 otherwise
1362@end display
1363
1364Possible return values:
1365@display
1366 Pointer to an abstract @code{BZFILE}
1367 if @code{bzerror} is @code{BZ_OK}
1368 @code{NULL}
1369 otherwise
1370@end display
1371
1372Allowable next actions:
1373@display
1374 @code{bzWrite}
1375 if @code{bzerror} is @code{BZ_OK}
1376 (you could go directly to @code{bzWriteClose}, but this would be pretty pointless)
1377 @code{bzWriteClose}
1378 otherwise
1379@end display
1380
1381
1382
1383@subsection @code{bzWrite}
1384@example
1385 void bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
1386@end example
1387Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be
1388compressed and written to the file.
1389
1390Possible assignments to @code{bzerror}:
1391@display
1392 @code{BZ_PARAM_ERROR}
1393 if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1394 @code{BZ_SEQUENCE_ERROR}
1395 if b was opened with @code{bzReadOpen}
1396 @code{BZ_IO_ERROR}
1397 if there is an error writing the compressed file.
1398 @code{BZ_OK}
1399 otherwise
1400@end display
1401
1402
1403
1404
1405@subsection @code{bzWriteClose}
1406@example
1407 int bzWriteClose ( int *bzerror, BZFILE* f,
1408 int abandon,
1409 unsigned int* nbytes_in,
1410 unsigned int* nbytes_out );
1411@end example
1412
1413Compresses and flushes to the compressed file all data so far supplied
1414by @code{bzWrite}. The logical end-of-stream markers are also written, so
1415subsequent calls to @code{bzWrite} are illegal. All memory associated
1416with the compressed file @code{b} is released.
1417@code{fflush} is called on the
1418compressed file, but it is not @code{fclose}'d.
1419
1420If @code{bzWriteClose} is called to clean up after an error, the only
1421action is to release the memory. The library records the error codes
1422issued by previous calls, so this situation will be detected
1423automatically. There is no attempt to complete the compression
1424operation, nor to @code{fflush} the compressed file. You can force this
1425behaviour to happen even in the case of no error, by passing a nonzero
1426value to @code{abandon}.
1427
1428If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the
1429total volume of uncompressed data handled. Similarly, @code{nbytes_out}
1430will be set to the total volume of compressed data written.
1431
1432Possible assignments to @code{bzerror}:
1433@display
1434 @code{BZ_SEQUENCE_ERROR}
1435 if @code{b} was opened with @code{bzReadOpen}
1436 @code{BZ_IO_ERROR}
1437 if there is an error writing the compressed file
1438 @code{BZ_OK}
1439 otherwise
1440@end display
1441
1442@subsection Handling embedded compressed data streams
1443
1444The high-level library facilitates use of
1445@code{bzip2} data streams which form some part of a surrounding, larger
1446data stream.
1447@itemize @bullet
1448@item For writing, the library takes an open file handle, writes
1449compressed data to it, @code{fflush}es it but does not @code{fclose} it.
1450The calling application can write its own data before and after the
1451compressed data stream, using that same file handle.
1452@item Reading is more complex, and the facilities are not as general
1453as they could be since generality is hard to reconcile with efficiency.
1454@code{bzRead} reads from the compressed file in blocks of size
1455@code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot
1456the logical end of compressed stream.
1457To recover this data once decompression has
1458ended, call @code{bzReadGetUnused} after the last call of @code{bzRead}
1459(the one returning @code{BZ_STREAM_END}) but before calling
1460@code{bzReadClose}.
1461@end itemize
1462
1463This mechanism makes it easy to decompress multiple @code{bzip2}
1464streams placed end-to-end. As the end of one stream, when @code{bzRead}
1465returns @code{BZ_STREAM_END}, call @code{bzReadGetUnused} to collect the
1466unused data (copy it into your own buffer somewhere).
1467That data forms the start of the next compressed stream.
1468To start uncompressing that next stream, call @code{bzReadOpen} again,
1469feeding in the unused data via the @code{unused}/@code{nUnused}
1470parameters.
1471Keep doing this until @code{BZ_STREAM_END} return coincides with the
1472physical end of file (@code{feof(f)}). In this situation
1473@code{bzReadGetUnused}
1474will of course return no data.
1475
1476This should give some feel for how the high-level interface can be used.
1477If you require extra flexibility, you'll have to bite the bullet and get
1478to grips with the low-level interface.
1479
1480@subsection Standard file-reading/writing code
1481Here's how you'd write data to a compressed file:
1482@example @code
1483FILE* f;
1484BZFILE* b;
1485int nBuf;
1486char buf[ /* whatever size you like */ ];
1487int bzerror;
1488int nWritten;
1489
1490f = fopen ( "myfile.bz2", "w" );
1491if (!f) @{
1492 /* handle error */
1493@}
1494b = bzWriteOpen ( &bzerror, f, 9 );
1495if (bzerror != BZ_OK) @{
1496 bzWriteClose ( b );
1497 /* handle error */
1498@}
1499
1500while ( /* condition */ ) @{
1501 /* get data to write into buf, and set nBuf appropriately */
1502 nWritten = bzWrite ( &bzerror, b, buf, nBuf );
1503 if (bzerror == BZ_IO_ERROR) @{
1504 bzWriteClose ( &bzerror, b );
1505 /* handle error */
1506 @}
1507@}
1508
1509bzWriteClose ( &bzerror, b );
1510if (bzerror == BZ_IO_ERROR) @{
1511 /* handle error */
1512@}
1513@end example
1514And to read from a compressed file:
1515@example
1516FILE* f;
1517BZFILE* b;
1518int nBuf;
1519char buf[ /* whatever size you like */ ];
1520int bzerror;
1521int nWritten;
1522
1523f = fopen ( "myfile.bz2", "r" );
1524if (!f) @{
1525 /* handle error */
1526@}
1527b = bzReadOpen ( &bzerror, f, 0, NULL, 0 );
1528if (bzerror != BZ_OK) @{
1529 bzReadClose ( &bzerror, b );
1530 /* handle error */
1531@}
1532
1533bzerror = BZ_OK;
1534while (bzerror == BZ_OK && /* arbitrary other conditions */) @{
1535 nBuf = bzRead ( &bzerror, b, buf, /* size of buf */ );
1536 if (bzerror == BZ_OK) @{
1537 /* do something with buf[0 .. nBuf-1] */
1538 @}
1539@}
1540if (bzerror != BZ_STREAM_END) @{
1541 bzReadClose ( &bzerror, b );
1542 /* handle error */
1543@} else @{
1544 bzReadClose ( &bzerror );
1545@}
1546@end example
1547
1548
1549
1550@section Utility functions
1551@subsection @code{bzBuffToBuffCompress}
1552@example
1553 int bzBuffToBuffCompress( char* dest,
1554 unsigned int* destLen,
1555 char* source,
1556 unsigned int sourceLen,
1557 int blockSize100k,
1558 int verbosity,
1559 int workFactor );
1560@end example
1561Attempts to compress the data in @code{source[0 .. sourceLen-1]}
1562into the destination buffer, @code{dest[0 .. *destLen-1]}.
1563If the destination buffer is big enough, @code{*destLen} is
1564set to the size of the compressed data, and @code{BZ_OK} is
1565returned. If the compressed data won't fit, @code{*destLen}
1566is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1567
1568Compression in this manner is a one-shot event, done with a single call
1569to this function. The resulting compressed data is a complete
1570@code{bzip2} format data stream. There is no mechanism for making
1571additional calls to provide extra input data. If you want that kind of
1572mechanism, use the low-level interface.
1573
1574For the meaning of parameters @code{blockSize100k}, @code{verbosity}
1575and @code{workFactor}, @* see @code{bzCompressInit}.
1576
1577To guarantee that the compressed data will fit in its buffer, allocate
1578an output buffer of size 1% larger than the uncompressed data, plus
1579six hundred extra bytes.
1580
1581@code{bzBuffToBuffDecompress} will not write data at or
1582beyond @code{dest[*destLen]}, even in case of buffer overflow.
1583
1584Possible return values:
1585@display
1586 @code{BZ_PARAM_ERROR}
1587 if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1588 or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1589 or @code{verbosity < 0} or @code{verbosity > 4}
1590 or @code{workFactor < 0} or @code{workFactor > 250}
1591 @code{BZ_MEM_ERROR}
1592 if insufficient memory is available
1593 @code{BZ_OUTBUFF_FULL}
1594 if the size of the compressed data exceeds @code{*destLen}
1595 @code{BZ_OK}
1596 otherwise
1597@end display
1598
1599
1600
1601@subsection @code{bzBuffToBuffDecompress}
1602@example
1603 int bzBuffToBuffDecompress ( char* dest,
1604 unsigned int* destLen,
1605 char* source,
1606 unsigned int sourceLen,
1607 int small,
1608 int verbosity );
1609@end example
1610Attempts to decompress the data in @code{source[0 .. sourceLen-1]}
1611into the destination buffer, @code{dest[0 .. *destLen-1]}.
1612If the destination buffer is big enough, @code{*destLen} is
1613set to the size of the uncompressed data, and @code{BZ_OK} is
1614returned. If the compressed data won't fit, @code{*destLen}
1615is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1616
1617@code{source} is assumed to hold a complete @code{bzip2} format
1618data stream. @code{bzBuffToBuffDecompress} tries to decompress
1619the entirety of the stream into the output buffer.
1620
1621For the meaning of parameters @code{small} and @code{verbosity},
1622see @code{bzDecompressInit}.
1623
1624Because the compression ratio of the compressed data cannot be known in
1625advance, there is no easy way to guarantee that the output buffer will
1626be big enough. You may of course make arrangements in your code to
1627record the size of the uncompressed data, but such a mechanism is beyond
1628the scope of this library.
1629
1630@code{bzBuffToBuffDecompress} will not write data at or
1631beyond @code{dest[*destLen]}, even in case of buffer overflow.
1632
1633Possible return values:
1634@display
1635 @code{BZ_PARAM_ERROR}
1636 if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1637 or @code{small != 0 && small != 1}
1638 or @code{verbosity < 0} or @code{verbosity > 4}
1639 @code{BZ_MEM_ERROR}
1640 if insufficient memory is available
1641 @code{BZ_OUTBUFF_FULL}
1642 if the size of the compressed data exceeds @code{*destLen}
1643 @code{BZ_DATA_ERROR}
1644 if a data integrity error was detected in the compressed data
1645 @code{BZ_DATA_ERROR_MAGIC}
1646 if the compressed data doesn't begin with the right magic bytes
1647 @code{BZ_UNEXPECTED_EOF}
1648 if the compressed data ends unexpectedly
1649 @code{BZ_OK}
1650 otherwise
1651@end display
1652
1653
1654
1655@section Using the library in a @code{stdio}-free environment
1656
1657@subsection Getting rid of @code{stdio}
1658
1659In a deeply embedded application, you might want to use just
1660the memory-to-memory functions. You can do this conveniently
1661by compiling the library with preprocessor symbol @code{BZ_NO_STDIO}
1662defined. Doing this gives you a library containing only the following
1663eight functions:
1664
1665@code{bzCompressInit}, @code{bzCompress}, @code{bzCompressEnd} @*
1666@code{bzDecompressInit}, @code{bzDecompress}, @code{bzDecompressEnd} @*
1667@code{bzBuffToBuffCompress}, @code{bzBuffToBuffDecompress}
1668
1669When compiled like this, all functions will ignore @code{verbosity}
1670settings.
1671
1672@subsection Critical error handling
1673@code{libbzip2} contains a number of internal assertion checks which
1674should, needless to say, never be activated. Nevertheless, if an
1675assertion should fail, behaviour depends on whether or not the library
1676was compiled with @code{BZ_NO_STDIO} set.
1677
1678For a normal compile, an assertion failure yields the message
1679@example
1680 bzip2/libbzip2, v0.9.0: internal error number N.
1681 This is a bug in bzip2/libbzip2, v0.9.0. Please report
1682 it to me at: jseward@@acm.org. If this happened when
1683 you were using some program which uses libbzip2 as a
1684 component, you should also report this bug to the author(s)
1685 of that program. Please make an effort to report this bug;
1686 timely and accurate bug reports eventually lead to higher
1687 quality software. Thx. Julian Seward, 27 June 1998.
1688@end example
1689where @code{N} is some error code number. @code{exit(3)}
1690is then called.
1691
1692For a @code{stdio}-free library, assertion failures result
1693in a call to a function declared as:
1694@example
1695 extern void bz_internal_error ( int errcode );
1696@end example
1697The relevant code is passed as a parameter. You should supply
1698such a function.
1699
1700In either case, once an assertion failure has occurred, any
1701@code{bz_stream} records involved can be regarded as invalid.
1702You should not attempt to resume normal operation with them.
1703
1704You may, of course, change critical error handling to suit
1705your needs. As I said above, critical errors indicate bugs
1706in the library and should not occur. All "normal" error
1707situations are indicated via error return codes from functions,
1708and can be recovered from.
1709
1710
1711@section Making a Windows DLL
1712Everything related to Windows has been contributed by Yoshioka Tsuneo
1713@* (@code{QWF00133@@niftyserve.or.jp} /
1714@code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to
1715him (but perhaps Cc: me, @code{jseward@@acm.org}).
1716
1717My vague understanding of what to do is: using Visual C++ 5.0,
1718open the project file @code{libbz2.dsp}, and build. That's all.
1719
1720If you can't
1721open the project file for some reason, make a new one, naming these files:
1722@code{blocksort.c}, @code{bzlib.c}, @code{compress.c},
1723@code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @*
1724@code{randtable.c} and @code{libbz2.def}. You might also need
1725to name the header files @code{bzlib.h} and @code{bzlib_private.h}.
1726
1727If you don't use VC++, you may need to define the proprocessor symbol
1728@code{_WIN32}.
1729
1730Finally, @code{dlltest.c} is a sample program using the DLL. It has a
1731project file, @code{dlltest.dsp}.
1732
1733I haven't tried any of this stuff myself, but it all looks plausible.
1734
1735
1736
1737@chapter Miscellanea
1738
1739These are just some random thoughts of mine. Your mileage may
1740vary.
1741
1742@section Limitations of the compressed file format
1743@code{bzip2-0.9.0} uses exactly the same file format as the previous
1744version, @code{bzip2-0.1}. This decision was made in the interests of
1745stability. Creating yet another incompatible compressed file format
1746would create further confusion and disruption for users.
1747
1748Nevertheless, this is not a painless decision. Development
1749work since the release of @code{bzip2-0.1} in August 1997
1750has shown complexities in the file format which slow down
1751decompression and, in retrospect, are unnecessary. These are:
1752@itemize @bullet
1753@item The run-length encoder, which is the first of the
1754 compression transformations, is entirely irrelevant.
1755 The original purpose was to protect the sorting algorithm
1756 from the very worst case input: a string of repeated
1757 symbols. But algorithm steps Q6a and Q6b in the original
1758 Burrows-Wheeler technical report (SRC-124) show how
1759 repeats can be handled without difficulty in block
1760 sorting.
1761@item The randomisation mechanism doesn't really need to be
1762 there. Udi Manber and Gene Myers published a suffix
1763 array construction algorithm a few years back, which
1764 can be employed to sort any block, no matter how
1765 repetitive, in O(N log N) time. Subsequent work by
1766 Kunihiko Sadakane has produced a derivative O(N (log N)^2)
1767 algorithm which usually outperforms the Manber-Myers
1768 algorithm.
1769
1770 I could have changed to Sadakane's algorithm, but I find
1771 it to be slower than @code{bzip2}'s existing algorithm for
1772 most inputs, and the randomisation mechanism protects
1773 adequately against bad cases. I didn't think it was
1774 a good tradeoff to make. Partly this is due to the fact
1775 that I was not flooded with email complaints about
1776 @code{bzip2-0.1}'s performance on repetitive data, so
1777 perhaps it isn't a problem for real inputs.
1778
1779 Probably the best long-term solution
1780 is to use the existing sorting
1781 algorithm initially, and fall back to a O(N (log N)^2)
1782 algorithm if the standard algorithm gets into difficulties.
1783 This can be done without much difficulty; I made
1784 a prototype implementation of it some months now.
1785@item The compressed file format was never designed to be
1786 handled by a library, and I have had to jump though
1787 some hoops to produce an efficient implementation of
1788 decompression. It's a bit hairy. Try passing
1789 @code{decompress.c} through the C preprocessor
1790 and you'll see what I mean. Much of this complexity
1791 could have been avoided if the compressed size of
1792 each block of data was recorded in the data stream.
1793@item An Adler-32 checksum, rather than a CRC32 checksum,
1794 would be faster to compute.
1795@end itemize
1796It would be fair to say that the @code{bzip2} format was frozen
1797before I properly and fully understood the performance
1798consequences of doing so.
1799
1800Improvements which I have been able to incorporate into
18010.9.0, despite using the same file format, are:
1802@itemize @bullet
1803@item Single array implementation of the inverse BWT. This
1804 significantly speeds up decompression, presumably
1805 because it reduces the number of cache misses.
1806@item Faster inverse MTF transform for large MTF values. The
1807 new implementation is based on the notion of sliding blocks
1808 of values.
1809@item @code{bzip2-0.9.0} now reads and writes files with @code{fread}
1810 and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}.
1811 Duh! I'm embarrassed at my own moronicness (moronicity?) on this
1812 one.
1813
1814@end itemize
1815Further ahead, it would be nice
1816to be able to do random access into files. This will
1817require some careful design of compressed file formats.
1818
1819
1820
1821@section Portability issues
1822After some consideration, I have decided not to use
1823GNU @code{autoconf} to configure 0.9.0.
1824
1825@code{autoconf}, admirable and wonderful though it is,
1826mainly assists with portability problems between Unix-like
1827platforms. But @code{bzip2} doesn't have much in the way
1828of portability problems on Unix; most of the difficulties appear
1829when porting to the Mac, or to Microsoft's operating systems.
1830@code{autoconf} doesn't help in those cases, and brings in a
1831whole load of new complexity.
1832
1833Most people should be able to compile the library and program
1834under Unix straight out-of-the-box, so to speak, especially
1835if you have a version of GNU C available.
1836
1837There are a couple of @code{__inline__} directives in the code. GNU C
1838(@code{gcc}) should be able to handle them. If your compiler doesn't
1839like them, just @code{#define} @code{__inline__} to be null. One
1840easy way to do this is to compile with the flag @code{-D__inline__=},
1841which should be understood by most Unix compilers.
1842
1843If you still have difficulties, try compiling with the macro
1844@code{BZ_STRICT_ANSI} defined. This should enable you to build the
1845library in a strictly ANSI compliant environment. Building the program
1846itself like this is dangerous and not supported, since you remove
1847@code{bzip2}'s checks against compressing directories, symbolic links,
1848devices, and other not-really-a-file entities. This could cause
1849filesystem corruption!
1850
1851One other thing: if you create a @code{bzip2} binary for public
1852distribution, please try and link it statically (@code{gcc -s}). This
1853avoids all sorts of library-version issues that others may encounter
1854later on.
1855
1856
1857@section Reporting bugs
1858I tried pretty hard to make sure @code{bzip2} is
1859bug free, both by design and by testing. Hopefully
1860you'll never need to read this section for real.
1861
1862Nevertheless, if @code{bzip2} dies with a segmentation
1863fault, a bus error or an internal assertion failure, it
1864will ask you to email me a bug report. Experience with
1865version 0.1 shows that almost all these problems can
1866be traced to either compiler bugs or hardware problems.
1867@itemize @bullet
1868@item
1869Recompile the program with no optimisation, and see if it
1870works. And/or try a different compiler.
1871I heard all sorts of stories about various flavours
1872of GNU C (and other compilers) generating bad code for
1873@code{bzip2}, and I've run across two such examples myself.
1874
18752.7.X versions of GNU C are known to generate bad code from
1876time to time, at high optimisation levels.
1877If you get problems, try using the flags
1878@code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}.
1879You should specifically @emph{not} use @code{-funroll-loops}.
1880
1881You may notice that the Makefile runs four tests as part of
1882the build process. If the program passes all of these, it's
1883a pretty good (but not 100%) indication that the compiler has
1884done its job correctly.
1885@item
1886If @code{bzip2} crashes randomly, and the crashes are not
1887repeatable, you may have a flaky memory subsystem. @code{bzip2}
1888really hammers your memory hierarchy, and if it's a bit marginal,
1889you may get these problems. Ditto if your disk or I/O subsystem
1890is slowly failing. Yup, this really does happen.
1891
1892Try using a different machine of the same type, and see if
1893you can repeat the problem.
1894@item This isn't really a bug, but ... If @code{bzip2} tells
1895you your file is corrupted on decompression, and you
1896obtained the file via FTP, there is a possibility that you
1897forgot to tell FTP to do a binary mode transfer. That absolutely
1898will cause the file to be non-decompressible. You'll have to transfer
1899it again.
1900@end itemize
1901
1902If you've incorporated @code{libbzip2} into your own program
1903and are getting problems, please, please, please, check that the
1904parameters you are passing in calls to the library, are
1905correct, and in accordance with what the documentation says
1906is allowable. I have tried to make the library robust against
1907such problems, but I'm sure I haven't succeeded.
1908
1909Finally, if the above comments don't help, you'll have to send
1910me a bug report. Now, it's just amazing how many people will
1911send me a bug report saying something like
1912@display
1913 bzip2 crashed with segmentation fault on my machine
1914@end display
1915and absolutely nothing else. Needless to say, a such a report
1916is @emph{totally, utterly, completely and comprehensively 100% useless;
1917a waste of your time, my time, and net bandwidth}.
1918With no details at all, there's no way I can possibly begin
1919to figure out what the problem is.
1920
1921The rules of the game are: facts, facts, facts. Don't omit
1922them because "oh, they won't be relevant". At the bare
1923minimum:
1924@display
1925 Machine type. Operating system version.
1926 Exact version of @code{bzip2} (do @code{bzip2 -V}).
1927 Exact version of the compiler used.
1928 Flags passed to the compiler.
1929@end display
1930However, the most important single thing that will help me is
1931the file that you were trying to compress or decompress at the
1932time the problem happened. Without that, my ability to do anything
1933more than speculate about the cause, is limited.
1934
1935Please remember that I connect to the Internet with a modem, so
1936you should contact me before mailing me huge files.
1937
1938
1939@section Did you get the right package?
1940
1941@code{bzip2} is a resource hog. It soaks up large amounts of CPU cycles
1942and memory. Also, it gives very large latencies. In the worst case, you
1943can feed many megabytes of uncompressed data into the library before
1944getting any compressed output, so this probably rules out applications
1945requiring interactive behaviour.
1946
1947These aren't faults of my implementation, I hope, but more
1948an intrinsic property of the Burrows-Wheeler transform (unfortunately).
1949Maybe this isn't what you want.
1950
1951If you want a compressor and/or library which is faster, uses less
1952memory but gets pretty good compression, and has minimal latency,
1953consider Jean-loup
1954Gailly's and Mark Adler's work, @code{zlib-1.1.2} and
1955@code{gzip-1.2.4}. Look for them at
1956@code{http://www.cdrom.com/pub/infozip/zlib} and
1957@code{http://www.gzip.org} respectively.
1958
1959For something faster and lighter still, you might try Markus F X J
1960Oberhumer's @code{LZO} real-time compression/decompression library, at
1961@* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}.
1962
1963If you want to use the @code{bzip2} algorithms to compress small blocks
1964of data, 64k bytes or smaller, for example on an on-the-fly disk
1965compressor, you'd be well advised not to use this library. Instead,
1966I've made a special library tuned for that kind of use. It's part of
1967@code{e2compr-0.40}, an on-the-fly disk compressor for the Linux
1968@code{ext2} filesystem. Look at
1969@code{http://www.netspace.net.au/~reiter/e2compr}.
1970
1971
1972
1973@section Testing
1974
1975A record of the tests I've done.
1976
1977First, some data sets:
1978@itemize @bullet
1979@item B: a directory containing a 6001 files, one for every length in the
1980 range 0 to 6000 bytes. The files contain random lowercase
1981 letters. 18.7 megabytes.
1982@item H: my home directory tree. Documents, source code, mail files,
1983 compressed data. H contains B, and also a directory of
1984 files designed as boundary cases for the sorting; mostly very
1985 repetitive, nasty files. 445 megabytes.
1986@item A: directory tree holding various applications built from source:
1987 @code{egcs-1.0.2}, @code{gcc-2.8.1}, KDE Beta 4, GTK, Octave, etc.
1988 827 megabytes.
1989@item P: directory tree holding large amounts of source code (@code{.tar}
1990 files) of the entire GNU distribution, plus a couple of
1991 Linux distributions. 2400 megabytes.
1992@end itemize
1993The tests conducted are as follows. Each test means compressing
1994(a copy of) each file in the data set, decompressing it and
1995comparing it against the original.
1996
1997First, a bunch of tests with block sizes, internal buffer
1998sizes and randomisation lengths set very small,
1999to detect any problems with the
2000blocking, buffering and randomisation mechanisms.
2001This required modifying the source code so as to try to
2002break it.
2003@enumerate
2004@item Data set H, with
2005 buffer size of 1 byte, and block size of 23 bytes.
2006@item Data set B, buffer sizes 1 byte, block size 1 byte.
2007@item As (2) but small-mode decompression (first 1700 files).
2008@item As (2) with block size 2 bytes.
2009@item As (2) with block size 3 bytes.
2010@item As (2) with block size 4 bytes.
2011@item As (2) with block size 5 bytes.
2012@item As (2) with block size 6 bytes and small-mode decompression.
2013@item H with normal buffer sizes (5000 bytes), normal block
2014 size (up to 900000 bytes), but with randomisation
2015 mechanism running intensely (randomising approximately every
2016 third byte).
2017@item As (9) with small-mode decompression.
2018@end enumerate
2019Then some tests with unmodified source code.
2020@enumerate
2021@item H, all settings normal.
2022@item As (1), with small-mode decompress.
2023@item H, compress with flag @code{-1}.
2024@item H, compress with flag @code{-s}, decompress with flag @code{-s}.
2025@item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing,
2026 @code{bzip2-0.9.0} decompressing, all settings normal.
2027@item Backwards compatibility: H, @code{bzip2-0.9.0} compressing,
2028 @code{bzip2-0.1pl2} decompressing, all settings normal.
2029@item Bigger tests: A, all settings normal.
2030@item P, all settings normal.
2031@item Misc test: about 100 megabytes of @code{.tar} files with
2032 @code{bzip2} compiled with Purify.
2033@item Misc tests to make sure it builds and runs ok on non-Linux/x86
2034 platforms.
2035@end enumerate
2036These tests were conducted on a 205 MHz Cyrix 6x86MX machine, running
2037Linux 2.0.32. They represent nearly a week of continuous computation.
2038All tests completed successfully.
2039
2040
2041@section Further reading
2042@code{bzip2} is not research work, in the sense that it doesn't present
2043any new ideas. Rather, it's an engineering exercise based on existing
2044ideas.
2045
2046Four documents describe essentially all the ideas behind @code{bzip2}:
2047@example
2048Michael Burrows and D. J. Wheeler:
2049 "A block-sorting lossless data compression algorithm"
2050 10th May 1994.
2051 Digital SRC Research Report 124.
2052 ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
2053 If you have trouble finding it, try searching at the
2054 New Zealand Digital Library, http://www.nzdl.org.
2055
2056Daniel S. Hirschberg and Debra A. LeLewer
2057 "Efficient Decoding of Prefix Codes"
2058 Communications of the ACM, April 1990, Vol 33, Number 4.
2059 You might be able to get an electronic copy of this
2060 from the ACM Digital Library.
2061
2062David J. Wheeler
2063 Program bred3.c and accompanying document bred3.ps.
2064 This contains the idea behind the multi-table Huffman
2065 coding scheme.
2066 ftp://ftp.cl.cam.ac.uk/pub/user/djw3/
2067
2068Jon L. Bentley and Robert Sedgewick
2069 "Fast Algorithms for Sorting and Searching Strings"
2070 Available from Sedgewick's web page,
2071 www.cs.princeton.edu/~rs
2072@end example
2073The following paper gives valuable additional insights into the
2074algorithm, but is not immediately the basis of any code
2075used in bzip2.
2076@example
2077Peter Fenwick:
2078 Block Sorting Text Compression
2079 Proceedings of the 19th Australasian Computer Science Conference,
2080 Melbourne, Australia. Jan 31 - Feb 2, 1996.
2081 ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
2082@end example
2083Kunihiko Sadakane's sorting algorithm, mentioned above,
2084is available from:
2085@example
2086http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
2087@end example
2088The Manber-Myers suffix array construction
2089algorithm is described in a paper
2090available from:
2091@example
2092http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
2093@end example
2094
2095
2096
2097@contents
2098
2099@bye
2100