From f93cd82a9a7094ad90fd19bbc6ccf6f4627f8060 Mon Sep 17 00:00:00 2001
From: Julian Seward <jseward@acm.org>
Date: Sat, 4 Sep 1999 22:13:13 +0200
Subject: bzip2-0.9.5d

---
 manual.texi | 869 ++++++++++++++++++++++++++++++++----------------------------
 1 file changed, 459 insertions(+), 410 deletions(-)

(limited to 'manual.texi')

diff --git a/manual.texi b/manual.texi
index 99ce661..e48e656 100644
--- a/manual.texi
+++ b/manual.texi
@@ -2,10 +2,10 @@
 @setfilename bzip2.info
 
 @ignore
-This file documents bzip2 version 0.9.0c, and associated library
+This file documents bzip2 version 0.9.5, and associated library
 libbzip2, written by Julian Seward (jseward@acm.org).
 
-Copyright (C) 1996-1998 Julian R Seward
+Copyright (C) 1996-1999 Julian R Seward
 
 Permission is granted to make and distribute verbatim copies of
 this manual provided the copyright notice and this permission notice
@@ -30,21 +30,21 @@ END-INFO-DIR-ENTRY
 @titlepage
 @title bzip2 and libbzip2
 @subtitle a program and library for data compression
-@subtitle copyright (C) 1996-1998 Julian Seward
-@subtitle version 0.9.0c of 18 October 1998
+@subtitle copyright (C) 1996-1999 Julian Seward
+@subtitle version 0.9.5d of 4 September 1999
 @author Julian Seward
 
 @end titlepage
-@end iftex
-
 
 @parindent 0mm
 @parskip 2mm
 
+@end iftex
+@node Top, Overview, (dir), (dir)
 
 This program, @code{bzip2}, 
 and associated library @code{libbzip2}, are
-Copyright (C) 1996-1998 Julian R Seward.  All rights reserved.
+Copyright (C) 1996-1999 Julian R Seward.  All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
@@ -78,13 +78,13 @@ WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-Julian Seward, Guildford, Surrey, UK.
+Julian Seward, Cambridge, UK.
 
 @code{jseward@@acm.org}
 
 @code{http://www.muraroa.demon.co.uk}
 
-@code{bzip2}/@code{libbzip2} version 0.9.0c of 18 October 1998.
+@code{bzip2}/@code{libbzip2} version 0.9.5 of 24 May 1999.
 
 PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented
 algorithms.  However, I do not have the resources available to carry out
@@ -124,360 +124,345 @@ ought to be recorded somewhere.
 
 This chapter contains a copy of the @code{bzip2} man page,
 and nothing else.
+
+@quotation
+
+@unnumberedsubsubsec NAME
+@itemize
+@item @code{bzip2}, @code{bunzip2}
+- a block-sorting file compressor, v0.9.5
+@item @code{bzcat} 
+- decompresses files to stdout
+@item @code{bzip2recover}
+- recovers data from damaged bzip2 files
+@end itemize
+
+@unnumberedsubsubsec SYNOPSIS
+@itemize
+@item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ...  ]
+@item @code{bunzip2} [ -fkvsVL ] [ filenames ...  ]
+@item @code{bzcat} [ -s ] [ filenames ...  ]
+@item @code{bzip2recover} filename
+@end itemize
+
+@unnumberedsubsubsec DESCRIPTION
+
+@code{bzip2} compresses files using the Burrows-Wheeler block sorting
+text compression algorithm, and Huffman coding.  Compression is
+generally considerably better than that achieved by more conventional
+LZ77/LZ78-based compressors, and approaches the performance of the PPM
+family of statistical compressors.
+
+The command-line options are deliberately very similar to those of GNU
+@code{gzip}, but they are not identical.
+
+@code{bzip2} expects a list of file names to accompany the command-line
+flags.  Each file is replaced by a compressed version of itself, with
+the name @code{original_name.bz2}.  Each compressed file has the same
+modification date, permissions, and, when possible, ownership as the
+corresponding original, so that these properties can be correctly
+restored at decompression time.  File name handling is naive in the
+sense that there is no mechanism for preserving original file names,
+permissions, ownerships or dates in filesystems which lack these
+concepts, or have serious file name length restrictions, such as MS-DOS.
+
+@code{bzip2} and @code{bunzip2} will by default not overwrite existing
+files.  If you want this to happen, specify the @code{-f} flag.
+
+If no file names are specified, @code{bzip2} compresses from standard
+input to standard output.  In this case, @code{bzip2} will decline to
+write compressed output to a terminal, as this would be entirely
+incomprehensible and therefore pointless.
+
+@code{bunzip2} (or @code{bzip2 -d}) decompresses all
+specified files.  Files which were not created by @code{bzip2}
+will be detected and ignored, and a warning issued.  
+@code{bzip2} attempts to guess the filename for the decompressed file 
+from that of the compressed file as follows:
+@itemize
+@item @code{filename.bz2 } becomes @code{filename}
+@item @code{filename.bz  } becomes @code{filename}
+@item @code{filename.tbz2} becomes @code{filename.tar}
+@item @code{filename.tbz } becomes @code{filename.tar}
+@item @code{anyothername } becomes @code{anyothername.out}
+@end itemize
+If the file does not end in one of the recognised endings, 
+@code{.bz2}, @code{.bz}, 
+@code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot
+guess the name of the original file, and uses the original name
+with @code{.out} appended.
+
+As with compression, supplying no
+filenames causes decompression from standard input to standard output.
+
+@code{bunzip2} will correctly decompress a file which is the
+concatenation of two or more compressed files.  The result is the
+concatenation of the corresponding uncompressed files.  Integrity
+testing (@code{-t}) of concatenated compressed files is also supported.
+
+You can also compress or decompress files to the standard output by
+giving the @code{-c} flag.  Multiple files may be compressed and
+decompressed like this.  The resulting outputs are fed sequentially to
+stdout.  Compression of multiple files in this manner generates a stream
+containing multiple compressed file representations.  Such a stream
+can be decompressed correctly only by @code{bzip2} version 0.9.0 or
+later.  Earlier versions of @code{bzip2} will stop after decompressing
+the first file in the stream.
+
+@code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to
+the standard output.
+
+@code{bzip2} will read arguments from the environment variables
+@code{BZIP2} and @code{BZIP}, in that order, and will process them
+before any arguments read from the command line.  This gives a 
+convenient way to supply default arguments.
+
+Compression is always performed, even if the compressed file is slightly
+larger than the original.  Files of less than about one hundred bytes
+tend to get larger, since the compression mechanism has a constant
+overhead in the region of 50 bytes.  Random data (including the output
+of most file compressors) is coded at about 8.05 bits per byte, giving
+an expansion of around 0.5%.
+
+As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to
+make sure that the decompressed version of a file is identical to the
+original.  This guards against corruption of the compressed data, and
+against undetected bugs in @code{bzip2} (hopefully very unlikely).  The
+chances of data corruption going undetected is microscopic, about one
+chance in four billion for each file processed.  Be aware, though, that
+the check occurs upon decompression, so it can only tell you that
+something is wrong.  It can't help you recover the original uncompressed
+data.  You can use @code{bzip2recover} to try to recover data from
+damaged files.
+
+Return values: 0 for a normal exit, 1 for environmental problems (file
+not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
+compressed file, 3 for an internal consistency error (eg, bug) which
+caused @code{bzip2} to panic.
+
+
+@unnumberedsubsubsec OPTIONS
+@table @code
+@item -c  --stdout
+Compress or decompress to standard output.
+@item -d  --decompress
+Force decompression.  @code{bzip2}, @code{bunzip2} and @code{bzcat} are
+really the same program, and the decision about what actions to take is
+done on the basis of which name is used.  This flag overrides that
+mechanism, and forces bzip2 to decompress.
+@item -z --compress
+The complement to @code{-d}: forces compression, regardless of the
+invokation name.
+@item -t --test
+Check integrity of the specified file(s), but don't decompress them.
+This really performs a trial decompression and throws away the result.
+@item -f --force
+Force overwrite of output files.  Normally, @code{bzip2} will not overwrite
+existing output files.  Also forces @code{bzip2} to break hard links
+to files, which it otherwise wouldn't do.
+@item -k --keep
+Keep (don't delete) input files during compression
+or decompression.
+@item -s --small
+Reduce memory usage, for compression, decompression and testing.  Files
+are decompressed and tested using a modified algorithm which only
+requires 2.5 bytes per block byte.  This means any file can be
+decompressed in 2300k of memory, albeit at about half the normal speed.
+
+During compression, @code{-s} selects a block size of 200k, which limits
+memory use to around the same figure, at the expense of your compression
+ratio.  In short, if your machine is low on memory (8 megabytes or
+less), use -s for everything.  See MEMORY MANAGEMENT below.
+@item -q --quiet
+Suppress non-essential warning messages.  Messages pertaining to
+I/O errors and other critical events will not be suppressed.
+@item -v --verbose
+Verbose mode -- show the compression ratio for each file processed.
+Further @code{-v}'s increase the verbosity level, spewing out lots of
+information which is primarily of interest for diagnostic purposes.
+@item -L --license -V --version
+Display the software version, license terms and conditions.
+@item -1 to -9
+Set the block size to 100 k, 200 k ..  900 k when compressing.  Has no
+effect when decompressing.  See MEMORY MANAGEMENT below.
+@item --
+Treats all subsequent arguments as file names, even if they start
+with a dash.  This is so you can handle files with names beginning
+with a dash, for example: @code{bzip2 -- -myfilename}.
+@item --repetitive-fast 
+@item --repetitive-best
+These flags are redundant in versions 0.9.5 and above.  They provided
+some coarse control over the behaviour of the sorting algorithm in
+earlier versions, which was sometimes useful.  0.9.5 and above have an
+improved algorithm which renders these flags irrelevant.
+@end table
+
+
+@unnumberedsubsubsec MEMORY MANAGEMENT
+
+@code{bzip2} compresses large files in blocks.  The block size affects
+both the compression ratio achieved, and the amount of memory needed for
+compression and decompression.  The flags @code{-1} through @code{-9}
+specify the block size to be 100,000 bytes through 900,000 bytes (the
+default) respectively.  At decompression time, the block size used for
+compression is read from the header of the compressed file, and
+@code{bunzip2} then allocates itself just enough memory to decompress
+the file.  Since block sizes are stored in compressed files, it follows
+that the flags @code{-1} to @code{-9} are irrelevant to and so ignored
+during decompression.
+
+Compression and decompression requirements, in bytes, can be estimated
+as:
 @example
-NAME
-       bzip2, bunzip2 - a block-sorting file compressor, v0.9.0
-       bzcat - decompresses files to stdout
-       bzip2recover - recovers data from damaged bzip2 files
-
-
-SYNOPSIS
-       bzip2 [ -cdfkstvzVL123456789 ] [ filenames ...  ]
-       bunzip2 [ -fkvsVL ] [ filenames ...  ]
-       bzcat [ -s ] [ filenames ...  ]
-       bzip2recover filename
-
-
-DESCRIPTION
-       bzip2  compresses  files  using the Burrows-Wheeler block-
-       sorting text compression algorithm,  and  Huffman  coding.
-       Compression  is  generally  considerably  better than that
-       achieved by more conventional LZ77/LZ78-based compressors,
-       and  approaches  the performance of the PPM family of sta-
-       tistical compressors.
-
-       The command-line options are deliberately very similar  to
-       those of GNU Gzip, but they are not identical.
-
-       bzip2  expects  a list of file names to accompany the com-
-       mand-line flags.  Each file is replaced  by  a  compressed
-       version  of  itself,  with  the  name "original_name.bz2".
-       Each compressed file has the same  modification  date  and
-       permissions  as  the corresponding original, so that these
-       properties can  be  correctly  restored  at  decompression
-       time.  File name handling is naive in the sense that there
-       is no mechanism for preserving original file  names,  per-
-       missions  and  dates  in filesystems which lack these con-
-       cepts, or have serious file name length restrictions, such
-       as MS-DOS.
-
-       bzip2  and  bunzip2 will by default not overwrite existing
-       files; if you want this to happen, specify the -f flag.
-
-       If no file names  are  specified,  bzip2  compresses  from
-       standard  input  to  standard output.  In this case, bzip2
-       will decline to write compressed output to a terminal,  as
-       this  would  be  entirely  incomprehensible  and therefore
-       pointless.
-
-       bunzip2 (or bzip2 -d ) decompresses and restores all spec-
-       ified files whose names end in ".bz2".  Files without this
-       suffix are ignored.  Again, supplying no filenames  causes
-       decompression from standard input to standard output.
-
-       bunzip2 will correctly decompress a file which is the con-
-       catenation of two or more compressed files.  The result is
-       the concatenation of the corresponding uncompressed files.
-       Integrity testing (-t) of concatenated compressed files is
-       also supported.
-
-       You  can also compress or decompress files to the standard
-       output by giving the -c flag.  Multiple files may be  com-
-       pressed and decompressed like this.  The resulting outputs
-       are fed sequentially to stdout.  Compression  of  multiple
-       files  in this manner generates a stream containing multi-
-       ple compressed file representations.  Such a stream can be
-       decompressed  correctly  only  by  bzip2  version 0.9.0 or
-       later.  Earlier versions of bzip2 will stop  after  decom-
-       pressing the first file in the stream.
-
-       bzcat  (or bzip2 -dc ) decompresses all specified files to
-       the standard output.
-
-       Compression is always performed, even  if  the  compressed
-       file  is slightly larger than the original.  Files of less
-       than about one hundred bytes tend to get larger, since the
-       compression  mechanism  has  a  constant  overhead  in the
-       region of 50 bytes.  Random data (including the output  of
-       most  file  compressors)  is  coded at about 8.05 bits per
-       byte, giving an expansion of around 0.5%.
-
-       As a self-check for your  protection,  bzip2  uses  32-bit
-       CRCs  to make sure that the decompressed version of a file
-       is identical to the original.  This guards against corrup-
-       tion  of  the compressed data, and against undetected bugs
-       in bzip2 (hopefully very unlikely).  The chances  of  data
-       corruption  going  undetected  is  microscopic,  about one
-       chance in four billion for each file processed.  Be aware,
-       though,  that  the  check occurs upon decompression, so it
-       can only tell you that that something is wrong.  It  can't
-       help  you recover the original uncompressed data.  You can
-       use bzip2recover to  try  to  recover  data  from  damaged
-       files.
-
-       Return  values:  0  for a normal exit, 1 for environmental
-       problems (file not found, invalid flags, I/O errors,  &c),
-       2 to indicate a corrupt compressed file, 3 for an internal
-       consistency error (eg, bug) which caused bzip2 to panic.
-
-
-MEMORY MANAGEMENT
-       Bzip2 compresses large files in blocks.   The  block  size
-       affects  both  the  compression  ratio  achieved,  and the
-       amount of memory needed both for  compression  and  decom-
-       pression.   The flags -1 through -9 specify the block size
-       to be 100,000 bytes through 900,000  bytes  (the  default)
-       respectively.   At decompression-time, the block size used
-       for compression is read from the header of the  compressed
-       file, and bunzip2 then allocates itself just enough memory
-       to decompress the file.  Since block sizes are  stored  in
-       compressed  files,  it follows that the flags -1 to -9 are
-       irrelevant  to  and  so  ignored   during   decompression.
-
-       Compression  and decompression requirements, in bytes, can
-       be estimated as:
-
-             Compression:   400k + ( 7 x block size )
-
-             Decompression: 100k + ( 4 x block size ), or
-                            100k + ( 2.5 x block size )
-
-       Larger  block  sizes  give  rapidly  diminishing  marginal
-       returns;  most of the compression comes from the first two
-       or three hundred k of block size, a fact worth bearing  in
-       mind  when  using  bzip2  on  small  machines.  It is also
-       important to  appreciate  that  the  decompression  memory
-       requirement  is  set  at compression-time by the choice of
-       block size.
+     Compression:   400k + ( 8 x block size )
+
+     Decompression: 100k + ( 4 x block size ), or
+                    100k + ( 2.5 x block size )
+@end example
+Larger block sizes give rapidly diminishing marginal returns.  Most of
+the compression comes from the first two or three hundred k of block
+size, a fact worth bearing in mind when using @code{bzip2} on small machines.
+It is also important to appreciate that the decompression memory
+requirement is set at compression time by the choice of block size.
+
+For files compressed with the default 900k block size, @code{bunzip2}
+will require about 3700 kbytes to decompress.  To support decompression
+of any file on a 4 megabyte machine, @code{bunzip2} has an option to
+decompress using approximately half this amount of memory, about 2300
+kbytes.  Decompression speed is also halved, so you should use this
+option only where necessary.  The relevant flag is @code{-s}.
+
+In general, try and use the largest block size memory constraints allow,
+since that maximises the compression achieved.  Compression and
+decompression speed are virtually unaffected by block size.
+
+Another significant point applies to files which fit in a single block
+-- that means most files you'd encounter using a large block size.  The
+amount of real memory touched is proportional to the size of the file,
+since the file is smaller than a block.  For example, compressing a file
+20,000 bytes long with the flag @code{-9} will cause the compressor to
+allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
+kbytes of it.  Similarly, the decompressor will allocate 3700k but only
+touch 100k + 20000 * 4 = 180 kbytes.
+
+Here is a table which summarises the maximum memory usage for different
+block sizes.  Also recorded is the total compressed size for 14 files of
+the Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
+column gives some feel for how compression varies with block size.
+These figures tend to understate the advantage of larger block sizes for
+larger files, since the Corpus is dominated by smaller files.
+@example
+          Compress   Decompress   Decompress   Corpus
+   Flag     usage      usage       -s usage     Size
+
+    -1      1200k       500k         350k      914704
+    -2      2000k       900k         600k      877703
+    -3      2800k      1300k         850k      860338
+    -4      3600k      1700k        1100k      846899
+    -5      4400k      2100k        1350k      845160
+    -6      5200k      2500k        1600k      838626
+    -7      6100k      2900k        1850k      834096
+    -8      6800k      3300k        2100k      828642
+    -9      7600k      3700k        2350k      828642
+@end example
+
+@unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES
 
-       For files compressed with the  default  900k  block  size,
-       bunzip2  will require about 3700 kbytes to decompress.  To
-       support decompression of any file on a 4 megabyte machine,
-       bunzip2  has  an  option to decompress using approximately
-       half this amount of memory, about 2300 kbytes.  Decompres-
-       sion  speed  is also halved, so you should use this option
-       only where necessary.  The relevant flag is -s.
-
-       In general, try and use the largest block size memory con-
-       straints  allow,  since  that  maximises  the  compression
-       achieved.  Compression and decompression speed are  virtu-
-       ally unaffected by block size.
-
-       Another  significant point applies to files which fit in a
-       single block -- that  means  most  files  you'd  encounter
-       using  a  large  block  size.   The  amount of real memory
-       touched is proportional to the size of the file, since the
-       file  is smaller than a block.  For example, compressing a
-       file 20,000 bytes long with the flag  -9  will  cause  the
-       compressor  to  allocate  around 6700k of memory, but only
-       touch 400k + 20000 * 7 = 540 kbytes of it.  Similarly, the
-       decompressor  will  allocate  3700k  but only touch 100k +
-       20000 * 4 = 180 kbytes.
-
-       Here is a table which summarises the maximum memory  usage
-       for  different  block  sizes.   Also recorded is the total
-       compressed size for 14 files of the Calgary Text  Compres-
-       sion  Corpus totalling 3,141,622 bytes.  This column gives
-       some feel for how  compression  varies  with  block  size.
-       These  figures  tend to understate the advantage of larger
-       block sizes for larger files, since the  Corpus  is  domi-
-       nated by smaller files.
-
-                  Compress   Decompress   Decompress   Corpus
-           Flag     usage      usage       -s usage     Size
-
-            -1      1100k       500k         350k      914704
-            -2      1800k       900k         600k      877703
-            -3      2500k      1300k         850k      860338
-            -4      3200k      1700k        1100k      846899
-            -5      3900k      2100k        1350k      845160
-            -6      4600k      2500k        1600k      838626
-            -7      5400k      2900k        1850k      834096
-            -8      6000k      3300k        2100k      828642
-            -9      6700k      3700k        2350k      828642
-
-
-OPTIONS
-       -c --stdout
-              Compress or decompress to standard output.  -c will
-              decompress multiple files to stdout, but will  only
-              compress a single file to stdout.
-
-       -d --decompress
-              Force  decompression.  bzip2, bunzip2 and bzcat are
-              really the same program,  and  the  decision  about
-              what  actions to take is done on the basis of which
-              name is used.  This flag overrides that  mechanism,
-              and forces bzip2 to decompress.
-
-       -z --compress
-              The  complement  to -d: forces compression, regard-
-              less of the invokation name.
-
-       -t --test
-              Check integrity of the specified file(s), but don't
-              decompress  them.   This  really  performs  a trial
-              decompression and throws away the result.
-
-       -f --force
-              Force overwrite of output files.   Normally,  bzip2
-              will not overwrite existing output files.
-
-       -k --keep
-              Keep  (don't delete) input files during compression
-              or decompression.
-
-       -s --small
-              Reduce memory usage, for compression, decompression
-              and  testing.   Files  are  decompressed and tested
-              using a modified algorithm which only requires  2.5
-              bytes  per  block byte.  This means any file can be
-              decompressed in 2300k of memory,  albeit  at  about
-              half the normal speed.
-
-              During  compression,  -s  selects  a  block size of
-              200k, which limits memory use to  around  the  same
-              figure,  at  the expense of your compression ratio.
-              In short, if your  machine  is  low  on  memory  (8
-              megabytes  or  less),  use  -s for everything.  See
-              MEMORY MANAGEMENT above.
-
-       -v --verbose
-              Verbose mode -- show the compression ratio for each
-              file  processed.   Further  -v's  increase the ver-
-              bosity level, spewing out lots of information which
-              is primarily of interest for diagnostic purposes.
-
-       -L --license -V --version
-              Display  the  software  version,  license terms and
-              conditions.
-
-       -1 to -9
-              Set the block size to 100 k, 200 k ..  900  k  when
-              compressing.   Has  no  effect  when decompressing.
-              See MEMORY MANAGEMENT above.
-
-       --repetitive-fast
-              bzip2 injects some small  pseudo-random  variations
-              into  very  repetitive  blocks  to limit worst-case
-              performance during compression.   If  sorting  runs
-              into  difficulties,  the  block  is randomised, and
-              sorting is restarted.  Very roughly, bzip2 persists
-              for  three  times  as  long as a well-behaved input
-              would take before resorting to randomisation.  This
-              flag makes it give up much sooner.
-
-       --repetitive-best
-              Opposite  of  --repetitive-fast;  try  a lot harder
-              before resorting to randomisation.
-
-
-RECOVERING DATA FROM DAMAGED FILES
-       bzip2 compresses files in blocks, usually 900kbytes  long.
-       Each block is handled independently.  If a media or trans-
-       mission error causes a multi-block  .bz2  file  to  become
-       damaged,  it  may  be  possible  to  recover data from the
-       undamaged blocks in the file.
-
-       The compressed representation of each block  is  delimited
-       by  a  48-bit pattern, which makes it possible to find the
-       block boundaries with reasonable  certainty.   Each  block
-       also  carries its own 32-bit CRC, so damaged blocks can be
-       distinguished from undamaged ones.
-
-       bzip2recover is a  simple  program  whose  purpose  is  to
-       search  for blocks in .bz2 files, and write each block out
-       into its own .bz2 file.  You can then use bzip2 -t to test
-       the integrity of the resulting files, and decompress those
-       which are undamaged.
-
-       bzip2recover takes a single argument, the name of the dam-
-       aged file, and writes a number of files "rec0001file.bz2",
-       "rec0002file.bz2", etc, containing the  extracted  blocks.
+@code{bzip2} compresses files in blocks, usually 900kbytes long.  Each
+block is handled independently.  If a media or transmission error causes
+a multi-block @code{.bz2} file to become damaged, it may be possible to
+recover data from the undamaged blocks in the file.
+
+The compressed representation of each block is delimited by a 48-bit
+pattern, which makes it possible to find the block boundaries with
+reasonable certainty.  Each block also carries its own 32-bit CRC, so
+damaged blocks can be distinguished from undamaged ones.
+
+@code{bzip2recover} is a simple program whose purpose is to search for
+blocks in @code{.bz2} files, and write each block out into its own
+@code{.bz2} file.  You can then use @code{bzip2 -t} to test the
+integrity of the resulting files, and decompress those which are
+undamaged.
+
+@code{bzip2recover} 
+takes a single argument, the name of the damaged file, 
+and writes a number of files @code{rec0001file.bz2},
+       @code{rec0002file.bz2}, etc, containing the  extracted  blocks.
        The  output  filenames  are  designed  so  that the use of
-       wildcards in subsequent processing -- for example,  "bzip2
-       -dc  rec*file.bz2  > recovered_data" -- lists the files in
-       the "right" order.
+       wildcards in subsequent processing -- for example,  
+@code{bzip2 -dc  rec*file.bz2 > recovered_data} -- lists the files in
+       the correct order.
 
-       bzip2recover should be of most use dealing with large .bz2
+@code{bzip2recover} should be of most use dealing with large @code{.bz2}
        files,  as  these will contain many blocks.  It is clearly
        futile to use it on damaged single-block  files,  since  a
-       damaged  block  cannot  be recovered.  If you wish to min-
-       imise any potential data loss through media  or  transmis-
-       sion errors, you might consider compressing with a smaller
+       damaged  block  cannot  be recovered.  If you wish to minimise 
+any potential data loss through media  or  transmission errors, 
+you might consider compressing with a smaller
        block size.
 
 
-PERFORMANCE NOTES
-       The sorting phase of compression gathers together  similar
-       strings  in  the  file.  Because of this, files containing
-       very long runs of  repeated  symbols,  like  "aabaabaabaab
-       ..."   (repeated   several  hundred  times)  may  compress
-       extraordinarily slowly.  You can use the -vvvvv option  to
-       monitor progress in great detail, if you want.  Decompres-
-       sion speed is unaffected.
-
-       Such pathological cases seem rare in  practice,  appearing
-       mostly in artificially-constructed test files, and in low-
-       level disk images.  It may be inadvisable to use bzip2  to
-       compress  the  latter.   If you do get a file which causes
-       severe slowness in compression, try making the block  size
-       as small as possible, with flag -1.
-
-       bzip2  usually  allocates  several  megabytes of memory to
-       operate in, and then charges all over it in a fairly  ran-
-       dom  fashion.   This means that performance, both for com-
-       pressing and decompressing, is largely determined  by  the
-       speed  at  which  your  machine  can service cache misses.
-       Because of this, small changes to the code to  reduce  the
-       miss  rate  have  been observed to give disproportionately
-       large performance improvements.  I imagine bzip2 will per-
-       form best on machines with very large caches.
-
-
-CAVEATS
-       I/O  error  messages  are not as helpful as they could be.
-       Bzip2 tries hard to detect I/O errors  and  exit  cleanly,
-       but  the  details  of  what  the problem is sometimes seem
-       rather misleading.
-
-       This manual page pertains to version 0.9.0 of bzip2.  Com-
-       pressed  data created by this version is entirely forwards
-       and backwards compatible with the previous public release,
-       version  0.1pl2,  but  with the following exception: 0.9.0
-       can correctly decompress multiple concatenated  compressed
-       files.   0.1pl2  cannot do this; it will stop after decom-
-       pressing just the first file in the stream.
-
-       Wildcard expansion for Windows 95 and NT is flaky.
-
-       bzip2recover uses 32-bit integers to represent  bit  posi-
-       tions  in compressed files, so it cannot handle compressed
-       files more than 512 megabytes long.  This could easily  be
-       fixed.
-
-
-AUTHOR
-       Julian Seward, jseward@@acm.org.
-
-       The ideas embodied in bzip2 are due to (at least) the fol-
-       lowing people: Michael Burrows and David Wheeler (for  the
-       block  sorting  transformation), David Wheeler (again, for
-       the Huffman coder), Peter Fenwick (for the structured cod-
-       ing model in the original bzip, and many refinements), and
-       Alistair Moffat, Radford Neal  and  Ian  Witten  (for  the
-       arithmetic  coder  in  the  original  bzip).   I  am  much
-       indebted for their help, support and advice.  See the man-
-       ual  in the source distribution for pointers to sources of
-       documentation.  Christian von Roques encouraged me to look
-       for  faster sorting algorithms, so as to speed up compres-
-       sion.  Bela Lubkin encouraged me to improve the worst-case
-       compression performance.  Many people sent patches, helped
-       with portability problems, lent machines, gave advice  and
-       were generally helpful.
-@end example
+@unnumberedsubsubsec PERFORMANCE NOTES
+
+The sorting phase of compression gathers together similar strings in the
+file.  Because of this, files containing very long runs of repeated
+symbols, like "aabaabaabaab ..."  (repeated several hundred times) may
+compress more slowly than normal.  Versions 0.9.5 and above fare much
+better than previous versions in this respect.  The ratio between
+worst-case and average-case compression time is in the region of 10:1.
+For previous versions, this figure was more like 100:1.  You can use the
+@code{-vvvv} option to monitor progress in great detail, if you want.
+
+Decompression speed is unaffected by these phenomena.
+
+@code{bzip2} usually allocates several megabytes of memory to operate
+in, and then charges all over it in a fairly random fashion.  This means
+that performance, both for compressing and decompressing, is largely
+determined by the speed at which your machine can service cache misses.
+Because of this, small changes to the code to reduce the miss rate have
+been observed to give disproportionately large performance improvements.
+I imagine @code{bzip2} will perform best on machines with very large
+caches.
 
 
+@unnumberedsubsubsec CAVEATS
+
+I/O error messages are not as helpful as they could be.  @code{bzip2}
+tries hard to detect I/O errors and exit cleanly, but the details of
+what the problem is sometimes seem rather misleading.
+
+This manual page pertains to version 0.9.5 of @code{bzip2}.  Compressed
+data created by this version is entirely forwards and backwards
+compatible with the previous public releases, versions 0.1pl2 and 0.9.0,
+but with the following exception: 0.9.0 and above can correctly
+decompress multiple concatenated compressed files.  0.1pl2 cannot do
+this; it will stop after decompressing just the first file in the
+stream.
+
+@code{bzip2recover} uses 32-bit integers to represent bit positions in
+compressed files, so it cannot handle compressed files more than 512
+megabytes long.  This could easily be fixed.
+
+
+@unnumberedsubsubsec AUTHOR
+Julian Seward, @code{jseward@@acm.org}.
+
+The ideas embodied in @code{bzip2} are due to (at least) the following
+people: Michael Burrows and David Wheeler (for the block sorting
+transformation), David Wheeler (again, for the Huffman coder), Peter
+Fenwick (for the structured coding model in the original @code{bzip},
+and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
+(for the arithmetic coder in the original @code{bzip}).  I am much
+indebted for their help, support and advice.  See the manual in the
+source distribution for pointers to sources of documentation.  Christian
+von Roques encouraged me to look for faster sorting algorithms, so as to
+speed up compression.  Bela Lubkin encouraged me to improve the
+worst-case compression performance.  Many people sent patches, helped
+with portability problems, lent machines, gave advice and were generally
+helpful.
+
+@end quotation
+
 
 
 
@@ -579,7 +564,7 @@ give better @code{zlib} compatibility.  These functions are
 @code{bzerror} and @code{bzlibVersion}.  You may find these functions
 more convenient for simple file reading and writing, than those in the
 high-level interface.  These functions are not (yet) officially part of
-the library, and are not further documented here.  If they break, you
+the library, and are minimally documented here.  If they break, you
 get to keep all the pieces.  I hope to document them properly when time
 permits.
 
@@ -748,25 +733,31 @@ monitoring/debugging output.  If the library has been compiled with
 setting.
 
 Parameter @code{workFactor} controls how the compression phase behaves
-when presented with worst case, highly repetitive, input data.
-If compression runs into difficulties caused by repetitive data,
-some pseudo-random variations are inserted into the block, and
-compression is restarted.  Lower values of @code{workFactor}
-reduce the tolerance of compression to repetitive data.
-You should set this parameter carefully; too low, and 
-compression ratio suffers, too high, and your average-to-worst
-case compression times can become very large.  
-The default value of 30
-gives reasonable behaviour over a wide range of circumstances.
-
-Allowable values range from 0 to 250 inclusive.  0 is a special
-case, equivalent to using the default value of 30.
-
-Note that the randomisation process is entirely transparent.  
-If the library decides to randomise and restart compression on a
-block, it does so without comment.  Randomised blocks are
-automatically de-randomised during decompression, so data
-integrity is never compromised.
+when presented with worst case, highly repetitive, input data.  If
+compression runs into difficulties caused by repetitive data, the
+library switches from the standard sorting algorithm to a fallback
+algorithm.  The fallback is slower than the standard algorithm by
+perhaps a factor of three, but always behaves reasonably, no matter how
+bad the input.
+
+Lower values of @code{workFactor} reduce the amount of effort the
+standard algorithm will expend before resorting to the fallback.  You
+should set this parameter carefully; too low, and many inputs will be
+handled by the fallback algorithm and so compress rather slowly, too
+high, and your average-to-worst case compression times can become very
+large.  The default value of 30 gives reasonable behaviour over a wide
+range of circumstances.
+
+Allowable values range from 0 to 250 inclusive.  0 is a special case,
+equivalent to using the default value of 30.
+
+Note that the compressed output generated is the same regardless of
+whether or not the fallback algorithm is used.
+
+Be aware also that this parameter may disappear entirely in future
+versions of the library.  In principle it should be possible to devise a
+good way to automatically choose which algorithm to use.  Such a
+mechanism would render the parameter obsolete.
 
 Possible return values:
 @display
@@ -1652,6 +1643,48 @@ Possible return values:
 
 
 
+@section @code{zlib} compatibility functions
+Yoshioka Tsuneo has contributed some functions to
+give better @code{zlib} compatibility.  These functions are
+@code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush},
+@code{bzclose},
+@code{bzerror} and @code{bzlibVersion}.
+These functions are not (yet) officially part of
+the library.  If they break, you get to keep all the pieces.
+Nevertheless, I think they work ok.
+@example
+typedef void BZFILE;
+
+const char * bzlibVersion ( void );
+@end example
+Returns a string indicating the library version.
+@example
+BZFILE * bzopen  ( const char *path, const char *mode );
+BZFILE * bzdopen ( int        fd,    const char *mode );
+@end example
+Opens a @code{.bz2} file for reading or writing, using either its name
+or a pre-existing file descriptor. 
+Analogous to @code{fopen} and @code{fdopen}.
+@example         
+int bzread  ( BZFILE* b, void* buf, int len );
+int bzwrite ( BZFILE* b, void* buf, int len );
+@end example
+Reads/writes data from/to a previously opened @code{BZFILE}.
+Analogous to @code{fread} and @code{fwrite}.
+@example
+int  bzflush ( BZFILE* b );
+void bzclose ( BZFILE* b );
+@end example
+Flushes/closes a @code{BZFILE}.  @code{bzflush} doesn't actually do
+anything.  Analogous to @code{fflush} and @code{fclose}.
+
+@example 
+const char * bzerror ( BZFILE *b, int *errnum )
+@end example
+Returns a string describing the more recent error status of
+@code{b}, and also sets @code{*errnum} to its numerical value.
+
+
 @section Using the library in a @code{stdio}-free environment
 
 @subsection Getting rid of @code{stdio}
@@ -1677,14 +1710,14 @@ was compiled with @code{BZ_NO_STDIO} set.
 
 For a normal compile, an assertion failure yields the message
 @example
-   bzip2/libbzip2, v0.9.0: internal error number N.
-   This is a bug in bzip2/libbzip2, v0.9.0.  Please report
+   bzip2/libbzip2, v0.9.5: internal error number N.
+   This is a bug in bzip2/libbzip2, v0.9.5.  Please report
    it to me at: jseward@@acm.org.  If this happened when
    you were using some program which uses libbzip2 as a
    component, you should also report this bug to the author(s)
    of that program.  Please make an effort to report this bug;
    timely and accurate bug reports eventually lead to higher
-   quality software.  Thx.  Julian Seward, 27 June 1998.
+   quality software.  Thanks.  Julian Seward, 24 May 1999.
 @end example
 where @code{N} is some error code number.  @code{exit(3)}
 is then called.
@@ -1721,7 +1754,7 @@ If you can't
 open the project file for some reason, make a new one, naming these files:
 @code{blocksort.c}, @code{bzlib.c}, @code{compress.c}, 
 @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @*
-@code{randtable.c} and @code{libbz2.def}.  You might also need
+@code{randtable.c} and @code{libbz2.def}.  You will also need
 to name the header files @code{bzlib.h} and @code{bzlib_private.h}.
 
 If you don't use VC++, you may need to define the proprocessor symbol
@@ -1730,6 +1763,14 @@ If you don't use VC++, you may need to define the proprocessor symbol
 Finally, @code{dlltest.c} is a sample program using the DLL.  It has a
 project file, @code{dlltest.dsp}.
 
+If you just want a makefile for Visual C, have a look at
+@code{makefile.msc}.
+
+Be aware that if you compile @code{bzip2} itself on Win32, you must set
+@code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file
+@code{bzip2.c}, before compiling.  Otherwise the resulting binary won't
+work correctly.
+
 I haven't tried any of this stuff myself, but it all looks plausible.
 
 
@@ -1740,7 +1781,8 @@ These are just some random thoughts of mine.  Your mileage may
 vary.
 
 @section Limitations of the compressed file format
-@code{bzip2-0.9.0} uses exactly the same file format as the previous
+@code{bzip2-0.9.5} and @code{0.9.0}
+use exactly the same file format as the previous
 version, @code{bzip2-0.1}.  This decision was made in the interests of
 stability.  Creating yet another incompatible compressed file format
 would create further confusion and disruption for users.
@@ -1776,12 +1818,11 @@ decompression and, in retrospect, are unnecessary.  These are:
       @code{bzip2-0.1}'s performance on repetitive data, so
       perhaps it isn't a problem for real inputs.
 
-      Probably the best long-term solution
+      Probably the best long-term solution,
+      and the one I have incorporated into 0.9.5 and above,
       is to use the existing sorting
       algorithm initially, and fall back to a O(N (log N)^2)
       algorithm if the standard algorithm gets into difficulties.
-      This can be done without much difficulty; I made
-      a prototype implementation of it some months now.
 @item The compressed file format was never designed to be
       handled by a library, and I have had to jump though
       some hoops to produce an efficient implementation of
@@ -1797,7 +1838,7 @@ It would be fair to say that the @code{bzip2} format was frozen
 before I properly and fully understood the performance
 consequences of doing so.
 
-Improvements which I have been able to incorporate into
+Improvements which I was able to incorporate into
 0.9.0, despite using the same file format, are:
 @itemize @bullet
 @item Single array implementation of the inverse BWT.  This
@@ -1808,8 +1849,7 @@ Improvements which I have been able to incorporate into
       of values.
 @item @code{bzip2-0.9.0} now reads and writes files with @code{fread}
       and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}.
-      Duh! I'm embarrassed at my own moronicness (moronicity?) on this
-      one.
+      Duh!  Well, you live and learn.
 
 @end itemize
 Further ahead, it would be nice 
@@ -1820,7 +1860,7 @@ require some careful design of compressed file formats.
 
 @section Portability issues
 After some consideration, I have decided not to use
-GNU @code{autoconf} to configure 0.9.0.
+GNU @code{autoconf} to configure 0.9.5.
 
 @code{autoconf}, admirable and wonderful though it is, 
 mainly assists with portability problems between Unix-like
@@ -1835,8 +1875,10 @@ under Unix straight out-of-the-box, so to speak, especially
 if you have a version of GNU C available.
 
 There are a couple of @code{__inline__} directives in the code.  GNU C
-(@code{gcc}) should be able to handle them.  If your compiler doesn't
-like them, just @code{#define} @code{__inline__} to be null.  One
+(@code{gcc}) should be able to handle them.  If you're not using
+GNU C, your C compiler shouldn't see them at all.
+If your compiler does, for some reason, see them and doesn't
+like them, just @code{#define} @code{__inline__} to be @code{/* */}.  One
 easy way to do this is to compile with the flag @code{-D__inline__=}, 
 which should be understood by most Unix compilers.
 
@@ -1853,6 +1895,11 @@ distribution, please try and link it statically (@code{gcc -s}).  This
 avoids all sorts of library-version issues that others may encounter
 later on.
 
+If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and
+@code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling.
+Otherwise the resulting binary won't work correctly.
+
+
 
 @section Reporting bugs
 I tried pretty hard to make sure @code{bzip2} is
@@ -1976,45 +2023,39 @@ A record of the tests I've done.
 
 First, some data sets:
 @itemize @bullet
-@item B: a directory containing a 6001 files, one for every length in the
+@item B: a directory containing 6001 files, one for every length in the
       range 0 to 6000 bytes.  The files contain random lowercase
       letters.  18.7 megabytes.
 @item H: my home directory tree.  Documents, source code, mail files,
       compressed data.  H contains B, and also a directory of 
       files designed as boundary cases for the sorting; mostly very
-      repetitive, nasty files.  445 megabytes.
+      repetitive, nasty files.  565 megabytes.
 @item A: directory tree holding various applications built from source:
-      @code{egcs-1.0.2}, @code{gcc-2.8.1}, KDE Beta 4, GTK, Octave, etc.
-      827 megabytes.
-@item P: directory tree holding large amounts of source code (@code{.tar} 
-      files) of the entire GNU distribution, plus a couple of
-      Linux distributions.  2400 megabytes.
+      @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc.
+      2200 megabytes.
 @end itemize
 The tests conducted are as follows.  Each test means compressing 
 (a copy of) each file in the data set, decompressing it and
 comparing it against the original.
 
-First, a bunch of tests with block sizes, internal buffer
-sizes and randomisation lengths set very small, 
+First, a bunch of tests with block sizes and internal buffer
+sizes set very small, 
 to detect any problems with the
-blocking, buffering and randomisation mechanisms.  
+blocking and buffering mechanisms.  
 This required modifying the source code so as to try to 
 break it.
 @enumerate
 @item Data set H, with
       buffer size of 1 byte, and block size of 23 bytes.
 @item Data set B, buffer sizes 1 byte, block size 1 byte.
-@item As (2) but small-mode decompression (first 1700 files).
+@item As (2) but small-mode decompression.
 @item As (2) with block size 2 bytes.
 @item As (2) with block size 3 bytes.
 @item As (2) with block size 4 bytes.
 @item As (2) with block size 5 bytes.
 @item As (2) with block size 6 bytes and small-mode decompression.
-@item H with normal buffer sizes (5000 bytes), normal block
-      size (up to 900000 bytes), but with randomisation
-      mechanism running intensely (randomising approximately every
-      third byte).
-@item As (9) with small-mode decompression.
+@item H with buffer size of 1 byte, but normal block
+      size (up to 900000 bytes).
 @end enumerate
 Then some tests with unmodified source code.
 @enumerate
@@ -2023,18 +2064,26 @@ Then some tests with unmodified source code.
 @item H, compress with flag @code{-1}.
 @item H, compress with flag @code{-s}, decompress with flag @code{-s}.
 @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing,
-      @code{bzip2-0.9.0} decompressing, all settings normal.
-@item Backwards compatibility:  H, @code{bzip2-0.9.0} compressing,
+      @code{bzip2-0.9.5} decompressing, all settings normal.
+@item Backwards compatibility:  H, @code{bzip2-0.9.5} compressing,
       @code{bzip2-0.1pl2} decompressing, all settings normal.
 @item Bigger tests: A, all settings normal.
-@item P, all settings normal.
-@item Misc test: about 100 megabytes of @code{.tar} files with
-      @code{bzip2} compiled with Purify.
+@item As (7), using the fallback (Sadakane-like) sorting algorithm.
+@item As (8), compress with flag @code{-1}, decompress with flag
+      @code{-s}.
+@item H, using the fallback sorting algorithm.
+@item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing,
+      @code{bzip2-0.9.5} decompressing, all settings normal.
+@item Backwards compatibility:  A, @code{bzip2-0.9.5} compressing,
+      @code{bzip2-0.1pl2} decompressing, all settings normal.
+@item Misc test: about 400 megabytes of @code{.tar} files with
+      @code{bzip2} compiled with Checker (a memory access error
+       detector, like Purify).
 @item Misc tests to make sure it builds and runs ok on non-Linux/x86
       platforms.
 @end enumerate
-These tests were conducted on a 205 MHz Cyrix 6x86MX machine, running
-Linux 2.0.32.  They represent nearly a week of continuous computation.
+These tests were conducted on a 225 MHz IDT WinChip machine, running
+Linux 2.0.36.  They represent nearly a week of continuous computation.
 All tests completed successfully.
 
 
@@ -2063,7 +2112,7 @@ David J. Wheeler
    Program bred3.c and accompanying document bred3.ps.
    This contains the idea behind the multi-table Huffman
    coding scheme.
-   ftp://ftp.cl.cam.ac.uk/pub/user/djw3/
+   ftp://ftp.cl.cam.ac.uk/users/djw3/
 
 Jon L. Bentley and Robert Sedgewick
   "Fast Algorithms for Sorting and Searching Strings"
-- 
cgit v1.2.3-55-g6feb