zlib 1.2.3.1v1.2.3.1

author: Mark Adler <madler@alumni.caltech.edu> 2011-09-09 23:25:27 -0700
committer: Mark Adler <madler@alumni.caltech.edu> 2011-09-09 23:25:27 -0700
commit: b1c19ca6d82c98a8be6cd9cad7a9c5fa5e8e634e (patch)
tree: f0eeb8f52d07819f417411da5104c9d743dd46de /doc/txtvsbin.txt
parent: abf180a067223611620dd97dd5681df7c7fa7c9b (diff)
download: zlib-1.2.3.1.tar.gz
zlib-1.2.3.1.tar.bz2
zlib-1.2.3.1.zip
1 files changed, 107 insertions, 0 deletions
diff --git a/doc/txtvsbin.txt b/doc/txtvsbin.txt
new file mode 100644
index 0000000..3d0f063
--- /dev/null
+++ b/doc/txtvsbin.txt
@@ -0,0 +1,107 @@
+A Fast Method for Identifying Plain Text Files
+==============================================
+Introduction
+------------
+Given a file coming from an unknown source, it is sometimes desirable
+to find out whether the format of that file is plain text.  Although
+this may appear like a simple task, a fully accurate detection of the
+file type requires heavy-duty semantic analysis on the file contents.
+It is, however, possible to obtain satisfactory results by employing
+various heuristics.
+Previous versions of PKZip and other zip-compatible compression tools
+were using a crude detection scheme: if more than 80% (4/5) of the bytes
+found in a certain buffer are within the range [7..127], the file is
+labeled as plain text, otherwise it is labeled as binary.  A prominent
+limitation of this scheme is the restriction to Latin-based alphabets.
+Other alphabets, like Greek, Cyrillic or Asian, make extensive use of
+the bytes within the range [128..255], and texts using these alphabets
+are most often misidentified by this scheme; in other words, the rate
+of false negatives is sometimes too high, which means that the recall
+is low.  Another weakness of this scheme is a reduced precision, due to
+the false positives that may occur when binary files containing large
+amounts of textual characters are misidentified as plain text.
+In this article we propose a new, simple detection scheme that features
+a much increased precision and a near-100% recall.  This scheme is
+designed to work on ASCII, Unicode and other ASCII-derived alphabets,
+and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
+and variable-sized encodings (ISO-2022, UTF-8, etc.).  Wider encodings
+(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
+The Algorithm
+-------------
+The algorithm works by dividing the set of bytecodes [0..255] into three
+categories:
+- The white list of textual bytecodes:
+  9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
+- The gray list of tolerated bytecodes:
+  7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
+- The black list of undesired, non-textual bytecodes:
+  0 (NUL) to 6, 14 to 31.
+If a file contains at least one byte that belongs to the white list and
+no byte that belongs to the black list, then the file is categorized as
+plain text; otherwise, it is categorized as binary.  (The boundary case,
+when the file is empty, automatically falls into the latter category.)
+Rationale
+---------
+The idea behind this algorithm relies on two observations.
+The first observation is that, although the full range of 7-bit codes
+[0..127] is properly specified by the ASCII standard, most control
+characters in the range [0..31] are not used in practice.  The only
+widely-used, almost universally-portable control codes are 9 (TAB),
+10 (LF) and 13 (CR).  There are a few more control codes that are
+recognized on a reduced range of platforms and text viewers/editors:
+7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
+codes are rarely (if ever) used alone, without being accompanied by
+some printable text.  Even the newer, portable text formats such as
+XML avoid using control characters outside the list mentioned here.
+The second observation is that most of the binary files tend to contain
+control characters, especially 0 (NUL).  Even though the older text
+detection schemes observe the presence of non-ASCII codes from the range
+[128..255], the precision rarely has to suffer if this upper range is
+labeled as textual, because the files that are genuinely binary tend to
+contain both control characters and codes from the upper range.  On the
+other hand, the upper range needs to be labeled as textual, because it
+is used by virtually all ASCII extensions.  In particular, this range is
+used for encoding non-Latin scripts.
+Since there is no counting involved, other than simply observing the
+presence or the absence of some byte values, the algorithm produces
+consistent results, regardless what alphabet encoding is being used.
+(If counting were involved, it could be possible to obtain different
+results on a text encoded, say, using ISO-8859-16 versus UTF-8.)
+There is an extra category of plain text files that are "polluted" with
+one or more black-listed codes, either by mistake or by peculiar design
+considerations.  In such cases, a scheme that tolerates a small fraction
+of black-listed codes would provide an increased recall (i.e. more true
+positives).  This, however, incurs a reduced precision overall, since
+false positives are more likely to appear in binary files that contain
+large chunks of textual data.  Furthermore, "polluted" plain text should
+be regarded as binary by general-purpose text detection schemes, because
+general-purpose text processing algorithms might not be applicable.
+Under this premise, it is safe to say that our detection method provides
+a near-100% recall.
+Experiments have been run on many files coming from various platforms
+and applications.  We tried plain text files, system logs, source code,
+formatted office documents, compiled object code, etc.  The results
+confirm the optimistic assumptions about the capabilities of this
+algorithm.
+--
+Cosmin Truta
+Last updated: 2006-May-28
author	Mark Adler <madler@alumni.caltech.edu>	2011-09-09 23:25:27 -0700
committer	Mark Adler <madler@alumni.caltech.edu>	2011-09-09 23:25:27 -0700
commit	b1c19ca6d82c98a8be6cd9cad7a9c5fa5e8e634e (patch)
tree	f0eeb8f52d07819f417411da5104c9d743dd46de /doc/txtvsbin.txt
parent	abf180a067223611620dd97dd5681df7c7fa7c9b (diff)
download	zlib-1.2.3.1.tar.gz zlib-1.2.3.1.tar.bz2 zlib-1.2.3.1.zip

diff --git a/doc/txtvsbin.txt b/doc/txtvsbin.txt new file mode 100644 index 0000000..3d0f063 --- /dev/null +++ b/doc/txtvsbin.txt
@@ -0,0 +1,107 @@
	1	A Fast Method for Identifying Plain Text Files
	2	==============================================
	3
	4
	5	Introduction
	6	------------
	7
	8	Given a file coming from an unknown source, it is sometimes desirable
	9	to find out whether the format of that file is plain text. Although
	10	this may appear like a simple task, a fully accurate detection of the
	11	file type requires heavy-duty semantic analysis on the file contents.
	12	It is, however, possible to obtain satisfactory results by employing
	13	various heuristics.
	14
	15	Previous versions of PKZip and other zip-compatible compression tools
	16	were using a crude detection scheme: if more than 80% (4/5) of the bytes
	17	found in a certain buffer are within the range [7..127], the file is
	18	labeled as plain text, otherwise it is labeled as binary. A prominent
	19	limitation of this scheme is the restriction to Latin-based alphabets.
	20	Other alphabets, like Greek, Cyrillic or Asian, make extensive use of
	21	the bytes within the range [128..255], and texts using these alphabets
	22	are most often misidentified by this scheme; in other words, the rate
	23	of false negatives is sometimes too high, which means that the recall
	24	is low. Another weakness of this scheme is a reduced precision, due to
	25	the false positives that may occur when binary files containing large
	26	amounts of textual characters are misidentified as plain text.
	27
	28	In this article we propose a new, simple detection scheme that features
	29	a much increased precision and a near-100% recall. This scheme is
	30	designed to work on ASCII, Unicode and other ASCII-derived alphabets,
	31	and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
	32	and variable-sized encodings (ISO-2022, UTF-8, etc.). Wider encodings
	33	(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
	34
	35
	36	The Algorithm
	37	-------------
	38
	39	The algorithm works by dividing the set of bytecodes [0..255] into three
	40	categories:
	41	- The white list of textual bytecodes:
	42	9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
	43	- The gray list of tolerated bytecodes:
	44	7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
	45	- The black list of undesired, non-textual bytecodes:
	46	0 (NUL) to 6, 14 to 31.
	47
	48	If a file contains at least one byte that belongs to the white list and
	49	no byte that belongs to the black list, then the file is categorized as
	50	plain text; otherwise, it is categorized as binary. (The boundary case,
	51	when the file is empty, automatically falls into the latter category.)
	52
	53
	54	Rationale
	55	---------
	56
	57	The idea behind this algorithm relies on two observations.
	58
	59	The first observation is that, although the full range of 7-bit codes
	60	[0..127] is properly specified by the ASCII standard, most control
	61	characters in the range [0..31] are not used in practice. The only
	62	widely-used, almost universally-portable control codes are 9 (TAB),
	63	10 (LF) and 13 (CR). There are a few more control codes that are
	64	recognized on a reduced range of platforms and text viewers/editors:
	65	7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
	66	codes are rarely (if ever) used alone, without being accompanied by
	67	some printable text. Even the newer, portable text formats such as
	68	XML avoid using control characters outside the list mentioned here.
	69
	70	The second observation is that most of the binary files tend to contain
	71	control characters, especially 0 (NUL). Even though the older text
	72	detection schemes observe the presence of non-ASCII codes from the range
	73	[128..255], the precision rarely has to suffer if this upper range is
	74	labeled as textual, because the files that are genuinely binary tend to
	75	contain both control characters and codes from the upper range. On the
	76	other hand, the upper range needs to be labeled as textual, because it
	77	is used by virtually all ASCII extensions. In particular, this range is
	78	used for encoding non-Latin scripts.
	79
	80	Since there is no counting involved, other than simply observing the
	81	presence or the absence of some byte values, the algorithm produces
	82	consistent results, regardless what alphabet encoding is being used.
	83	(If counting were involved, it could be possible to obtain different
	84	results on a text encoded, say, using ISO-8859-16 versus UTF-8.)
	85
	86	There is an extra category of plain text files that are "polluted" with
	87	one or more black-listed codes, either by mistake or by peculiar design
	88	considerations. In such cases, a scheme that tolerates a small fraction
	89	of black-listed codes would provide an increased recall (i.e. more true
	90	positives). This, however, incurs a reduced precision overall, since
	91	false positives are more likely to appear in binary files that contain
	92	large chunks of textual data. Furthermore, "polluted" plain text should
	93	be regarded as binary by general-purpose text detection schemes, because
	94	general-purpose text processing algorithms might not be applicable.
	95	Under this premise, it is safe to say that our detection method provides
	96	a near-100% recall.
	97
	98	Experiments have been run on many files coming from various platforms
	99	and applications. We tried plain text files, system logs, source code,
	100	formatted office documents, compiled object code, etc. The results
	101	confirm the optimistic assumptions about the capabilities of this
	102	algorithm.
	103
	104
	105	--
	106	Cosmin Truta
	107	Last updated: 2006-May-28