summaryrefslogtreecommitdiff
path: root/algorithm.txt
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--algorithm.txt72
1 files changed, 36 insertions, 36 deletions
diff --git a/algorithm.txt b/algorithm.txt
index a049765..b022dde 100644
--- a/algorithm.txt
+++ b/algorithm.txt
@@ -91,40 +91,40 @@ for 30 symbols.
91 91
922.2 More details on the inflate table lookup 922.2 More details on the inflate table lookup
93 93
94Ok, you want to know what this cleverly obfuscated inflate tree actually 94Ok, you want to know what this cleverly obfuscated inflate tree actually
95looks like. You are correct that it's not a Huffman tree. It is simply a 95looks like. You are correct that it's not a Huffman tree. It is simply a
96lookup table for the first, let's say, nine bits of a Huffman symbol. The 96lookup table for the first, let's say, nine bits of a Huffman symbol. The
97symbol could be as short as one bit or as long as 15 bits. If a particular 97symbol could be as short as one bit or as long as 15 bits. If a particular
98symbol is shorter than nine bits, then that symbol's translation is duplicated 98symbol is shorter than nine bits, then that symbol's translation is duplicated
99in all those entries that start with that symbol's bits. For example, if the 99in all those entries that start with that symbol's bits. For example, if the
100symbol is four bits, then it's duplicated 32 times in a nine-bit table. If a 100symbol is four bits, then it's duplicated 32 times in a nine-bit table. If a
101symbol is nine bits long, it appears in the table once. 101symbol is nine bits long, it appears in the table once.
102 102
103If the symbol is longer than nine bits, then that entry in the table points 103If the symbol is longer than nine bits, then that entry in the table points
104to another similar table for the remaining bits. Again, there are duplicated 104to another similar table for the remaining bits. Again, there are duplicated
105entries as needed. The idea is that most of the time the symbol will be short 105entries as needed. The idea is that most of the time the symbol will be short
106and there will only be one table look up. (That's whole idea behind data 106and there will only be one table look up. (That's whole idea behind data
107compression in the first place.) For the less frequent long symbols, there 107compression in the first place.) For the less frequent long symbols, there
108will be two lookups. If you had a compression method with really long 108will be two lookups. If you had a compression method with really long
109symbols, you could have as many levels of lookups as is efficient. For 109symbols, you could have as many levels of lookups as is efficient. For
110inflate, two is enough. 110inflate, two is enough.
111 111
112So a table entry either points to another table (in which case nine bits in 112So a table entry either points to another table (in which case nine bits in
113the above example are gobbled), or it contains the translation for the symbol 113the above example are gobbled), or it contains the translation for the symbol
114and the number of bits to gobble. Then you start again with the next 114and the number of bits to gobble. Then you start again with the next
115ungobbled bit. 115ungobbled bit.
116 116
117You may wonder: why not just have one lookup table for how ever many bits the 117You may wonder: why not just have one lookup table for how ever many bits the
118longest symbol is? The reason is that if you do that, you end up spending 118longest symbol is? The reason is that if you do that, you end up spending
119more time filling in duplicate symbol entries than you do actually decoding. 119more time filling in duplicate symbol entries than you do actually decoding.
120At least for deflate's output that generates new trees every several 10's of 120At least for deflate's output that generates new trees every several 10's of
121kbytes. You can imagine that filling in a 2^15 entry table for a 15-bit code 121kbytes. You can imagine that filling in a 2^15 entry table for a 15-bit code
122would take too long if you're only decoding several thousand symbols. At the 122would take too long if you're only decoding several thousand symbols. At the
123other extreme, you could make a new table for every bit in the code. In fact, 123other extreme, you could make a new table for every bit in the code. In fact,
124that's essentially a Huffman tree. But then you spend two much time 124that's essentially a Huffman tree. But then you spend two much time
125traversing the tree while decoding, even for short symbols. 125traversing the tree while decoding, even for short symbols.
126 126
127So the number of bits for the first lookup table is a trade of the time to 127So the number of bits for the first lookup table is a trade of the time to
128fill out the table vs. the time spent looking at the second level and above of 128fill out the table vs. the time spent looking at the second level and above of
129the table. 129the table.
130 130
@@ -154,7 +154,7 @@ Let's make the first table three bits long (eight entries):
154110: -> table X (gobble 3 bits) 154110: -> table X (gobble 3 bits)
155111: -> table Y (gobble 3 bits) 155111: -> table Y (gobble 3 bits)
156 156
157Each entry is what the bits decode as and how many bits that is, i.e. how 157Each entry is what the bits decode as and how many bits that is, i.e. how
158many bits to gobble. Or the entry points to another table, with the number of 158many bits to gobble. Or the entry points to another table, with the number of
159bits to gobble implicit in the size of the table. 159bits to gobble implicit in the size of the table.
160 160
@@ -166,7 +166,7 @@ long:
16610: D,2 16610: D,2
16711: E,2 16711: E,2
168 168
169Table Y is three bits long since the longest code starting with 111 is six 169Table Y is three bits long since the longest code starting with 111 is six
170bits long: 170bits long:
171 171
172000: F,2 172000: F,2
@@ -178,20 +178,20 @@ bits long:
178110: I,3 178110: I,3
179111: J,3 179111: J,3
180 180
181So what we have here are three tables with a total of 20 entries that had to 181So what we have here are three tables with a total of 20 entries that had to
182be constructed. That's compared to 64 entries for a single table. Or 182be constructed. That's compared to 64 entries for a single table. Or
183compared to 16 entries for a Huffman tree (six two entry tables and one four 183compared to 16 entries for a Huffman tree (six two entry tables and one four
184entry table). Assuming that the code ideally represents the probability of 184entry table). Assuming that the code ideally represents the probability of
185the symbols, it takes on the average 1.25 lookups per symbol. That's compared 185the symbols, it takes on the average 1.25 lookups per symbol. That's compared
186to one lookup for the single table, or 1.66 lookups per symbol for the 186to one lookup for the single table, or 1.66 lookups per symbol for the
187Huffman tree. 187Huffman tree.
188 188
189There, I think that gives you a picture of what's going on. For inflate, the 189There, I think that gives you a picture of what's going on. For inflate, the
190meaning of a particular symbol is often more than just a letter. It can be a 190meaning of a particular symbol is often more than just a letter. It can be a
191byte (a "literal"), or it can be either a length or a distance which 191byte (a "literal"), or it can be either a length or a distance which
192indicates a base value and a number of bits to fetch after the code that is 192indicates a base value and a number of bits to fetch after the code that is
193added to the base value. Or it might be the special end-of-block code. The 193added to the base value. Or it might be the special end-of-block code. The
194data structures created in inftrees.c try to encode all that information 194data structures created in inftrees.c try to encode all that information
195compactly in the tables. 195compactly in the tables.
196 196
197 197