diff options
author | Diego Nehab <diego@tecgraf.puc-rio.br> | 2007-10-11 21:16:28 +0000 |
---|---|---|
committer | Diego Nehab <diego@tecgraf.puc-rio.br> | 2007-10-11 21:16:28 +0000 |
commit | 52ac60af8132ae7e42151d3012a9607d7cadaf95 (patch) | |
tree | 95085f9f535d20b6686399af1ba284bb74ac89d3 /gem/ltn012.tex | |
parent | e394956cde629e24ecdd285c4c13c206891fcec4 (diff) | |
download | luasocket-52ac60af8132ae7e42151d3012a9607d7cadaf95.tar.gz luasocket-52ac60af8132ae7e42151d3012a9607d7cadaf95.tar.bz2 luasocket-52ac60af8132ae7e42151d3012a9607d7cadaf95.zip |
Tested each sample.
Diffstat (limited to 'gem/ltn012.tex')
-rw-r--r-- | gem/ltn012.tex | 355 |
1 files changed, 190 insertions, 165 deletions
diff --git a/gem/ltn012.tex b/gem/ltn012.tex index 0f81b86..8eccd46 100644 --- a/gem/ltn012.tex +++ b/gem/ltn012.tex | |||
@@ -6,7 +6,10 @@ | |||
6 | \DefineVerbatimEnvironment{mime}{Verbatim}{fontsize=\small,commandchars=\$\#\%} | 6 | \DefineVerbatimEnvironment{mime}{Verbatim}{fontsize=\small,commandchars=\$\#\%} |
7 | \newcommand{\stick}[1]{\vbox{\setlength{\parskip}{0pt}#1}} | 7 | \newcommand{\stick}[1]{\vbox{\setlength{\parskip}{0pt}#1}} |
8 | \newcommand{\bl}{\ensuremath{\mathtt{\backslash}}} | 8 | \newcommand{\bl}{\ensuremath{\mathtt{\backslash}}} |
9 | 9 | \newcommand{\CR}{\texttt{CR}} | |
10 | \newcommand{\LF}{\texttt{LF}} | ||
11 | \newcommand{\CRLF}{\texttt{CR~LF}} | ||
12 | \newcommand{\nil}{\texttt{nil}} | ||
10 | 13 | ||
11 | \title{Filters, sources, sinks, and pumps\\ | 14 | \title{Filters, sources, sinks, and pumps\\ |
12 | {\large or Functional programming for the rest of us}} | 15 | {\large or Functional programming for the rest of us}} |
@@ -17,30 +20,31 @@ | |||
17 | \maketitle | 20 | \maketitle |
18 | 21 | ||
19 | \begin{abstract} | 22 | \begin{abstract} |
20 | Certain data processing operations can be implemented in the | 23 | Certain data processing operations can be implemented in the |
21 | form of filters. A filter is a function that can process data | 24 | form of filters. A filter is a function that can process |
22 | received in consecutive function calls, returning partial | 25 | data received in consecutive invocations, returning partial |
23 | results after each invocation. Examples of operations that can be | 26 | results each time it is called. Examples of operations that |
24 | implemented as filters include the end-of-line normalization | 27 | can be implemented as filters include the end-of-line |
25 | for text, Base64 and Quoted-Printable transfer content | 28 | normalization for text, Base64 and Quoted-Printable transfer |
26 | encodings, the breaking of text into lines, SMTP dot-stuffing, | 29 | content encodings, the breaking of text into lines, SMTP |
27 | and there are many others. Filters become even | 30 | dot-stuffing, and there are many others. Filters become |
28 | more powerful when we allow them to be chained together to | 31 | even more powerful when we allow them to be chained together |
29 | create composite filters. In this context, filters can be seen | 32 | to create composite filters. In this context, filters can be |
30 | as the middle links in a chain of data transformations. Sources an sinks | 33 | seen as the internal links in a chain of data transformations. |
31 | are the corresponding end points of these chains. A source | 34 | Sources and sinks are the corresponding end points in these |
32 | is a function that produces data, chunk by chunk, and a sink | 35 | chains. A source is a function that produces data, chunk by |
33 | is a function that takes data, chunk by chunk. In this | 36 | chunk, and a sink is a function that takes data, chunk by |
34 | article, we describe the design of an elegant interface for filters, | 37 | chunk. Finally, pumps are procedures that actively drive |
35 | sources, sinks, and chaining, and illustrate each step | 38 | data from a source to a sink, and indirectly through all |
36 | with concrete examples. | 39 | intervening filters. In this article, we describe the design of an |
40 | elegant interface for filters, sources, sinks, chains, and | ||
41 | pumps, and we illustrate each step with concrete examples. | ||
37 | \end{abstract} | 42 | \end{abstract} |
38 | 43 | ||
39 | |||
40 | \section{Introduction} | 44 | \section{Introduction} |
41 | 45 | ||
42 | Within the realm of networking applications, we are often | 46 | Within the realm of networking applications, we are often |
43 | required apply transformations to streams of data. Examples | 47 | required to apply transformations to streams of data. Examples |
44 | include the end-of-line normalization for text, Base64 and | 48 | include the end-of-line normalization for text, Base64 and |
45 | Quoted-Printable transfer content encodings, breaking text | 49 | Quoted-Printable transfer content encodings, breaking text |
46 | into lines with a maximum number of columns, SMTP | 50 | into lines with a maximum number of columns, SMTP |
@@ -50,11 +54,10 @@ transfer coding, and the list goes on. | |||
50 | Many complex tasks require a combination of two or more such | 54 | Many complex tasks require a combination of two or more such |
51 | transformations, and therefore a general mechanism for | 55 | transformations, and therefore a general mechanism for |
52 | promoting reuse is desirable. In the process of designing | 56 | promoting reuse is desirable. In the process of designing |
53 | \texttt{LuaSocket~2.0}, David Burgess and I were forced to deal with | 57 | \texttt{LuaSocket~2.0}, we repeatedly faced this problem. |
54 | this problem. The solution we reached proved to be very | 58 | The solution we reached proved to be very general and |
55 | general and convenient. It is based on the concepts of | 59 | convenient. It is based on the concepts of filters, sources, |
56 | filters, sources, sinks, and pumps, which we introduce | 60 | sinks, and pumps, which we introduce below. |
57 | below. | ||
58 | 61 | ||
59 | \emph{Filters} are functions that can be repeatedly invoked | 62 | \emph{Filters} are functions that can be repeatedly invoked |
60 | with chunks of input, successively returning processed | 63 | with chunks of input, successively returning processed |
@@ -62,34 +65,33 @@ chunks of output. More importantly, the result of | |||
62 | concatenating all the output chunks must be the same as the | 65 | concatenating all the output chunks must be the same as the |
63 | result of applying the filter to the concatenation of all | 66 | result of applying the filter to the concatenation of all |
64 | input chunks. In fancier language, filters \emph{commute} | 67 | input chunks. In fancier language, filters \emph{commute} |
65 | with the concatenation operator. As a result, chunk | 68 | with the concatenation operator. More importantly, filters |
66 | boundaries are irrelevant: filters correctly handle input | 69 | must handle input data correctly no matter how the stream |
67 | data no matter how it is split. | 70 | has been split into chunks. |
68 | 71 | ||
69 | A \emph{chain} transparently combines the effect of one or | 72 | A \emph{chain} is a function that transparently combines the |
70 | more filters. The interface of a chain is | 73 | effect of one or more filters. The interface of a chain is |
71 | indistinguishable from the interface of its components. | 74 | indistinguishable from the interface of its component |
72 | This allows a chained filter to be used wherever an atomic | 75 | filters. This allows a chained filter to be used wherever |
73 | filter is expected. In particular, chains can be | 76 | an atomic filter is accepted. In particular, chains can be |
74 | themselves chained to create arbitrarily complex operations. | 77 | themselves chained to create arbitrarily complex operations. |
75 | 78 | ||
76 | Filters can be seen as internal nodes in a network through | 79 | Filters can be seen as internal nodes in a network through |
77 | which data will flow, potentially being transformed many | 80 | which data will flow, potentially being transformed many |
78 | times along its way. Chains connect these nodes together. | 81 | times along the way. Chains connect these nodes together. |
79 | To complete the picture, we need \emph{sources} and | 82 | The initial and final nodes of the network are |
80 | \emph{sinks}. These are the initial and final nodes of the | 83 | \emph{sources} and \emph{sinks}, respectively. Less |
81 | network, respectively. Less abstractly, a source is a | 84 | abstractly, a source is a function that produces new data |
82 | function that produces new data every time it is called. | 85 | every time it is invoked. Conversely, sinks are functions |
83 | Conversely, sinks are functions that give a final | 86 | that give a final destination to the data they receive. |
84 | destination to the data they receive. Naturally, sources | 87 | Naturally, sources and sinks can also be chained with |
85 | and sinks can also be chained with filters to produce | 88 | filters to produce filtered sources and sinks. |
86 | filtered sources and sinks. | ||
87 | 89 | ||
88 | Finally, filters, chains, sources, and sinks are all passive | 90 | Finally, filters, chains, sources, and sinks are all passive |
89 | entities: they must be repeatedly invoked in order for | 91 | entities: they must be repeatedly invoked in order for |
90 | anything to happen. \emph{Pumps} provide the driving force | 92 | anything to happen. \emph{Pumps} provide the driving force |
91 | that pushes data through the network, from a source to a | 93 | that pushes data through the network, from a source to a |
92 | sink. | 94 | sink, and indirectly through all intervening filters. |
93 | 95 | ||
94 | In the following sections, we start with a simplified | 96 | In the following sections, we start with a simplified |
95 | interface, which we later refine. The evolution we present | 97 | interface, which we later refine. The evolution we present |
@@ -99,27 +101,28 @@ concepts within our application domain. | |||
99 | 101 | ||
100 | \subsection{A simple example} | 102 | \subsection{A simple example} |
101 | 103 | ||
102 | Let us use the end-of-line normalization of text as an | 104 | The end-of-line normalization of text is a good |
103 | example to motivate our initial filter interface. | 105 | example to motivate our initial filter interface. |
104 | Assume we are given text in an unknown end-of-line | 106 | Assume we are given text in an unknown end-of-line |
105 | convention (including possibly mixed conventions) out of the | 107 | convention (including possibly mixed conventions) out of the |
106 | commonly found Unix (LF), Mac OS (CR), and DOS (CRLF) | 108 | commonly found Unix (\LF), Mac OS (\CR), and |
107 | conventions. We would like to be able to write code like the | 109 | DOS (\CRLF) conventions. We would like to be able to |
108 | following: | 110 | use the folowing code to normalize the end-of-line markers: |
109 | \begin{quote} | 111 | \begin{quote} |
110 | \begin{lua} | 112 | \begin{lua} |
111 | @stick# | 113 | @stick# |
112 | local in = source.chain(source.file(io.stdin), normalize("\r\n")) | 114 | local CRLF = "\013\010" |
113 | local out = sink.file(io.stdout) | 115 | local input = source.chain(source.file(io.stdin), normalize(CRLF)) |
114 | pump.all(in, out) | 116 | local output = sink.file(io.stdout) |
117 | pump.all(input, output) | ||
115 | % | 118 | % |
116 | \end{lua} | 119 | \end{lua} |
117 | \end{quote} | 120 | \end{quote} |
118 | 121 | ||
119 | This program should read data from the standard input stream | 122 | This program should read data from the standard input stream |
120 | and normalize the end-of-line markers to the canonic CRLF | 123 | and normalize the end-of-line markers to the canonic |
121 | marker, as defined by the MIME standard. Finally, the | 124 | \CRLF\ marker, as defined by the MIME standard. |
122 | normalized text should be sent to the standard output | 125 | Finally, the normalized text should be sent to the standard output |
123 | stream. We use a \emph{file source} that produces data from | 126 | stream. We use a \emph{file source} that produces data from |
124 | standard input, and chain it with a filter that normalizes | 127 | standard input, and chain it with a filter that normalizes |
125 | the data. The pump then repeatedly obtains data from the | 128 | the data. The pump then repeatedly obtains data from the |
@@ -127,27 +130,28 @@ source, and passes it to the \emph{file sink}, which sends | |||
127 | it to the standard output. | 130 | it to the standard output. |
128 | 131 | ||
129 | In the code above, the \texttt{normalize} \emph{factory} is a | 132 | In the code above, the \texttt{normalize} \emph{factory} is a |
130 | function that creates our normalization filter. This filter | 133 | function that creates our normalization filter, which |
131 | will replace any end-of-line marker with the canonic | 134 | replaces any end-of-line marker with the canonic marker. |
132 | `\verb|\r\n|' marker. The initial filter interface is | 135 | The initial filter interface is |
133 | trivial: a filter function receives a chunk of input data, | 136 | trivial: a filter function receives a chunk of input data, |
134 | and returns a chunk of processed data. When there are no | 137 | and returns a chunk of processed data. When there are no |
135 | more input data left, the caller notifies the filter by invoking | 138 | more input data left, the caller notifies the filter by invoking |
136 | it with a \texttt{nil} chunk. The filter responds by returning | 139 | it with a \nil\ chunk. The filter responds by returning |
137 | the final chunk of processed data. | 140 | the final chunk of processed data (which could of course be |
141 | the empty string). | ||
138 | 142 | ||
139 | Although the interface is extremely simple, the | 143 | Although the interface is extremely simple, the |
140 | implementation is not so obvious. A normalization filter | 144 | implementation is not so obvious. A normalization filter |
141 | respecting this interface needs to keep some kind of context | 145 | respecting this interface needs to keep some kind of context |
142 | between calls. This is because a chunk boundary may lie between | 146 | between calls. This is because a chunk boundary may lie between |
143 | the CR and LF characters marking the end of a line. This | 147 | the \CR\ and \LF\ characters marking the end of a single line. This |
144 | need for contextual storage motivates the use of | 148 | need for contextual storage motivates the use of |
145 | factories: each time the factory is invoked, it returns a | 149 | factories: each time the factory is invoked, it returns a |
146 | filter with its own context so that we can have several | 150 | filter with its own context so that we can have several |
147 | independent filters being used at the same time. For | 151 | independent filters being used at the same time. For |
148 | efficiency reasons, we must avoid the obvious solution of | 152 | efficiency reasons, we must avoid the obvious solution of |
149 | concatenating all the input into the context before | 153 | concatenating all the input into the context before |
150 | producing any output. | 154 | producing any output chunks. |
151 | 155 | ||
152 | To that end, we break the implementation into two parts: | 156 | To that end, we break the implementation into two parts: |
153 | a low-level filter, and a factory of high-level filters. The | 157 | a low-level filter, and a factory of high-level filters. The |
@@ -167,10 +171,10 @@ end-of-line normalization filters: | |||
167 | \begin{quote} | 171 | \begin{quote} |
168 | \begin{lua} | 172 | \begin{lua} |
169 | @stick# | 173 | @stick# |
170 | function filter.cycle(low, ctx, extra) | 174 | function filter.cycle(lowlevel, context, extra) |
171 | return function(chunk) | 175 | return function(chunk) |
172 | local ret | 176 | local ret |
173 | ret, ctx = low(ctx, chunk, extra) | 177 | ret, context = lowlevel(context, chunk, extra) |
174 | return ret | 178 | return ret |
175 | end | 179 | end |
176 | end | 180 | end |
@@ -178,27 +182,30 @@ end | |||
178 | 182 | ||
179 | @stick# | 183 | @stick# |
180 | function normalize(marker) | 184 | function normalize(marker) |
181 | return cycle(eol, 0, marker) | 185 | return filter.cycle(eol, 0, marker) |
182 | end | 186 | end |
183 | % | 187 | % |
184 | \end{lua} | 188 | \end{lua} |
185 | \end{quote} | 189 | \end{quote} |
186 | 190 | ||
187 | The \texttt{normalize} factory simply calls a more generic | 191 | The \texttt{normalize} factory simply calls a more generic |
188 | factory, the \texttt{cycle} factory. This factory receives a | 192 | factory, the \texttt{cycle}~factory, passing the low-level |
193 | filter~\texttt{eol}. The \texttt{cycle}~factory receives a | ||
189 | low-level filter, an initial context, and an extra | 194 | low-level filter, an initial context, and an extra |
190 | parameter, and returns a new high-level filter. Each time | 195 | parameter, and returns a new high-level filter. Each time |
191 | the high-level filer is passed a new chunk, it invokes the | 196 | the high-level filer is passed a new chunk, it invokes the |
192 | low-level filter with the previous context, the new chunk, | 197 | low-level filter with the previous context, the new chunk, |
193 | and the extra argument. It is the low-level filter that | 198 | and the extra argument. It is the low-level filter that |
194 | does all the work, producing the chunk of processed data and | 199 | does all the work, producing the chunk of processed data and |
195 | a new context. The high-level filter then updates its | 200 | a new context. The high-level filter then replaces its |
196 | internal context, and returns the processed chunk of data to | 201 | internal context, and returns the processed chunk of data to |
197 | the user. Notice that we take advantage of Lua's lexical | 202 | the user. Notice that we take advantage of Lua's lexical |
198 | scoping to store the context in a closure between function | 203 | scoping to store the context in a closure between function |
199 | calls. | 204 | calls. |
200 | 205 | ||
201 | Concerning the low-level filter code, we must first accept | 206 | \subsection{The C part of the filter} |
207 | |||
208 | As for the low-level filter, we must first accept | ||
202 | that there is no perfect solution to the end-of-line marker | 209 | that there is no perfect solution to the end-of-line marker |
203 | normalization problem. The difficulty comes from an | 210 | normalization problem. The difficulty comes from an |
204 | inherent ambiguity in the definition of empty lines within | 211 | inherent ambiguity in the definition of empty lines within |
@@ -208,39 +215,39 @@ mixed input. It also does a reasonable job with empty lines | |||
208 | and serves as a good example of how to implement a low-level | 215 | and serves as a good example of how to implement a low-level |
209 | filter. | 216 | filter. |
210 | 217 | ||
211 | The idea is to consider both CR and~LF as end-of-line | 218 | The idea is to consider both \CR\ and~\LF\ as end-of-line |
212 | \emph{candidates}. We issue a single break if any candidate | 219 | \emph{candidates}. We issue a single break if any candidate |
213 | is seen alone, or followed by a different candidate. In | 220 | is seen alone, or if it is followed by a different |
214 | other words, CR~CR~and LF~LF each issue two end-of-line | 221 | candidate. In other words, \CR~\CR~and \LF~\LF\ each issue |
215 | markers, whereas CR~LF~and LF~CR issue only one marker each. | 222 | two end-of-line markers, whereas \CR~\LF~and \LF~\CR\ issue |
216 | This method correctly handles the Unix, DOS/MIME, VMS, and Mac | 223 | only one marker each. It is easy to see that this method |
217 | OS conventions. | 224 | correctly handles the most common end-of-line conventions. |
218 | 225 | ||
219 | \subsection{The C part of the filter} | 226 | With this in mind, we divide the low-level filter into two |
220 | 227 | simple functions. The inner function~\texttt{pushchar} performs the | |
221 | Our low-level filter is divided into two simple functions. | 228 | normalization itself. It takes each input character in turn, |
222 | The inner function performs the normalization itself. It takes | 229 | deciding what to output and how to modify the context. The |
223 | each input character in turn, deciding what to output and | 230 | context tells if the last processed character was an |
224 | how to modify the context. The context tells if the last | 231 | end-of-line candidate, and if so, which candidate it was. |
225 | processed character was an end-of-line candidate, and if so, | 232 | For efficiency, we use Lua's auxiliary library's buffer |
226 | which candidate it was. For efficiency, it uses | 233 | interface: |
227 | Lua's auxiliary library's buffer interface: | ||
228 | \begin{quote} | 234 | \begin{quote} |
229 | \begin{C} | 235 | \begin{C} |
230 | @stick# | 236 | @stick# |
231 | @#define candidate(c) (c == CR || c == LF) | 237 | @#define candidate(c) (c == CR || c == LF) |
232 | static int process(int c, int last, const char *marker, | 238 | static int pushchar(int c, int last, const char *marker, |
233 | luaL_Buffer *buffer) { | 239 | luaL_Buffer *buffer) { |
234 | if (candidate(c)) { | 240 | if (candidate(c)) { |
235 | if (candidate(last)) { | 241 | if (candidate(last)) { |
236 | if (c == last) luaL_addstring(buffer, marker); | 242 | if (c == last) |
243 | luaL_addstring(buffer, marker); | ||
237 | return 0; | 244 | return 0; |
238 | } else { | 245 | } else { |
239 | luaL_addstring(buffer, marker); | 246 | luaL_addstring(buffer, marker); |
240 | return c; | 247 | return c; |
241 | } | 248 | } |
242 | } else { | 249 | } else { |
243 | luaL_putchar(buffer, c); | 250 | luaL_pushchar(buffer, c); |
244 | return 0; | 251 | return 0; |
245 | } | 252 | } |
246 | } | 253 | } |
@@ -248,15 +255,20 @@ static int process(int c, int last, const char *marker, | |||
248 | \end{C} | 255 | \end{C} |
249 | \end{quote} | 256 | \end{quote} |
250 | 257 | ||
251 | The outer function simply interfaces with Lua. It receives the | 258 | The outer function~\texttt{eol} simply interfaces with Lua. |
252 | context and input chunk (as well as an optional | 259 | It receives the context and input chunk (as well as an |
253 | custom end-of-line marker), and returns the transformed | 260 | optional custom end-of-line marker), and returns the |
254 | output chunk and the new context: | 261 | transformed output chunk and the new context. |
262 | Notice that if the input chunk is \nil, the operation | ||
263 | is considered to be finished. In that case, the loop will | ||
264 | not execute a single time and the context is reset to the | ||
265 | initial state. This allows the filter to be reused many | ||
266 | times: | ||
255 | \begin{quote} | 267 | \begin{quote} |
256 | \begin{C} | 268 | \begin{C} |
257 | @stick# | 269 | @stick# |
258 | static int eol(lua_State *L) { | 270 | static int eol(lua_State *L) { |
259 | int ctx = luaL_checkint(L, 1); | 271 | int context = luaL_checkint(L, 1); |
260 | size_t isize = 0; | 272 | size_t isize = 0; |
261 | const char *input = luaL_optlstring(L, 2, NULL, &isize); | 273 | const char *input = luaL_optlstring(L, 2, NULL, &isize); |
262 | const char *last = input + isize; | 274 | const char *last = input + isize; |
@@ -269,24 +281,18 @@ static int eol(lua_State *L) { | |||
269 | return 2; | 281 | return 2; |
270 | } | 282 | } |
271 | while (input < last) | 283 | while (input < last) |
272 | ctx = process(*input++, ctx, marker, &buffer); | 284 | context = pushchar(*input++, context, marker, &buffer); |
273 | luaL_pushresult(&buffer); | 285 | luaL_pushresult(&buffer); |
274 | lua_pushnumber(L, ctx); | 286 | lua_pushnumber(L, context); |
275 | return 2; | 287 | return 2; |
276 | } | 288 | } |
277 | % | 289 | % |
278 | \end{C} | 290 | \end{C} |
279 | \end{quote} | 291 | \end{quote} |
280 | 292 | ||
281 | Notice that if the input chunk is \texttt{nil}, the operation | ||
282 | is considered to be finished. In that case, the loop will | ||
283 | not execute a single time and the context is reset to the | ||
284 | initial state. This allows the filter to be reused many | ||
285 | times. | ||
286 | |||
287 | When designing your own filters, the challenging part is to | 293 | When designing your own filters, the challenging part is to |
288 | decide what will be in the context. For line breaking, for | 294 | decide what will be in the context. For line breaking, for |
289 | instance, it could be the number of bytes left in the | 295 | instance, it could be the number of bytes that still fit in the |
290 | current line. For Base64 encoding, it could be a string | 296 | current line. For Base64 encoding, it could be a string |
291 | with the bytes that remain after the division of the input | 297 | with the bytes that remain after the division of the input |
292 | into 3-byte atoms. The MIME module in the \texttt{LuaSocket} | 298 | into 3-byte atoms. The MIME module in the \texttt{LuaSocket} |
@@ -294,19 +300,22 @@ distribution has many other examples. | |||
294 | 300 | ||
295 | \section{Filter chains} | 301 | \section{Filter chains} |
296 | 302 | ||
297 | Chains add a lot to the power of filters. For example, | 303 | Chains greatly increase the power of filters. For example, |
298 | according to the standard for Quoted-Printable encoding, | 304 | according to the standard for Quoted-Printable encoding, |
299 | text must be normalized to a canonic end-of-line marker | 305 | text should be normalized to a canonic end-of-line marker |
300 | prior to encoding. To help specifying complex | 306 | prior to encoding. After encoding, the resulting text must |
301 | transformations like this, we define a chain factory that | 307 | be broken into lines of no more than 76 characters, with the |
302 | creates a composite filter from one or more filters. A | 308 | use of soft line breaks (a line terminated by the \texttt{=} |
303 | chained filter passes data through all its components, and | 309 | sign). To help specifying complex transformations like |
304 | can be used wherever a primitive filter is accepted. | 310 | this, we define a chain factory that creates a composite |
311 | filter from one or more filters. A chained filter passes | ||
312 | data through all its components, and can be used wherever a | ||
313 | primitive filter is accepted. | ||
305 | 314 | ||
306 | The chaining factory is very simple. The auxiliary | 315 | The chaining factory is very simple. The auxiliary |
307 | function~\texttt{chainpair} chains two filters together, | 316 | function~\texttt{chainpair} chains two filters together, |
308 | taking special care if the chunk is the last. This is | 317 | taking special care if the chunk is the last. This is |
309 | because the final \texttt{nil} chunk notification has to be | 318 | because the final \nil\ chunk notification has to be |
310 | pushed through both filters in turn: | 319 | pushed through both filters in turn: |
311 | \begin{quote} | 320 | \begin{quote} |
312 | \begin{lua} | 321 | \begin{lua} |
@@ -322,9 +331,9 @@ end | |||
322 | 331 | ||
323 | @stick# | 332 | @stick# |
324 | function filter.chain(...) | 333 | function filter.chain(...) |
325 | local f = arg[1] | 334 | local f = select(1, ...) |
326 | for i = 2, @#arg do | 335 | for i = 2, select('@#', ...) do |
327 | f = chainpair(f, arg[i]) | 336 | f = chainpair(f, select(i, ...)) |
328 | end | 337 | end |
329 | return f | 338 | return f |
330 | end | 339 | end |
@@ -337,11 +346,11 @@ define the Quoted-Printable conversion as such: | |||
337 | \begin{quote} | 346 | \begin{quote} |
338 | \begin{lua} | 347 | \begin{lua} |
339 | @stick# | 348 | @stick# |
340 | local qp = filter.chain(normalize("\r\n"), | 349 | local qp = filter.chain(normalize(CRLF), encode("quoted-printable"), |
341 | encode("quoted-printable")) | 350 | wrap("quoted-printable")) |
342 | local in = source.chain(source.file(io.stdin), qp) | 351 | local input = source.chain(source.file(io.stdin), qp) |
343 | local out = sink.file(io.stdout) | 352 | local output = sink.file(io.stdout) |
344 | pump.all(in, out) | 353 | pump.all(input, output) |
345 | % | 354 | % |
346 | \end{lua} | 355 | \end{lua} |
347 | \end{quote} | 356 | \end{quote} |
@@ -360,14 +369,14 @@ gives a final destination to the data. | |||
360 | \subsection{Sources} | 369 | \subsection{Sources} |
361 | 370 | ||
362 | A source returns the next chunk of data each time it is | 371 | A source returns the next chunk of data each time it is |
363 | invoked. When there is no more data, it simply returns | 372 | invoked. When there is no more data, it simply returns~\nil. |
364 | \texttt{nil}. In the event of an error, the source can inform the | 373 | In the event of an error, the source can inform the |
365 | caller by returning \texttt{nil} followed by an error message. | 374 | caller by returning \nil\ followed by the error message. |
366 | 375 | ||
367 | Below are two simple source factories. The \texttt{empty} source | 376 | Below are two simple source factories. The \texttt{empty} source |
368 | returns no data, possibly returning an associated error | 377 | returns no data, possibly returning an associated error |
369 | message. The \texttt{file} source works harder, and | 378 | message. The \texttt{file} source yields the contents of a file |
370 | yields the contents of a file in a chunk by chunk fashion: | 379 | in a chunk by chunk fashion: |
371 | \begin{quote} | 380 | \begin{quote} |
372 | \begin{lua} | 381 | \begin{lua} |
373 | @stick# | 382 | @stick# |
@@ -398,7 +407,7 @@ A filtered source passes its data through the | |||
398 | associated filter before returning it to the caller. | 407 | associated filter before returning it to the caller. |
399 | Filtered sources are useful when working with | 408 | Filtered sources are useful when working with |
400 | functions that get their input data from a source (such as | 409 | functions that get their input data from a source (such as |
401 | the pump in our first example). By chaining a source with one or | 410 | the pumps in our examples). By chaining a source with one or |
402 | more filters, the function can be transparently provided | 411 | more filters, the function can be transparently provided |
403 | with filtered data, with no need to change its interface. | 412 | with filtered data, with no need to change its interface. |
404 | Here is a factory that does the job: | 413 | Here is a factory that does the job: |
@@ -406,14 +415,18 @@ Here is a factory that does the job: | |||
406 | \begin{lua} | 415 | \begin{lua} |
407 | @stick# | 416 | @stick# |
408 | function source.chain(src, f) | 417 | function source.chain(src, f) |
409 | return source.simplify(function() | 418 | return function() |
410 | if not src then return nil end | 419 | if not src then |
420 | return nil | ||
421 | end | ||
411 | local chunk, err = src() | 422 | local chunk, err = src() |
412 | if not chunk then | 423 | if not chunk then |
413 | src = nil | 424 | src = nil |
414 | return f(nil) | 425 | return f(nil) |
415 | else return f(chunk) end | 426 | else |
416 | end) | 427 | return f(chunk) |
428 | end | ||
429 | end | ||
417 | end | 430 | end |
418 | % | 431 | % |
419 | \end{lua} | 432 | \end{lua} |
@@ -421,20 +434,20 @@ end | |||
421 | 434 | ||
422 | \subsection{Sinks} | 435 | \subsection{Sinks} |
423 | 436 | ||
424 | Just as we defined an interface a data source, | 437 | Just as we defined an interface for source of data, |
425 | we can also define an interface for a data destination. | 438 | we can also define an interface for a data destination. |
426 | We call any function respecting this | 439 | We call any function respecting this |
427 | interface a \emph{sink}. In our first example, we used a | 440 | interface a \emph{sink}. In our first example, we used a |
428 | file sink connected to the standard output. | 441 | file sink connected to the standard output. |
429 | 442 | ||
430 | Sinks receive consecutive chunks of data, until the end of | 443 | Sinks receive consecutive chunks of data, until the end of |
431 | data is signaled by a \texttt{nil} chunk. A sink can be | 444 | data is signaled by a \nil\ input chunk. A sink can be |
432 | notified of an error with an optional extra argument that | 445 | notified of an error with an optional extra argument that |
433 | contains the error message, following a \texttt{nil} chunk. | 446 | contains the error message, following a \nil\ chunk. |
434 | If a sink detects an error itself, and | 447 | If a sink detects an error itself, and |
435 | wishes not to be called again, it can return \texttt{nil}, | 448 | wishes not to be called again, it can return \nil, |
436 | followed by an error message. A return value that | 449 | followed by an error message. A return value that |
437 | is not \texttt{nil} means the source will accept more data. | 450 | is not \nil\ means the sink will accept more data. |
438 | 451 | ||
439 | Below are two useful sink factories. | 452 | Below are two useful sink factories. |
440 | The table factory creates a sink that stores | 453 | The table factory creates a sink that stores |
@@ -469,7 +482,7 @@ end | |||
469 | 482 | ||
470 | Naturally, filtered sinks are just as useful as filtered | 483 | Naturally, filtered sinks are just as useful as filtered |
471 | sources. A filtered sink passes each chunk it receives | 484 | sources. A filtered sink passes each chunk it receives |
472 | through the associated filter before handing it to the | 485 | through the associated filter before handing it down to the |
473 | original sink. In the following example, we use a source | 486 | original sink. In the following example, we use a source |
474 | that reads from the standard input. The input chunks are | 487 | that reads from the standard input. The input chunks are |
475 | sent to a table sink, which has been coupled with a | 488 | sent to a table sink, which has been coupled with a |
@@ -479,10 +492,10 @@ standard out: | |||
479 | \begin{quote} | 492 | \begin{quote} |
480 | \begin{lua} | 493 | \begin{lua} |
481 | @stick# | 494 | @stick# |
482 | local in = source.file(io.stdin) | 495 | local input = source.file(io.stdin) |
483 | local out, t = sink.table() | 496 | local output, t = sink.table() |
484 | out = sink.chain(normalize("\r\n"), out) | 497 | output = sink.chain(normalize(CRLF), output) |
485 | pump.all(in, out) | 498 | pump.all(input, output) |
486 | io.write(table.concat(t)) | 499 | io.write(table.concat(t)) |
487 | % | 500 | % |
488 | \end{lua} | 501 | \end{lua} |
@@ -490,11 +503,11 @@ io.write(table.concat(t)) | |||
490 | 503 | ||
491 | \subsection{Pumps} | 504 | \subsection{Pumps} |
492 | 505 | ||
493 | Adrian Sietsma noticed that, although not on purpose, our | 506 | Although not on purpose, our interface for sources is |
494 | interface for sources is compatible with Lua iterators. | 507 | compatible with Lua iterators. That is, a source can be |
495 | That is, a source can be neatly used in conjunction | 508 | neatly used in conjunction with \texttt{for} loops. Using |
496 | with \texttt{for} loops. Using our file | 509 | our file source as an iterator, we can write the following |
497 | source as an iterator, we can write the following code: | 510 | code: |
498 | \begin{quote} | 511 | \begin{quote} |
499 | \begin{lua} | 512 | \begin{lua} |
500 | @stick# | 513 | @stick# |
@@ -539,20 +552,22 @@ end | |||
539 | The \texttt{pump.step} function moves one chunk of data from | 552 | The \texttt{pump.step} function moves one chunk of data from |
540 | the source to the sink. The \texttt{pump.all} function takes | 553 | the source to the sink. The \texttt{pump.all} function takes |
541 | an optional \texttt{step} function and uses it to pump all the | 554 | an optional \texttt{step} function and uses it to pump all the |
542 | data from the source to the sink. We can now use everything | 555 | data from the source to the sink. |
543 | we have to write a program that reads a binary file from | 556 | Here is an example that uses the Base64 and the |
557 | line wrapping filters from the \texttt{LuaSocket} | ||
558 | distribution. The program reads a binary file from | ||
544 | disk and stores it in another file, after encoding it to the | 559 | disk and stores it in another file, after encoding it to the |
545 | Base64 transfer content encoding: | 560 | Base64 transfer content encoding: |
546 | \begin{quote} | 561 | \begin{quote} |
547 | \begin{lua} | 562 | \begin{lua} |
548 | @stick# | 563 | @stick# |
549 | local in = source.chain( | 564 | local input = source.chain( |
550 | source.file(io.open("input.bin", "rb")), | 565 | source.file(io.open("input.bin", "rb")), |
551 | encode("base64")) | 566 | encode("base64")) |
552 | local out = sink.chain( | 567 | local output = sink.chain( |
553 | wrap(76), | 568 | wrap(76), |
554 | sink.file(io.open("output.b64", "w"))) | 569 | sink.file(io.open("output.b64", "w"))) |
555 | pump.all(in, out) | 570 | pump.all(input, output) |
556 | % | 571 | % |
557 | \end{lua} | 572 | \end{lua} |
558 | \end{quote} | 573 | \end{quote} |
@@ -561,19 +576,17 @@ The way we split the filters here is not intuitive, on | |||
561 | purpose. Alternatively, we could have chained the Base64 | 576 | purpose. Alternatively, we could have chained the Base64 |
562 | encode filter and the line-wrap filter together, and then | 577 | encode filter and the line-wrap filter together, and then |
563 | chain the resulting filter with either the file source or | 578 | chain the resulting filter with either the file source or |
564 | the file sink. It doesn't really matter. The Base64 and the | 579 | the file sink. It doesn't really matter. |
565 | line wrapping filters are part of the \texttt{LuaSocket} | ||
566 | distribution. | ||
567 | 580 | ||
568 | \section{Exploding filters} | 581 | \section{Exploding filters} |
569 | 582 | ||
570 | Our current filter interface has one flagrant shortcoming. | 583 | Our current filter interface has one serious shortcoming. |
571 | When David Burgess was writing his \texttt{gzip} filter, he | 584 | Consider for example a \texttt{gzip} decompression filter. |
572 | noticed that a decompression filter can explode a small | 585 | During decompression, a small input chunk can be exploded |
573 | input chunk into a huge amount of data. To address this | 586 | into a huge amount of data. To address this problem, we |
574 | problem, we decided to change the filter interface and allow | 587 | decided to change the filter interface and allow exploding |
575 | exploding filters to return large quantities of output data | 588 | filters to return large quantities of output data in a chunk |
576 | in a chunk by chunk manner. | 589 | by chunk manner. |
577 | 590 | ||
578 | More specifically, after passing each chunk of input to | 591 | More specifically, after passing each chunk of input to |
579 | a filter, and collecting the first chunk of output, the | 592 | a filter, and collecting the first chunk of output, the |
@@ -582,11 +595,11 @@ filtered data is left. Within these secondary calls, the | |||
582 | caller passes an empty string to the filter. The filter | 595 | caller passes an empty string to the filter. The filter |
583 | responds with an empty string when it is ready for the next | 596 | responds with an empty string when it is ready for the next |
584 | input chunk. In the end, after the user passes a | 597 | input chunk. In the end, after the user passes a |
585 | \texttt{nil} chunk notifying the filter that there is no | 598 | \nil\ chunk notifying the filter that there is no |
586 | more input data, the filter might still have to produce too | 599 | more input data, the filter might still have to produce too |
587 | much output data to return in a single chunk. The user has | 600 | much output data to return in a single chunk. The user has |
588 | to loop again, now passing \texttt{nil} to the filter each time, | 601 | to loop again, now passing \nil\ to the filter each time, |
589 | until the filter itself returns \texttt{nil} to notify the | 602 | until the filter itself returns \nil\ to notify the |
590 | user it is finally done. | 603 | user it is finally done. |
591 | 604 | ||
592 | Fortunately, it is very easy to modify a filter to respect | 605 | Fortunately, it is very easy to modify a filter to respect |
@@ -599,13 +612,13 @@ Interestingly, the modifications do not have a measurable | |||
599 | negative impact in the performance of filters that do | 612 | negative impact in the performance of filters that do |
600 | not need the added flexibility. On the other hand, for a | 613 | not need the added flexibility. On the other hand, for a |
601 | small price in complexity, the changes make exploding | 614 | small price in complexity, the changes make exploding |
602 | filters practical. | 615 | filters practical. |
603 | 616 | ||
604 | \section{A complex example} | 617 | \section{A complex example} |
605 | 618 | ||
606 | The LTN12 module in the \texttt{LuaSocket} distribution | 619 | The LTN12 module in the \texttt{LuaSocket} distribution |
607 | implements the ideas we have described. The MIME | 620 | implements all the ideas we have described. The MIME |
608 | and SMTP modules are especially integrated with LTN12, | 621 | and SMTP modules are tightly integrated with LTN12, |
609 | and can be used to showcase the expressive power of filters, | 622 | and can be used to showcase the expressive power of filters, |
610 | sources, sinks, and pumps. Below is an example | 623 | sources, sinks, and pumps. Below is an example |
611 | of how a user would proceed to define and send a | 624 | of how a user would proceed to define and send a |
@@ -622,9 +635,9 @@ local message = smtp.message{ | |||
622 | to = "Fulano <fulano@example.com>", | 635 | to = "Fulano <fulano@example.com>", |
623 | subject = "A message with an attachment"}, | 636 | subject = "A message with an attachment"}, |
624 | body = { | 637 | body = { |
625 | preamble = "Hope you can see the attachment\r\n", | 638 | preamble = "Hope you can see the attachment" .. CRLF, |
626 | [1] = { | 639 | [1] = { |
627 | body = "Here is our logo\r\n"}, | 640 | body = "Here is our logo" .. CRLF}, |
628 | [2] = { | 641 | [2] = { |
629 | headers = { | 642 | headers = { |
630 | ["content-type"] = 'image/png; name="luasocket.png"', | 643 | ["content-type"] = 'image/png; name="luasocket.png"', |
@@ -665,6 +678,18 @@ abstraction for final data destinations. Filters define an | |||
665 | interface for data transformations. The chaining of | 678 | interface for data transformations. The chaining of |
666 | filters, sources and sinks provides an elegant way to create | 679 | filters, sources and sinks provides an elegant way to create |
667 | arbitrarily complex data transformations from simpler | 680 | arbitrarily complex data transformations from simpler |
668 | components. Pumps simply move the data through. | 681 | components. Pumps simply push the data through. |
682 | |||
683 | \section{Acknowledgements} | ||
684 | |||
685 | The concepts described in this text are the result of long | ||
686 | discussions with David Burgess. A version of this text has | ||
687 | been released on-line as the Lua Technical Note 012, hence | ||
688 | the name of the corresponding LuaSocket module, | ||
689 | \texttt{ltn12}. Wim Couwenberg contributed to the | ||
690 | implementation of the module, and Adrian Sietsma was the | ||
691 | first to notice the correspondence between sources and Lua | ||
692 | iterators. | ||
693 | |||
669 | 694 | ||
670 | \end{document} | 695 | \end{document} |