diff options
Diffstat (limited to '')
-rw-r--r-- | gem/ltn012.tex | 695 |
1 files changed, 0 insertions, 695 deletions
diff --git a/gem/ltn012.tex b/gem/ltn012.tex deleted file mode 100644 index 8027ecc..0000000 --- a/gem/ltn012.tex +++ /dev/null | |||
@@ -1,695 +0,0 @@ | |||
1 | \documentclass[10pt]{article} | ||
2 | \usepackage{fancyvrb} | ||
3 | \usepackage{url} | ||
4 | \DefineVerbatimEnvironment{lua}{Verbatim}{fontsize=\small,commandchars=\@\#\%} | ||
5 | \DefineVerbatimEnvironment{C}{Verbatim}{fontsize=\small,commandchars=\@\#\%} | ||
6 | \DefineVerbatimEnvironment{mime}{Verbatim}{fontsize=\small,commandchars=\$\#\%} | ||
7 | \newcommand{\stick}[1]{\vbox{\setlength{\parskip}{0pt}#1}} | ||
8 | \newcommand{\bl}{\ensuremath{\mathtt{\backslash}}} | ||
9 | \newcommand{\CR}{\texttt{CR}} | ||
10 | \newcommand{\LF}{\texttt{LF}} | ||
11 | \newcommand{\CRLF}{\texttt{CR~LF}} | ||
12 | \newcommand{\nil}{\texttt{nil}} | ||
13 | |||
14 | \title{Filters, sources, sinks, and pumps\\ | ||
15 | {\large or Functional programming for the rest of us}} | ||
16 | \author{Diego Nehab} | ||
17 | |||
18 | \begin{document} | ||
19 | |||
20 | \maketitle | ||
21 | |||
22 | \begin{abstract} | ||
23 | Certain data processing operations can be implemented in the | ||
24 | form of filters. A filter is a function that can process | ||
25 | data received in consecutive invocations, returning partial | ||
26 | results each time it is called. Examples of operations that | ||
27 | can be implemented as filters include the end-of-line | ||
28 | normalization for text, Base64 and Quoted-Printable transfer | ||
29 | content encodings, the breaking of text into lines, SMTP | ||
30 | dot-stuffing, and there are many others. Filters become | ||
31 | even more powerful when we allow them to be chained together | ||
32 | to create composite filters. In this context, filters can be | ||
33 | seen as the internal links in a chain of data transformations. | ||
34 | Sources and sinks are the corresponding end points in these | ||
35 | chains. A source is a function that produces data, chunk by | ||
36 | chunk, and a sink is a function that takes data, chunk by | ||
37 | chunk. Finally, pumps are procedures that actively drive | ||
38 | data from a source to a sink, and indirectly through all | ||
39 | intervening filters. In this article, we describe the design of an | ||
40 | elegant interface for filters, sources, sinks, chains, and | ||
41 | pumps, and we illustrate each step with concrete examples. | ||
42 | \end{abstract} | ||
43 | |||
44 | \section{Introduction} | ||
45 | |||
46 | Within the realm of networking applications, we are often | ||
47 | required to apply transformations to streams of data. Examples | ||
48 | include the end-of-line normalization for text, Base64 and | ||
49 | Quoted-Printable transfer content encodings, breaking text | ||
50 | into lines with a maximum number of columns, SMTP | ||
51 | dot-stuffing, \texttt{gzip} compression, HTTP chunked | ||
52 | transfer coding, and the list goes on. | ||
53 | |||
54 | Many complex tasks require a combination of two or more such | ||
55 | transformations, and therefore a general mechanism for | ||
56 | promoting reuse is desirable. In the process of designing | ||
57 | \texttt{LuaSocket~2.0}, we repeatedly faced this problem. | ||
58 | The solution we reached proved to be very general and | ||
59 | convenient. It is based on the concepts of filters, sources, | ||
60 | sinks, and pumps, which we introduce below. | ||
61 | |||
62 | \emph{Filters} are functions that can be repeatedly invoked | ||
63 | with chunks of input, successively returning processed | ||
64 | chunks of output. Naturally, the result of | ||
65 | concatenating all the output chunks must be the same as the | ||
66 | result of applying the filter to the concatenation of all | ||
67 | input chunks. In fancier language, filters \emph{commute} | ||
68 | with the concatenation operator. More importantly, filters | ||
69 | must handle input data correctly no matter how the stream | ||
70 | has been split into chunks. | ||
71 | |||
72 | A \emph{chain} is a function that transparently combines the | ||
73 | effect of one or more filters. The interface of a chain is | ||
74 | indistinguishable from the interface of its component | ||
75 | filters. This allows a chained filter to be used wherever | ||
76 | an atomic filter is accepted. In particular, chains can be | ||
77 | themselves chained to create arbitrarily complex operations. | ||
78 | |||
79 | Filters can be seen as internal nodes in a network through | ||
80 | which data will flow, potentially being transformed many | ||
81 | times along the way. Chains connect these nodes together. | ||
82 | The initial and final nodes of the network are | ||
83 | \emph{sources} and \emph{sinks}, respectively. Less | ||
84 | abstractly, a source is a function that produces new chunks | ||
85 | of data every time it is invoked. Conversely, sinks are | ||
86 | functions that give a final destination to the chunks of | ||
87 | data they receive in sucessive calls. Naturally, sources | ||
88 | and sinks can also be chained with filters to produce | ||
89 | filtered sources and sinks. | ||
90 | |||
91 | Finally, filters, chains, sources, and sinks are all passive | ||
92 | entities: they must be repeatedly invoked in order for | ||
93 | anything to happen. \emph{Pumps} provide the driving force | ||
94 | that pushes data through the network, from a source to a | ||
95 | sink, and indirectly through all intervening filters. | ||
96 | |||
97 | In the following sections, we start with a simplified | ||
98 | interface, which we later refine. The evolution we present | ||
99 | is not contrived: it recreates the steps we ourselves | ||
100 | followed as we consolidated our understanding of these | ||
101 | concepts within our application domain. | ||
102 | |||
103 | \subsection{A simple example} | ||
104 | |||
105 | The end-of-line normalization of text is a good | ||
106 | example to motivate our initial filter interface. | ||
107 | Assume we are given text in an unknown end-of-line | ||
108 | convention (including possibly mixed conventions) out of the | ||
109 | commonly found Unix (\LF), Mac OS (\CR), and | ||
110 | DOS (\CRLF) conventions. We would like to be able to | ||
111 | use the folowing code to normalize the end-of-line markers: | ||
112 | \begin{quote} | ||
113 | \begin{lua} | ||
114 | @stick# | ||
115 | local CRLF = "\013\010" | ||
116 | local input = source.chain(source.file(io.stdin), normalize(CRLF)) | ||
117 | local output = sink.file(io.stdout) | ||
118 | pump.all(input, output) | ||
119 | % | ||
120 | \end{lua} | ||
121 | \end{quote} | ||
122 | |||
123 | This program should read data from the standard input stream | ||
124 | and normalize the end-of-line markers to the canonic | ||
125 | \CRLF\ marker, as defined by the MIME standard. | ||
126 | Finally, the normalized text should be sent to the standard output | ||
127 | stream. We use a \emph{file source} that produces data from | ||
128 | standard input, and chain it with a filter that normalizes | ||
129 | the data. The pump then repeatedly obtains data from the | ||
130 | source, and passes it to the \emph{file sink}, which sends | ||
131 | it to the standard output. | ||
132 | |||
133 | In the code above, the \texttt{normalize} \emph{factory} is a | ||
134 | function that creates our normalization filter, which | ||
135 | replaces any end-of-line marker with the canonic marker. | ||
136 | The initial filter interface is | ||
137 | trivial: a filter function receives a chunk of input data, | ||
138 | and returns a chunk of processed data. When there are no | ||
139 | more input data left, the caller notifies the filter by invoking | ||
140 | it with a \nil\ chunk. The filter responds by returning | ||
141 | the final chunk of processed data (which could of course be | ||
142 | the empty string). | ||
143 | |||
144 | Although the interface is extremely simple, the | ||
145 | implementation is not so obvious. A normalization filter | ||
146 | respecting this interface needs to keep some kind of context | ||
147 | between calls. This is because a chunk boundary may lie between | ||
148 | the \CR\ and \LF\ characters marking the end of a single line. This | ||
149 | need for contextual storage motivates the use of | ||
150 | factories: each time the factory is invoked, it returns a | ||
151 | filter with its own context so that we can have several | ||
152 | independent filters being used at the same time. For | ||
153 | efficiency reasons, we must avoid the obvious solution of | ||
154 | concatenating all the input into the context before | ||
155 | producing any output chunks. | ||
156 | |||
157 | To that end, we break the implementation into two parts: | ||
158 | a low-level filter, and a factory of high-level filters. The | ||
159 | low-level filter is implemented in C and does not maintain | ||
160 | any context between function calls. The high-level filter | ||
161 | factory, implemented in Lua, creates and returns a | ||
162 | high-level filter that maintains whatever context the low-level | ||
163 | filter needs, but isolates the user from its internal | ||
164 | details. That way, we take advantage of C's efficiency to | ||
165 | perform the hard work, and take advantage of Lua's | ||
166 | simplicity for the bookkeeping. | ||
167 | |||
168 | \subsection{The Lua part of the filter} | ||
169 | |||
170 | Below is the complete implementation of the factory of high-level | ||
171 | end-of-line normalization filters: | ||
172 | \begin{quote} | ||
173 | \begin{lua} | ||
174 | @stick# | ||
175 | function filter.cycle(lowlevel, context, extra) | ||
176 | return function(chunk) | ||
177 | local ret | ||
178 | ret, context = lowlevel(context, chunk, extra) | ||
179 | return ret | ||
180 | end | ||
181 | end | ||
182 | % | ||
183 | |||
184 | @stick# | ||
185 | function normalize(marker) | ||
186 | return filter.cycle(eol, 0, marker) | ||
187 | end | ||
188 | % | ||
189 | \end{lua} | ||
190 | \end{quote} | ||
191 | |||
192 | The \texttt{normalize} factory simply calls a more generic | ||
193 | factory, the \texttt{cycle}~factory, passing the low-level | ||
194 | filter~\texttt{eol}. The \texttt{cycle}~factory receives a | ||
195 | low-level filter, an initial context, and an extra | ||
196 | parameter, and returns a new high-level filter. Each time | ||
197 | the high-level filer is passed a new chunk, it invokes the | ||
198 | low-level filter with the previous context, the new chunk, | ||
199 | and the extra argument. It is the low-level filter that | ||
200 | does all the work, producing the chunk of processed data and | ||
201 | a new context. The high-level filter then replaces its | ||
202 | internal context, and returns the processed chunk of data to | ||
203 | the user. Notice that we take advantage of Lua's lexical | ||
204 | scoping to store the context in a closure between function | ||
205 | calls. | ||
206 | |||
207 | \subsection{The C part of the filter} | ||
208 | |||
209 | As for the low-level filter, we must first accept | ||
210 | that there is no perfect solution to the end-of-line marker | ||
211 | normalization problem. The difficulty comes from an | ||
212 | inherent ambiguity in the definition of empty lines within | ||
213 | mixed input. However, the following solution works well for | ||
214 | any consistent input, as well as for non-empty lines in | ||
215 | mixed input. It also does a reasonable job with empty lines | ||
216 | and serves as a good example of how to implement a low-level | ||
217 | filter. | ||
218 | |||
219 | The idea is to consider both \CR\ and~\LF\ as end-of-line | ||
220 | \emph{candidates}. We issue a single break if any candidate | ||
221 | is seen alone, or if it is followed by a different | ||
222 | candidate. In other words, \CR~\CR~and \LF~\LF\ each issue | ||
223 | two end-of-line markers, whereas \CR~\LF~and \LF~\CR\ issue | ||
224 | only one marker each. It is easy to see that this method | ||
225 | correctly handles the most common end-of-line conventions. | ||
226 | |||
227 | With this in mind, we divide the low-level filter into two | ||
228 | simple functions. The inner function~\texttt{pushchar} performs the | ||
229 | normalization itself. It takes each input character in turn, | ||
230 | deciding what to output and how to modify the context. The | ||
231 | context tells if the last processed character was an | ||
232 | end-of-line candidate, and if so, which candidate it was. | ||
233 | For efficiency, we use Lua's auxiliary library's buffer | ||
234 | interface: | ||
235 | \begin{quote} | ||
236 | \begin{C} | ||
237 | @stick# | ||
238 | @#define candidate(c) (c == CR || c == LF) | ||
239 | static int pushchar(int c, int last, const char *marker, | ||
240 | luaL_Buffer *buffer) { | ||
241 | if (candidate(c)) { | ||
242 | if (candidate(last)) { | ||
243 | if (c == last) | ||
244 | luaL_addstring(buffer, marker); | ||
245 | return 0; | ||
246 | } else { | ||
247 | luaL_addstring(buffer, marker); | ||
248 | return c; | ||
249 | } | ||
250 | } else { | ||
251 | luaL_pushchar(buffer, c); | ||
252 | return 0; | ||
253 | } | ||
254 | } | ||
255 | % | ||
256 | \end{C} | ||
257 | \end{quote} | ||
258 | |||
259 | The outer function~\texttt{eol} simply interfaces with Lua. | ||
260 | It receives the context and input chunk (as well as an | ||
261 | optional custom end-of-line marker), and returns the | ||
262 | transformed output chunk and the new context. | ||
263 | Notice that if the input chunk is \nil, the operation | ||
264 | is considered to be finished. In that case, the loop will | ||
265 | not execute a single time and the context is reset to the | ||
266 | initial state. This allows the filter to be reused many | ||
267 | times: | ||
268 | \begin{quote} | ||
269 | \begin{C} | ||
270 | @stick# | ||
271 | static int eol(lua_State *L) { | ||
272 | int context = luaL_checkint(L, 1); | ||
273 | size_t isize = 0; | ||
274 | const char *input = luaL_optlstring(L, 2, NULL, &isize); | ||
275 | const char *last = input + isize; | ||
276 | const char *marker = luaL_optstring(L, 3, CRLF); | ||
277 | luaL_Buffer buffer; | ||
278 | luaL_buffinit(L, &buffer); | ||
279 | if (!input) { | ||
280 | lua_pushnil(L); | ||
281 | lua_pushnumber(L, 0); | ||
282 | return 2; | ||
283 | } | ||
284 | while (input < last) | ||
285 | context = pushchar(*input++, context, marker, &buffer); | ||
286 | luaL_pushresult(&buffer); | ||
287 | lua_pushnumber(L, context); | ||
288 | return 2; | ||
289 | } | ||
290 | % | ||
291 | \end{C} | ||
292 | \end{quote} | ||
293 | |||
294 | When designing filters, the challenging part is usually | ||
295 | deciding what to store in the context. For line breaking, for | ||
296 | instance, it could be the number of bytes that still fit in the | ||
297 | current line. For Base64 encoding, it could be a string | ||
298 | with the bytes that remain after the division of the input | ||
299 | into 3-byte atoms. The MIME module in the \texttt{LuaSocket} | ||
300 | distribution has many other examples. | ||
301 | |||
302 | \section{Filter chains} | ||
303 | |||
304 | Chains greatly increase the power of filters. For example, | ||
305 | according to the standard for Quoted-Printable encoding, | ||
306 | text should be normalized to a canonic end-of-line marker | ||
307 | prior to encoding. After encoding, the resulting text must | ||
308 | be broken into lines of no more than 76 characters, with the | ||
309 | use of soft line breaks (a line terminated by the \texttt{=} | ||
310 | sign). To help specifying complex transformations like | ||
311 | this, we define a chain factory that creates a composite | ||
312 | filter from one or more filters. A chained filter passes | ||
313 | data through all its components, and can be used wherever a | ||
314 | primitive filter is accepted. | ||
315 | |||
316 | The chaining factory is very simple. The auxiliary | ||
317 | function~\texttt{chainpair} chains two filters together, | ||
318 | taking special care if the chunk is the last. This is | ||
319 | because the final \nil\ chunk notification has to be | ||
320 | pushed through both filters in turn: | ||
321 | \begin{quote} | ||
322 | \begin{lua} | ||
323 | @stick# | ||
324 | local function chainpair(f1, f2) | ||
325 | return function(chunk) | ||
326 | local ret = f2(f1(chunk)) | ||
327 | if chunk then return ret | ||
328 | else return ret .. f2() end | ||
329 | end | ||
330 | end | ||
331 | % | ||
332 | |||
333 | @stick# | ||
334 | function filter.chain(...) | ||
335 | local f = select(1, ...) | ||
336 | for i = 2, select('@#', ...) do | ||
337 | f = chainpair(f, select(i, ...)) | ||
338 | end | ||
339 | return f | ||
340 | end | ||
341 | % | ||
342 | \end{lua} | ||
343 | \end{quote} | ||
344 | |||
345 | Thanks to the chain factory, we can | ||
346 | define the Quoted-Printable conversion as such: | ||
347 | \begin{quote} | ||
348 | \begin{lua} | ||
349 | @stick# | ||
350 | local qp = filter.chain(normalize(CRLF), encode("quoted-printable"), | ||
351 | wrap("quoted-printable")) | ||
352 | local input = source.chain(source.file(io.stdin), qp) | ||
353 | local output = sink.file(io.stdout) | ||
354 | pump.all(input, output) | ||
355 | % | ||
356 | \end{lua} | ||
357 | \end{quote} | ||
358 | |||
359 | \section{Sources, sinks, and pumps} | ||
360 | |||
361 | The filters we introduced so far act as the internal nodes | ||
362 | in a network of transformations. Information flows from node | ||
363 | to node (or rather from one filter to the next) and is | ||
364 | transformed along the way. Chaining filters together is our | ||
365 | way to connect nodes in this network. As the starting point | ||
366 | for the network, we need a source node that produces the | ||
367 | data. In the end of the network, we need a sink node that | ||
368 | gives a final destination to the data. | ||
369 | |||
370 | \subsection{Sources} | ||
371 | |||
372 | A source returns the next chunk of data each time it is | ||
373 | invoked. When there is no more data, it simply returns~\nil. | ||
374 | In the event of an error, the source can inform the | ||
375 | caller by returning \nil\ followed by the error message. | ||
376 | |||
377 | Below are two simple source factories. The \texttt{empty} source | ||
378 | returns no data, possibly returning an associated error | ||
379 | message. The \texttt{file} source yields the contents of a file | ||
380 | in a chunk by chunk fashion: | ||
381 | \begin{quote} | ||
382 | \begin{lua} | ||
383 | @stick# | ||
384 | function source.empty(err) | ||
385 | return function() | ||
386 | return nil, err | ||
387 | end | ||
388 | end | ||
389 | % | ||
390 | |||
391 | @stick# | ||
392 | function source.file(handle, io_err) | ||
393 | if handle then | ||
394 | return function() | ||
395 | local chunk = handle:read(2048) | ||
396 | if not chunk then handle:close() end | ||
397 | return chunk | ||
398 | end | ||
399 | else return source.empty(io_err or "unable to open file") end | ||
400 | end | ||
401 | % | ||
402 | \end{lua} | ||
403 | \end{quote} | ||
404 | |||
405 | \subsection{Filtered sources} | ||
406 | |||
407 | A filtered source passes its data through the | ||
408 | associated filter before returning it to the caller. | ||
409 | Filtered sources are useful when working with | ||
410 | functions that get their input data from a source (such as | ||
411 | the pumps in our examples). By chaining a source with one or | ||
412 | more filters, such functions can be transparently provided | ||
413 | with filtered data, with no need to change their interfaces. | ||
414 | Here is a factory that does the job: | ||
415 | \begin{quote} | ||
416 | \begin{lua} | ||
417 | @stick# | ||
418 | function source.chain(src, f) | ||
419 | return function() | ||
420 | if not src then | ||
421 | return nil | ||
422 | end | ||
423 | local chunk, err = src() | ||
424 | if not chunk then | ||
425 | src = nil | ||
426 | return f(nil) | ||
427 | else | ||
428 | return f(chunk) | ||
429 | end | ||
430 | end | ||
431 | end | ||
432 | % | ||
433 | \end{lua} | ||
434 | \end{quote} | ||
435 | |||
436 | \subsection{Sinks} | ||
437 | |||
438 | Just as we defined an interface for a source of data, we can | ||
439 | also define an interface for a data destination. We call | ||
440 | any function respecting this interface a sink. In our first | ||
441 | example, we used a file sink connected to the standard | ||
442 | output. | ||
443 | |||
444 | Sinks receive consecutive chunks of data, until the end of | ||
445 | data is signaled by a \nil\ input chunk. A sink can be | ||
446 | notified of an error with an optional extra argument that | ||
447 | contains the error message, following a \nil\ chunk. | ||
448 | If a sink detects an error itself, and | ||
449 | wishes not to be called again, it can return \nil, | ||
450 | followed by an error message. A return value that | ||
451 | is not \nil\ means the sink will accept more data. | ||
452 | |||
453 | Below are two useful sink factories. | ||
454 | The table factory creates a sink that stores | ||
455 | individual chunks into an array. The data can later be | ||
456 | efficiently concatenated into a single string with Lua's | ||
457 | \texttt{table.concat} library function. The \texttt{null} sink | ||
458 | simply discards the chunks it receives: | ||
459 | \begin{quote} | ||
460 | \begin{lua} | ||
461 | @stick# | ||
462 | function sink.table(t) | ||
463 | t = t or {} | ||
464 | local f = function(chunk, err) | ||
465 | if chunk then table.insert(t, chunk) end | ||
466 | return 1 | ||
467 | end | ||
468 | return f, t | ||
469 | end | ||
470 | % | ||
471 | |||
472 | @stick# | ||
473 | local function null() | ||
474 | return 1 | ||
475 | end | ||
476 | |||
477 | function sink.null() | ||
478 | return null | ||
479 | end | ||
480 | % | ||
481 | \end{lua} | ||
482 | \end{quote} | ||
483 | |||
484 | Naturally, filtered sinks are just as useful as filtered | ||
485 | sources. A filtered sink passes each chunk it receives | ||
486 | through the associated filter before handing it down to the | ||
487 | original sink. In the following example, we use a source | ||
488 | that reads from the standard input. The input chunks are | ||
489 | sent to a table sink, which has been coupled with a | ||
490 | normalization filter. The filtered chunks are then | ||
491 | concatenated from the output array, and finally sent to | ||
492 | standard out: | ||
493 | \begin{quote} | ||
494 | \begin{lua} | ||
495 | @stick# | ||
496 | local input = source.file(io.stdin) | ||
497 | local output, t = sink.table() | ||
498 | output = sink.chain(normalize(CRLF), output) | ||
499 | pump.all(input, output) | ||
500 | io.write(table.concat(t)) | ||
501 | % | ||
502 | \end{lua} | ||
503 | \end{quote} | ||
504 | |||
505 | \subsection{Pumps} | ||
506 | |||
507 | Although not on purpose, our interface for sources is | ||
508 | compatible with Lua iterators. That is, a source can be | ||
509 | neatly used in conjunction with \texttt{for} loops. Using | ||
510 | our file source as an iterator, we can write the following | ||
511 | code: | ||
512 | \begin{quote} | ||
513 | \begin{lua} | ||
514 | @stick# | ||
515 | for chunk in source.file(io.stdin) do | ||
516 | io.write(chunk) | ||
517 | end | ||
518 | % | ||
519 | \end{lua} | ||
520 | \end{quote} | ||
521 | |||
522 | Loops like this will always be present because everything | ||
523 | we designed so far is passive. Sources, sinks, filters: none | ||
524 | of them can do anything on their own. The operation of | ||
525 | pumping all data a source can provide into a sink is so | ||
526 | common that it deserves its own function: | ||
527 | \begin{quote} | ||
528 | \begin{lua} | ||
529 | @stick# | ||
530 | function pump.step(src, snk) | ||
531 | local chunk, src_err = src() | ||
532 | local ret, snk_err = snk(chunk, src_err) | ||
533 | if chunk and ret then return 1 | ||
534 | else return nil, src_err or snk_err end | ||
535 | end | ||
536 | % | ||
537 | |||
538 | @stick# | ||
539 | function pump.all(src, snk, step) | ||
540 | step = step or pump.step | ||
541 | while true do | ||
542 | local ret, err = step(src, snk) | ||
543 | if not ret then | ||
544 | if err then return nil, err | ||
545 | else return 1 end | ||
546 | end | ||
547 | end | ||
548 | end | ||
549 | % | ||
550 | \end{lua} | ||
551 | \end{quote} | ||
552 | |||
553 | The \texttt{pump.step} function moves one chunk of data from | ||
554 | the source to the sink. The \texttt{pump.all} function takes | ||
555 | an optional \texttt{step} function and uses it to pump all the | ||
556 | data from the source to the sink. | ||
557 | Here is an example that uses the Base64 and the | ||
558 | line wrapping filters from the \texttt{LuaSocket} | ||
559 | distribution. The program reads a binary file from | ||
560 | disk and stores it in another file, after encoding it to the | ||
561 | Base64 transfer content encoding: | ||
562 | \begin{quote} | ||
563 | \begin{lua} | ||
564 | @stick# | ||
565 | local input = source.chain( | ||
566 | source.file(io.open("input.bin", "rb")), | ||
567 | encode("base64")) | ||
568 | local output = sink.chain( | ||
569 | wrap(76), | ||
570 | sink.file(io.open("output.b64", "w"))) | ||
571 | pump.all(input, output) | ||
572 | % | ||
573 | \end{lua} | ||
574 | \end{quote} | ||
575 | |||
576 | The way we split the filters here is not intuitive, on | ||
577 | purpose. Alternatively, we could have chained the Base64 | ||
578 | encode filter and the line-wrap filter together, and then | ||
579 | chain the resulting filter with either the file source or | ||
580 | the file sink. It doesn't really matter. | ||
581 | |||
582 | \section{Exploding filters} | ||
583 | |||
584 | Our current filter interface has one serious shortcoming. | ||
585 | Consider for example a \texttt{gzip} decompression filter. | ||
586 | During decompression, a small input chunk can be exploded | ||
587 | into a huge amount of data. To address this problem, we | ||
588 | decided to change the filter interface and allow exploding | ||
589 | filters to return large quantities of output data in a chunk | ||
590 | by chunk manner. | ||
591 | |||
592 | More specifically, after passing each chunk of input to | ||
593 | a filter, and collecting the first chunk of output, the | ||
594 | user must now loop to receive other chunks from the filter until no | ||
595 | filtered data is left. Within these secondary calls, the | ||
596 | caller passes an empty string to the filter. The filter | ||
597 | responds with an empty string when it is ready for the next | ||
598 | input chunk. In the end, after the user passes a | ||
599 | \nil\ chunk notifying the filter that there is no | ||
600 | more input data, the filter might still have to produce too | ||
601 | much output data to return in a single chunk. The user has | ||
602 | to loop again, now passing \nil\ to the filter each time, | ||
603 | until the filter itself returns \nil\ to notify the | ||
604 | user it is finally done. | ||
605 | |||
606 | Fortunately, it is very easy to modify a filter to respect | ||
607 | the new interface. In fact, the end-of-line translation | ||
608 | filter we presented earlier already conforms to it. The | ||
609 | complexity is encapsulated within the chaining functions, | ||
610 | which must now include a loop. Since these functions only | ||
611 | have to be written once, the user is rarely affected. | ||
612 | Interestingly, the modifications do not have a measurable | ||
613 | negative impact in the performance of filters that do | ||
614 | not need the added flexibility. On the other hand, for a | ||
615 | small price in complexity, the changes make exploding | ||
616 | filters practical. | ||
617 | |||
618 | \section{A complex example} | ||
619 | |||
620 | The LTN12 module in the \texttt{LuaSocket} distribution | ||
621 | implements all the ideas we have described. The MIME | ||
622 | and SMTP modules are tightly integrated with LTN12, | ||
623 | and can be used to showcase the expressive power of filters, | ||
624 | sources, sinks, and pumps. Below is an example | ||
625 | of how a user would proceed to define and send a | ||
626 | multipart message, with attachments, using \texttt{LuaSocket}: | ||
627 | \begin{quote} | ||
628 | \begin{mime} | ||
629 | local smtp = require"socket.smtp" | ||
630 | local mime = require"mime" | ||
631 | local ltn12 = require"ltn12" | ||
632 | |||
633 | local message = smtp.message{ | ||
634 | headers = { | ||
635 | from = "Sicrano <sicrano@example.com>", | ||
636 | to = "Fulano <fulano@example.com>", | ||
637 | subject = "A message with an attachment"}, | ||
638 | body = { | ||
639 | preamble = "Hope you can see the attachment" .. CRLF, | ||
640 | [1] = { | ||
641 | body = "Here is our logo" .. CRLF}, | ||
642 | [2] = { | ||
643 | headers = { | ||
644 | ["content-type"] = 'image/png; name="luasocket.png"', | ||
645 | ["content-disposition"] = | ||
646 | 'attachment; filename="luasocket.png"', | ||
647 | ["content-description"] = 'LuaSocket logo', | ||
648 | ["content-transfer-encoding"] = "BASE64"}, | ||
649 | body = ltn12.source.chain( | ||
650 | ltn12.source.file(io.open("luasocket.png", "rb")), | ||
651 | ltn12.filter.chain( | ||
652 | mime.encode("base64"), | ||
653 | mime.wrap()))}}} | ||
654 | |||
655 | assert(smtp.send{ | ||
656 | rcpt = "<fulano@example.com>", | ||
657 | from = "<sicrano@example.com>", | ||
658 | source = message}) | ||
659 | \end{mime} | ||
660 | \end{quote} | ||
661 | |||
662 | The \texttt{smtp.message} function receives a table | ||
663 | describing the message, and returns a source. The | ||
664 | \texttt{smtp.send} function takes this source, chains it with the | ||
665 | SMTP dot-stuffing filter, connects a socket sink | ||
666 | with the server, and simply pumps the data. The message is never | ||
667 | assembled in memory. Everything is produced on demand, | ||
668 | transformed in small pieces, and sent to the server in chunks, | ||
669 | including the file attachment which is loaded from disk and | ||
670 | encoded on the fly. It just works. | ||
671 | |||
672 | \section{Conclusions} | ||
673 | |||
674 | In this article, we introduced the concepts of filters, | ||
675 | sources, sinks, and pumps to the Lua language. These are | ||
676 | useful tools for stream processing in general. Sources provide | ||
677 | a simple abstraction for data acquisition. Sinks provide an | ||
678 | abstraction for final data destinations. Filters define an | ||
679 | interface for data transformations. The chaining of | ||
680 | filters, sources and sinks provides an elegant way to create | ||
681 | arbitrarily complex data transformations from simpler | ||
682 | components. Pumps simply push the data through. | ||
683 | |||
684 | \section{Acknowledgements} | ||
685 | |||
686 | The concepts described in this text are the result of long | ||
687 | discussions with David Burgess. A version of this text has | ||
688 | been released on-line as the Lua Technical Note 012, hence | ||
689 | the name of the corresponding LuaSocket module, LTN12. Wim | ||
690 | Couwenberg contributed to the implementation of the module, | ||
691 | and Adrian Sietsma was the first to notice the | ||
692 | correspondence between sources and Lua iterators. | ||
693 | |||
694 | |||
695 | \end{document} | ||