aboutsummaryrefslogtreecommitdiff
path: root/gem/ltn012.tex
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--gem/ltn012.tex695
1 files changed, 0 insertions, 695 deletions
diff --git a/gem/ltn012.tex b/gem/ltn012.tex
deleted file mode 100644
index 8027ecc..0000000
--- a/gem/ltn012.tex
+++ /dev/null
@@ -1,695 +0,0 @@
1\documentclass[10pt]{article}
2\usepackage{fancyvrb}
3\usepackage{url}
4\DefineVerbatimEnvironment{lua}{Verbatim}{fontsize=\small,commandchars=\@\#\%}
5\DefineVerbatimEnvironment{C}{Verbatim}{fontsize=\small,commandchars=\@\#\%}
6\DefineVerbatimEnvironment{mime}{Verbatim}{fontsize=\small,commandchars=\$\#\%}
7\newcommand{\stick}[1]{\vbox{\setlength{\parskip}{0pt}#1}}
8\newcommand{\bl}{\ensuremath{\mathtt{\backslash}}}
9\newcommand{\CR}{\texttt{CR}}
10\newcommand{\LF}{\texttt{LF}}
11\newcommand{\CRLF}{\texttt{CR~LF}}
12\newcommand{\nil}{\texttt{nil}}
13
14\title{Filters, sources, sinks, and pumps\\
15 {\large or Functional programming for the rest of us}}
16\author{Diego Nehab}
17
18\begin{document}
19
20\maketitle
21
22\begin{abstract}
23Certain data processing operations can be implemented in the
24form of filters. A filter is a function that can process
25data received in consecutive invocations, returning partial
26results each time it is called. Examples of operations that
27can be implemented as filters include the end-of-line
28normalization for text, Base64 and Quoted-Printable transfer
29content encodings, the breaking of text into lines, SMTP
30dot-stuffing, and there are many others. Filters become
31even more powerful when we allow them to be chained together
32to create composite filters. In this context, filters can be
33seen as the internal links in a chain of data transformations.
34Sources and sinks are the corresponding end points in these
35chains. A source is a function that produces data, chunk by
36chunk, and a sink is a function that takes data, chunk by
37chunk. Finally, pumps are procedures that actively drive
38data from a source to a sink, and indirectly through all
39intervening filters. In this article, we describe the design of an
40elegant interface for filters, sources, sinks, chains, and
41pumps, and we illustrate each step with concrete examples.
42\end{abstract}
43
44\section{Introduction}
45
46Within the realm of networking applications, we are often
47required to apply transformations to streams of data. Examples
48include the end-of-line normalization for text, Base64 and
49Quoted-Printable transfer content encodings, breaking text
50into lines with a maximum number of columns, SMTP
51dot-stuffing, \texttt{gzip} compression, HTTP chunked
52transfer coding, and the list goes on.
53
54Many complex tasks require a combination of two or more such
55transformations, and therefore a general mechanism for
56promoting reuse is desirable. In the process of designing
57\texttt{LuaSocket~2.0}, we repeatedly faced this problem.
58The solution we reached proved to be very general and
59convenient. It is based on the concepts of filters, sources,
60sinks, and pumps, which we introduce below.
61
62\emph{Filters} are functions that can be repeatedly invoked
63with chunks of input, successively returning processed
64chunks of output. Naturally, the result of
65concatenating all the output chunks must be the same as the
66result of applying the filter to the concatenation of all
67input chunks. In fancier language, filters \emph{commute}
68with the concatenation operator. More importantly, filters
69must handle input data correctly no matter how the stream
70has been split into chunks.
71
72A \emph{chain} is a function that transparently combines the
73effect of one or more filters. The interface of a chain is
74indistinguishable from the interface of its component
75filters. This allows a chained filter to be used wherever
76an atomic filter is accepted. In particular, chains can be
77themselves chained to create arbitrarily complex operations.
78
79Filters can be seen as internal nodes in a network through
80which data will flow, potentially being transformed many
81times along the way. Chains connect these nodes together.
82The initial and final nodes of the network are
83\emph{sources} and \emph{sinks}, respectively. Less
84abstractly, a source is a function that produces new chunks
85of data every time it is invoked. Conversely, sinks are
86functions that give a final destination to the chunks of
87data they receive in sucessive calls. Naturally, sources
88and sinks can also be chained with filters to produce
89filtered sources and sinks.
90
91Finally, filters, chains, sources, and sinks are all passive
92entities: they must be repeatedly invoked in order for
93anything to happen. \emph{Pumps} provide the driving force
94that pushes data through the network, from a source to a
95sink, and indirectly through all intervening filters.
96
97In the following sections, we start with a simplified
98interface, which we later refine. The evolution we present
99is not contrived: it recreates the steps we ourselves
100followed as we consolidated our understanding of these
101concepts within our application domain.
102
103\subsection{A simple example}
104
105The end-of-line normalization of text is a good
106example to motivate our initial filter interface.
107Assume we are given text in an unknown end-of-line
108convention (including possibly mixed conventions) out of the
109commonly found Unix (\LF), Mac OS (\CR), and
110DOS (\CRLF) conventions. We would like to be able to
111use the folowing code to normalize the end-of-line markers:
112\begin{quote}
113\begin{lua}
114@stick#
115local CRLF = "\013\010"
116local input = source.chain(source.file(io.stdin), normalize(CRLF))
117local output = sink.file(io.stdout)
118pump.all(input, output)
119%
120\end{lua}
121\end{quote}
122
123This program should read data from the standard input stream
124and normalize the end-of-line markers to the canonic
125\CRLF\ marker, as defined by the MIME standard.
126Finally, the normalized text should be sent to the standard output
127stream. We use a \emph{file source} that produces data from
128standard input, and chain it with a filter that normalizes
129the data. The pump then repeatedly obtains data from the
130source, and passes it to the \emph{file sink}, which sends
131it to the standard output.
132
133In the code above, the \texttt{normalize} \emph{factory} is a
134function that creates our normalization filter, which
135replaces any end-of-line marker with the canonic marker.
136The initial filter interface is
137trivial: a filter function receives a chunk of input data,
138and returns a chunk of processed data. When there are no
139more input data left, the caller notifies the filter by invoking
140it with a \nil\ chunk. The filter responds by returning
141the final chunk of processed data (which could of course be
142the empty string).
143
144Although the interface is extremely simple, the
145implementation is not so obvious. A normalization filter
146respecting this interface needs to keep some kind of context
147between calls. This is because a chunk boundary may lie between
148the \CR\ and \LF\ characters marking the end of a single line. This
149need for contextual storage motivates the use of
150factories: each time the factory is invoked, it returns a
151filter with its own context so that we can have several
152independent filters being used at the same time. For
153efficiency reasons, we must avoid the obvious solution of
154concatenating all the input into the context before
155producing any output chunks.
156
157To that end, we break the implementation into two parts:
158a low-level filter, and a factory of high-level filters. The
159low-level filter is implemented in C and does not maintain
160any context between function calls. The high-level filter
161factory, implemented in Lua, creates and returns a
162high-level filter that maintains whatever context the low-level
163filter needs, but isolates the user from its internal
164details. That way, we take advantage of C's efficiency to
165perform the hard work, and take advantage of Lua's
166simplicity for the bookkeeping.
167
168\subsection{The Lua part of the filter}
169
170Below is the complete implementation of the factory of high-level
171end-of-line normalization filters:
172\begin{quote}
173\begin{lua}
174@stick#
175function filter.cycle(lowlevel, context, extra)
176 return function(chunk)
177 local ret
178 ret, context = lowlevel(context, chunk, extra)
179 return ret
180 end
181end
182%
183
184@stick#
185function normalize(marker)
186 return filter.cycle(eol, 0, marker)
187end
188%
189\end{lua}
190\end{quote}
191
192The \texttt{normalize} factory simply calls a more generic
193factory, the \texttt{cycle}~factory, passing the low-level
194filter~\texttt{eol}. The \texttt{cycle}~factory receives a
195low-level filter, an initial context, and an extra
196parameter, and returns a new high-level filter. Each time
197the high-level filer is passed a new chunk, it invokes the
198low-level filter with the previous context, the new chunk,
199and the extra argument. It is the low-level filter that
200does all the work, producing the chunk of processed data and
201a new context. The high-level filter then replaces its
202internal context, and returns the processed chunk of data to
203the user. Notice that we take advantage of Lua's lexical
204scoping to store the context in a closure between function
205calls.
206
207\subsection{The C part of the filter}
208
209As for the low-level filter, we must first accept
210that there is no perfect solution to the end-of-line marker
211normalization problem. The difficulty comes from an
212inherent ambiguity in the definition of empty lines within
213mixed input. However, the following solution works well for
214any consistent input, as well as for non-empty lines in
215mixed input. It also does a reasonable job with empty lines
216and serves as a good example of how to implement a low-level
217filter.
218
219The idea is to consider both \CR\ and~\LF\ as end-of-line
220\emph{candidates}. We issue a single break if any candidate
221is seen alone, or if it is followed by a different
222candidate. In other words, \CR~\CR~and \LF~\LF\ each issue
223two end-of-line markers, whereas \CR~\LF~and \LF~\CR\ issue
224only one marker each. It is easy to see that this method
225correctly handles the most common end-of-line conventions.
226
227With this in mind, we divide the low-level filter into two
228simple functions. The inner function~\texttt{pushchar} performs the
229normalization itself. It takes each input character in turn,
230deciding what to output and how to modify the context. The
231context tells if the last processed character was an
232end-of-line candidate, and if so, which candidate it was.
233For efficiency, we use Lua's auxiliary library's buffer
234interface:
235\begin{quote}
236\begin{C}
237@stick#
238@#define candidate(c) (c == CR || c == LF)
239static int pushchar(int c, int last, const char *marker,
240 luaL_Buffer *buffer) {
241 if (candidate(c)) {
242 if (candidate(last)) {
243 if (c == last)
244 luaL_addstring(buffer, marker);
245 return 0;
246 } else {
247 luaL_addstring(buffer, marker);
248 return c;
249 }
250 } else {
251 luaL_pushchar(buffer, c);
252 return 0;
253 }
254}
255%
256\end{C}
257\end{quote}
258
259The outer function~\texttt{eol} simply interfaces with Lua.
260It receives the context and input chunk (as well as an
261optional custom end-of-line marker), and returns the
262transformed output chunk and the new context.
263Notice that if the input chunk is \nil, the operation
264is considered to be finished. In that case, the loop will
265not execute a single time and the context is reset to the
266initial state. This allows the filter to be reused many
267times:
268\begin{quote}
269\begin{C}
270@stick#
271static int eol(lua_State *L) {
272 int context = luaL_checkint(L, 1);
273 size_t isize = 0;
274 const char *input = luaL_optlstring(L, 2, NULL, &isize);
275 const char *last = input + isize;
276 const char *marker = luaL_optstring(L, 3, CRLF);
277 luaL_Buffer buffer;
278 luaL_buffinit(L, &buffer);
279 if (!input) {
280 lua_pushnil(L);
281 lua_pushnumber(L, 0);
282 return 2;
283 }
284 while (input < last)
285 context = pushchar(*input++, context, marker, &buffer);
286 luaL_pushresult(&buffer);
287 lua_pushnumber(L, context);
288 return 2;
289}
290%
291\end{C}
292\end{quote}
293
294When designing filters, the challenging part is usually
295deciding what to store in the context. For line breaking, for
296instance, it could be the number of bytes that still fit in the
297current line. For Base64 encoding, it could be a string
298with the bytes that remain after the division of the input
299into 3-byte atoms. The MIME module in the \texttt{LuaSocket}
300distribution has many other examples.
301
302\section{Filter chains}
303
304Chains greatly increase the power of filters. For example,
305according to the standard for Quoted-Printable encoding,
306text should be normalized to a canonic end-of-line marker
307prior to encoding. After encoding, the resulting text must
308be broken into lines of no more than 76 characters, with the
309use of soft line breaks (a line terminated by the \texttt{=}
310sign). To help specifying complex transformations like
311this, we define a chain factory that creates a composite
312filter from one or more filters. A chained filter passes
313data through all its components, and can be used wherever a
314primitive filter is accepted.
315
316The chaining factory is very simple. The auxiliary
317function~\texttt{chainpair} chains two filters together,
318taking special care if the chunk is the last. This is
319because the final \nil\ chunk notification has to be
320pushed through both filters in turn:
321\begin{quote}
322\begin{lua}
323@stick#
324local function chainpair(f1, f2)
325 return function(chunk)
326 local ret = f2(f1(chunk))
327 if chunk then return ret
328 else return ret .. f2() end
329 end
330end
331%
332
333@stick#
334function filter.chain(...)
335 local f = select(1, ...)
336 for i = 2, select('@#', ...) do
337 f = chainpair(f, select(i, ...))
338 end
339 return f
340end
341%
342\end{lua}
343\end{quote}
344
345Thanks to the chain factory, we can
346define the Quoted-Printable conversion as such:
347\begin{quote}
348\begin{lua}
349@stick#
350local qp = filter.chain(normalize(CRLF), encode("quoted-printable"),
351 wrap("quoted-printable"))
352local input = source.chain(source.file(io.stdin), qp)
353local output = sink.file(io.stdout)
354pump.all(input, output)
355%
356\end{lua}
357\end{quote}
358
359\section{Sources, sinks, and pumps}
360
361The filters we introduced so far act as the internal nodes
362in a network of transformations. Information flows from node
363to node (or rather from one filter to the next) and is
364transformed along the way. Chaining filters together is our
365way to connect nodes in this network. As the starting point
366for the network, we need a source node that produces the
367data. In the end of the network, we need a sink node that
368gives a final destination to the data.
369
370\subsection{Sources}
371
372A source returns the next chunk of data each time it is
373invoked. When there is no more data, it simply returns~\nil.
374In the event of an error, the source can inform the
375caller by returning \nil\ followed by the error message.
376
377Below are two simple source factories. The \texttt{empty} source
378returns no data, possibly returning an associated error
379message. The \texttt{file} source yields the contents of a file
380in a chunk by chunk fashion:
381\begin{quote}
382\begin{lua}
383@stick#
384function source.empty(err)
385 return function()
386 return nil, err
387 end
388end
389%
390
391@stick#
392function source.file(handle, io_err)
393 if handle then
394 return function()
395 local chunk = handle:read(2048)
396 if not chunk then handle:close() end
397 return chunk
398 end
399 else return source.empty(io_err or "unable to open file") end
400end
401%
402\end{lua}
403\end{quote}
404
405\subsection{Filtered sources}
406
407A filtered source passes its data through the
408associated filter before returning it to the caller.
409Filtered sources are useful when working with
410functions that get their input data from a source (such as
411the pumps in our examples). By chaining a source with one or
412more filters, such functions can be transparently provided
413with filtered data, with no need to change their interfaces.
414Here is a factory that does the job:
415\begin{quote}
416\begin{lua}
417@stick#
418function source.chain(src, f)
419 return function()
420 if not src then
421 return nil
422 end
423 local chunk, err = src()
424 if not chunk then
425 src = nil
426 return f(nil)
427 else
428 return f(chunk)
429 end
430 end
431end
432%
433\end{lua}
434\end{quote}
435
436\subsection{Sinks}
437
438Just as we defined an interface for a source of data, we can
439also define an interface for a data destination. We call
440any function respecting this interface a sink. In our first
441example, we used a file sink connected to the standard
442output.
443
444Sinks receive consecutive chunks of data, until the end of
445data is signaled by a \nil\ input chunk. A sink can be
446notified of an error with an optional extra argument that
447contains the error message, following a \nil\ chunk.
448If a sink detects an error itself, and
449wishes not to be called again, it can return \nil,
450followed by an error message. A return value that
451is not \nil\ means the sink will accept more data.
452
453Below are two useful sink factories.
454The table factory creates a sink that stores
455individual chunks into an array. The data can later be
456efficiently concatenated into a single string with Lua's
457\texttt{table.concat} library function. The \texttt{null} sink
458simply discards the chunks it receives:
459\begin{quote}
460\begin{lua}
461@stick#
462function sink.table(t)
463 t = t or {}
464 local f = function(chunk, err)
465 if chunk then table.insert(t, chunk) end
466 return 1
467 end
468 return f, t
469end
470%
471
472@stick#
473local function null()
474 return 1
475end
476
477function sink.null()
478 return null
479end
480%
481\end{lua}
482\end{quote}
483
484Naturally, filtered sinks are just as useful as filtered
485sources. A filtered sink passes each chunk it receives
486through the associated filter before handing it down to the
487original sink. In the following example, we use a source
488that reads from the standard input. The input chunks are
489sent to a table sink, which has been coupled with a
490normalization filter. The filtered chunks are then
491concatenated from the output array, and finally sent to
492standard out:
493\begin{quote}
494\begin{lua}
495@stick#
496local input = source.file(io.stdin)
497local output, t = sink.table()
498output = sink.chain(normalize(CRLF), output)
499pump.all(input, output)
500io.write(table.concat(t))
501%
502\end{lua}
503\end{quote}
504
505\subsection{Pumps}
506
507Although not on purpose, our interface for sources is
508compatible with Lua iterators. That is, a source can be
509neatly used in conjunction with \texttt{for} loops. Using
510our file source as an iterator, we can write the following
511code:
512\begin{quote}
513\begin{lua}
514@stick#
515for chunk in source.file(io.stdin) do
516 io.write(chunk)
517end
518%
519\end{lua}
520\end{quote}
521
522Loops like this will always be present because everything
523we designed so far is passive. Sources, sinks, filters: none
524of them can do anything on their own. The operation of
525pumping all data a source can provide into a sink is so
526common that it deserves its own function:
527\begin{quote}
528\begin{lua}
529@stick#
530function pump.step(src, snk)
531 local chunk, src_err = src()
532 local ret, snk_err = snk(chunk, src_err)
533 if chunk and ret then return 1
534 else return nil, src_err or snk_err end
535end
536%
537
538@stick#
539function pump.all(src, snk, step)
540 step = step or pump.step
541 while true do
542 local ret, err = step(src, snk)
543 if not ret then
544 if err then return nil, err
545 else return 1 end
546 end
547 end
548end
549%
550\end{lua}
551\end{quote}
552
553The \texttt{pump.step} function moves one chunk of data from
554the source to the sink. The \texttt{pump.all} function takes
555an optional \texttt{step} function and uses it to pump all the
556data from the source to the sink.
557Here is an example that uses the Base64 and the
558line wrapping filters from the \texttt{LuaSocket}
559distribution. The program reads a binary file from
560disk and stores it in another file, after encoding it to the
561Base64 transfer content encoding:
562\begin{quote}
563\begin{lua}
564@stick#
565local input = source.chain(
566 source.file(io.open("input.bin", "rb")),
567 encode("base64"))
568local output = sink.chain(
569 wrap(76),
570 sink.file(io.open("output.b64", "w")))
571pump.all(input, output)
572%
573\end{lua}
574\end{quote}
575
576The way we split the filters here is not intuitive, on
577purpose. Alternatively, we could have chained the Base64
578encode filter and the line-wrap filter together, and then
579chain the resulting filter with either the file source or
580the file sink. It doesn't really matter.
581
582\section{Exploding filters}
583
584Our current filter interface has one serious shortcoming.
585Consider for example a \texttt{gzip} decompression filter.
586During decompression, a small input chunk can be exploded
587into a huge amount of data. To address this problem, we
588decided to change the filter interface and allow exploding
589filters to return large quantities of output data in a chunk
590by chunk manner.
591
592More specifically, after passing each chunk of input to
593a filter, and collecting the first chunk of output, the
594user must now loop to receive other chunks from the filter until no
595filtered data is left. Within these secondary calls, the
596caller passes an empty string to the filter. The filter
597responds with an empty string when it is ready for the next
598input chunk. In the end, after the user passes a
599\nil\ chunk notifying the filter that there is no
600more input data, the filter might still have to produce too
601much output data to return in a single chunk. The user has
602to loop again, now passing \nil\ to the filter each time,
603until the filter itself returns \nil\ to notify the
604user it is finally done.
605
606Fortunately, it is very easy to modify a filter to respect
607the new interface. In fact, the end-of-line translation
608filter we presented earlier already conforms to it. The
609complexity is encapsulated within the chaining functions,
610which must now include a loop. Since these functions only
611have to be written once, the user is rarely affected.
612Interestingly, the modifications do not have a measurable
613negative impact in the performance of filters that do
614not need the added flexibility. On the other hand, for a
615small price in complexity, the changes make exploding
616filters practical.
617
618\section{A complex example}
619
620The LTN12 module in the \texttt{LuaSocket} distribution
621implements all the ideas we have described. The MIME
622and SMTP modules are tightly integrated with LTN12,
623and can be used to showcase the expressive power of filters,
624sources, sinks, and pumps. Below is an example
625of how a user would proceed to define and send a
626multipart message, with attachments, using \texttt{LuaSocket}:
627\begin{quote}
628\begin{mime}
629local smtp = require"socket.smtp"
630local mime = require"mime"
631local ltn12 = require"ltn12"
632
633local message = smtp.message{
634 headers = {
635 from = "Sicrano <sicrano@example.com>",
636 to = "Fulano <fulano@example.com>",
637 subject = "A message with an attachment"},
638 body = {
639 preamble = "Hope you can see the attachment" .. CRLF,
640 [1] = {
641 body = "Here is our logo" .. CRLF},
642 [2] = {
643 headers = {
644 ["content-type"] = 'image/png; name="luasocket.png"',
645 ["content-disposition"] =
646 'attachment; filename="luasocket.png"',
647 ["content-description"] = 'LuaSocket logo',
648 ["content-transfer-encoding"] = "BASE64"},
649 body = ltn12.source.chain(
650 ltn12.source.file(io.open("luasocket.png", "rb")),
651 ltn12.filter.chain(
652 mime.encode("base64"),
653 mime.wrap()))}}}
654
655assert(smtp.send{
656 rcpt = "<fulano@example.com>",
657 from = "<sicrano@example.com>",
658 source = message})
659\end{mime}
660\end{quote}
661
662The \texttt{smtp.message} function receives a table
663describing the message, and returns a source. The
664\texttt{smtp.send} function takes this source, chains it with the
665SMTP dot-stuffing filter, connects a socket sink
666with the server, and simply pumps the data. The message is never
667assembled in memory. Everything is produced on demand,
668transformed in small pieces, and sent to the server in chunks,
669including the file attachment which is loaded from disk and
670encoded on the fly. It just works.
671
672\section{Conclusions}
673
674In this article, we introduced the concepts of filters,
675sources, sinks, and pumps to the Lua language. These are
676useful tools for stream processing in general. Sources provide
677a simple abstraction for data acquisition. Sinks provide an
678abstraction for final data destinations. Filters define an
679interface for data transformations. The chaining of
680filters, sources and sinks provides an elegant way to create
681arbitrarily complex data transformations from simpler
682components. Pumps simply push the data through.
683
684\section{Acknowledgements}
685
686The concepts described in this text are the result of long
687discussions with David Burgess. A version of this text has
688been released on-line as the Lua Technical Note 012, hence
689the name of the corresponding LuaSocket module, LTN12. Wim
690Couwenberg contributed to the implementation of the module,
691and Adrian Sietsma was the first to notice the
692correspondence between sources and Lua iterators.
693
694
695\end{document}