aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorDiego Nehab <diego@tecgraf.puc-rio.br>2007-05-31 22:27:40 +0000
committerDiego Nehab <diego@tecgraf.puc-rio.br>2007-05-31 22:27:40 +0000
commit3074a8f56b5153f4477e662453102583d7b6f539 (patch)
tree095eecdd7017e17115ab8387898d2f5e5f2f2323
parent7b195164b0c8755b15e8055f1d524282847f6e13 (diff)
downloadluasocket-3074a8f56b5153f4477e662453102583d7b6f539.tar.gz
luasocket-3074a8f56b5153f4477e662453102583d7b6f539.tar.bz2
luasocket-3074a8f56b5153f4477e662453102583d7b6f539.zip
Before sending to Roberto.
-rw-r--r--gem/ltn012.tex218
1 files changed, 105 insertions, 113 deletions
diff --git a/gem/ltn012.tex b/gem/ltn012.tex
index 7dbc5ef..0f81b86 100644
--- a/gem/ltn012.tex
+++ b/gem/ltn012.tex
@@ -23,19 +23,17 @@ received in consecutive function calls, returning partial
23results after each invocation. Examples of operations that can be 23results after each invocation. Examples of operations that can be
24implemented as filters include the end-of-line normalization 24implemented as filters include the end-of-line normalization
25for text, Base64 and Quoted-Printable transfer content 25for text, Base64 and Quoted-Printable transfer content
26encodings, the breaking of text into lines, SMTP byte 26encodings, the breaking of text into lines, SMTP dot-stuffing,
27stuffing, and there are many others. Filters become even 27and there are many others. Filters become even
28more powerful when we allow them to be chained together to 28more powerful when we allow them to be chained together to
29create composite filters. In this context, filters can be seen 29create composite filters. In this context, filters can be seen
30as the middle links in a chain of data transformations. Sources an sinks 30as the middle links in a chain of data transformations. Sources an sinks
31are the corresponding end points of these chains. A source 31are the corresponding end points of these chains. A source
32is a function that produces data, chunk by chunk, and a sink 32is a function that produces data, chunk by chunk, and a sink
33is a function that takes data, chunk by chunk. In this 33is a function that takes data, chunk by chunk. In this
34chapter, we describe the design of an elegant interface for filters, 34article, we describe the design of an elegant interface for filters,
35sources, sinks and chaining, refine it 35sources, sinks, and chaining, and illustrate each step
36until it reaches a high degree of generality. We discuss 36with concrete examples.
37implementation challenges, provide practical solutions,
38and illustrate each step with concrete examples.
39\end{abstract} 37\end{abstract}
40 38
41 39
@@ -52,7 +50,7 @@ transfer coding, and the list goes on.
52Many complex tasks require a combination of two or more such 50Many complex tasks require a combination of two or more such
53transformations, and therefore a general mechanism for 51transformations, and therefore a general mechanism for
54promoting reuse is desirable. In the process of designing 52promoting reuse is desirable. In the process of designing
55LuaSocket 2.0, David Burgess and I were forced to deal with 53\texttt{LuaSocket~2.0}, David Burgess and I were forced to deal with
56this problem. The solution we reached proved to be very 54this problem. The solution we reached proved to be very
57general and convenient. It is based on the concepts of 55general and convenient. It is based on the concepts of
58filters, sources, sinks, and pumps, which we introduce 56filters, sources, sinks, and pumps, which we introduce
@@ -62,18 +60,18 @@ below.
62with chunks of input, successively returning processed 60with chunks of input, successively returning processed
63chunks of output. More importantly, the result of 61chunks of output. More importantly, the result of
64concatenating all the output chunks must be the same as the 62concatenating all the output chunks must be the same as the
65result of applying the filter over the concatenation of all 63result of applying the filter to the concatenation of all
66input chunks. In fancier language, filters \emph{commute} 64input chunks. In fancier language, filters \emph{commute}
67with the concatenation operator. As a result, chunk 65with the concatenation operator. As a result, chunk
68boundaries are irrelevant: filters correctly handle input 66boundaries are irrelevant: filters correctly handle input
69data no matter how it was originally split. 67data no matter how it is split.
70 68
71A \emph{chain} transparently combines the effect of one or 69A \emph{chain} transparently combines the effect of one or
72more filters. The interface of a chain must be 70more filters. The interface of a chain is
73indistinguishable from the interface of its components. 71indistinguishable from the interface of its components.
74This allows a chained filter to be used wherever an atomic 72This allows a chained filter to be used wherever an atomic
75filter is expected. In particular, chains can be chained 73filter is expected. In particular, chains can be
76themselves to create arbitrarily complex operations. 74themselves chained to create arbitrarily complex operations.
77 75
78Filters can be seen as internal nodes in a network through 76Filters can be seen as internal nodes in a network through
79which data will flow, potentially being transformed many 77which data will flow, potentially being transformed many
@@ -93,15 +91,13 @@ anything to happen. \emph{Pumps} provide the driving force
93that pushes data through the network, from a source to a 91that pushes data through the network, from a source to a
94sink. 92sink.
95 93
96These concepts will become less abstract with examples. In 94In the following sections, we start with a simplified
97the following sections, we start with a simplified 95interface, which we later refine. The evolution we present
98interface, which we refine several times until no obvious 96is not contrived: it recreates the steps we followed
99shortcomings remain. The evolution we present is not 97ourselves as we consolidated our understanding of these
100contrived: it recreates the steps we followed ourselves as 98concepts within our application domain.
101we consolidated our understanding of these concepts and the
102applications that benefit from them.
103 99
104\subsection{A concrete example} 100\subsection{A simple example}
105 101
106Let us use the end-of-line normalization of text as an 102Let us use the end-of-line normalization of text as an
107example to motivate our initial filter interface. 103example to motivate our initial filter interface.
@@ -141,23 +137,23 @@ it with a \texttt{nil} chunk. The filter responds by returning
141the final chunk of processed data. 137the final chunk of processed data.
142 138
143Although the interface is extremely simple, the 139Although the interface is extremely simple, the
144implementation is not so obvious. Any filter 140implementation is not so obvious. A normalization filter
145respecting this interface needs to keep some kind of context 141respecting this interface needs to keep some kind of context
146between calls. This is because chunks can for example be broken 142between calls. This is because a chunk boundary may lie between
147between the CR and LF characters marking the end of a line. This 143the CR and LF characters marking the end of a line. This
148need for contextual storage is what motivates the use of 144need for contextual storage motivates the use of
149factories: each time the factory is called, it returns a 145factories: each time the factory is invoked, it returns a
150filter with its own context so that we can have several 146filter with its own context so that we can have several
151independent filters being used at the same time. For 147independent filters being used at the same time. For
152efficiency reasons, we must avoid the obvious solution of 148efficiency reasons, we must avoid the obvious solution of
153concatenating all the input into the context before 149concatenating all the input into the context before
154producing any output. 150producing any output.
155 151
156To that end, we will break the implementation in two parts: 152To that end, we break the implementation into two parts:
157a low-level filter, and a factory of high-level filters. The 153a low-level filter, and a factory of high-level filters. The
158low-level filter will be implemented in C and will not carry 154low-level filter is implemented in C and does not maintain
159any context between function calls. The high-level filter 155any context between function calls. The high-level filter
160factory, implemented in Lua, will create and return a 156factory, implemented in Lua, creates and returns a
161high-level filter that maintains whatever context the low-level 157high-level filter that maintains whatever context the low-level
162filter needs, but isolates the user from its internal 158filter needs, but isolates the user from its internal
163details. That way, we take advantage of C's efficiency to 159details. That way, we take advantage of C's efficiency to
@@ -191,22 +187,21 @@ end
191The \texttt{normalize} factory simply calls a more generic 187The \texttt{normalize} factory simply calls a more generic
192factory, the \texttt{cycle} factory. This factory receives a 188factory, the \texttt{cycle} factory. This factory receives a
193low-level filter, an initial context, and an extra 189low-level filter, an initial context, and an extra
194parameter, and returns the corresponding high-level filter. 190parameter, and returns a new high-level filter. Each time
195Each time the high-level filer is passed a new chunk, it 191the high-level filer is passed a new chunk, it invokes the
196invokes the low-level filter passing it the previous 192low-level filter with the previous context, the new chunk,
197context, the new chunk, and the extra argument. The 193and the extra argument. It is the low-level filter that
198low-level filter in turn produces the chunk of processed 194does all the work, producing the chunk of processed data and
199data and a new context. The high-level filter then updates 195a new context. The high-level filter then updates its
200its internal context, and returns the processed chunk of 196internal context, and returns the processed chunk of data to
201data to the user. It is the low-level filter that does all 197the user. Notice that we take advantage of Lua's lexical
202the work. Notice that we take advantage of Lua's lexical
203scoping to store the context in a closure between function 198scoping to store the context in a closure between function
204calls. 199calls.
205 200
206Concerning the low-level filter code, we must first accept 201Concerning the low-level filter code, we must first accept
207that there is no perfect solution to the end-of-line marker 202that there is no perfect solution to the end-of-line marker
208normalization problem itself. The difficulty comes from an 203normalization problem. The difficulty comes from an
209inherent ambiguity on the definition of empty lines within 204inherent ambiguity in the definition of empty lines within
210mixed input. However, the following solution works well for 205mixed input. However, the following solution works well for
211any consistent input, as well as for non-empty lines in 206any consistent input, as well as for non-empty lines in
212mixed input. It also does a reasonable job with empty lines 207mixed input. It also does a reasonable job with empty lines
@@ -218,17 +213,18 @@ The idea is to consider both CR and~LF as end-of-line
218is seen alone, or followed by a different candidate. In 213is seen alone, or followed by a different candidate. In
219other words, CR~CR~and LF~LF each issue two end-of-line 214other words, CR~CR~and LF~LF each issue two end-of-line
220markers, whereas CR~LF~and LF~CR issue only one marker each. 215markers, whereas CR~LF~and LF~CR issue only one marker each.
221This idea correctly handles the Unix, DOS/MIME, VMS, and Mac 216This method correctly handles the Unix, DOS/MIME, VMS, and Mac
222OS, as well as other more obscure conventions. 217OS conventions.
223 218
224\subsection{The C part of the filter} 219\subsection{The C part of the filter}
225 220
226Our low-level filter is divided into two simple functions. 221Our low-level filter is divided into two simple functions.
227The inner function actually does the conversion. It takes 222The inner function performs the normalization itself. It takes
228each input character in turn, deciding what to output and 223each input character in turn, deciding what to output and
229how to modify the context. The context tells if the last 224how to modify the context. The context tells if the last
230character processed was an end-of-line candidate, and if so, 225processed character was an end-of-line candidate, and if so,
231which candidate it was. 226which candidate it was. For efficiency, it uses
227Lua's auxiliary library's buffer interface:
232\begin{quote} 228\begin{quote}
233\begin{C} 229\begin{C}
234@stick# 230@stick#
@@ -252,12 +248,10 @@ static int process(int c, int last, const char *marker,
252\end{C} 248\end{C}
253\end{quote} 249\end{quote}
254 250
255The inner function makes use of Lua's auxiliary library's 251The outer function simply interfaces with Lua. It receives the
256buffer interface for efficiency. The 252context and input chunk (as well as an optional
257outer function simply interfaces with Lua. It receives the
258context and the input chunk (as well as an optional
259custom end-of-line marker), and returns the transformed 253custom end-of-line marker), and returns the transformed
260output chunk and the new context. 254output chunk and the new context:
261\begin{quote} 255\begin{quote}
262\begin{C} 256\begin{C}
263@stick# 257@stick#
@@ -291,33 +285,29 @@ initial state. This allows the filter to be reused many
291times. 285times.
292 286
293When designing your own filters, the challenging part is to 287When designing your own filters, the challenging part is to
294decide what will be the context. For line breaking, for 288decide what will be in the context. For line breaking, for
295instance, it could be the number of bytes left in the 289instance, it could be the number of bytes left in the
296current line. For Base64 encoding, it could be a string 290current line. For Base64 encoding, it could be a string
297with the bytes that remain after the division of the input 291with the bytes that remain after the division of the input
298into 3-byte atoms. The MIME module in the LuaSocket 292into 3-byte atoms. The MIME module in the \texttt{LuaSocket}
299distribution has many other examples. 293distribution has many other examples.
300 294
301\section{Filter chains} 295\section{Filter chains}
302 296
303Chains add a lot to the power of filters. For example, 297Chains add a lot to the power of filters. For example,
304according to the standard for Quoted-Printable encoding, the 298according to the standard for Quoted-Printable encoding,
305text must be normalized into its canonic form prior to 299text must be normalized to a canonic end-of-line marker
306encoding, as far as end-of-line markers are concerned. To 300prior to encoding. To help specifying complex
307help specifying complex transformations like these, we define a 301transformations like this, we define a chain factory that
308chain factory that creates a composite filter from one or 302creates a composite filter from one or more filters. A
309more filters. A chained filter passes data through all 303chained filter passes data through all its components, and
310its components, and can be used wherever a primitive filter 304can be used wherever a primitive filter is accepted.
311is accepted. 305
312 306The chaining factory is very simple. The auxiliary
313The chaining factory is very simple. All it does is return a 307function~\texttt{chainpair} chains two filters together,
314function that passes data through all filters and returns 308taking special care if the chunk is the last. This is
315the result to the user. The auxiliary 309because the final \texttt{nil} chunk notification has to be
316function~\texttt{chainpair} can only chain two filters 310pushed through both filters in turn:
317together. In the auxiliary function, special care must be
318taken if the chunk is the last. This is because the final
319\texttt{nil} chunk notification has to be pushed through both
320filters in turn:
321\begin{quote} 311\begin{quote}
322\begin{lua} 312\begin{lua}
323@stick# 313@stick#
@@ -333,7 +323,7 @@ end
333@stick# 323@stick#
334function filter.chain(...) 324function filter.chain(...)
335 local f = arg[1] 325 local f = arg[1]
336 for i = 2, table.getn(arg) do 326 for i = 2, @#arg do
337 f = chainpair(f, arg[i]) 327 f = chainpair(f, arg[i])
338 end 328 end
339 return f 329 return f
@@ -343,7 +333,7 @@ end
343\end{quote} 333\end{quote}
344 334
345Thanks to the chain factory, we can 335Thanks to the chain factory, we can
346trivially define the Quoted-Printable conversion: 336define the Quoted-Printable conversion as such:
347\begin{quote} 337\begin{quote}
348\begin{lua} 338\begin{lua}
349@stick# 339@stick#
@@ -361,7 +351,7 @@ pump.all(in, out)
361The filters we introduced so far act as the internal nodes 351The filters we introduced so far act as the internal nodes
362in a network of transformations. Information flows from node 352in a network of transformations. Information flows from node
363to node (or rather from one filter to the next) and is 353to node (or rather from one filter to the next) and is
364transformed on its way out. Chaining filters together is our 354transformed along the way. Chaining filters together is our
365way to connect nodes in this network. As the starting point 355way to connect nodes in this network. As the starting point
366for the network, we need a source node that produces the 356for the network, we need a source node that produces the
367data. In the end of the network, we need a sink node that 357data. In the end of the network, we need a sink node that
@@ -376,8 +366,8 @@ caller by returning \texttt{nil} followed by an error message.
376 366
377Below are two simple source factories. The \texttt{empty} source 367Below are two simple source factories. The \texttt{empty} source
378returns no data, possibly returning an associated error 368returns no data, possibly returning an associated error
379message. The \texttt{file} source is more usefule, and 369message. The \texttt{file} source works harder, and
380yields the contents of a file in a chunk by chunk fashion. 370yields the contents of a file in a chunk by chunk fashion:
381\begin{quote} 371\begin{quote}
382\begin{lua} 372\begin{lua}
383@stick# 373@stick#
@@ -404,9 +394,13 @@ end
404 394
405\subsection{Filtered sources} 395\subsection{Filtered sources}
406 396
407It is often useful to chain a source with a filter. A 397A filtered source passes its data through the
408filtered source passes its data through the
409associated filter before returning it to the caller. 398associated filter before returning it to the caller.
399Filtered sources are useful when working with
400functions that get their input data from a source (such as
401the pump in our first example). By chaining a source with one or
402more filters, the function can be transparently provided
403with filtered data, with no need to change its interface.
410Here is a factory that does the job: 404Here is a factory that does the job:
411\begin{quote} 405\begin{quote}
412\begin{lua} 406\begin{lua}
@@ -425,23 +419,16 @@ end
425\end{lua} 419\end{lua}
426\end{quote} 420\end{quote}
427 421
428Our motivating example in the introduction chains a source
429with a filter. Filtered sources are useful when working with
430functions that get their input data from a source (such as
431the pump in the example). By chaining a source with one or
432more filters, the function can be transparently provided
433with filtered data, with no need to change its interface.
434
435\subsection{Sinks} 422\subsection{Sinks}
436 423
437Just as we defined an interface for sources of 424Just as we defined an interface a data source,
438data, we can also define an interface for a 425we can also define an interface for a data destination.
439destination for data. We call any function respecting this 426We call any function respecting this
440interface a \emph{sink}. In our first example, we used a 427interface a \emph{sink}. In our first example, we used a
441file sink connected to the standard output. 428file sink connected to the standard output.
442 429
443Sinks receive consecutive chunks of data, until the end of 430Sinks receive consecutive chunks of data, until the end of
444data is notified with a \texttt{nil} chunk. A sink can be 431data is signaled by a \texttt{nil} chunk. A sink can be
445notified of an error with an optional extra argument that 432notified of an error with an optional extra argument that
446contains the error message, following a \texttt{nil} chunk. 433contains the error message, following a \texttt{nil} chunk.
447If a sink detects an error itself, and 434If a sink detects an error itself, and
@@ -529,18 +516,21 @@ common that it deserves its own function:
529function pump.step(src, snk) 516function pump.step(src, snk)
530 local chunk, src_err = src() 517 local chunk, src_err = src()
531 local ret, snk_err = snk(chunk, src_err) 518 local ret, snk_err = snk(chunk, src_err)
532 return chunk and ret and not src_err and not snk_err, 519 if chunk and ret then return 1
533 src_err or snk_err 520 else return nil, src_err or snk_err end
534end 521end
535% 522%
536 523
537@stick# 524@stick#
538function pump.all(src, snk, step) 525function pump.all(src, snk, step)
539 step = step or pump.step 526 step = step or pump.step
540 while true do 527 while true do
541 local ret, err = step(src, snk) 528 local ret, err = step(src, snk)
542 if not ret then return not err, err end 529 if not ret then
543 end 530 if err then return nil, err
531 else return 1 end
532 end
533 end
544end 534end
545% 535%
546\end{lua} 536\end{lua}
@@ -571,21 +561,23 @@ The way we split the filters here is not intuitive, on
571purpose. Alternatively, we could have chained the Base64 561purpose. Alternatively, we could have chained the Base64
572encode filter and the line-wrap filter together, and then 562encode filter and the line-wrap filter together, and then
573chain the resulting filter with either the file source or 563chain the resulting filter with either the file source or
574the file sink. It doesn't really matter. 564the file sink. It doesn't really matter. The Base64 and the
565line wrapping filters are part of the \texttt{LuaSocket}
566distribution.
575 567
576\section{Exploding filters} 568\section{Exploding filters}
577 569
578Our current filter interface has one flagrant shortcoming. 570Our current filter interface has one flagrant shortcoming.
579When David Burgess was writing his \texttt{gzip} filter, he 571When David Burgess was writing his \texttt{gzip} filter, he
580noticed that a decompression filter can explode a small 572noticed that a decompression filter can explode a small
581input chunk into a huge amount of data. To address this, we 573input chunk into a huge amount of data. To address this
582decided to change our filter interface to allow exploding 574problem, we decided to change the filter interface and allow
583filters to return large quantities of output data in a chunk 575exploding filters to return large quantities of output data
584by chunk manner. 576in a chunk by chunk manner.
585 577
586More specifically, after passing each chunk of input data to 578More specifically, after passing each chunk of input to
587a filter and collecting the first chunk of output data, the 579a filter, and collecting the first chunk of output, the
588user must now loop to receive data from the filter until no 580user must now loop to receive other chunks from the filter until no
589filtered data is left. Within these secondary calls, the 581filtered data is left. Within these secondary calls, the
590caller passes an empty string to the filter. The filter 582caller passes an empty string to the filter. The filter
591responds with an empty string when it is ready for the next 583responds with an empty string when it is ready for the next
@@ -593,7 +585,7 @@ input chunk. In the end, after the user passes a
593\texttt{nil} chunk notifying the filter that there is no 585\texttt{nil} chunk notifying the filter that there is no
594more input data, the filter might still have to produce too 586more input data, the filter might still have to produce too
595much output data to return in a single chunk. The user has 587much output data to return in a single chunk. The user has
596to loop again, this time passing \texttt{nil} each time, 588to loop again, now passing \texttt{nil} to the filter each time,
597until the filter itself returns \texttt{nil} to notify the 589until the filter itself returns \texttt{nil} to notify the
598user it is finally done. 590user it is finally done.
599 591
@@ -602,9 +594,9 @@ the new interface. In fact, the end-of-line translation
602filter we presented earlier already conforms to it. The 594filter we presented earlier already conforms to it. The
603complexity is encapsulated within the chaining functions, 595complexity is encapsulated within the chaining functions,
604which must now include a loop. Since these functions only 596which must now include a loop. Since these functions only
605have to be written once, the user is not affected. 597have to be written once, the user is rarely affected.
606Interestingly, the modifications do not have a measurable 598Interestingly, the modifications do not have a measurable
607negative impact in the the performance of filters that do 599negative impact in the performance of filters that do
608not need the added flexibility. On the other hand, for a 600not need the added flexibility. On the other hand, for a
609small price in complexity, the changes make exploding 601small price in complexity, the changes make exploding
610filters practical. 602filters practical.
@@ -617,7 +609,7 @@ and SMTP modules are especially integrated with LTN12,
617and can be used to showcase the expressive power of filters, 609and can be used to showcase the expressive power of filters,
618sources, sinks, and pumps. Below is an example 610sources, sinks, and pumps. Below is an example
619of how a user would proceed to define and send a 611of how a user would proceed to define and send a
620multipart message with attachments, using \texttt{LuaSocket}: 612multipart message, with attachments, using \texttt{LuaSocket}:
621\begin{quote} 613\begin{quote}
622\begin{mime} 614\begin{mime}
623local smtp = require"socket.smtp" 615local smtp = require"socket.smtp"
@@ -656,8 +648,8 @@ assert(smtp.send{
656The \texttt{smtp.message} function receives a table 648The \texttt{smtp.message} function receives a table
657describing the message, and returns a source. The 649describing the message, and returns a source. The
658\texttt{smtp.send} function takes this source, chains it with the 650\texttt{smtp.send} function takes this source, chains it with the
659SMTP dot-stuffing filter, creates a connects a socket sink 651SMTP dot-stuffing filter, connects a socket sink
660to the server, and simply pumps the data. The message is never 652with the server, and simply pumps the data. The message is never
661assembled in memory. Everything is produced on demand, 653assembled in memory. Everything is produced on demand,
662transformed in small pieces, and sent to the server in chunks, 654transformed in small pieces, and sent to the server in chunks,
663including the file attachment that is loaded from disk and 655including the file attachment that is loaded from disk and
@@ -665,14 +657,14 @@ encoded on the fly. It just works.
665 657
666\section{Conclusions} 658\section{Conclusions}
667 659
668In this article we introduce the concepts of filters, 660In this article, we introduced the concepts of filters,
669sources, sinks, and pumps to the Lua language. These are 661sources, sinks, and pumps to the Lua language. These are
670useful tools for data processing in general. Sources provide 662useful tools for stream processing in general. Sources provide
671a simple abstraction for data acquisition. Sinks provide an 663a simple abstraction for data acquisition. Sinks provide an
672abstraction for final data destinations. Filters define an 664abstraction for final data destinations. Filters define an
673interface for data transformations. The chaining of 665interface for data transformations. The chaining of
674filters, sources and sinks provides an elegant way to create 666filters, sources and sinks provides an elegant way to create
675arbitrarily complex data transformation from simpler 667arbitrarily complex data transformations from simpler
676transformations. Pumps simply move the data through. 668components. Pumps simply move the data through.
677 669
678\end{document} 670\end{document}