From 7187be8b76452aa968726180af24deaaa545431d Mon Sep 17 00:00:00 2001 From: Thijs Schreijer Date: Wed, 23 Mar 2022 16:01:50 +0100 Subject: cleanup; delete the ./gem folder --- gem/ltn012.tex | 695 --------------------------------------------------------- 1 file changed, 695 deletions(-) delete mode 100644 gem/ltn012.tex (limited to 'gem/ltn012.tex') diff --git a/gem/ltn012.tex b/gem/ltn012.tex deleted file mode 100644 index 8027ecc..0000000 --- a/gem/ltn012.tex +++ /dev/null @@ -1,695 +0,0 @@ -\documentclass[10pt]{article} -\usepackage{fancyvrb} -\usepackage{url} -\DefineVerbatimEnvironment{lua}{Verbatim}{fontsize=\small,commandchars=\@\#\%} -\DefineVerbatimEnvironment{C}{Verbatim}{fontsize=\small,commandchars=\@\#\%} -\DefineVerbatimEnvironment{mime}{Verbatim}{fontsize=\small,commandchars=\$\#\%} -\newcommand{\stick}[1]{\vbox{\setlength{\parskip}{0pt}#1}} -\newcommand{\bl}{\ensuremath{\mathtt{\backslash}}} -\newcommand{\CR}{\texttt{CR}} -\newcommand{\LF}{\texttt{LF}} -\newcommand{\CRLF}{\texttt{CR~LF}} -\newcommand{\nil}{\texttt{nil}} - -\title{Filters, sources, sinks, and pumps\\ - {\large or Functional programming for the rest of us}} -\author{Diego Nehab} - -\begin{document} - -\maketitle - -\begin{abstract} -Certain data processing operations can be implemented in the -form of filters. A filter is a function that can process -data received in consecutive invocations, returning partial -results each time it is called. Examples of operations that -can be implemented as filters include the end-of-line -normalization for text, Base64 and Quoted-Printable transfer -content encodings, the breaking of text into lines, SMTP -dot-stuffing, and there are many others. Filters become -even more powerful when we allow them to be chained together -to create composite filters. In this context, filters can be -seen as the internal links in a chain of data transformations. -Sources and sinks are the corresponding end points in these -chains. A source is a function that produces data, chunk by -chunk, and a sink is a function that takes data, chunk by -chunk. Finally, pumps are procedures that actively drive -data from a source to a sink, and indirectly through all -intervening filters. In this article, we describe the design of an -elegant interface for filters, sources, sinks, chains, and -pumps, and we illustrate each step with concrete examples. -\end{abstract} - -\section{Introduction} - -Within the realm of networking applications, we are often -required to apply transformations to streams of data. Examples -include the end-of-line normalization for text, Base64 and -Quoted-Printable transfer content encodings, breaking text -into lines with a maximum number of columns, SMTP -dot-stuffing, \texttt{gzip} compression, HTTP chunked -transfer coding, and the list goes on. - -Many complex tasks require a combination of two or more such -transformations, and therefore a general mechanism for -promoting reuse is desirable. In the process of designing -\texttt{LuaSocket~2.0}, we repeatedly faced this problem. -The solution we reached proved to be very general and -convenient. It is based on the concepts of filters, sources, -sinks, and pumps, which we introduce below. - -\emph{Filters} are functions that can be repeatedly invoked -with chunks of input, successively returning processed -chunks of output. Naturally, the result of -concatenating all the output chunks must be the same as the -result of applying the filter to the concatenation of all -input chunks. In fancier language, filters \emph{commute} -with the concatenation operator. More importantly, filters -must handle input data correctly no matter how the stream -has been split into chunks. - -A \emph{chain} is a function that transparently combines the -effect of one or more filters. The interface of a chain is -indistinguishable from the interface of its component -filters. This allows a chained filter to be used wherever -an atomic filter is accepted. In particular, chains can be -themselves chained to create arbitrarily complex operations. - -Filters can be seen as internal nodes in a network through -which data will flow, potentially being transformed many -times along the way. Chains connect these nodes together. -The initial and final nodes of the network are -\emph{sources} and \emph{sinks}, respectively. Less -abstractly, a source is a function that produces new chunks -of data every time it is invoked. Conversely, sinks are -functions that give a final destination to the chunks of -data they receive in sucessive calls. Naturally, sources -and sinks can also be chained with filters to produce -filtered sources and sinks. - -Finally, filters, chains, sources, and sinks are all passive -entities: they must be repeatedly invoked in order for -anything to happen. \emph{Pumps} provide the driving force -that pushes data through the network, from a source to a -sink, and indirectly through all intervening filters. - -In the following sections, we start with a simplified -interface, which we later refine. The evolution we present -is not contrived: it recreates the steps we ourselves -followed as we consolidated our understanding of these -concepts within our application domain. - -\subsection{A simple example} - -The end-of-line normalization of text is a good -example to motivate our initial filter interface. -Assume we are given text in an unknown end-of-line -convention (including possibly mixed conventions) out of the -commonly found Unix (\LF), Mac OS (\CR), and -DOS (\CRLF) conventions. We would like to be able to -use the folowing code to normalize the end-of-line markers: -\begin{quote} -\begin{lua} -@stick# -local CRLF = "\013\010" -local input = source.chain(source.file(io.stdin), normalize(CRLF)) -local output = sink.file(io.stdout) -pump.all(input, output) -% -\end{lua} -\end{quote} - -This program should read data from the standard input stream -and normalize the end-of-line markers to the canonic -\CRLF\ marker, as defined by the MIME standard. -Finally, the normalized text should be sent to the standard output -stream. We use a \emph{file source} that produces data from -standard input, and chain it with a filter that normalizes -the data. The pump then repeatedly obtains data from the -source, and passes it to the \emph{file sink}, which sends -it to the standard output. - -In the code above, the \texttt{normalize} \emph{factory} is a -function that creates our normalization filter, which -replaces any end-of-line marker with the canonic marker. -The initial filter interface is -trivial: a filter function receives a chunk of input data, -and returns a chunk of processed data. When there are no -more input data left, the caller notifies the filter by invoking -it with a \nil\ chunk. The filter responds by returning -the final chunk of processed data (which could of course be -the empty string). - -Although the interface is extremely simple, the -implementation is not so obvious. A normalization filter -respecting this interface needs to keep some kind of context -between calls. This is because a chunk boundary may lie between -the \CR\ and \LF\ characters marking the end of a single line. This -need for contextual storage motivates the use of -factories: each time the factory is invoked, it returns a -filter with its own context so that we can have several -independent filters being used at the same time. For -efficiency reasons, we must avoid the obvious solution of -concatenating all the input into the context before -producing any output chunks. - -To that end, we break the implementation into two parts: -a low-level filter, and a factory of high-level filters. The -low-level filter is implemented in C and does not maintain -any context between function calls. The high-level filter -factory, implemented in Lua, creates and returns a -high-level filter that maintains whatever context the low-level -filter needs, but isolates the user from its internal -details. That way, we take advantage of C's efficiency to -perform the hard work, and take advantage of Lua's -simplicity for the bookkeeping. - -\subsection{The Lua part of the filter} - -Below is the complete implementation of the factory of high-level -end-of-line normalization filters: -\begin{quote} -\begin{lua} -@stick# -function filter.cycle(lowlevel, context, extra) - return function(chunk) - local ret - ret, context = lowlevel(context, chunk, extra) - return ret - end -end -% - -@stick# -function normalize(marker) - return filter.cycle(eol, 0, marker) -end -% -\end{lua} -\end{quote} - -The \texttt{normalize} factory simply calls a more generic -factory, the \texttt{cycle}~factory, passing the low-level -filter~\texttt{eol}. The \texttt{cycle}~factory receives a -low-level filter, an initial context, and an extra -parameter, and returns a new high-level filter. Each time -the high-level filer is passed a new chunk, it invokes the -low-level filter with the previous context, the new chunk, -and the extra argument. It is the low-level filter that -does all the work, producing the chunk of processed data and -a new context. The high-level filter then replaces its -internal context, and returns the processed chunk of data to -the user. Notice that we take advantage of Lua's lexical -scoping to store the context in a closure between function -calls. - -\subsection{The C part of the filter} - -As for the low-level filter, we must first accept -that there is no perfect solution to the end-of-line marker -normalization problem. The difficulty comes from an -inherent ambiguity in the definition of empty lines within -mixed input. However, the following solution works well for -any consistent input, as well as for non-empty lines in -mixed input. It also does a reasonable job with empty lines -and serves as a good example of how to implement a low-level -filter. - -The idea is to consider both \CR\ and~\LF\ as end-of-line -\emph{candidates}. We issue a single break if any candidate -is seen alone, or if it is followed by a different -candidate. In other words, \CR~\CR~and \LF~\LF\ each issue -two end-of-line markers, whereas \CR~\LF~and \LF~\CR\ issue -only one marker each. It is easy to see that this method -correctly handles the most common end-of-line conventions. - -With this in mind, we divide the low-level filter into two -simple functions. The inner function~\texttt{pushchar} performs the -normalization itself. It takes each input character in turn, -deciding what to output and how to modify the context. The -context tells if the last processed character was an -end-of-line candidate, and if so, which candidate it was. -For efficiency, we use Lua's auxiliary library's buffer -interface: -\begin{quote} -\begin{C} -@stick# -@#define candidate(c) (c == CR || c == LF) -static int pushchar(int c, int last, const char *marker, - luaL_Buffer *buffer) { - if (candidate(c)) { - if (candidate(last)) { - if (c == last) - luaL_addstring(buffer, marker); - return 0; - } else { - luaL_addstring(buffer, marker); - return c; - } - } else { - luaL_pushchar(buffer, c); - return 0; - } -} -% -\end{C} -\end{quote} - -The outer function~\texttt{eol} simply interfaces with Lua. -It receives the context and input chunk (as well as an -optional custom end-of-line marker), and returns the -transformed output chunk and the new context. -Notice that if the input chunk is \nil, the operation -is considered to be finished. In that case, the loop will -not execute a single time and the context is reset to the -initial state. This allows the filter to be reused many -times: -\begin{quote} -\begin{C} -@stick# -static int eol(lua_State *L) { - int context = luaL_checkint(L, 1); - size_t isize = 0; - const char *input = luaL_optlstring(L, 2, NULL, &isize); - const char *last = input + isize; - const char *marker = luaL_optstring(L, 3, CRLF); - luaL_Buffer buffer; - luaL_buffinit(L, &buffer); - if (!input) { - lua_pushnil(L); - lua_pushnumber(L, 0); - return 2; - } - while (input < last) - context = pushchar(*input++, context, marker, &buffer); - luaL_pushresult(&buffer); - lua_pushnumber(L, context); - return 2; -} -% -\end{C} -\end{quote} - -When designing filters, the challenging part is usually -deciding what to store in the context. For line breaking, for -instance, it could be the number of bytes that still fit in the -current line. For Base64 encoding, it could be a string -with the bytes that remain after the division of the input -into 3-byte atoms. The MIME module in the \texttt{LuaSocket} -distribution has many other examples. - -\section{Filter chains} - -Chains greatly increase the power of filters. For example, -according to the standard for Quoted-Printable encoding, -text should be normalized to a canonic end-of-line marker -prior to encoding. After encoding, the resulting text must -be broken into lines of no more than 76 characters, with the -use of soft line breaks (a line terminated by the \texttt{=} -sign). To help specifying complex transformations like -this, we define a chain factory that creates a composite -filter from one or more filters. A chained filter passes -data through all its components, and can be used wherever a -primitive filter is accepted. - -The chaining factory is very simple. The auxiliary -function~\texttt{chainpair} chains two filters together, -taking special care if the chunk is the last. This is -because the final \nil\ chunk notification has to be -pushed through both filters in turn: -\begin{quote} -\begin{lua} -@stick# -local function chainpair(f1, f2) - return function(chunk) - local ret = f2(f1(chunk)) - if chunk then return ret - else return ret .. f2() end - end -end -% - -@stick# -function filter.chain(...) - local f = select(1, ...) - for i = 2, select('@#', ...) do - f = chainpair(f, select(i, ...)) - end - return f -end -% -\end{lua} -\end{quote} - -Thanks to the chain factory, we can -define the Quoted-Printable conversion as such: -\begin{quote} -\begin{lua} -@stick# -local qp = filter.chain(normalize(CRLF), encode("quoted-printable"), - wrap("quoted-printable")) -local input = source.chain(source.file(io.stdin), qp) -local output = sink.file(io.stdout) -pump.all(input, output) -% -\end{lua} -\end{quote} - -\section{Sources, sinks, and pumps} - -The filters we introduced so far act as the internal nodes -in a network of transformations. Information flows from node -to node (or rather from one filter to the next) and is -transformed along the way. Chaining filters together is our -way to connect nodes in this network. As the starting point -for the network, we need a source node that produces the -data. In the end of the network, we need a sink node that -gives a final destination to the data. - -\subsection{Sources} - -A source returns the next chunk of data each time it is -invoked. When there is no more data, it simply returns~\nil. -In the event of an error, the source can inform the -caller by returning \nil\ followed by the error message. - -Below are two simple source factories. The \texttt{empty} source -returns no data, possibly returning an associated error -message. The \texttt{file} source yields the contents of a file -in a chunk by chunk fashion: -\begin{quote} -\begin{lua} -@stick# -function source.empty(err) - return function() - return nil, err - end -end -% - -@stick# -function source.file(handle, io_err) - if handle then - return function() - local chunk = handle:read(2048) - if not chunk then handle:close() end - return chunk - end - else return source.empty(io_err or "unable to open file") end -end -% -\end{lua} -\end{quote} - -\subsection{Filtered sources} - -A filtered source passes its data through the -associated filter before returning it to the caller. -Filtered sources are useful when working with -functions that get their input data from a source (such as -the pumps in our examples). By chaining a source with one or -more filters, such functions can be transparently provided -with filtered data, with no need to change their interfaces. -Here is a factory that does the job: -\begin{quote} -\begin{lua} -@stick# -function source.chain(src, f) - return function() - if not src then - return nil - end - local chunk, err = src() - if not chunk then - src = nil - return f(nil) - else - return f(chunk) - end - end -end -% -\end{lua} -\end{quote} - -\subsection{Sinks} - -Just as we defined an interface for a source of data, we can -also define an interface for a data destination. We call -any function respecting this interface a sink. In our first -example, we used a file sink connected to the standard -output. - -Sinks receive consecutive chunks of data, until the end of -data is signaled by a \nil\ input chunk. A sink can be -notified of an error with an optional extra argument that -contains the error message, following a \nil\ chunk. -If a sink detects an error itself, and -wishes not to be called again, it can return \nil, -followed by an error message. A return value that -is not \nil\ means the sink will accept more data. - -Below are two useful sink factories. -The table factory creates a sink that stores -individual chunks into an array. The data can later be -efficiently concatenated into a single string with Lua's -\texttt{table.concat} library function. The \texttt{null} sink -simply discards the chunks it receives: -\begin{quote} -\begin{lua} -@stick# -function sink.table(t) - t = t or {} - local f = function(chunk, err) - if chunk then table.insert(t, chunk) end - return 1 - end - return f, t -end -% - -@stick# -local function null() - return 1 -end - -function sink.null() - return null -end -% -\end{lua} -\end{quote} - -Naturally, filtered sinks are just as useful as filtered -sources. A filtered sink passes each chunk it receives -through the associated filter before handing it down to the -original sink. In the following example, we use a source -that reads from the standard input. The input chunks are -sent to a table sink, which has been coupled with a -normalization filter. The filtered chunks are then -concatenated from the output array, and finally sent to -standard out: -\begin{quote} -\begin{lua} -@stick# -local input = source.file(io.stdin) -local output, t = sink.table() -output = sink.chain(normalize(CRLF), output) -pump.all(input, output) -io.write(table.concat(t)) -% -\end{lua} -\end{quote} - -\subsection{Pumps} - -Although not on purpose, our interface for sources is -compatible with Lua iterators. That is, a source can be -neatly used in conjunction with \texttt{for} loops. Using -our file source as an iterator, we can write the following -code: -\begin{quote} -\begin{lua} -@stick# -for chunk in source.file(io.stdin) do - io.write(chunk) -end -% -\end{lua} -\end{quote} - -Loops like this will always be present because everything -we designed so far is passive. Sources, sinks, filters: none -of them can do anything on their own. The operation of -pumping all data a source can provide into a sink is so -common that it deserves its own function: -\begin{quote} -\begin{lua} -@stick# -function pump.step(src, snk) - local chunk, src_err = src() - local ret, snk_err = snk(chunk, src_err) - if chunk and ret then return 1 - else return nil, src_err or snk_err end -end -% - -@stick# -function pump.all(src, snk, step) - step = step or pump.step - while true do - local ret, err = step(src, snk) - if not ret then - if err then return nil, err - else return 1 end - end - end -end -% -\end{lua} -\end{quote} - -The \texttt{pump.step} function moves one chunk of data from -the source to the sink. The \texttt{pump.all} function takes -an optional \texttt{step} function and uses it to pump all the -data from the source to the sink. -Here is an example that uses the Base64 and the -line wrapping filters from the \texttt{LuaSocket} -distribution. The program reads a binary file from -disk and stores it in another file, after encoding it to the -Base64 transfer content encoding: -\begin{quote} -\begin{lua} -@stick# -local input = source.chain( - source.file(io.open("input.bin", "rb")), - encode("base64")) -local output = sink.chain( - wrap(76), - sink.file(io.open("output.b64", "w"))) -pump.all(input, output) -% -\end{lua} -\end{quote} - -The way we split the filters here is not intuitive, on -purpose. Alternatively, we could have chained the Base64 -encode filter and the line-wrap filter together, and then -chain the resulting filter with either the file source or -the file sink. It doesn't really matter. - -\section{Exploding filters} - -Our current filter interface has one serious shortcoming. -Consider for example a \texttt{gzip} decompression filter. -During decompression, a small input chunk can be exploded -into a huge amount of data. To address this problem, we -decided to change the filter interface and allow exploding -filters to return large quantities of output data in a chunk -by chunk manner. - -More specifically, after passing each chunk of input to -a filter, and collecting the first chunk of output, the -user must now loop to receive other chunks from the filter until no -filtered data is left. Within these secondary calls, the -caller passes an empty string to the filter. The filter -responds with an empty string when it is ready for the next -input chunk. In the end, after the user passes a -\nil\ chunk notifying the filter that there is no -more input data, the filter might still have to produce too -much output data to return in a single chunk. The user has -to loop again, now passing \nil\ to the filter each time, -until the filter itself returns \nil\ to notify the -user it is finally done. - -Fortunately, it is very easy to modify a filter to respect -the new interface. In fact, the end-of-line translation -filter we presented earlier already conforms to it. The -complexity is encapsulated within the chaining functions, -which must now include a loop. Since these functions only -have to be written once, the user is rarely affected. -Interestingly, the modifications do not have a measurable -negative impact in the performance of filters that do -not need the added flexibility. On the other hand, for a -small price in complexity, the changes make exploding -filters practical. - -\section{A complex example} - -The LTN12 module in the \texttt{LuaSocket} distribution -implements all the ideas we have described. The MIME -and SMTP modules are tightly integrated with LTN12, -and can be used to showcase the expressive power of filters, -sources, sinks, and pumps. Below is an example -of how a user would proceed to define and send a -multipart message, with attachments, using \texttt{LuaSocket}: -\begin{quote} -\begin{mime} -local smtp = require"socket.smtp" -local mime = require"mime" -local ltn12 = require"ltn12" - -local message = smtp.message{ - headers = { - from = "Sicrano ", - to = "Fulano ", - subject = "A message with an attachment"}, - body = { - preamble = "Hope you can see the attachment" .. CRLF, - [1] = { - body = "Here is our logo" .. CRLF}, - [2] = { - headers = { - ["content-type"] = 'image/png; name="luasocket.png"', - ["content-disposition"] = - 'attachment; filename="luasocket.png"', - ["content-description"] = 'LuaSocket logo', - ["content-transfer-encoding"] = "BASE64"}, - body = ltn12.source.chain( - ltn12.source.file(io.open("luasocket.png", "rb")), - ltn12.filter.chain( - mime.encode("base64"), - mime.wrap()))}}} - -assert(smtp.send{ - rcpt = "", - from = "", - source = message}) -\end{mime} -\end{quote} - -The \texttt{smtp.message} function receives a table -describing the message, and returns a source. The -\texttt{smtp.send} function takes this source, chains it with the -SMTP dot-stuffing filter, connects a socket sink -with the server, and simply pumps the data. The message is never -assembled in memory. Everything is produced on demand, -transformed in small pieces, and sent to the server in chunks, -including the file attachment which is loaded from disk and -encoded on the fly. It just works. - -\section{Conclusions} - -In this article, we introduced the concepts of filters, -sources, sinks, and pumps to the Lua language. These are -useful tools for stream processing in general. Sources provide -a simple abstraction for data acquisition. Sinks provide an -abstraction for final data destinations. Filters define an -interface for data transformations. The chaining of -filters, sources and sinks provides an elegant way to create -arbitrarily complex data transformations from simpler -components. Pumps simply push the data through. - -\section{Acknowledgements} - -The concepts described in this text are the result of long -discussions with David Burgess. A version of this text has -been released on-line as the Lua Technical Note 012, hence -the name of the corresponding LuaSocket module, LTN12. Wim -Couwenberg contributed to the implementation of the module, -and Adrian Sietsma was the first to notice the -correspondence between sources and Lua iterators. - - -\end{document} -- cgit v1.2.3-55-g6feb