aboutsummaryrefslogtreecommitdiff
path: root/re.html
diff options
context:
space:
mode:
authorRoberto Ierusalimschy <roberto@inf.puc-rio.br>2019-02-20 10:13:46 -0300
committerRoberto Ierusalimschy <roberto@inf.puc-rio.br>2019-02-20 10:13:46 -0300
commite08e5df853560de6482d84066a7accc6a18de545 (patch)
treeee19686bb35da90709a32ed24bf7855de1a3946a /re.html
downloadlpeg-e08e5df853560de6482d84066a7accc6a18de545.tar.gz
lpeg-e08e5df853560de6482d84066a7accc6a18de545.tar.bz2
lpeg-e08e5df853560de6482d84066a7accc6a18de545.zip
Fist version of LPeg on GIT
LPeg repository is being moved to git. Past versions won't be moved; they are still available in RCS.
Diffstat (limited to 're.html')
-rw-r--r--re.html500
1 files changed, 500 insertions, 0 deletions
diff --git a/re.html b/re.html
new file mode 100644
index 0000000..c7d575b
--- /dev/null
+++ b/re.html
@@ -0,0 +1,500 @@
1<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3<html>
4<head>
5 <title>LPeg.re - Regex syntax for LPEG</title>
6 <link rel="stylesheet"
7 href="http://www.inf.puc-rio.br/~roberto/lpeg/doc.css"
8 type="text/css"/>
9 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
10</head>
11<body>
12
13<!-- $Id: re.html,v 1.25 2018/06/04 16:21:19 roberto Exp $ -->
14
15<div id="container">
16
17<div id="product">
18 <div id="product_logo">
19 <a href="http://www.inf.puc-rio.br/~roberto/lpeg/">
20 <img alt="LPeg logo" src="lpeg-128.gif"/>
21 </a>
22 </div>
23 <div id="product_name"><big><strong>LPeg.re</strong></big></div>
24 <div id="product_description">
25 Regex syntax for LPEG
26 </div>
27</div> <!-- id="product" -->
28
29<div id="main">
30
31<div id="navigation">
32<h1>re</h1>
33
34<ul>
35 <li><a href="#basic">Basic Constructions</a></li>
36 <li><a href="#func">Functions</a></li>
37 <li><a href="#ex">Some Examples</a></li>
38 <li><a href="#license">License</a></li>
39 </ul>
40 </li>
41</ul>
42</div> <!-- id="navigation" -->
43
44<div id="content">
45
46<h2><a name="basic"></a>The <code>re</code> Module</h2>
47
48<p>
49The <code>re</code> module
50(provided by file <code>re.lua</code> in the distribution)
51supports a somewhat conventional regex syntax
52for pattern usage within <a href="lpeg.html">LPeg</a>.
53</p>
54
55<p>
56The next table summarizes <code>re</code>'s syntax.
57A <code>p</code> represents an arbitrary pattern;
58<code>num</code> represents a number (<code>[0-9]+</code>);
59<code>name</code> represents an identifier
60(<code>[a-zA-Z][a-zA-Z0-9_]*</code>).
61Constructions are listed in order of decreasing precedence.
62<table border="1">
63<tbody><tr><td><b>Syntax</b></td><td><b>Description</b></td></tr>
64<tr><td><code>( p )</code></td> <td>grouping</td></tr>
65<tr><td><code>'string'</code></td> <td>literal string</td></tr>
66<tr><td><code>"string"</code></td> <td>literal string</td></tr>
67<tr><td><code>[class]</code></td> <td>character class</td></tr>
68<tr><td><code>.</code></td> <td>any character</td></tr>
69<tr><td><code>%name</code></td>
70 <td>pattern <code>defs[name]</code> or a pre-defined pattern</td></tr>
71<tr><td><code>name</code></td><td>non terminal</td></tr>
72<tr><td><code>&lt;name&gt;</code></td><td>non terminal</td></tr>
73<tr><td><code>{}</code></td> <td>position capture</td></tr>
74<tr><td><code>{ p }</code></td> <td>simple capture</td></tr>
75<tr><td><code>{: p :}</code></td> <td>anonymous group capture</td></tr>
76<tr><td><code>{:name: p :}</code></td> <td>named group capture</td></tr>
77<tr><td><code>{~ p ~}</code></td> <td>substitution capture</td></tr>
78<tr><td><code>{| p |}</code></td> <td>table capture</td></tr>
79<tr><td><code>=name</code></td> <td>back reference
80</td></tr>
81<tr><td><code>p ?</code></td> <td>optional match</td></tr>
82<tr><td><code>p *</code></td> <td>zero or more repetitions</td></tr>
83<tr><td><code>p +</code></td> <td>one or more repetitions</td></tr>
84<tr><td><code>p^num</code></td> <td>exactly <code>n</code> repetitions</td></tr>
85<tr><td><code>p^+num</code></td>
86 <td>at least <code>n</code> repetitions</td></tr>
87<tr><td><code>p^-num</code></td>
88 <td>at most <code>n</code> repetitions</td></tr>
89<tr><td><code>p -&gt; 'string'</code></td> <td>string capture</td></tr>
90<tr><td><code>p -&gt; "string"</code></td> <td>string capture</td></tr>
91<tr><td><code>p -&gt; num</code></td> <td>numbered capture</td></tr>
92<tr><td><code>p -&gt; name</code></td> <td>function/query/string capture
93equivalent to <code>p / defs[name]</code></td></tr>
94<tr><td><code>p =&gt; name</code></td> <td>match-time capture
95equivalent to <code>lpeg.Cmt(p, defs[name])</code></td></tr>
96<tr><td><code>p ~&gt; name</code></td> <td>fold capture
97equivalent to <code>lpeg.Cf(p, defs[name])</code></td></tr>
98<tr><td><code>& p</code></td> <td>and predicate</td></tr>
99<tr><td><code>! p</code></td> <td>not predicate</td></tr>
100<tr><td><code>p1 p2</code></td> <td>concatenation</td></tr>
101<tr><td><code>p1 / p2</code></td> <td>ordered choice</td></tr>
102<tr><td>(<code>name &lt;- p</code>)<sup>+</sup></td> <td>grammar</td></tr>
103</tbody></table>
104<p>
105Any space appearing in a syntax description can be
106replaced by zero or more space characters and Lua-style comments
107(<code>--</code> until end of line).
108</p>
109
110<p>
111Character classes define sets of characters.
112An initial <code>^</code> complements the resulting set.
113A range <em>x</em><code>-</code><em>y</em> includes in the set
114all characters with codes between the codes of <em>x</em> and <em>y</em>.
115A pre-defined class <code>%</code><em>name</em> includes all
116characters of that class.
117A simple character includes itself in the set.
118The only special characters inside a class are <code>^</code>
119(special only if it is the first character);
120<code>]</code>
121(can be included in the set as the first character,
122after the optional <code>^</code>);
123<code>%</code> (special only if followed by a letter);
124and <code>-</code>
125(can be included in the set as the first or the last character).
126</p>
127
128<p>
129Currently the pre-defined classes are similar to those from the
130Lua's string library
131(<code>%a</code> for letters,
132<code>%A</code> for non letters, etc.).
133There is also a class <code>%nl</code>
134containing only the newline character,
135which is particularly handy for grammars written inside long strings,
136as long strings do not interpret escape sequences like <code>\n</code>.
137</p>
138
139
140<h2><a name="func">Functions</a></h2>
141
142<h3><code>re.compile (string, [, defs])</code></h3>
143<p>
144Compiles the given string and
145returns an equivalent LPeg pattern.
146The given string may define either an expression or a grammar.
147The optional <code>defs</code> table provides extra Lua values
148to be used by the pattern.
149</p>
150
151<h3><code>re.find (subject, pattern [, init])</code></h3>
152<p>
153Searches the given pattern in the given subject.
154If it finds a match,
155returns the index where this occurrence starts and
156the index where it ends.
157Otherwise, returns nil.
158</p>
159
160<p>
161An optional numeric argument <code>init</code> makes the search
162starts at that position in the subject string.
163As usual in Lua libraries,
164a negative value counts from the end.
165</p>
166
167<h3><code>re.gsub (subject, pattern, replacement)</code></h3>
168<p>
169Does a <em>global substitution</em>,
170replacing all occurrences of <code>pattern</code>
171in the given <code>subject</code> by <code>replacement</code>.
172
173<h3><code>re.match (subject, pattern)</code></h3>
174<p>
175Matches the given pattern against the given subject,
176returning all captures.
177</p>
178
179<h3><code>re.updatelocale ()</code></h3>
180<p>
181Updates the pre-defined character classes to the current locale.
182</p>
183
184
185<h2><a name="ex">Some Examples</a></h2>
186
187<h3>A complete simple program</h3>
188<p>
189The next code shows a simple complete Lua program using
190the <code>re</code> module:
191</p>
192<pre class="example">
193local re = require"re"
194
195-- find the position of the first numeral in a string
196print(re.find("the number 423 is odd", "[0-9]+")) --&gt; 12 14
197
198-- returns all words in a string
199print(re.match("the number 423 is odd", "({%a+} / .)*"))
200--&gt; the number is odd
201
202-- returns the first numeral in a string
203print(re.match("the number 423 is odd", "s <- {%d+} / . s"))
204--&gt; 423
205
206print(re.gsub("hello World", "[aeiou]", "."))
207--&gt; h.ll. W.rld
208</pre>
209
210
211<h3>Balanced parentheses</h3>
212<p>
213The following call will produce the same pattern produced by the
214Lua expression in the
215<a href="lpeg.html#balanced">balanced parentheses</a> example:
216</p>
217<pre class="example">
218b = re.compile[[ balanced &lt;- "(" ([^()] / balanced)* ")" ]]
219</pre>
220
221<h3>String reversal</h3>
222<p>
223The next example reverses a string:
224</p>
225<pre class="example">
226rev = re.compile[[ R &lt;- (!.) -&gt; '' / ({.} R) -&gt; '%2%1']]
227print(rev:match"0123456789") --&gt; 9876543210
228</pre>
229
230<h3>CSV decoder</h3>
231<p>
232The next example replicates the <a href="lpeg.html#CSV">CSV decoder</a>:
233</p>
234<pre class="example">
235record = re.compile[[
236 record &lt;- {| field (',' field)* |} (%nl / !.)
237 field &lt;- escaped / nonescaped
238 nonescaped &lt;- { [^,"%nl]* }
239 escaped &lt;- '"' {~ ([^"] / '""' -&gt; '"')* ~} '"'
240]]
241</pre>
242
243<h3>Lua's long strings</h3>
244<p>
245The next example matches Lua long strings:
246</p>
247<pre class="example">
248c = re.compile([[
249 longstring &lt;- ('[' {:eq: '='* :} '[' close)
250 close &lt;- ']' =eq ']' / . close
251]])
252
253print(c:match'[==[]]===]]]]==]===[]') --&gt; 17
254</pre>
255
256<h3>Abstract Syntax Trees</h3>
257<p>
258This example shows a simple way to build an
259abstract syntax tree (AST) for a given grammar.
260To keep our example simple,
261let us consider the following grammar
262for lists of names:
263</p>
264<pre class="example">
265p = re.compile[[
266 listname &lt;- (name s)*
267 name &lt;- [a-z][a-z]*
268 s &lt;- %s*
269]]
270</pre>
271<p>
272Now, we will add captures to build a corresponding AST.
273As a first step, the pattern will build a table to
274represent each non terminal;
275terminals will be represented by their corresponding strings:
276</p>
277<pre class="example">
278c = re.compile[[
279 listname &lt;- {| (name s)* |}
280 name &lt;- {| {[a-z][a-z]*} |}
281 s &lt;- %s*
282]]
283</pre>
284<p>
285Now, a match against <code>"hi hello bye"</code>
286results in the table
287<code>{{"hi"}, {"hello"}, {"bye"}}</code>.
288</p>
289<p>
290For such a simple grammar,
291this AST is more than enough;
292actually, the tables around each single name
293are already overkilling.
294More complex grammars,
295however, may need some more structure.
296Specifically,
297it would be useful if each table had
298a <code>tag</code> field telling what non terminal
299that table represents.
300We can add such a tag using
301<a href="lpeg.html#cap-g">named group captures</a>:
302</p>
303<pre class="example">
304x = re.compile[[
305 listname <- {| {:tag: '' -> 'list':} (name s)* |}
306 name <- {| {:tag: '' -> 'id':} {[a-z][a-z]*} |}
307 s <- ' '*
308]]
309</pre>
310<p>
311With these group captures,
312a match against <code>"hi hello bye"</code>
313results in the following table:
314</p>
315<pre class="example">
316{tag="list",
317 {tag="id", "hi"},
318 {tag="id", "hello"},
319 {tag="id", "bye"}
320}
321</pre>
322
323
324<h3>Indented blocks</h3>
325<p>
326This example breaks indented blocks into tables,
327respecting the indentation:
328</p>
329<pre class="example">
330p = re.compile[[
331 block &lt;- {| {:ident:' '*:} line
332 ((=ident !' ' line) / &(=ident ' ') block)* |}
333 line &lt;- {[^%nl]*} %nl
334]]
335</pre>
336<p>
337As an example,
338consider the following text:
339</p>
340<pre class="example">
341t = p:match[[
342first line
343 subline 1
344 subline 2
345second line
346third line
347 subline 3.1
348 subline 3.1.1
349 subline 3.2
350]]
351</pre>
352<p>
353The resulting table <code>t</code> will be like this:
354</p>
355<pre class="example">
356 {'first line'; {'subline 1'; 'subline 2'; ident = ' '};
357 'second line';
358 'third line'; { 'subline 3.1'; {'subline 3.1.1'; ident = ' '};
359 'subline 3.2'; ident = ' '};
360 ident = ''}
361</pre>
362
363<h3>Macro expander</h3>
364<p>
365This example implements a simple macro expander.
366Macros must be defined as part of the pattern,
367following some simple rules:
368</p>
369<pre class="example">
370p = re.compile[[
371 text &lt;- {~ item* ~}
372 item &lt;- macro / [^()] / '(' item* ')'
373 arg &lt;- ' '* {~ (!',' item)* ~}
374 args &lt;- '(' arg (',' arg)* ')'
375 -- now we define some macros
376 macro &lt;- ('apply' args) -&gt; '%1(%2)'
377 / ('add' args) -&gt; '%1 + %2'
378 / ('mul' args) -&gt; '%1 * %2'
379]]
380
381print(p:match"add(mul(a,b), apply(f,x))") --&gt; a * b + f(x)
382</pre>
383<p>
384A <code>text</code> is a sequence of items,
385wherein we apply a substitution capture to expand any macros.
386An <code>item</code> is either a macro,
387any character different from parentheses,
388or a parenthesized expression.
389A macro argument (<code>arg</code>) is a sequence
390of items different from a comma.
391(Note that a comma may appear inside an item,
392e.g., inside a parenthesized expression.)
393Again we do a substitution capture to expand any macro
394in the argument before expanding the outer macro.
395<code>args</code> is a list of arguments separated by commas.
396Finally we define the macros.
397Each macro is a string substitution;
398it replaces the macro name and its arguments by its corresponding string,
399with each <code>%</code><em>n</em> replaced by the <em>n</em>-th argument.
400</p>
401
402<h3>Patterns</h3>
403<p>
404This example shows the complete syntax
405of patterns accepted by <code>re</code>.
406</p>
407<pre class="example">
408p = [=[
409
410pattern &lt;- exp !.
411exp &lt;- S (grammar / alternative)
412
413alternative &lt;- seq ('/' S seq)*
414seq &lt;- prefix*
415prefix &lt;- '&amp;' S prefix / '!' S prefix / suffix
416suffix &lt;- primary S (([+*?]
417 / '^' [+-]? num
418 / '-&gt;' S (string / '{}' / name)
419 / '=&gt;' S name) S)*
420
421primary &lt;- '(' exp ')' / string / class / defined
422 / '{:' (name ':')? exp ':}'
423 / '=' name
424 / '{}'
425 / '{~' exp '~}'
426 / '{' exp '}'
427 / '.'
428 / name S !arrow
429 / '&lt;' name '&gt;' -- old-style non terminals
430
431grammar &lt;- definition+
432definition &lt;- name S arrow exp
433
434class &lt;- '[' '^'? item (!']' item)* ']'
435item &lt;- defined / range / .
436range &lt;- . '-' [^]]
437
438S &lt;- (%s / '--' [^%nl]*)* -- spaces and comments
439name &lt;- [A-Za-z][A-Za-z0-9_]*
440arrow &lt;- '&lt;-'
441num &lt;- [0-9]+
442string &lt;- '"' [^"]* '"' / "'" [^']* "'"
443defined &lt;- '%' name
444
445]=]
446
447print(re.match(p, p)) -- a self description must match itself
448</pre>
449
450
451
452<h2><a name="license">License</a></h2>
453
454<p>
455Copyright &copy; 2008-2015 Lua.org, PUC-Rio.
456</p>
457<p>
458Permission is hereby granted, free of charge,
459to any person obtaining a copy of this software and
460associated documentation files (the "Software"),
461to deal in the Software without restriction,
462including without limitation the rights to use,
463copy, modify, merge, publish, distribute, sublicense,
464and/or sell copies of the Software,
465and to permit persons to whom the Software is
466furnished to do so,
467subject to the following conditions:
468</p>
469
470<p>
471The above copyright notice and this permission notice
472shall be included in all copies or substantial portions of the Software.
473</p>
474
475<p>
476THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
477EXPRESS OR IMPLIED,
478INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
479FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
480IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
481DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
482TORT OR OTHERWISE, ARISING FROM,
483OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
484THE SOFTWARE.
485</p>
486
487</div> <!-- id="content" -->
488
489</div> <!-- id="main" -->
490
491<div id="about">
492<p><small>
493$Id: re.html,v 1.25 2018/06/04 16:21:19 roberto Exp $
494</small></p>
495</div> <!-- id="about" -->
496
497</div> <!-- id="container" -->
498
499</body>
500</html>