aboutsummaryrefslogtreecommitdiff
path: root/lpeg.html
diff options
context:
space:
mode:
Diffstat (limited to 'lpeg.html')
-rw-r--r--lpeg.html92
1 files changed, 3 insertions, 89 deletions
diff --git a/lpeg.html b/lpeg.html
index f4d8658..f50d327 100644
--- a/lpeg.html
+++ b/lpeg.html
@@ -10,7 +10,6 @@
10</head> 10</head>
11<body> 11<body>
12 12
13<!-- $Id: lpeg.html $ -->
14 13
15<div id="container"> 14<div id="container">
16 15
@@ -664,10 +663,10 @@ LPeg does not specify when (and if) it evaluates its captures.
664consider the pattern <code>lpeg.P"a" / func / 0</code>. 663consider the pattern <code>lpeg.P"a" / func / 0</code>.
665Because the "division" by 0 instructs LPeg to throw away the 664Because the "division" by 0 instructs LPeg to throw away the
666results from the pattern, 665results from the pattern,
667LPeg may or may not call <code>func</code>.) 666it is not specified whether LPeg will call <code>func</code>.)
668Therefore, captures should avoid side effects. 667Therefore, captures should avoid side effects.
669Moreover, 668Moreover,
670most captures cannot affect the way a pattern matches a subject. 669captures cannot affect the way a pattern matches a subject.
671The only exception to this rule is the 670The only exception to this rule is the
672so-called <a href="#matchtime"><em>match-time capture</em></a>. 671so-called <a href="#matchtime"><em>match-time capture</em></a>.
673When a match-time capture matches, 672When a match-time capture matches,
@@ -1175,91 +1174,6 @@ local record = lpeg.Ct(field * (',' * field)^0) * (lpeg.P'\n' + -1)
1175</pre> 1174</pre>
1176 1175
1177 1176
1178<h3>UTF-8 and Latin 1</h3>
1179<p>
1180It is not difficult to use LPeg to convert a string from
1181UTF-8 encoding to Latin 1 (ISO 8859-1):
1182</p>
1183
1184<pre class="example">
1185-- convert a two-byte UTF-8 sequence to a Latin 1 character
1186local function f2 (s)
1187 local c1, c2 = string.byte(s, 1, 2)
1188 return string.char(c1 * 64 + c2 - 12416)
1189end
1190
1191local utf8 = lpeg.R("\0\127")
1192 + lpeg.R("\194\195") * lpeg.R("\128\191") / f2
1193
1194local decode_pattern = lpeg.Cs(utf8^0) * -1
1195</pre>
1196<p>
1197In this code,
1198the definition of UTF-8 is already restricted to the
1199Latin 1 range (from 0 to 255).
1200Any encoding outside this range (as well as any invalid encoding)
1201will not match that pattern.
1202</p>
1203
1204<p>
1205As the definition of <code>decode_pattern</code> demands that
1206the pattern matches the whole input (because of the -1 at its end),
1207any invalid string will simply fail to match,
1208without any useful information about the problem.
1209We can improve this situation redefining <code>decode_pattern</code>
1210as follows:
1211</p>
1212<pre class="example">
1213local function er (_, i) error("invalid encoding at position " .. i) end
1214
1215local decode_pattern = lpeg.Cs(utf8^0) * (-1 + lpeg.P(er))
1216</pre>
1217<p>
1218Now, if the pattern <code>utf8^0</code> stops
1219before the end of the string,
1220an appropriate error function is called.
1221</p>
1222
1223
1224<h3>UTF-8 and Unicode</h3>
1225<p>
1226We can extend the previous patterns to handle all Unicode code points.
1227Of course,
1228we cannot translate them to Latin 1 or any other one-byte encoding.
1229Instead, our translation results in a array with the code points
1230represented as numbers.
1231The full code is here:
1232</p>
1233<pre class="example">
1234-- decode a two-byte UTF-8 sequence
1235local function f2 (s)
1236 local c1, c2 = string.byte(s, 1, 2)
1237 return c1 * 64 + c2 - 12416
1238end
1239
1240-- decode a three-byte UTF-8 sequence
1241local function f3 (s)
1242 local c1, c2, c3 = string.byte(s, 1, 3)
1243 return (c1 * 64 + c2) * 64 + c3 - 925824
1244end
1245
1246-- decode a four-byte UTF-8 sequence
1247local function f4 (s)
1248 local c1, c2, c3, c4 = string.byte(s, 1, 4)
1249 return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168
1250end
1251
1252local cont = lpeg.R("\128\191") -- continuation byte
1253
1254local utf8 = lpeg.R("\0\127") / string.byte
1255 + lpeg.R("\194\223") * cont / f2
1256 + lpeg.R("\224\239") * cont * cont / f3
1257 + lpeg.R("\240\244") * cont * cont * cont / f4
1258
1259local decode_pattern = lpeg.Ct(utf8^0) * -1
1260</pre>
1261
1262
1263<h3>Lua's long strings</h3> 1177<h3>Lua's long strings</h3>
1264<p> 1178<p>
1265A long string in Lua starts with the pattern <code>[=*[</code> 1179A long string in Lua starts with the pattern <code>[=*[</code>
@@ -1416,7 +1330,7 @@ the following command is all you need to install LPeg:
1416<h2><a name="license">License</a></h2> 1330<h2><a name="license">License</a></h2>
1417 1331
1418<p> 1332<p>
1419Copyright &copy; 2007-2019 Lua.org, PUC-Rio. 1333Copyright &copy; 2007-2023 Lua.org, PUC-Rio.
1420</p> 1334</p>
1421<p> 1335<p>
1422Permission is hereby granted, free of charge, 1336Permission is hereby granted, free of charge,