aboutsummaryrefslogtreecommitdiff
path: root/manual
diff options
context:
space:
mode:
authorRoberto Ierusalimschy <roberto@inf.puc-rio.br>2019-03-15 13:14:17 -0300
committerRoberto Ierusalimschy <roberto@inf.puc-rio.br>2019-03-15 13:14:17 -0300
commit1e0c73d5b643707335b06abd2546a83d9439d14c (patch)
treeb80b7d5e2cfeeef888ddf98fcc6276832134c1bf /manual
parent8fa4f1380b9a203bfdf002c2e9e9e13ebb8384c1 (diff)
downloadlua-1e0c73d5b643707335b06abd2546a83d9439d14c.tar.gz
lua-1e0c73d5b643707335b06abd2546a83d9439d14c.tar.bz2
lua-1e0c73d5b643707335b06abd2546a83d9439d14c.zip
Changes in the validation of UTF-8
All UTF-8 encoding functionality (including the escape sequence '\u') accepts all values from the original UTF-8 specification (with sequences of up to six bytes). By default, the decoding functions in the UTF-8 library do not accept invalid Unicode code points, such as surrogates. A new parameter 'nonstrict' makes them accept all code points up to (2^31)-1, as in the original UTF-8 specification.
Diffstat (limited to 'manual')
-rw-r--r--manual/manual.of43
1 files changed, 39 insertions, 4 deletions
diff --git a/manual/manual.of b/manual/manual.of
index 1e4ca857..8a8ebad5 100644
--- a/manual/manual.of
+++ b/manual/manual.of
@@ -1004,6 +1004,8 @@ the escape sequence @T{\u{@rep{XXX}}}
1004(note the mandatory enclosing brackets), 1004(note the mandatory enclosing brackets),
1005where @rep{XXX} is a sequence of one or more hexadecimal digits 1005where @rep{XXX} is a sequence of one or more hexadecimal digits
1006representing the character code point. 1006representing the character code point.
1007This code point can be any value smaller than @M{2@sp{31}}.
1008(Lua uses the original UTF-8 specification here.)
1007 1009
1008Literal strings can also be defined using a long format 1010Literal strings can also be defined using a long format
1009enclosed by @def{long brackets}. 1011enclosed by @def{long brackets}.
@@ -6899,6 +6901,7 @@ x = string.gsub("$name-$version.tar.gz", "%$(%w+)", t)
6899} 6901}
6900 6902
6901@LibEntry{string.len (s)| 6903@LibEntry{string.len (s)|
6904
6902Receives a string and returns its length. 6905Receives a string and returns its length.
6903The empty string @T{""} has length 0. 6906The empty string @T{""} has length 0.
6904Embedded zeros are counted, 6907Embedded zeros are counted,
@@ -6907,6 +6910,7 @@ so @T{"a\000bc\000"} has length 5.
6907} 6910}
6908 6911
6909@LibEntry{string.lower (s)| 6912@LibEntry{string.lower (s)|
6913
6910Receives a string and returns a copy of this string with all 6914Receives a string and returns a copy of this string with all
6911uppercase letters changed to lowercase. 6915uppercase letters changed to lowercase.
6912All other characters are left unchanged. 6916All other characters are left unchanged.
@@ -6915,6 +6919,7 @@ The definition of what an uppercase letter is depends on the current locale.
6915} 6919}
6916 6920
6917@LibEntry{string.match (s, pattern [, init])| 6921@LibEntry{string.match (s, pattern [, init])|
6922
6918Looks for the first @emph{match} of 6923Looks for the first @emph{match} of
6919@id{pattern} @see{pm} in the string @id{s}. 6924@id{pattern} @see{pm} in the string @id{s}.
6920If it finds one, then @id{match} returns 6925If it finds one, then @id{match} returns
@@ -6946,6 +6951,7 @@ The format string cannot have the variable-length options
6946} 6951}
6947 6952
6948@LibEntry{string.rep (s, n [, sep])| 6953@LibEntry{string.rep (s, n [, sep])|
6954
6949Returns a string that is the concatenation of @id{n} copies of 6955Returns a string that is the concatenation of @id{n} copies of
6950the string @id{s} separated by the string @id{sep}. 6956the string @id{s} separated by the string @id{sep}.
6951The default value for @id{sep} is the empty string 6957The default value for @id{sep} is the empty string
@@ -6958,11 +6964,13 @@ with a single call to this function.)
6958} 6964}
6959 6965
6960@LibEntry{string.reverse (s)| 6966@LibEntry{string.reverse (s)|
6967
6961Returns a string that is the string @id{s} reversed. 6968Returns a string that is the string @id{s} reversed.
6962 6969
6963} 6970}
6964 6971
6965@LibEntry{string.sub (s, i [, j])| 6972@LibEntry{string.sub (s, i [, j])|
6973
6966Returns the substring of @id{s} that 6974Returns the substring of @id{s} that
6967starts at @id{i} and continues until @id{j}; 6975starts at @id{i} and continues until @id{j};
6968@id{i} and @id{j} can be negative. 6976@id{i} and @id{j} can be negative.
@@ -6998,6 +7006,7 @@ this function also returns the index of the first unread byte in @id{s}.
6998} 7006}
6999 7007
7000@LibEntry{string.upper (s)| 7008@LibEntry{string.upper (s)|
7009
7001Receives a string and returns a copy of this string with all 7010Receives a string and returns a copy of this string with all
7002lowercase letters changed to uppercase. 7011lowercase letters changed to uppercase.
7003All other characters are left unchanged. 7012All other characters are left unchanged.
@@ -7318,8 +7327,24 @@ or one plus the length of the subject string.
7318As in the string library, 7327As in the string library,
7319negative indices count from the end of the string. 7328negative indices count from the end of the string.
7320 7329
7330Functions that create byte sequences
7331accept all values up to @T{0x7FFFFFFF},
7332as defined in the original UTF-8 specification;
7333that implies byte sequences of up to six bytes.
7334
7335Functions that interpret byte sequences only accept
7336valid sequences (well formed and not overlong).
7337By default, they only accept byte sequences
7338that result in valid Unicode code points,
7339rejecting values larger than @T{10FFFF} and surrogates.
7340A boolean argument @id{nonstrict}, when available,
7341lifts these checks,
7342so that all values up to @T{0x7FFFFFFF} are accepted.
7343(Not well formed and overlong sequences are still rejected.)
7344
7321 7345
7322@LibEntry{utf8.char (@Cdots)| 7346@LibEntry{utf8.char (@Cdots)|
7347
7323Receives zero or more integers, 7348Receives zero or more integers,
7324converts each one to its corresponding UTF-8 byte sequence 7349converts each one to its corresponding UTF-8 byte sequence
7325and returns a string with the concatenation of all these sequences. 7350and returns a string with the concatenation of all these sequences.
@@ -7327,14 +7352,15 @@ and returns a string with the concatenation of all these sequences.
7327} 7352}
7328 7353
7329@LibEntry{utf8.charpattern| 7354@LibEntry{utf8.charpattern|
7330The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xF4][\x80-\xBF]*} 7355
7356The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xFD][\x80-\xBF]*}
7331@see{pm}, 7357@see{pm},
7332which matches exactly one UTF-8 byte sequence, 7358which matches exactly one UTF-8 byte sequence,
7333assuming that the subject is a valid UTF-8 string. 7359assuming that the subject is a valid UTF-8 string.
7334 7360
7335} 7361}
7336 7362
7337@LibEntry{utf8.codes (s)| 7363@LibEntry{utf8.codes (s [, nonstrict])|
7338 7364
7339Returns values so that the construction 7365Returns values so that the construction
7340@verbatim{ 7366@verbatim{
@@ -7347,7 +7373,8 @@ It raises an error if it meets any invalid byte sequence.
7347 7373
7348} 7374}
7349 7375
7350@LibEntry{utf8.codepoint (s [, i [, j]])| 7376@LibEntry{utf8.codepoint (s [, i [, j [, nonstrict]]])|
7377
7351Returns the codepoints (as integers) from all characters in @id{s} 7378Returns the codepoints (as integers) from all characters in @id{s}
7352that start between byte position @id{i} and @id{j} (both included). 7379that start between byte position @id{i} and @id{j} (both included).
7353The default for @id{i} is 1 and for @id{j} is @id{i}. 7380The default for @id{i} is 1 and for @id{j} is @id{i}.
@@ -7355,7 +7382,8 @@ It raises an error if it meets any invalid byte sequence.
7355 7382
7356} 7383}
7357 7384
7358@LibEntry{utf8.len (s [, i [, j]])| 7385@LibEntry{utf8.len (s [, i [, j [, nonstrict]]])|
7386
7359Returns the number of UTF-8 characters in string @id{s} 7387Returns the number of UTF-8 characters in string @id{s}
7360that start between positions @id{i} and @id{j} (both inclusive). 7388that start between positions @id{i} and @id{j} (both inclusive).
7361The default for @id{i} is @num{1} and for @id{j} is @num{-1}. 7389The default for @id{i} is @num{1} and for @id{j} is @num{-1}.
@@ -7365,6 +7393,7 @@ returns a false value plus the position of the first invalid byte.
7365} 7393}
7366 7394
7367@LibEntry{utf8.offset (s, n [, i])| 7395@LibEntry{utf8.offset (s, n [, i])|
7396
7368Returns the position (in bytes) where the encoding of the 7397Returns the position (in bytes) where the encoding of the
7369@id{n}-th character of @id{s} 7398@id{n}-th character of @id{s}
7370(counting from position @id{i}) starts. 7399(counting from position @id{i}) starts.
@@ -8755,6 +8784,12 @@ You can enclose the call in parentheses if you need to
8755discard these extra results. 8784discard these extra results.
8756} 8785}
8757 8786
8787@item{
8788By default, the decoding functions in the @Lid{utf8} library
8789do not accept surrogates as valid code points.
8790An extra parameter in these functions makes them more permissive.
8791}
8792
8758} 8793}
8759 8794
8760} 8795}