Changes in the validation of UTF-8

All UTF-8 encoding functionality (including the escape sequence '\u') accepts all values from the original UTF-8 specification (with sequences of up to six bytes). By default, the decoding functions in the UTF-8 library do not accept invalid Unicode code points, such as surrogates. A new parameter 'nonstrict' makes them accept all code points up to (2^31)-1, as in the original UTF-8 specification.
author: Roberto Ierusalimschy <roberto@inf.puc-rio.br> 2019-03-15 13:14:17 -0300
committer: Roberto Ierusalimschy <roberto@inf.puc-rio.br> 2019-03-15 13:14:17 -0300
commit: 1e0c73d5b643707335b06abd2546a83d9439d14c (patch)
tree: b80b7d5e2cfeeef888ddf98fcc6276832134c1bf /manual
parent: 8fa4f1380b9a203bfdf002c2e9e9e13ebb8384c1 (diff)
download: lua-1e0c73d5b643707335b06abd2546a83d9439d14c.tar.gz
lua-1e0c73d5b643707335b06abd2546a83d9439d14c.tar.bz2
lua-1e0c73d5b643707335b06abd2546a83d9439d14c.zip
1 files changed, 39 insertions, 4 deletions
diff --git a/manual/manual.of b/manual/manual.of
index 1e4ca857..8a8ebad5 100644
--- a/manual/manual.of
+++ b/manual/manual.of
@@ -1004,6 +1004,8 @@ the escape sequence @T{\u{@rep{XXX}}}
 (note the mandatory enclosing brackets),
 where @rep{XXX} is a sequence of one or more hexadecimal digits
 representing the character code point.
+This code point can be any value smaller than @M{2@sp{31}}.
+(Lua uses the original UTF-8 specification here.)
 Literal strings can also be defined using a long format
 enclosed by @def{long brackets}.
@@ -6899,6 +6901,7 @@ x = string.gsub("$name-$version.tar.gz", "%$(%w+)", t)
 }
 @LibEntry{string.len (s)|
 Receives a string and returns its length.
 The empty string @T{""} has length 0.
 Embedded zeros are counted,
@@ -6907,6 +6910,7 @@ so @T{"a\000bc\000"} has length 5.
 }
 @LibEntry{string.lower (s)|
 Receives a string and returns a copy of this string with all
 uppercase letters changed to lowercase.
 All other characters are left unchanged.
@@ -6915,6 +6919,7 @@ The definition of what an uppercase letter is depends on the current locale.
 }
 @LibEntry{string.match (s, pattern [, init])|
 Looks for the first @emph{match} of
 @id{pattern} @see{pm} in the string @id{s}.
 If it finds one, then @id{match} returns
@@ -6946,6 +6951,7 @@ The format string cannot have the variable-length options
 }
 @LibEntry{string.rep (s, n [, sep])|
 Returns a string that is the concatenation of @id{n} copies of
 the string @id{s} separated by the string @id{sep}.
 The default value for @id{sep} is the empty string
@@ -6958,11 +6964,13 @@ with a single call to this function.)
 }
 @LibEntry{string.reverse (s)|
 Returns a string that is the string @id{s} reversed.
 }
 @LibEntry{string.sub (s, i [, j])|
 Returns the substring of @id{s} that
 starts at @id{i}  and continues until @id{j};
 @id{i} and @id{j} can be negative.
@@ -6998,6 +7006,7 @@ this function also returns the index of the first unread byte in @id{s}.
 }
 @LibEntry{string.upper (s)|
 Receives a string and returns a copy of this string with all
 lowercase letters changed to uppercase.
 All other characters are left unchanged.
@@ -7318,8 +7327,24 @@ or one plus the length of the subject string.
 As in the string library,
 negative indices count from the end of the string.
+Functions that create byte sequences
+accept all values up to @T{0x7FFFFFFF},
+as defined in the original UTF-8 specification;
+that implies byte sequences of up to six bytes.
+Functions that interpret byte sequences only accept
+valid sequences (well formed and not overlong).
+By default, they only accept byte sequences
+that result in valid Unicode code points,
+rejecting values larger than @T{10FFFF} and surrogates.
+A boolean argument @id{nonstrict}, when available,
+lifts these checks,
+so that all values up to @T{0x7FFFFFFF} are accepted.
+(Not well formed and overlong sequences are still rejected.)
 @LibEntry{utf8.char (@Cdots)|
 Receives zero or more integers,
 converts each one to its corresponding UTF-8 byte sequence
 and returns a string with the concatenation of all these sequences.
@@ -7327,14 +7352,15 @@ and returns a string with the concatenation of all these sequences.
 }
 @LibEntry{utf8.charpattern|
-The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xF4][\x80-\xBF]*}
+The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xFD][\x80-\xBF]*}
 @see{pm},
 which matches exactly one UTF-8 byte sequence,
 assuming that the subject is a valid UTF-8 string.
 }
-@LibEntry{utf8.codes (s)|
+@LibEntry{utf8.codes (s [, nonstrict])|
 Returns values so that the construction
 @verbatim{
@@ -7347,7 +7373,8 @@ It raises an error if it meets any invalid byte sequence.
 }
-@LibEntry{utf8.codepoint (s [, i [, j]])|
+@LibEntry{utf8.codepoint (s [, i [, j [, nonstrict]]])|
 Returns the codepoints (as integers) from all characters in @id{s}
 that start between byte position @id{i} and @id{j} (both included).
 The default for @id{i} is 1 and for @id{j} is @id{i}.
@@ -7355,7 +7382,8 @@ It raises an error if it meets any invalid byte sequence.
 }
-@LibEntry{utf8.len (s [, i [, j]])|
+@LibEntry{utf8.len (s [, i [, j [, nonstrict]]])|
 Returns the number of UTF-8 characters in string @id{s}
 that start between positions @id{i} and @id{j} (both inclusive).
 The default for @id{i} is @num{1} and for @id{j} is @num{-1}.
@@ -7365,6 +7393,7 @@ returns a false value plus the position of the first invalid byte.
 }
 @LibEntry{utf8.offset (s, n [, i])|
 Returns the position (in bytes) where the encoding of the
 @id{n}-th character of @id{s}
 (counting from position @id{i}) starts.
@@ -8755,6 +8784,12 @@ You can enclose the call in parentheses if you need to
 discard these extra results.
 }
+@item{
+By default, the decoding functions in the @Lid{utf8} library
+do not accept surrogates as valid code points.
+An extra parameter in these functions makes them more permissive.
+}
 }
 }
author	Roberto Ierusalimschy <roberto@inf.puc-rio.br>	2019-03-15 13:14:17 -0300
committer	Roberto Ierusalimschy <roberto@inf.puc-rio.br>	2019-03-15 13:14:17 -0300
commit	1e0c73d5b643707335b06abd2546a83d9439d14c (patch)
tree	b80b7d5e2cfeeef888ddf98fcc6276832134c1bf /manual
parent	8fa4f1380b9a203bfdf002c2e9e9e13ebb8384c1 (diff)
download	lua-1e0c73d5b643707335b06abd2546a83d9439d14c.tar.gz lua-1e0c73d5b643707335b06abd2546a83d9439d14c.tar.bz2 lua-1e0c73d5b643707335b06abd2546a83d9439d14c.zip

diff --git a/manual/manual.of b/manual/manual.of index 1e4ca857..8a8ebad5 100644 --- a/manual/manual.of +++ b/manual/manual.of
@@ -1004,6 +1004,8 @@ the escape sequence @T{\u{@rep{XXX}}}
1004	(note the mandatory enclosing brackets),	1004	(note the mandatory enclosing brackets),
1005	where @rep{XXX} is a sequence of one or more hexadecimal digits	1005	where @rep{XXX} is a sequence of one or more hexadecimal digits
1006	representing the character code point.	1006	representing the character code point.
		1007	This code point can be any value smaller than @M{2@sp{31}}.
		1008	(Lua uses the original UTF-8 specification here.)
1007		1009
1008	Literal strings can also be defined using a long format	1010	Literal strings can also be defined using a long format
1009	enclosed by @def{long brackets}.	1011	enclosed by @def{long brackets}.
@@ -6899,6 +6901,7 @@ x = string.gsub("$name-$version.tar.gz", "%$(%w+)", t)
6899	}	6901	}
6900		6902
6901	@LibEntry{string.len (s)\|	6903	@LibEntry{string.len (s)\|
		6904
6902	Receives a string and returns its length.	6905	Receives a string and returns its length.
6903	The empty string @T{""} has length 0.	6906	The empty string @T{""} has length 0.
6904	Embedded zeros are counted,	6907	Embedded zeros are counted,
@@ -6907,6 +6910,7 @@ so @T{"a\000bc\000"} has length 5.
6907	}	6910	}
6908		6911
6909	@LibEntry{string.lower (s)\|	6912	@LibEntry{string.lower (s)\|
		6913
6910	Receives a string and returns a copy of this string with all	6914	Receives a string and returns a copy of this string with all
6911	uppercase letters changed to lowercase.	6915	uppercase letters changed to lowercase.
6912	All other characters are left unchanged.	6916	All other characters are left unchanged.
@@ -6915,6 +6919,7 @@ The definition of what an uppercase letter is depends on the current locale.
6915	}	6919	}
6916		6920
6917	@LibEntry{string.match (s, pattern [, init])\|	6921	@LibEntry{string.match (s, pattern [, init])\|
		6922
6918	Looks for the first @emph{match} of	6923	Looks for the first @emph{match} of
6919	@id{pattern} @see{pm} in the string @id{s}.	6924	@id{pattern} @see{pm} in the string @id{s}.
6920	If it finds one, then @id{match} returns	6925	If it finds one, then @id{match} returns
@@ -6946,6 +6951,7 @@ The format string cannot have the variable-length options
6946	}	6951	}
6947		6952
6948	@LibEntry{string.rep (s, n [, sep])\|	6953	@LibEntry{string.rep (s, n [, sep])\|
		6954
6949	Returns a string that is the concatenation of @id{n} copies of	6955	Returns a string that is the concatenation of @id{n} copies of
6950	the string @id{s} separated by the string @id{sep}.	6956	the string @id{s} separated by the string @id{sep}.
6951	The default value for @id{sep} is the empty string	6957	The default value for @id{sep} is the empty string
@@ -6958,11 +6964,13 @@ with a single call to this function.)
6958	}	6964	}
6959		6965
6960	@LibEntry{string.reverse (s)\|	6966	@LibEntry{string.reverse (s)\|
		6967
6961	Returns a string that is the string @id{s} reversed.	6968	Returns a string that is the string @id{s} reversed.
6962		6969
6963	}	6970	}
6964		6971
6965	@LibEntry{string.sub (s, i [, j])\|	6972	@LibEntry{string.sub (s, i [, j])\|
		6973
6966	Returns the substring of @id{s} that	6974	Returns the substring of @id{s} that
6967	starts at @id{i} and continues until @id{j};	6975	starts at @id{i} and continues until @id{j};
6968	@id{i} and @id{j} can be negative.	6976	@id{i} and @id{j} can be negative.
@@ -6998,6 +7006,7 @@ this function also returns the index of the first unread byte in @id{s}.
6998	}	7006	}
6999		7007
7000	@LibEntry{string.upper (s)\|	7008	@LibEntry{string.upper (s)\|
		7009
7001	Receives a string and returns a copy of this string with all	7010	Receives a string and returns a copy of this string with all
7002	lowercase letters changed to uppercase.	7011	lowercase letters changed to uppercase.
7003	All other characters are left unchanged.	7012	All other characters are left unchanged.
@@ -7318,8 +7327,24 @@ or one plus the length of the subject string.
7318	As in the string library,	7327	As in the string library,
7319	negative indices count from the end of the string.	7328	negative indices count from the end of the string.
7320		7329
		7330	Functions that create byte sequences
		7331	accept all values up to @T{0x7FFFFFFF},
		7332	as defined in the original UTF-8 specification;
		7333	that implies byte sequences of up to six bytes.
		7334
		7335	Functions that interpret byte sequences only accept
		7336	valid sequences (well formed and not overlong).
		7337	By default, they only accept byte sequences
		7338	that result in valid Unicode code points,
		7339	rejecting values larger than @T{10FFFF} and surrogates.
		7340	A boolean argument @id{nonstrict}, when available,
		7341	lifts these checks,
		7342	so that all values up to @T{0x7FFFFFFF} are accepted.
		7343	(Not well formed and overlong sequences are still rejected.)
		7344
7321		7345
7322	@LibEntry{utf8.char (@Cdots)\|	7346	@LibEntry{utf8.char (@Cdots)\|
		7347
7323	Receives zero or more integers,	7348	Receives zero or more integers,
7324	converts each one to its corresponding UTF-8 byte sequence	7349	converts each one to its corresponding UTF-8 byte sequence
7325	and returns a string with the concatenation of all these sequences.	7350	and returns a string with the concatenation of all these sequences.
@@ -7327,14 +7352,15 @@ and returns a string with the concatenation of all these sequences.
7327	}	7352	}
7328		7353
7329	@LibEntry{utf8.charpattern\|	7354	@LibEntry{utf8.charpattern\|
7330	The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xF4][\x80-\xBF]*}	7355
		7356	The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xFD][\x80-\xBF]*}
7331	@see{pm},	7357	@see{pm},
7332	which matches exactly one UTF-8 byte sequence,	7358	which matches exactly one UTF-8 byte sequence,
7333	assuming that the subject is a valid UTF-8 string.	7359	assuming that the subject is a valid UTF-8 string.
7334		7360
7335	}	7361	}
7336		7362
7337	@LibEntry{utf8.codes (s)\|	7363	@LibEntry{utf8.codes (s [, nonstrict])\|
7338		7364
7339	Returns values so that the construction	7365	Returns values so that the construction
7340	@verbatim{	7366	@verbatim{
@@ -7347,7 +7373,8 @@ It raises an error if it meets any invalid byte sequence.
7347		7373
7348	}	7374	}
7349		7375
7350	@LibEntry{utf8.codepoint (s [, i [, j]])\|	7376	@LibEntry{utf8.codepoint (s [, i [, j [, nonstrict]]])\|
		7377
7351	Returns the codepoints (as integers) from all characters in @id{s}	7378	Returns the codepoints (as integers) from all characters in @id{s}
7352	that start between byte position @id{i} and @id{j} (both included).	7379	that start between byte position @id{i} and @id{j} (both included).
7353	The default for @id{i} is 1 and for @id{j} is @id{i}.	7380	The default for @id{i} is 1 and for @id{j} is @id{i}.
@@ -7355,7 +7382,8 @@ It raises an error if it meets any invalid byte sequence.
7355		7382
7356	}	7383	}
7357		7384
7358	@LibEntry{utf8.len (s [, i [, j]])\|	7385	@LibEntry{utf8.len (s [, i [, j [, nonstrict]]])\|
		7386
7359	Returns the number of UTF-8 characters in string @id{s}	7387	Returns the number of UTF-8 characters in string @id{s}
7360	that start between positions @id{i} and @id{j} (both inclusive).	7388	that start between positions @id{i} and @id{j} (both inclusive).
7361	The default for @id{i} is @num{1} and for @id{j} is @num{-1}.	7389	The default for @id{i} is @num{1} and for @id{j} is @num{-1}.
@@ -7365,6 +7393,7 @@ returns a false value plus the position of the first invalid byte.
7365	}	7393	}
7366		7394
7367	@LibEntry{utf8.offset (s, n [, i])\|	7395	@LibEntry{utf8.offset (s, n [, i])\|
		7396
7368	Returns the position (in bytes) where the encoding of the	7397	Returns the position (in bytes) where the encoding of the
7369	@id{n}-th character of @id{s}	7398	@id{n}-th character of @id{s}
7370	(counting from position @id{i}) starts.	7399	(counting from position @id{i}) starts.
@@ -8755,6 +8784,12 @@ You can enclose the call in parentheses if you need to
8755	discard these extra results.	8784	discard these extra results.
8756	}	8785	}
8757		8786
		8787	@item{
		8788	By default, the decoding functions in the @Lid{utf8} library
		8789	do not accept surrogates as valid code points.
		8790	An extra parameter in these functions makes them more permissive.
		8791	}
		8792
8758	}	8793	}
8759		8794
8760	}	8795	}