| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a new wcwidth implementation at libbb/wcwidth_alt.c,
and uses it instead of the existing implementation when compiling for
windows and CONFIG_LAST_SUPPORTED_WCHAR >= 0x30000 - which is the case
with the unicode configs/mingw64u_defconfig.
The windows-target condition keeps non-windows build unmodified, and
the last supported wchar threshold is a semi-hack to allow switching
between implementations without adding a new config option (the old
code supports codepoints up to 0x2ffff).
The new file wcwidth_alt.c was generated by a new scripts/mkwcwidth,
which prints a wcwidth implementation using latest unicode data from
a local clone of https://github.com/jquast/wcwidth . This repo is the
main python wcwidth implementation, and is maintained and up to date.
Functional differences from the existing implementation:
- Unicode 15.1.0 (latest) with the new version (about 450 ranges of
wide and zero-width codepoints), compared to roughly Unicode 5.0
of the existing code (nearly 20 years old spec, about 150 ranges).
The new spec includes, among others, various wide icons and emojis,
which can now be edited correctly at the shell prompt, have correct
alignment in 'ls', etc.
- The old implementation returns -1 (non-printable) for surrogates,
while the new code returns 1, though this is inconsequential, and
POSIX doesn't care. Also libc implementations vary in this regard.
Technical differences:
- The old version compiles less code/data when the last supported
wchar is smaller, while the new version doesn't. This doesn't
matter because the new version is enabled only for the full range.
- The new version is smaller and relatively straight forward, and
fully automated (generated), so updates to newer spec is trivial.
The old version mixes data, ad-hoc code (tailored to the data),
and preprocessor checks, and is hard to automate updates.
The old version has various forms of 32 and 16 bit data ranges, in
several arrays, while the new version uses single data array with
unified form of 32 bits per range, with two rules:
- A data range can't span Unicode planes (enforced, but unlikely
required, and if yes, code to split ranges would be simple).
- A range can't hold more than 32768 codepoints, so bigger ranges
are split automatically (currently there are 2 such ranges).
Performance wise, the new version should be faster, even with three
times the data ranges. Both versions do effectively at most one binary
search in one Unicode plane data, but the new version finds both
zero-width and wide-width results in this one search, while the old
version only finds zero-width, and to detect wide-width it does an
additional linear series of manual range tests, but since most results
are width 1, this sequence is performed in most (non-ASCII) calls.
In a cursory comparison of the new wcwidth with glibc and musl-libc
(both use O(1) lookup tables), with few bodies of text, we're in the
same ballpark, with typical speed of 60% or better.
Bloat-wise, the new version is about 180 bytes code and 1800 bytes
data. If it had similar number of data ranges as the old code (150),
the new version would be about 200 bytes smaller, but because the
new version has 450 data ranges, it's about 1K bigger.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
FEATURE_UTF8_MANIFEST enables Unicode args and filenames on Win 10+.
FEATURE_UTF8_INPUT allows the shell prompt to digest correctly
Unicode strings (as UTF8) which are typed or pasted.
This commit adds support for building with FEATURE_UNICODE_SUPPORT
(mostly by supporting 32 bit wchar_t which busybox expects):
- Unicode-aware line-edit - for the most part cursor movement/del
being (UTF8) codepoint-aware rather than assuming that one-byte
equals one-char-on-screen.
- Codepoint-aware operations in some other utils, like rev or wc -c.
- When UNICODE_COMBINING_WCHARS and UNICODE_WIDE_WCHARS are enabled,
some screen-width-aware operations, like with fold, ls, expand, etc.
The busybox Unicode support is incomplete, and even less so with the
builtin libc replacement functions, like wcwidth, which are active
when UNICODE_USING_LOCALE is unset (mingw lacks those functions).
FEATURE_CHECK_UNICODE_IN_ENV should be set so that Unicode is not
hardcoded but rather depends on the ANSI codepage and some env vars:
LC_ALL=C disables Unicode support, else it's enabled if ACP is UTF8.
There's at least one known issue where the tab-completion-prefix-case
is not updated correctly, e.g. ~/desk<tab> completes to ~/desktop/
instead of ~/Desktop/, because the code which handles it exists
only at the non-unicode code paths, but that's not very critical.
That seems to be the only case where mingw-specific code is disabled
when Unicode is enabled, but there could be other unknown issues.
None of the Unicode options is enabled by default, and the next
commit will make it easier to create a build which supports Unicode.
|
| |
|
|
| |
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
| |
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
|
| |
|
|
|
|
|
| |
This change retains "or later" state! No licensing _changes_ here,
only form is adjusted (article, space between "GPL" and "v2" and so on).
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
function old new delta
unicode_strwidth - 20 +20
read_line_input 4945 4953 +8
unicode_strlen 31 - -31
------------------------------------------------------------------------------
(add/remove: 1/1 grow/shrink: 1/0 up/down: 28/-31) Total: -3 bytes
Signed-off-by: Tomas Heinrich <heinrich.tomas@gmail.com>
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
| |
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
|
| |
Signed-off-by: Tomas Heinrich <heinrich.tomas@gmail.com>
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
| |
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Off:
function old new delta
unicode_bidi_isrtl - 55 +55
isrtl_str 51 65 +14
unicode_isrtl 55 - -55
read_line_input 5003 4937 -66
------------------------------------------------------------------------------
(add/remove: 1/4 grow/shrink: 1/1 up/down: 69/-121) Total: -52 bytes
On:
function old new delta
static.neutral_b - 320 +320
static.neutral_p - 142 +142
unicode_bidi_isrtl - 55 +55
unicode_bidi_is_neutral_wchar - 55 +55
isrtl_str 51 59 +8
unicode_isrtl 55 - -55
read_line_input 5003 4937 -66
------------------------------------------------------------------------------
(add/remove: 4/4 grow/shrink: 1/1 up/down: 580/-121) Total: 459 bytes
Signed-off-by: Tomas Heinrich <heinrich.tomas@gmail.com>
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
function old new delta
read_line_input 4886 5003 +117
in_uint16_table - 97 +97
in_interval_table - 78 +78
static.rtl_b - 68 +68
unicode_isrtl - 55 +55
isrtl_str - 51 +51
static.rtl_p - 42 +42
unicode_conv_to_printable2 633 477 -156
------------------------------------------------------------------------------
(add/remove: 6/0 grow/shrink: 1/1 up/down: 508/-156) Total: 352 bytes
Signed-off-by: Tomas Heinrich <heinrich.tomas@gmail.com>
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
| |
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
| |
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
| |
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
| |
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also opens up a possibility to make other unicode stuff smaller
and more correct later. but:
function old new delta
static.combining - 516 +516
bb_wcwidth - 328 +328
unicode_cut_nchars - 141 +141
mbstowc_internal - 93 +93
in_table - 78 +78
cal_main 899 961 +62
static.combining0x10000 - 40 +40
unicode_strlen - 31 +31
bb_mbstrlen 31 - -31
bb_mbstowcs 173 102 -71
------------------------------------------------------------------------------
(add/remove: 7/1 grow/shrink: 1/1 up/down: 1289/-102) Total: 1187 bytes
Uses code of Markus Kuhn, which is in public domain:
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
"Permission to use, copy, modify, and distribute this software
for any purpose and without fee is hereby granted. The author
disclaims all warranties with regard to this software."
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
|
|
| |
+ smaller enhancements: inode is long long; -h is a bit narrower; etc
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
<heinrich.tomas@gmail.com>
General Unicode support is tweaked to expose unicode_status.
function old new delta
init_unicode - 77 +77
write2stdout - 19 +19
adjust_column 68 71 +3
unicode_status - 1 +1
unicode_is_enabled 1 - -1
grep_main 780 773 -7
fold_main 619 552 -67
check_unicode_in_env 77 - -77
------------------------------------------------------------------------------
(add/remove: 3/2 grow/shrink: 1/2 up/down: 100/-152) Total: -52 bytes
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
| |
|
|
| |
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|
|
|
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
|