aboutsummaryrefslogtreecommitdiff
path: root/libbb/unicode.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* win32: unicode: new wcwidth: allow enabling bidiAvi Halachmi (:avih)2024-04-021-2/+5
| | | | | | | | | | | | interval, in_interval_table, and in_uint16_table were previously not compiled when using the new wcwidth (commit c188a345a) because they're used by the old wcwidth but not by the new one. But they're also used by the BIDI routines. mingw64u_defconfig doesn't enable bidi (rightly - it's not working well), but it'd still be nice to allow enabling bidi while the new wcwidth is in effect. Enable the tables lookup code if BIDI is enabled.
* Revert "unicode: identify emoji width and modifiers"Avi Halachmi (:avih)2024-03-291-8/+0
| | | | | | | This reverts commit 878b3cd27fe83f2b0ff476b884c34d165be0072c. It's no longer required, since the last commit uses a new wcwidth implementation which covers the cases added by commit 878b3cd2 .
* win32: unicode: use newer wcwidth by defaultAvi Halachmi (:avih)2024-03-291-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a new wcwidth implementation at libbb/wcwidth_alt.c, and uses it instead of the existing implementation when compiling for windows and CONFIG_LAST_SUPPORTED_WCHAR >= 0x30000 - which is the case with the unicode configs/mingw64u_defconfig. The windows-target condition keeps non-windows build unmodified, and the last supported wchar threshold is a semi-hack to allow switching between implementations without adding a new config option (the old code supports codepoints up to 0x2ffff). The new file wcwidth_alt.c was generated by a new scripts/mkwcwidth, which prints a wcwidth implementation using latest unicode data from a local clone of https://github.com/jquast/wcwidth . This repo is the main python wcwidth implementation, and is maintained and up to date. Functional differences from the existing implementation: - Unicode 15.1.0 (latest) with the new version (about 450 ranges of wide and zero-width codepoints), compared to roughly Unicode 5.0 of the existing code (nearly 20 years old spec, about 150 ranges). The new spec includes, among others, various wide icons and emojis, which can now be edited correctly at the shell prompt, have correct alignment in 'ls', etc. - The old implementation returns -1 (non-printable) for surrogates, while the new code returns 1, though this is inconsequential, and POSIX doesn't care. Also libc implementations vary in this regard. Technical differences: - The old version compiles less code/data when the last supported wchar is smaller, while the new version doesn't. This doesn't matter because the new version is enabled only for the full range. - The new version is smaller and relatively straight forward, and fully automated (generated), so updates to newer spec is trivial. The old version mixes data, ad-hoc code (tailored to the data), and preprocessor checks, and is hard to automate updates. The old version has various forms of 32 and 16 bit data ranges, in several arrays, while the new version uses single data array with unified form of 32 bits per range, with two rules: - A data range can't span Unicode planes (enforced, but unlikely required, and if yes, code to split ranges would be simple). - A range can't hold more than 32768 codepoints, so bigger ranges are split automatically (currently there are 2 such ranges). Performance wise, the new version should be faster, even with three times the data ranges. Both versions do effectively at most one binary search in one Unicode plane data, but the new version finds both zero-width and wide-width results in this one search, while the old version only finds zero-width, and to detect wide-width it does an additional linear series of manual range tests, but since most results are width 1, this sequence is performed in most (non-ASCII) calls. In a cursory comparison of the new wcwidth with glibc and musl-libc (both use O(1) lookup tables), with few bodies of text, we're in the same ballpark, with typical speed of 60% or better. Bloat-wise, the new version is about 180 bytes code and 1800 bytes data. If it had similar number of data ranges as the old code (150), the new version would be about 200 bytes smaller, but because the new version has 450 data ranges, it's about 1K bigger.
* unicode: identify emoji width and modifiersAvi Halachmi (:avih)2023-07-231-0/+8
| | | | | | | | | | | | This adds the Emoticons block U+1F600..U+1F64F as double-width codepoints, and the skin tone modifiers range U+1F3FB..U+1F3FF as combining codepoints. The Emoticons variant modifiers U+FE0E and U+FE0F were already in. It's unclear how to test UNICODE_COMBINING_WCHARS and UNICODE_WIDE_WCHARS in general and also here specifically, but at least the data on Emojis width and combinings now exits.
* win32: support build with FEATURE_UNICODE_SUPPORTAvi Halachmi (:avih)2023-07-221-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FEATURE_UTF8_MANIFEST enables Unicode args and filenames on Win 10+. FEATURE_UTF8_INPUT allows the shell prompt to digest correctly Unicode strings (as UTF8) which are typed or pasted. This commit adds support for building with FEATURE_UNICODE_SUPPORT (mostly by supporting 32 bit wchar_t which busybox expects): - Unicode-aware line-edit - for the most part cursor movement/del being (UTF8) codepoint-aware rather than assuming that one-byte equals one-char-on-screen. - Codepoint-aware operations in some other utils, like rev or wc -c. - When UNICODE_COMBINING_WCHARS and UNICODE_WIDE_WCHARS are enabled, some screen-width-aware operations, like with fold, ls, expand, etc. The busybox Unicode support is incomplete, and even less so with the builtin libc replacement functions, like wcwidth, which are active when UNICODE_USING_LOCALE is unset (mingw lacks those functions). FEATURE_CHECK_UNICODE_IN_ENV should be set so that Unicode is not hardcoded but rather depends on the ANSI codepage and some env vars: LC_ALL=C disables Unicode support, else it's enabled if ACP is UTF8. There's at least one known issue where the tab-completion-prefix-case is not updated correctly, e.g. ~/desk<tab> completes to ~/desktop/ instead of ~/Desktop/, because the code which handles it exists only at the non-unicode code paths, but that's not very critical. That seems to be the only case where mingw-specific code is disabled when Unicode is enabled, but there could be other unknown issues. None of the Unicode options is enabled by default, and the next commit will make it easier to create a build which supports Unicode.
* unicode: relax array alignment for tablesDenys Vlasenko2020-11-301-8/+8
| | | | | | | | text data bss dec hex filename 1022075 559 5052 1027686 fae66 busybox_old 1021988 559 5052 1027599 fae0f busybox_unstripped Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* unicode: fix handling of short 1-4 char tablesDenys Vlasenko2020-11-301-2/+4
| | | | | | | function old new delta in_uint16_table 92 107 +15 Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* unicode: code shrink in character width determinationDenys Vlasenko2019-07-231-0/+6
| | | | | | | function old new delta bb_wcwidth 267 238 -29 Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* unzip: use printable_string() for printing filenamesDenys Vlasenko2018-09-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | function old new delta unzip_main 2726 2792 +66 printable_string2 - 57 +57 identify 4329 4336 +7 expmeta 659 663 +4 add_interface 99 103 +4 beep_main 286 289 +3 changepath 192 194 +2 builtin_type 115 117 +2 devmem_main 469 470 +1 input_tab 1076 1074 -2 create_J 1821 1819 -2 poplocalvars 314 311 -3 doCommands 2222 2214 -8 do_load 918 902 -16 printable_string 57 9 -48 ------------------------------------------------------------------------------ (add/remove: 1/0 grow/shrink: 8/6 up/down: 146/-79) Total: 67 bytes Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* libbb: fix potential NULL pointer useDenys Vlasenko2018-09-031-0/+2
| | | | | | | function old new delta unicode_conv_to_printable2 193 216 +23 Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* lineedit: improve Unicode handling (still buggy though)Denys Vlasenko2013-08-191-4/+1
| | | | | | | | | | | | function old new delta unicode_strlen - 31 +31 read_line_input 3876 3879 +3 lineedit_read_key 255 246 -9 parse_and_put_prompt 785 755 -30 ------------------------------------------------------------------------------ (add/remove: 1/0 grow/shrink: 1/2 up/down: 34/-39) Total: -5 bytes Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* Call setlocale("") , not "C", if we want to set the default oneDenys Vlasenko2013-07-071-3/+12
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* unicode: check $LC_CTYPE too to detect Unicode modeDenys Vlasenko2013-07-051-0/+8
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* unicode: check $LC_ALL to detect Unicode mode, not only $LANGDenys Vlasenko2013-07-021-4/+10
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* lineedit: fixes for CONFIG_UNICODE_USING_LOCALE=yDenys Vlasenko2011-03-271-3/+4
| | | | | | | | | | | | | function old new delta load_string 45 91 +46 save_string 40 82 +42 reinit_unicode 34 61 +27 BB_PUTCHAR 97 120 +23 init_unicode 17 37 +20 ------------------------------------------------------------------------------ (add/remove: 0/0 grow/shrink: 5/0 up/down: 158/0) Total: 158 bytes Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* ash,hush: recheck LANG before every line inputDenys Vlasenko2011-03-231-11/+17
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* libbb unicode: comment out usused function and unused parameterDenys Vlasenko2011-01-111-5/+6
| | | | Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
* unicode: update unicode_width on !unicode branch too. Closes bug 2593Denys Vlasenko2010-10-291-2/+5
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* lineedit: fix completion with Unicode charsDenys Vlasenko2010-09-021-1/+1
| | | | | | | | | | function old new delta read_line_input 4966 5002 +36 bb_wcstombs 170 159 -11 ------------------------------------------------------------------------------ (add/remove: 0/0 grow/shrink: 1/1 up/down: 36/-11) Total: 25 bytes Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
* *: make GNU licensing statement forms more regularDenys Vlasenko2010-08-161-1/+1
| | | | | | | This change retains "or later" state! No licensing _changes_ here, only form is adjusted (article, space between "GPL" and "v2" and so on). Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* lineedit: fix column display for wide and combining chars in TAB completionTomas Heinrich2010-06-011-3/+14
| | | | | | | | | | | | function old new delta unicode_strwidth - 20 +20 read_line_input 4945 4953 +8 unicode_strlen 31 - -31 ------------------------------------------------------------------------------ (add/remove: 1/1 grow/shrink: 1/0 up/down: 28/-31) Total: -3 bytes Signed-off-by: Tomas Heinrich <heinrich.tomas@gmail.com> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* stop using LAST_SUPPORTED_WCHAR and CONFIG_LAST_SUPPORTED_WCHAR, it's confusingDenys Vlasenko2010-05-161-13/+13
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* lineedit: partially fix wide and combining chars editingTomas Heinrich2010-05-161-1/+1
| | | | | Signed-off-by: Tomas Heinrich <heinrich.tomas@gmail.com> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* libbb/lineedit: add support for preserving "broken" (non-unicode) charsTomas Heinrich2010-04-291-9/+3
| | | | | Signed-off-by: Tomas Heinrich <heinrich.tomas@gmail.com> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* unicode: s/FEATURE_ASSUME_UNICODE/UNICODE_SUPPORT, add UNICODE_USING_LOCALEDenys Vlasenko2010-03-261-7/+427
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* unicode: optional table for better handling of neutral bidi charsTomas Heinrich2010-03-261-93/+255
| | | | | | | | | | | | | | | | | | | | | | | | | | Off: function old new delta unicode_bidi_isrtl - 55 +55 isrtl_str 51 65 +14 unicode_isrtl 55 - -55 read_line_input 5003 4937 -66 ------------------------------------------------------------------------------ (add/remove: 1/4 grow/shrink: 1/1 up/down: 69/-121) Total: -52 bytes On: function old new delta static.neutral_b - 320 +320 static.neutral_p - 142 +142 unicode_bidi_isrtl - 55 +55 unicode_bidi_is_neutral_wchar - 55 +55 isrtl_str 51 59 +8 unicode_isrtl 55 - -55 read_line_input 5003 4937 -66 ------------------------------------------------------------------------------ (add/remove: 4/4 grow/shrink: 1/1 up/down: 580/-121) Total: 459 bytes Signed-off-by: Tomas Heinrich <heinrich.tomas@gmail.com> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* lineedit: first shot at optional unicode bidi input supportTomas Heinrich2010-03-181-0/+132
| | | | | | | | | | | | | | | | | function old new delta read_line_input 4886 5003 +117 in_uint16_table - 97 +97 in_interval_table - 78 +78 static.rtl_b - 68 +68 unicode_isrtl - 55 +55 isrtl_str - 51 +51 static.rtl_p - 42 +42 unicode_conv_to_printable2 633 477 -156 ------------------------------------------------------------------------------ (add/remove: 6/0 grow/shrink: 1/1 up/down: 508/-156) Total: 352 bytes Signed-off-by: Tomas Heinrich <heinrich.tomas@gmail.com> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* ls: fix handling of broken unicode sequencesDenys Vlasenko2010-01-311-22/+25
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* further work on unicodizationDenys Vlasenko2010-01-301-18/+59
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* more fine-grained Unicode supportDenys Vlasenko2010-01-291-16/+70
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* testsuite-discovered fixesDenys Vlasenko2010-01-251-3/+5
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* libbb: better unicode width support. Hopefully fixes bug 839.Denys Vlasenko2010-01-241-57/+153
| | | | | | | | | | | | | | | | | | | | | | | | | | | Also opens up a possibility to make other unicode stuff smaller and more correct later. but: function old new delta static.combining - 516 +516 bb_wcwidth - 328 +328 unicode_cut_nchars - 141 +141 mbstowc_internal - 93 +93 in_table - 78 +78 cal_main 899 961 +62 static.combining0x10000 - 40 +40 unicode_strlen - 31 +31 bb_mbstrlen 31 - -31 bb_mbstowcs 173 102 -71 ------------------------------------------------------------------------------ (add/remove: 7/1 grow/shrink: 1/1 up/down: 1289/-102) Total: 1187 bytes Uses code of Markus Kuhn, which is in public domain: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c "Permission to use, copy, modify, and distribute this software for any purpose and without fee is hereby granted. The author disclaims all warranties with regard to this software." Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* *: small code shrinks and compile fix for unicodeDenys Vlasenko2010-01-201-0/+3
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* fold: unicode support. Based on a patch by Tomas Heinrich ↵Denys Vlasenko2010-01-041-19/+27
| | | | | | | | | | | | | | | | | | | | <heinrich.tomas@gmail.com> General Unicode support is tweaked to expose unicode_status. function old new delta init_unicode - 77 +77 write2stdout - 19 +19 adjust_column 68 71 +3 unicode_status - 1 +1 unicode_is_enabled 1 - -1 grep_main 780 773 -7 fold_main 619 552 -67 check_unicode_in_env 77 - -77 ------------------------------------------------------------------------------ (add/remove: 3/2 grow/shrink: 1/2 up/down: 100/-152) Total: -52 bytes Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* widen "Unicode in environment" checkDenys Vlasenko2009-07-161-1/+1
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* lineedit+unicode: code shrinkDenys Vlasenko2009-07-161-22/+14
| | | | | | | function old new delta wcrtomb_internal 161 83 -78 Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* comment fixes, no code changesDenys Vlasenko2009-07-161-1/+2
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* tweaking Unicode supportDenys Vlasenko2009-07-111-60/+47
| | | | Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
* added simplified Unicode support for non-locale-enabled buildsDenys Vlasenko2009-07-111-0/+241
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>