aboutsummaryrefslogtreecommitdiff
path: root/scripts/mkwcwidth (follow)
Commit message (Collapse)AuthorAgeFilesLines
* win32: unicode: use newer wcwidth by defaultAvi Halachmi (:avih)2024-03-291-0/+169
This commit adds a new wcwidth implementation at libbb/wcwidth_alt.c, and uses it instead of the existing implementation when compiling for windows and CONFIG_LAST_SUPPORTED_WCHAR >= 0x30000 - which is the case with the unicode configs/mingw64u_defconfig. The windows-target condition keeps non-windows build unmodified, and the last supported wchar threshold is a semi-hack to allow switching between implementations without adding a new config option (the old code supports codepoints up to 0x2ffff). The new file wcwidth_alt.c was generated by a new scripts/mkwcwidth, which prints a wcwidth implementation using latest unicode data from a local clone of https://github.com/jquast/wcwidth . This repo is the main python wcwidth implementation, and is maintained and up to date. Functional differences from the existing implementation: - Unicode 15.1.0 (latest) with the new version (about 450 ranges of wide and zero-width codepoints), compared to roughly Unicode 5.0 of the existing code (nearly 20 years old spec, about 150 ranges). The new spec includes, among others, various wide icons and emojis, which can now be edited correctly at the shell prompt, have correct alignment in 'ls', etc. - The old implementation returns -1 (non-printable) for surrogates, while the new code returns 1, though this is inconsequential, and POSIX doesn't care. Also libc implementations vary in this regard. Technical differences: - The old version compiles less code/data when the last supported wchar is smaller, while the new version doesn't. This doesn't matter because the new version is enabled only for the full range. - The new version is smaller and relatively straight forward, and fully automated (generated), so updates to newer spec is trivial. The old version mixes data, ad-hoc code (tailored to the data), and preprocessor checks, and is hard to automate updates. The old version has various forms of 32 and 16 bit data ranges, in several arrays, while the new version uses single data array with unified form of 32 bits per range, with two rules: - A data range can't span Unicode planes (enforced, but unlikely required, and if yes, code to split ranges would be simple). - A range can't hold more than 32768 codepoints, so bigger ranges are split automatically (currently there are 2 such ranges). Performance wise, the new version should be faster, even with three times the data ranges. Both versions do effectively at most one binary search in one Unicode plane data, but the new version finds both zero-width and wide-width results in this one search, while the old version only finds zero-width, and to detect wide-width it does an additional linear series of manual range tests, but since most results are width 1, this sequence is performed in most (non-ASCII) calls. In a cursory comparison of the new wcwidth with glibc and musl-libc (both use O(1) lookup tables), with few bodies of text, we're in the same ballpark, with typical speed of 60% or better. Bloat-wise, the new version is about 180 bytes code and 1800 bytes data. If it had similar number of data ranges as the old code (150), the new version would be about 200 bytes smaller, but because the new version has 450 data ranges, it's about 1K bigger.