Ar an cúigiú lá is fiche de mí na Nollaig, scríobh Stephen J. Turnbull:
Aidan Kehoe writes:
> We could disable the charsets-in-region tests on the Unicode builds.
> Or have a defined order we expect depending on the current language
> environment; but to be honest, with unicode-internal, the output of
> charsets-in-region isn’t something the user is going to care about, I
> would lean more towards the former.
Users like me who use diff on ISO-2022 (or windows-xxxx for that
matter) care about getting the same charsets out that they put in. If
we get that right, charsets-in-* will follow.
Well, then, we should test the charsets assigned by the iso-2022 coding
systems.
The coding systems should pass through extent information. GB2312, ISO-2022,
SHIFT_JIS and so forth should tag their non-ASCII as zh_CN, ja, zh_TW and so
forth, using extents, and then use that information when re-encoding. This
would be also helpful for the ECMA-48 support that GNU provides with
ansi-color.el, which doesn’t work on XEmacs currently. I’m not volunteering
for this anytime soon.
I was never interested in a proliferation of internal character sets, I
needed to use latin-unity for too long for that. My fixed-width coding
systems decoded from external data to existing internal characters, or the
jit-unicode character sets; Ben decided to replace them by creating internal
character sets for all the corresponding coding systems, including the
Windows code pages. I note he added a `charset-iso-2022-p’ predicate; I hope
the ISO-2022 coding systems take note of this and don’t use the Windows code
pages, but I haven’t checked.
> Ben’s choice. We do need more than 2^21, for our invalid
sequence
> characters, but we certainly don’t need the full 2^30.
Um, 2^21 leaves us with 15 non-Unicode planes (17-31). Surely that's
enough? Am I missing something? There's also the Python PEP 283
strategy (use singleton low surrogates), which allows you to live
within the 17 Unicode planes.
We need internal characters that have no standardly-defined external
representation. ‘Standardly-defined external representation with no
associated semantic definition’ is no use, as you and I see on the Mac; it’s
routine to encounter code points in the Unicode private use area in macOS
data, and if they conflict with characters we have assigned, that will give
bugs. ?\uF8FF should not suddenly be encoded as, e.g. a single-byte ÿ in a
UTF-8 file.
--
‘As I sat looking up at the Guinness ad, I could never figure out /
How your man stayed up on the surfboard after forty pints of stout’
(C. Moore)