Aidan Kehoe writes:
I was never interested in a proliferation of internal character
sets, needed to use latin-unity for too long for that. My
fixed-width coding systems decoded from external data to existing
internal characters, or the jit-unicode character sets; Ben decided
to replace them by creating internal
I missed that. That must have been in his Unicode branch?
I'm pretty sure Ben was thinking about the reproducibility problem:
garbage in, *identical* garbage out. More generally, if the editor
doesn't touch a range in the buffer, it should be encoded exactly as
in the input file unless the user has requested a transcoding. This
is pretty important in stuff like mail, where the ISO 2022 encodings
are still in pretty common use in Japan, and I think in Korea as
well. I think I understand your frustration with latin-unity enough
to say this is just as important to me.
I don't think it's worth dealing with stuff like designation sequences
immediately followed by another, etc. -- it will never be perfect
unless you compare to a copy of the original. So I'm up for just
fiddling with the algorithm used by the charsets-in-* functions until
the tests pass with HELLO, and deal with any reproducibility issues
that still remain as they show up.
I assume that is OK with you, but tell me I'm wrong if not. Do you
see any problem likely to arise with that kind of algorithm fiddling?
Any suggestions would be appreciated for reference while fiddling
(granted, I don't see any that aren't just as bad in the current
situation).
We need internal characters that have no standardly-defined external
representation.
My question is do we need more than 983040 such? Unicode uses all
0xyyyy codepoints, but of the 1xyyyy code points it only uses 10yyyy.
So we're free to use 110000-1fffff and still stay in 21 bits, in which
case the #x200000 constraint on mule-tests.el's #'test covers the
entire internal repertoire. And the test for Unicode code points vs
such "internal characters" is cheap: `ch < 110000h'.