Ar an ceathrú lá is fiche de mí na Nollaig, scríobh Stephen J. Turnbull:
So after dealing with the test all characters issue[1] I noticed
that
charsets-in-region and charsets-in-string were failing. I suspect the
reason is that in the Unicode build the precedence lists are not tuned
right. I guess a real fix would need to be some intelligent approach
to modifying the precedence list according to the charsets used at
read time or input (although that won't work if the input is
Unicode). The other problem is that "intelligent" really means
constructing some kind of precedence graph, and it could easily be
impossible (eg if both Chinese and Japanese were used in the same
file, you'd need a language tag to disambiguate).
I guess the first thing I'd try is ensuring that ISO charsets come
before the windows-xxx and IBM CPxxx versions.
Any other suggestions?
We could disable the charsets-in-region tests on the Unicode builds. Or have
a defined order we expect depending on the current language environment; but
to be honest, with unicode-internal, the output of charsets-in-region isn’t
something the user is going to care about, I would lean more towards the
former.
Footnotes:
[1] By the way, why does the Unicode build have a 2^30 repertoire?
ISO 10646 has a 2^31 repertoire IIRC (maybe 2^32?), but Unicode only
has 2^21 (precisely, 2^20+2^16).
Ben’s choice. We do need more than 2^21, for our invalid sequence
characters, but we certainly don’t need the full 2^30.
--
‘As I sat looking up at the Guinness ad, I could never figure out /
How your man stayed up on the surfboard after forty pints of stout’
(C. Moore)