"""
Q: Will UTF-16 ever be extended to more than a million characters?
A:
As stated, the goal of Unicode is not to encode glyphs, but characters. Over
a million possible codes is far more than enough for this goal. Unicode is
*not* designed to encode arbitrary data. If you wanted, for example, to give
each "instance of a character on paper throughout history" its own code, you
might need trillions or quadrillions of such codes; noble as this effort
might be, you would not use Unicode for such an encoding. No proposed
extensions of UTF-16 to more than 2 surrogates has a chance of being
accepted into the Unicode Standard or ISO/IEC 10646.
"""
A good example of Unicode encoding characters but not glyphs are the CJK
(Chinese, and Japanese, and Korean) Unicode code points. If i recall, for a
given unicode character in these ranges its non uncommon for Chinese,
Japanese, and Korean to have different glyphs associated with these code
points.
I don't think any sane person would expect to have a non UTF-16 encodeable
character accepted into Unicode for anytime in the upcoming future, unless
we suddenly discover several alien races (that also use ideographs) and need
to record their documents in Unicode document stores.
In other words, the fact that UTF-16 doesn't encode all 4million code points
isn't that big of a deal.
I will note that other people have told me that glibc defines wchar_t as a
UCS-4 type.
Bill
From: Ben Wing [mailto:ben@666.com]
surrogates can only encode 1,000,000 chars. ucs-4 encodes
4,000,000,000 chars.
is there another extension mechanism to handle the rest?
Bill Tutt wrote:
> > From: Ben Wing [mailto:ben@666.com]
> > "Stephen J. Turnbull" wrote:
> >
> > > (3) We may want to be a little bit careful with the
notion of the
> > > default internal representation. I can see that a default
> > > internal representation of UCS-2 (UTF-16, I presume
is what you
> > > really mean?) would be attractive. So what happens
if you have
> > > data that is not representable in the default internal
> > > representation? Do we just tell those users to get lost?
> > >
> > > It would be kind of weird if the default internal
representation
> > > that Eistrings dealt with was UCS-2 but UTF-8
representation was
> > > available in buffers, which you don't rule out.
> >
> > by its nature, the default int. rep. must be able to
> > represent all chars. that
> > would rule out utf16 if we have more than 1,000,000 and some
> > chars. but it
> > doesn't rule out ucs4, or some utf16 extension that could
> > encode gigs o' chars,
> > etc.
> >
>
> To clarify UTF-16 can represent all characters in UCS-4.
UTF-16, just like
> UTF-8 breaks that annoying simplification that all
characters are fixed
> width. As a happy concidence, the only difference between
UTF-16 and UCS-2
> is knowing where the character boundaries are. A UTF-16
encoding of a
> unicode character (e.g. U+000E0020) is itself two valid
UCS-2 characters.
> This is what the surrogate pair range in the Unicode code
space is for.
>
> Making things completly Unicode aware isn't as easy as some
people think,
> have a gander at some of the stuff on
www.unicode.org if you haven't
> recently. (esp. the techincal reports)
> e.g. Implementing a regular expression engine that supports
a good chunk of
> Unicode's "features" is very non-trivial, especially if you
don't want it to
> take forever.
>
> Bill