but *we* might use more than 1,000,000 code points in our internal
representation.
Bill Tutt wrote:
>From Unicode's FAQ: (
http://www.unicode.org/unicode/faq)
"""
Q: Will UTF-16 ever be extended to more than a million characters?
A:
As stated, the goal of Unicode is not to encode glyphs, but characters. Over
a million possible codes is far more than enough for this goal. Unicode is
*not* designed to encode arbitrary data. If you wanted, for example, to give
each "instance of a character on paper throughout history" its own code, you
might need trillions or quadrillions of such codes; noble as this effort
might be, you would not use Unicode for such an encoding. No proposed
extensions of UTF-16 to more than 2 surrogates has a chance of being
accepted into the Unicode Standard or ISO/IEC 10646.
"""
A good example of Unicode encoding characters but not glyphs are the CJK
(Chinese, and Japanese, and Korean) Unicode code points. If i recall, for a
given unicode character in these ranges its non uncommon for Chinese,
Japanese, and Korean to have different glyphs associated with these code
points.
I don't think any sane person would expect to have a non UTF-16 encodeable
character accepted into Unicode for anytime in the upcoming future, unless
we suddenly discover several alien races (that also use ideographs) and need
to record their documents in Unicode document stores.
In other words, the fact that UTF-16 doesn't encode all 4million code points
isn't that big of a deal.
I will note that other people have told me that glibc defines wchar_t as a
UCS-4 type.
Bill
> From: Ben Wing [mailto:ben@666.com]
>
>
> surrogates can only encode 1,000,000 chars. ucs-4 encodes
> 4,000,000,000 chars.
> is there another extension mechanism to handle the rest?
>
> Bill Tutt wrote:
>
> > > From: Ben Wing [mailto:ben@666.com]
> > > "Stephen J. Turnbull" wrote:
> > >
> > > > (3) We may want to be a little bit careful with the
> notion of the
> > > > default internal representation. I can see that a default
> > > > internal representation of UCS-2 (UTF-16, I presume
> is what you
> > > > really mean?) would be attractive. So what happens
> if you have
> > > > data that is not representable in the default internal
> > > > representation? Do we just tell those users to get lost?
> > > >
> > > > It would be kind of weird if the default internal
> representation
> > > > that Eistrings dealt with was UCS-2 but UTF-8
> representation was
> > > > available in buffers, which you don't rule out.
> > >
> > > by its nature, the default int. rep. must be able to
> > > represent all chars. that
> > > would rule out utf16 if we have more than 1,000,000 and some
> > > chars. but it
> > > doesn't rule out ucs4, or some utf16 extension that could
> > > encode gigs o' chars,
> > > etc.
> > >
> >
> > To clarify UTF-16 can represent all characters in UCS-4.
> UTF-16, just like
> > UTF-8 breaks that annoying simplification that all
> characters are fixed
> > width. As a happy concidence, the only difference between
> UTF-16 and UCS-2
> > is knowing where the character boundaries are. A UTF-16
> encoding of a
> > unicode character (e.g. U+000E0020) is itself two valid
> UCS-2 characters.
> > This is what the surrogate pair range in the Unicode code
> space is for.
> >
> > Making things completly Unicode aware isn't as easy as some
> people think,
> > have a gander at some of the stuff on
www.unicode.org if you haven't
> > recently. (esp. the techincal reports)
> > e.g. Implementing a regular expression engine that supports
> a good chunk of
> > Unicode's "features" is very non-trivial, especially if you
> don't want it to
> > take forever.
> >
> > Bill
--
Ben
In order to save my hands, I am cutting back on my mail. I also write
as succinctly as possible -- please don't be offended. If you send me
mail, you _will_ get a response, but please be patient, especially for
XEmacs-related mail. If you need an immediate response and it is not
apparent in your message, please say so. Thanks for your understanding.
See also