You might indeed. I'm not exactly sure why you'd want to do that, but that's
you call. :)
Seems like a waste of a large chunk of memory if you make it your internal
representation without a fairly compelling reason....
Bill
From: Ben Wing [mailto:ben@666.com]
but *we* might use more than 1,000,000 code points in our internal
representation.
Bill Tutt wrote:
> >From Unicode's FAQ: (
http://www.unicode.org/unicode/faq)
> """
> Q: Will UTF-16 ever be extended to more than a million characters?
> A:
> As stated, the goal of Unicode is not to encode glyphs, but
characters. Over
> a million possible codes is far more than enough for this
goal. Unicode is
> *not* designed to encode arbitrary data. If you wanted, for
example, to give
> each "instance of a character on paper throughout history"
its own code, you
> might need trillions or quadrillions of such codes; noble
as this effort
> might be, you would not use Unicode for such an encoding.
No proposed
> extensions of UTF-16 to more than 2 surrogates has a chance of being
> accepted into the Unicode Standard or ISO/IEC 10646.
> """
>
> A good example of Unicode encoding characters but not
glyphs are the CJK
> (Chinese, and Japanese, and Korean) Unicode code points. If
i recall, for a
> given unicode character in these ranges its non uncommon
for Chinese,
> Japanese, and Korean to have different glyphs associated
with these code
> points.
>
> I don't think any sane person would expect to have a non
UTF-16 encodeable
> character accepted into Unicode for anytime in the upcoming
future, unless
> we suddenly discover several alien races (that also use
ideographs) and need
> to record their documents in Unicode document stores.
>
> In other words, the fact that UTF-16 doesn't encode all
4million code points
> isn't that big of a deal.
>
> I will note that other people have told me that glibc
defines wchar_t as a
> UCS-4 type.
>
> Bill
>
> > From: Ben Wing [mailto:ben@666.com]
> >
> >
> > surrogates can only encode 1,000,000 chars. ucs-4 encodes
> > 4,000,000,000 chars.
> > is there another extension mechanism to handle the rest?
> >
> > Bill Tutt wrote:
> >
> > > > From: Ben Wing [mailto:ben@666.com]
> > > > "Stephen J. Turnbull" wrote:
> > > >
> > > > > (3) We may want to be a little bit careful with the
> > notion of the
> > > > > default internal representation. I can see
that a default
> > > > > internal representation of UCS-2 (UTF-16, I presume
> > is what you
> > > > > really mean?) would be attractive. So what happens
> > if you have
> > > > > data that is not representable in the default internal
> > > > > representation? Do we just tell those users to
get lost?
> > > > >
> > > > > It would be kind of weird if the default internal
> > representation
> > > > > that Eistrings dealt with was UCS-2 but UTF-8
> > representation was
> > > > > available in buffers, which you don't rule out.
> > > >
> > > > by its nature, the default int. rep. must be able to
> > > > represent all chars. that
> > > > would rule out utf16 if we have more than 1,000,000 and some
> > > > chars. but it
> > > > doesn't rule out ucs4, or some utf16 extension that could
> > > > encode gigs o' chars,
> > > > etc.
> > > >
> > >
> > > To clarify UTF-16 can represent all characters in UCS-4.
> > UTF-16, just like
> > > UTF-8 breaks that annoying simplification that all
> > characters are fixed
> > > width. As a happy concidence, the only difference between
> > UTF-16 and UCS-2
> > > is knowing where the character boundaries are. A UTF-16
> > encoding of a
> > > unicode character (e.g. U+000E0020) is itself two valid
> > UCS-2 characters.
> > > This is what the surrogate pair range in the Unicode code
> > space is for.
> > >
> > > Making things completly Unicode aware isn't as easy as some
> > people think,
> > > have a gander at some of the stuff on
www.unicode.org
if you haven't
> > > recently. (esp. the techincal reports)
> > > e.g. Implementing a regular expression engine that supports
> > a good chunk of
> > > Unicode's "features" is very non-trivial, especially if you
> > don't want it to
> > > take forever.
> > >
> > > Bill
--
Ben
In order to save my hands, I am cutting back on my mail. I also write
as succinctly as possible -- please don't be offended. If you send me
mail, you _will_ get a response, but please be patient, especially for
XEmacs-related mail. If you need an immediate response and it is not
apparent in your message, please say so. Thanks for your
understanding.
See also
http://www.666.com/ben/typing.html.