surrogates can only encode 1,000,000 chars. ucs-4 encodes 4,000,000,000 chars.
is there another extension mechanism to handle the rest?
Bill Tutt wrote:
> From: Ben Wing [mailto:ben@666.com]
> "Stephen J. Turnbull" wrote:
>
> > (3) We may want to be a little bit careful with the notion of the
> > default internal representation. I can see that a default
> > internal representation of UCS-2 (UTF-16, I presume is what you
> > really mean?) would be attractive. So what happens if you have
> > data that is not representable in the default internal
> > representation? Do we just tell those users to get lost?
> >
> > It would be kind of weird if the default internal representation
> > that Eistrings dealt with was UCS-2 but UTF-8 representation was
> > available in buffers, which you don't rule out.
>
> by its nature, the default int. rep. must be able to
> represent all chars. that
> would rule out utf16 if we have more than 1,000,000 and some
> chars. but it
> doesn't rule out ucs4, or some utf16 extension that could
> encode gigs o' chars,
> etc.
>
To clarify UTF-16 can represent all characters in UCS-4. UTF-16, just like
UTF-8 breaks that annoying simplification that all characters are fixed
width. As a happy concidence, the only difference between UTF-16 and UCS-2
is knowing where the character boundaries are. A UTF-16 encoding of a
unicode character (e.g. U+000E0020) is itself two valid UCS-2 characters.
This is what the surrogate pair range in the Unicode code space is for.
Making things completly Unicode aware isn't as easy as some people think,
have a gander at some of the stuff on
www.unicode.org if you haven't
recently. (esp. the techincal reports)
e.g. Implementing a regular expression engine that supports a good chunk of
Unicode's "features" is very non-trivial, especially if you don't want
it to
take forever.
Bill
Not a MS PR guy, etc...
--
Ben
In order to save my hands, I am cutting back on my mail. I also write
as succinctly as possible -- please don't be offended. If you send me
mail, you _will_ get a response, but please be patient, especially for
XEmacs-related mail. If you need an immediate response and it is not
apparent in your message, please say so. Thanks for your understanding.
See also
http://www.666.com/ben/typing.html.