From: Ben Wing [mailto:ben@666.com]
"Stephen J. Turnbull" wrote:
> (3) We may want to be a little bit careful with the notion of the
> default internal representation. I can see that a default
> internal representation of UCS-2 (UTF-16, I presume is what you
> really mean?) would be attractive. So what happens if you have
> data that is not representable in the default internal
> representation? Do we just tell those users to get lost?
>
> It would be kind of weird if the default internal representation
> that Eistrings dealt with was UCS-2 but UTF-8 representation was
> available in buffers, which you don't rule out.
by its nature, the default int. rep. must be able to
represent all chars. that
would rule out utf16 if we have more than 1,000,000 and some
chars. but it
doesn't rule out ucs4, or some utf16 extension that could
encode gigs o' chars,
etc.
To clarify UTF-16 can represent all characters in UCS-4. UTF-16, just like
UTF-8 breaks that annoying simplification that all characters are fixed
width. As a happy concidence, the only difference between UTF-16 and UCS-2
is knowing where the character boundaries are. A UTF-16 encoding of a
unicode character (e.g. U+000E0020) is itself two valid UCS-2 characters.
This is what the surrogate pair range in the Unicode code space is for.
Making things completly Unicode aware isn't as easy as some people think,
have a gander at some of the stuff on
www.unicode.org if you haven't
recently. (esp. the techincal reports)
e.g. Implementing a regular expression engine that supports a good chunk of
Unicode's "features" is very non-trivial, especially if you don't want
it to
take forever.
Bill
Not a MS PR guy, etc...