RE: proposed Eistring interface

Friday, 21 April 2000

...
 From: Ben Wing [mailto:ben＠666.com]
 "Stephen J. Turnbull" wrote:

 > (3) We may want to be a little bit careful with the notion of the
 >     default internal representation.  I can see that a default
 >     internal representation of UCS-2 (UTF-16, I presume is what you
 >     really mean?) would be attractive.  So what happens if you have
 >     data that is not representable in the default internal
 >     representation?  Do we just tell those users to get lost?
 >
 >     It would be kind of weird if the default internal representation
 >     that Eistrings dealt with was UCS-2 but UTF-8 representation was
 >     available in buffers, which you don't rule out.

 by its nature, the default int. rep. must be able to 
 represent all chars.  that
 would rule out utf16 if we have more than 1,000,000 and some 
 chars.  but it
 doesn't rule out ucs4, or some utf16 extension that could 
 encode gigs o' chars,
 etc.

To clarify UTF-16 can represent all characters in UCS-4. UTF-16, just like
UTF-8 breaks that annoying simplification that all characters are fixed
width. As a happy concidence, the only difference between UTF-16 and UCS-2
is knowing where the character boundaries are. A UTF-16 encoding of a
unicode character (e.g. U+000E0020) is itself two valid UCS-2 characters.
This is what the surrogate pair range in the Unicode code space is for.

Making things completly Unicode aware isn't as easy as some people think,
have a gander at some of the stuff on www.unicode.org if you haven't
recently. (esp. the techincal reports)
e.g. Implementing a regular expression engine that supports a good chunk of
Unicode's "features" is very non-trivial, especially if you don't want
it to
take forever.

Bill
Not a MS PR guy, etc...

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

RE: proposed Eistring interface