New subject: proposed Eistring interface

Saturday, 22 April 2000

        ...
From Unicode's FAQ: (http://www.unicode.org/unicode/faq)
"""
Q: Will UTF-16 ever be extended to more than a million characters?
A:
As stated, the goal of Unicode is not to encode glyphs, but characters. Over
a million possible codes is far more than enough for this goal. Unicode is
*not* designed to encode arbitrary data. If you wanted, for example, to give
each "instance of a character on paper throughout history" its own code, you
might need trillions or quadrillions of such codes; noble as this effort
might be, you would not use Unicode for such an encoding. No proposed
extensions of UTF-16 to more than 2 surrogates has a chance of being
accepted into the Unicode Standard or ISO/IEC 10646.
"""

A good example of Unicode encoding characters but not glyphs are the CJK
(Chinese, and Japanese, and Korean) Unicode code points. If i recall, for a
given unicode character in these ranges its non uncommon for Chinese,
Japanese, and Korean to have different glyphs associated with these code
points.

I don't think any sane person would expect to have a non UTF-16 encodeable
character accepted into Unicode for anytime in the upcoming future, unless
we suddenly discover several alien races (that also use ideographs) and need
to record their documents in Unicode document stores.

In other words, the fact that UTF-16 doesn't encode all 4million code points
isn't that big of a deal. 

I will note that other people have told me that glibc defines wchar_t as a
UCS-4 type.

Bill

...
 From: Ben Wing [mailto:ben＠666.com]

 surrogates can only encode 1,000,000 chars.  ucs-4 encodes 
 4,000,000,000 chars.
 is there another extension mechanism to handle the rest?

 Bill Tutt wrote:

 > > From: Ben Wing [mailto:ben＠666.com]
 > > "Stephen J. Turnbull" wrote:
 > >
 > > > (3) We may want to be a little bit careful with the 
 notion of the
 > > >     default internal representation.  I can see that a default
 > > >     internal representation of UCS-2 (UTF-16, I presume 
 is what you
 > > >     really mean?) would be attractive.  So what happens 
 if you have
 > > >     data that is not representable in the default internal
 > > >     representation?  Do we just tell those users to get lost?
 > > >
 > > >     It would be kind of weird if the default internal 
 representation
 > > >     that Eistrings dealt with was UCS-2 but UTF-8 
 representation was
 > > >     available in buffers, which you don't rule out.
 > >
 > > by its nature, the default int. rep. must be able to
 > > represent all chars.  that
 > > would rule out utf16 if we have more than 1,000,000 and some
 > > chars.  but it
 > > doesn't rule out ucs4, or some utf16 extension that could
 > > encode gigs o' chars,
 > > etc.
 > >
 >
 > To clarify UTF-16 can represent all characters in UCS-4. 
 UTF-16, just like
 > UTF-8 breaks that annoying simplification that all 
 characters are fixed
 > width. As a happy concidence, the only difference between 
 UTF-16 and UCS-2
 > is knowing where the character boundaries are. A UTF-16 
 encoding of a
 > unicode character (e.g. U+000E0020) is itself two valid 
 UCS-2 characters.
 > This is what the surrogate pair range in the Unicode code 
 space is for.
 >
 > Making things completly Unicode aware isn't as easy as some 
 people think,
 > have a gander at some of the stuff on www.unicode.org if you haven't
 > recently. (esp. the techincal reports)
 > e.g. Implementing a regular expression engine that supports 
 a good chunk of
 > Unicode's "features" is very non-trivial, especially if you 
 don't want it to
 > take forever.
 >
 > Bill 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

RE: proposed Eistring interface