RE: proposed Eistring interface

Sunday, 23 April 2000

...
 From: Ben Wing [mailto:ben＠666.com]

 I wrote this last night:

 NOTE: One possible default internal representation that was compatible
 with UTF16 but allowed all possible chars in UCS4 would be to take an
 unused range of 2048 chars (not from the private area because 
 Microsoft
 actually uses up most or all of it with EUDC chars).  Let's 
 say we picked
 4000 - 47FF.  Then, we'd have:

 0000 - FFFF    Simple chars

 D[8-B]xx D[C-F]xx  Surrogate char, represents 1M chars

 4[0-7]xx D[C-F]xx D[C-F]xx   Surrogate char, represents 2G chars

 This is exactly the same number of chars as UCS-4 handles, 
 and it follows the
 same property as UTF8 and Mule-internal:

 1. There are two disjoint groupings of units, one 
 representing leading units
    and one representing non-leading units.
 2. Given a leading unit, you immediately know how many units 
 follow to make
    up a valid char, irrespective of any other context.

There isn't a 2048 large empty block in the BMP atm.
See http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2213.pdf
(dated 2000-03-28)

The biggest open block I noticed is U+0000A500-U+0000ABFF.
The next biggest open block looks like U+00010900-U+00010FFF.
After that its U+00011200 - U+00011FFF. Both of which are in Plane 1.
Plane 1 Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2214.pdf

By open I mean that there isn't even a subbmitted proposal about what should
actually be encoded there.

Bill

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

RE: proposed Eistring interface