From: Ben Wing [mailto:ben@666.com]
I wrote this last night:
NOTE: One possible default internal representation that was compatible
with UTF16 but allowed all possible chars in UCS4 would be to take an
unused range of 2048 chars (not from the private area because
Microsoft
actually uses up most or all of it with EUDC chars). Let's
say we picked
4000 - 47FF. Then, we'd have:
0000 - FFFF Simple chars
D[8-B]xx D[C-F]xx Surrogate char, represents 1M chars
4[0-7]xx D[C-F]xx D[C-F]xx Surrogate char, represents 2G chars
This is exactly the same number of chars as UCS-4 handles,
and it follows the
same property as UTF8 and Mule-internal:
1. There are two disjoint groupings of units, one
representing leading units
and one representing non-leading units.
2. Given a leading unit, you immediately know how many units
follow to make
up a valid char, irrespective of any other context.
There isn't a 2048 large empty block in the BMP atm.
See
http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2213.pdf
(dated 2000-03-28)
The biggest open block I noticed is U+0000A500-U+0000ABFF.
The next biggest open block looks like U+00010900-U+00010FFF.
After that its U+00011200 - U+00011FFF. Both of which are in Plane 1.
Plane 1 Roadmap:
http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2214.pdf
By open I mean that there isn't even a subbmitted proposal about what should
actually be encoded there.
Bill