thanks for the url's -- any other interesting ones?
Bill Tutt wrote:
> From: Ben Wing [mailto:ben@666.com]
>
> I wrote this last night:
>
>
> NOTE: One possible default internal representation that was compatible
> with UTF16 but allowed all possible chars in UCS4 would be to take an
> unused range of 2048 chars (not from the private area because
> Microsoft
> actually uses up most or all of it with EUDC chars). Let's
> say we picked
> 4000 - 47FF. Then, we'd have:
>
> 0000 - FFFF Simple chars
>
> D[8-B]xx D[C-F]xx Surrogate char, represents 1M chars
>
> 4[0-7]xx D[C-F]xx D[C-F]xx Surrogate char, represents 2G chars
>
> This is exactly the same number of chars as UCS-4 handles,
> and it follows the
> same property as UTF8 and Mule-internal:
>
> 1. There are two disjoint groupings of units, one
> representing leading units
> and one representing non-leading units.
> 2. Given a leading unit, you immediately know how many units
> follow to make
> up a valid char, irrespective of any other context.
>
>
There isn't a 2048 large empty block in the BMP atm.
See
http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2213.pdf
(dated 2000-03-28)
The biggest open block I noticed is U+0000A500-U+0000ABFF.
The next biggest open block looks like U+00010900-U+00010FFF.
After that its U+00011200 - U+00011FFF. Both of which are in Plane 1.
Plane 1 Roadmap:
http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2214.pdf
By open I mean that there isn't even a subbmitted proposal about what should
actually be encoded there.
Bill
--
Ben
In order to save my hands, I am cutting back on my mail. I also write
as succinctly as possible -- please don't be offended. If you send me
mail, you _will_ get a response, but please be patient, especially for
XEmacs-related mail. If you need an immediate response and it is not
apparent in your message, please say so. Thanks for your understanding.
See also
http://www.666.com/ben/typing.html.