Re: changing the values of iso-8859-* charsets

Friday, 4 November 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> yet your response is just "sorry, don't have time", along
    Ben> with further sarcasm.  do you not see the persistent sarcasm
    Ben> in comments like "you obviously don't really want the help
    Ben> that's available"?

No sarcasm, just my reading based on some unpleasant facts.  You keep
saying I should shut up until I read the code.  Currently that's an
18,000 line patch, which is only going to get longer.  Given the time
I'm willing to put in, we're talking months.  Will you wait?

Now, about emacs-unicode, what is it you want to know?  It looks like
most of the relevant API stuff is in src/character.h (sort of similar
to 21.5's text.h), charset.h, and coding.h.  The DEFUNs are in the
corresponding C files.  A diff against current CVS head didn't show
much interesting differences in a few spot checks (eg, subr.el and
simple.el), but the raw diff is about 350000 lines.  Pruning that to
35000 lines will be easy, but it would help to have some idea what I'm
looking for.

A quick skim suggests that they're using a strategy similar to
Morioka's UTF-2000, changing the internal representation in the
buffer, and converting to standard Mule charset and octets
representation when doing display and stream I/O.  For example, the
interface to fonts is still fontsets, which as far as I know are only
slightly abstracted from XFontSets; I saw nothing that suggests that
Unicode is going to get special treatment in accessing fonts.

They do now have an abstract character type.  I can't say much more
than that yet, but they do have a whole set of characterp etc. APIs.

Although the internal representation is basically UTF-8, there are
weirdities.  As you know, there are a number of "holes" in UTF-8,
because legal UTF-8 must be the shortest representation.  Eg, although
algorithmically UTF-8 0xC0 0xA4 converts to 0x24, or "$", that's not a
legal UTF-8 representation, it must be simply 0x24.  So the first
"hole" is the two-byte sequences 0xC0 0xNN and 0xC1 0xNN, which
exactly cover the 8-bit-set octets in a natural way.

Emacs is going to use them, but I'm not sure what those "8-bit
characters" are for.  I'm guessing binary, but I'll let you know more
when I've looked closer.

They're also using the 47 "high planes" from U+00110000 to U+00FFFFFF
for something.  I suspect they're taking another page from Morioka's
book and stuffing all the national standard (ie, Mule) character sets
in there.  I'm not sure why they're restricted to 24 bits; it may have
something to do with their Lisp_Object layout.

I haven't looked at regexps yet.

Superficially, except for the use of denormalized UTF-8, it looks like
what you've described for XEmacs, except adapted to GNU Emacs.

I suppose you've seen that much for yourself, but it's useful to me to
write it down.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets