Re: changing the values of iso-8859-* charsets

Tuesday, 1 November 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> well, basically i want some api that says "convert a charset
    Ben> and codepoints into a Lisp char" and vice-versa "get a
    Ben> charset and codepoints from a Lisp char".  keep in mind that
    Ben> we will be running both unicode-internal and old-mule for
    Ben> awhile,

FWIW, IMO nobody's going to be running with unicode-internal if
neither font-handling nor language support is in, except for the sake
of testing unicode-internal.  So, we only need to support legacy Mule
stuff for applications testers are using, on request from the testers.

That will give us a much better idea of how much backward
compatibility we should support, and how.

    Ben> on the other hand, int-to-char and char-to-int always give
    Ben> you the actual int that makes up the character; not very
    Ben> portable.

But with Unicode inside, we simply don't care, because characters have
no internal structure.  Sure, users will find uses for inspecting the
internal representation (eg, pure curiosity), so we may as well
provide int-to-char and char-to-int.  (And of course Unicodes are
perfectly portable.)

I cannot think of a single use case (excluding legacy Mule
compatibility) for the charset-oriented functions.

Start with the core code.  There is no use of `split-char' and exactly
one use of `make-char' in 21.5/lisp.  There are a few of each in
21.5/lisp/mule.  The majority of uses are those in about.el,
european.el, greek.el, and hebrew.el.  Those reflect what is just a
long-standing bug: the lack of a proper facility for reading non-ASCII
characters in LISP code, which will obviously be fixed by
standardizing on Unicode (presumably UTF-8) for source code in XEmacs.

Of the rest, I doubt that any of the language-specific uses will
remain in unicode-internal.  Although I haven't checked carefully,
there is reason for confidence.  Since they are language-specific
files, they all refer to a single character set, so only the octets
are of interest; the charset can be left implicit.  Then surely we can
use tables.

That leaves 5 uses.  Of those 5, the two in mule-category.el can be
trivially and portably replaced by int-to-char.  The use in latin.el
can and should be replaced by reference to the Unicode case conversion
tables.  Which leaves two from mule-charset.el.  Deprecate the notion
of charset, as I propose, and we have *no* uses of split-char or
make-char left in core.

lisp/about.el:     (let ((muenchen (format "M%cnchen" (make-char
'latin-iso8859-1 252))))

21.5/lisp/mule/cyrillic.el:	     (split (split-char ch)))
21.5/lisp/mule/cyrillic.el:	     (split (split-char ch)))
21.5/lisp/mule/devan-util.el:  (let ((charcodes (split-char char)))
21.5/lisp/mule/devan-util.el:	(make-char 'indian-2-column ?\x21 (nth 1 charcodes))
21.5/lisp/mule/devan-util.el:  (let ((charcodes (split-char char)))
21.5/lisp/mule/devan-util.el:	(make-char 'indian-is13194 (nth 2 charcodes))
21.5/lisp/mule/ethio-util.el:  (make-char
21.5/lisp/mule/ethio-util.el:       (let ((char-components (split-char ch)))
21.5/lisp/mule/european.el:  do (modify-syntax-entry (make-char 'latin-iso8859-14 c)
"w"))
21.5/lisp/mule/european.el:  (modify-syntax-entry (make-char 'latin-iso8859-14 code)
"_"))
21.5/lisp/mule/european.el:  do (modify-syntax-entry (make-char 'latin-iso8859-16 c)
"w"))
21.5/lisp/mule/european.el:  (modify-syntax-entry (make-char 'latin-iso8859-16 code)
"_"))
21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xa5)
"\"")
21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xb5)
"\"")
21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xab) 
21.5/lisp/mule/european.el:		     (format "(%c" (make-char 'latin-iso8859-16
#xbb)))
21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xbb) 
21.5/lisp/mule/european.el:		     (format ")%c" (make-char 'latin-iso8859-16
#xab)))
21.5/lisp/mule/european.el:	  (make-char 'latin-iso8859-15 c)
21.5/lisp/mule/european.el:	  (string (char-syntax (make-char 'latin-iso8859-1 c)))))
21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xab) 
21.5/lisp/mule/european.el:		     (format "(%c" (make-char 'latin-iso8859-16
#xbb)))
21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xbb) 
21.5/lisp/mule/european.el:		     (format ")%c" (make-char 'latin-iso8859-16
#xab)))
21.5/lisp/mule/european.el:      do (modify-syntax-entry (make-char 'latin-iso8859-2
c) "w"))
21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-2 32)
"w") ; no-break space
21.5/lisp/mule/european.el:  do (modify-syntax-entry (make-char 'latin-iso8859-3 c)
"w"))
21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-3 32)
"w") ; no-break space
21.5/lisp/mule/european.el:  do (modify-syntax-entry (make-char 'latin-iso8859-4 c)
"w"))
21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-4 32)
"w") ; no-break space
21.5/lisp/mule/greek.el:      do (modify-syntax-entry (make-char 'greek-iso8859-7 c)
"w"))
21.5/lisp/mule/hebrew.el:      do (modify-syntax-entry (make-char 'hebrew-iso8859-8 c)
"w"))
21.5/lisp/mule/hebrew.el:(modify-syntax-entry (make-char 'hebrew-iso8859-8 32)
"w") ; no-break space
21.5/lisp/mule/latin.el:      (put-case-table-pair (make-char charset uc) (make-char
charset lc) tbl))))
21.5/lisp/mule/mule-category.el:	(setq found (make-char 'ascii a)))
21.5/lisp/mule/mule-category.el:	      (setq list (cons (make-char 'ascii a) list)))
21.5/lisp/mule/mule-charset.el:		 (from-rev (nreverse (split-char from)))
21.5/lisp/mule/mule-charset.el:		 (to-rev (nreverse (split-char to))))
21.5/lisp/mule/thai-xtis.el:		  to (nth 1 (split-char (aref chars i))))
21.5/lisp/mule/thai-xtis.el:	  (setq ch (nth 1 (split-char (aref chars i)))
21.5/lisp/mule/vietnamese.el:	    (split-char (aref viet-viscii-decode-table i)))
21.5/lisp/mule/vietnamese.el:	    (split-char (aref viet-vscii-decode-table i)))

How about mule-base?  Not much here, either.  char-table.el is a
utility for describing a Mule charset.  The language-specific uses can
presumably be replaced by simple tables.  mule-keyboard.el is an
interface to terminals whose keyboards send ISO 2022 encodings.
I don't think we can justify charsets on the basis of mule-base, either.

mule-base/char-table.el:       (make-char charset plane (+ (* r 16) l))
mule-base/char-table.el:     (make-char charset (+ (* r 16) l))
mule-base/ethio-util.el:  (make-char
mule-base/ethio-util.el:       (let ((char-components (split-char ch)))
mule-base/mule-keyboard.el:	  (keyboard-self-insert (make-char charset
last-command-char))
mule-base/mule-keyboard.el:	    (keyboard-self-insert (make-char charset
_keyboard-first-byte_
mule-base/mule-keyboard.el:  (keyboard-self-insert (make-char 'japanese-jisx0201-kana
last-command-char)))
mule-base/thai-xtis-util.el:    (make-char 'thai-xtis)))

How about charset precedence for Unicode translation?

    Ben> first step would be to create a language object and set up
    Ben> properties on it, such as the charset precedence list for
    Ben> unicode translation.

Display doesn't need it.  Windows, Quartz, and Xft all support Unicode
font indices.  Just give the font engine the preferred font (which
will vary by language in a natural way) and the Unicode code point.[1]

Text output doesn't need it all, because the coding system will be
specified, and it will specify character sets.

Charsets are not used for text input encoding detection as implemented
in current Mule.  We detect _coding categories_ based on presence or
absence of ISO 2022 control functions, plus looking at the high bits
of octets to detect UTF-8, Shift-JIS, and Big5.  Coding systems (and
thus charsets) are not detected; they are configured as properties of
the coding categories.  Maybe we _should_ use character set
information, but we currently don't, and I don't see why `charset'
needs to be exposed to LISP at this point.

Anyway, for both input and output, specifying a coding system achieves
everything (and more) that specifying charsets could, and is no more
confusing (especially since users in general confound coding systems
and character sets).

What am I missing?

Footnotes: 
[1]  Sure, there are legacy fonts, especially on X11, where we quite
likely will have to supply a registry in order to locate a usable
font.  But in those cases it will almost always be the case that given
a font family all registries that contain a given character will map
it to the same glyph.  So just take the first one.  Anything left can
probably be handled ad hoc as a per-font user customization in the
face.  (I have xfontsel in mind as the interface: get the registries
using XListFonts, then give the user a menu for the given font.  Add
an XEmacs-specific alist mapping X registries to descriptions that
will be more useful to typical users.)

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets