Re: Char-related crashes (hopefully) fixed

Wednesday, 17 November 1999

        Can somebody give a bunch of examples where using integers as
characters is useful?  For that matter, where they are actually used?
Ben said "backward compatibility," but I haven't seen this used, and I
don't really know how to grep for it.  I have grepped for int-char,
int-to-char, char-int, and char-to-int and they're pretty rare in the
core and package code (2/3 of it) that I have.

The only one that I ever use is the C-q hack for inserting characters
by code value at the keyboard, and that could arguably (and in
Japanese invariably is) delegated to an input method which would know
about language environment (and return a true character).

For iterating over a character set in "natural" order, only ASCII
satisfies the requirement of having one, and even that's shaky.  AFAIK
the Swedes and the Norwegians, or is it the Danes, disagree on
ordering the _letters_ in ISO-8859-1 character set.  This really
should be table-driven, and will have to be for everything except
ASCII and ISO-8859-1 if we go to a Unicode internal representation.

We already have primitives for efficient case conversion and the like.

The only example I can think of offhand where you would really really
want the facility is to iterate over a code space where you don't know
which points are legal characters.  Eg, to print out tables of fonts.
Pretty specialized.  And this can be done through make-char, anyway.

According to CLtL1, the main portable use for char-int is for hashing.
But that doesn't square with the kind of usage we've been talking
about (in loops and the like).

What else am I missing?

Ben's desiderata have some problems.

...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> Either int-char should be the mirror opposite of char-int
    Ben> (i.e. accept all legal char integers), or it should be
    Ben> removed entirely.

OK.  I agree with this.

    Ben> int-char should *never* have any dependence on the language
    Ben> environment.

In one-byte buffers (either Olivier's 1/2/4 extension or `xemacs -font
*-iso8859-2') it implicitly will have dependence whatever you say.
Even without Mule, people can always use external encoders to change
raw ISO-8859-2 to ISO-2022 (not that anybody sane ever would, OK,
Hrvoje?).  Then the two files will be interpreted differently in a
Latin-1 locale Mule; the ISO-8859-2 file will be recognized as
ISO-8859-1, and the ISO-2022 file will be internally interpreted as
ISO-8859-2.

The point is that people normally assume that int-char should accept
their "natural" integer to character map.  For Americans, that's
ASCII, for Germans, that's ISO-8859-1, for Croatians, that's
ISO-8859-2.  And it works "correctly" in a no-mule XEmacs with `-font
*-iso8859-2'!  Japanese usually use ku-ten or JIS, and there's a
"natural" map from byte-sized integer pairs to shorts, but it's full
of holes.  So language environments don't agree on what a legal char
integer is, and where they do (eg, ISO-8859-1 and ISO-8859-2), they
don't agree on the map.  To satisfy your dictum (with which I agree,
but I take to mean we should get rid of these functions) we can take
the intersection where they agree

==> legal char integers == ASCII

which is what I prefer, or pick something arbitrary and efficient

==> char-int returns the internal representation

which I really hate, or something else.  Suggestions?

    Ben> I don't think C-q should either.  If Hrvoje wants to insert
    Ben> Latin-2 characters by number, then make C-u C-q work so that
    Ben> it also prompts for a character set, with a default chosen
    Ben> from the language environment.

And restrict this to ASCII?  Or assume Latin-1 in GR if there is no
prefix argument?

This is a useful feature.  C-q currently inserts Latin-2 characters
for Hrvoje in no-mule XEmacs (stretching the point only a little); I
think it should continue to do so in Mule.  This really is an input
method issue, not a keyboard issue.  In XEmacs, inserting an integer
into a buffer has no meaning.  Users insert characters.  So this is a
completely different issue from the programming API, and should not be 
considered analogous.

Maybe we could have C-q insert according to the Unicode standard, and
treat C-u C-q as part of the input method.  But I think most users
would prefer to have C-q insert according to their locale-standard
tables, and select Unicode explicitly using the C-u C-q idiom.  In
fact (again this points to the input method idea), Japanese users
would probably like to have the alternatives of using kuten (pairs
from 1--94 x 1--94) or JIS (pairs from 0x21--0x7E x 0x21--0x7E) as
options since both indexing systems are common in tables.

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
__________________________________________________________________________
__________________________________________________________________________
What are those two straight lines for?  "Free software rules."

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Char-related crashes (hopefully) fixed