Re: Please test (and comment on) Euro support (ISO 8859/15)

Monday, 21 January 2002

        Second go-round.  Actually kinda functional now.  Two known breakages:
the new letters added to Latin-9 are not yet set to word-constituent
in the syntax table.  And the skip-chars optimization makes Japanese
JIS-Roman == ASCII, which is not quite true.  Both are very rare
(except maybe for Finns?), so I've postponed it to the next release
(which will probably be a package for 21.4 and 21.1, and addition to
21.5 core, unless bogosity is uncovered).

Users of non-Latin, non-Unicode scripts should avoid it for now.  It
will interfere with your work.  (It only knows about Latin and
universal coding systems, and it only allows you to save in coding
systems it knows about.)  This also will be fixed in the next round
(_I_ need it fixed for Japanese).

People who _don't_ need the Euro and only ever use one Latin coding
system should notice nothing changed (except performance, ASCII-only
buffers should see no performance degradation); please test.

There may be performance issues.  I expect this _will_ be noticable
for monoscript users on files >100kB.  The only optimization that has
been done is to use `skip-chars-forward' to slide on by those ASCII
characters.  Therefore I expect a performance degradation proportional
to the fraction of non-ASCII characters in the buffer.  It was quite
noticable in an all-ASCII buffer of 160kB before I added that
optimization.  Suggestions for improvement would be appreciated (don't
say code it in C; that is indeed planned for 21.5, but not acceptable
for 21.4 or 21.1).

This is a standalone Lisp package, it can be used with 21.4 or 21.5.4
(which don't support ISO 8859/15 yet) or with the 21.5 current CVS
(which does support ISO 8859/15).

Do try M-x latin-unity-test RET, it includes some useful documentation
and some forms you can C-x C-e to set things up.  (I don't yet have
the nerve to hook the latin-unity-sanity-check function by default.)

In this version, I have split it up into

http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/latin-unity.el
http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/latin-unity-tables.el
http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/latin-unity-utils.el
http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/latin-9-prefix.el

Hacker's guide:

latin-unity defines iso-8859-15 if necessary (it's currently broken---
syntax table needs to make the new letters into word constituents---
but usable for testing), and provides the run-time functionality.

The guts is the char-table in latin-unity-tables, which contains a
vector of [0] feasible character sets (integer as bitvector), [1-8]
translations of the character to other Latin sets (including ASCII and
JIS X 0201), and [9] the Unicode code point (for debugging purposes).
This is basically static data, and only needs to be changed if I
screwed up the data entry for the Unicode tables (or himi did in
Mule-UCS).

latin-unity-utils exists only to write the char-table, and that's the
only place where Mule-UCS is required.

latin-9-prefix.el is Dave Love's Quail input method for ISO 8859/15.
It's a standard Quail Latin method, if you use other Latin prefix
methods it will be natural.  Requires the XEmacs leim package to be
installed.

Compile, stick 'em on your load-path, and go.

...
>>>> "Hrvoje" == Hrvoje Niksic
<hniksic(a)arsdigita.com&gt; writes: 
    Hrvoje> What a cool hack!

Why, thanks!

    Hrvoje> But I think it's a good demonstration (and perhaps a good
    Hrvoje> basis?)

Yes, and yes, I think.  Replacing Mule-UCS lookups with a char-table
and optimizing ASCII traversal with skip-chars-forward gives "you
don't notice it" performance for me, but I don't have any large files
with significant amounts of non-ASCII Latin right to hand to test on.
I have reason to believe that for most European languages it will be
acceptably quick, though.

I think the UI is OK.  It's better than GNU Emacs's, which simply
presents an obnoxious list of broken ISO-8859+ISO-2022-extension
coding systems (ie, exactly what you hate).  I tried a Google Groups
search for Dave Love's unity package.  All that turned up was

    Subject: Latin-9 (euro) support for Emacs 20
    From: d.love(a)dl.ac.uk (Dave Love)
    Date: 04 Jan 2002 19:58:44 +0000
    Message-ID: <rzqy9jddh1n.fsf(a)djlvig.dl.ac.uk&gt;

which is just the Quail input method for ISO 8859/15.  (Of course I
intend to sync that, no point in having a slightly different input
method, but it's not related to the issue of Latin sanity.)

    Hrvoje> It's a shame that it's based on Mule-UCS,

It's not.  All it uses from Mule-UCS is char-to-ucs.  I think Ben's
workspace provides this functionality, as do Tomo's ucs-conv and
UTF-2000, and it would be easy to hack up something to parse the
www.unicode.org tables themselves (except that I'm pretty sure Ben
already has this, that's why I used Mule-UCS).  The only usage has now
been split out into an auxiliary file.

    Hrvoje> Perhaps it will work better in Ben's Unicode-aware Mule
    Hrvoje> workspace?

I don't think that has anything to do with it; I think I just need to
recode the representability check in C (actually, probably the current
code would work acceptably quickly with the obvious changes if we had
a skip-category-forward function for category tables analogous to
skip-syntax-forward).  All the Unicode parts are now precomputed in
the latin-unity-equivalences char-table.

UI and API still need work.  API-wise, I'm concerned that we need to
hook it in in more places than `write-region'.  I fixed one massive
bug: the code I distributed earlier never actually translates from one
latin set to another.  (Got bit by leaving to eat dinner with the code
unfinished, and when I came back and tested it "worked" because Mule
as usual wrote an ISO-2022 file and happily read it back in producing
a correct-looking display.  :-P)

In the UI, there are some subtle issues, like whether the buffer
should be changed (currently it is changed to the target encoding) or
left as-is.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
              Don't ask how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998