Re: unicode internal

Thursday, 6 October 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> so would you call them "char-class"?  what's the api for
    Ben> defining and using them?

"char-class" is pretty ugly, but it would do.  "repertoire" is
probably far too opaque for non-I18N/font-geeks.

I guess I would treat them as sets, having add (one character), remove
(one character), complement, union, and intersection methods for Lisp
handling.  Plus we would provide a low-level interface for on-demand
loading them from Unidata and/or character set translation databases,
and provide the names of the "usual suspect" char-classes (ie,
"Unicode word characters", "ISO 8859/2 character set", etc) as
predefined standard classes.  Then

    (char-class-intersection (get-char-class 'iso-8859-2-set)
                             (get-char-class 'unicode-word-characters-v4-0))

would give the obvious.

Check me on this, but it seems to me efficiency of these methods
(mutators) doesn't much matter, compared to access, which is going to
have to be fast-fast-fast because it's called from redisplay all the time.

    Ben> well, `regexp-compile' is already in mule-trex.

Well, I'll be damned.  For Mule coder code, it's remarkably regular looking.

    Ben> maybe we should maintain the current strategy of caching,
    Ben> since there's lots of this code out there?

Oh, yes.  We surely do have to provide a convenience API for throwaway
scripts and other non-efficiency-critical applications.  GvR says that
similar caching in Python saves like 85-95% of the runtime of stuff like

    (while (not (eobp)) (re-search-forward "[Bb]en" nil 'move-if-fail))

BTW, Python caches 100 regexps.  I suspect that re-intensive apps like
Gnus splitting and fontlock could benefit from an immediate bump in
the size of the regexp cache.  I've been meaning to check if that's
instrumented for cache effectiveness (or instrument it myself),
but... *sigh*

    Ben> what do we call the new user-level commands
    Ben> e.g. `new-isearch-forward-regexp'? `isearch-forward-newre'?

Yes, although I'd want to do some apropos-ing to see if there are nice
concise precise generic names we could use.

For example, with our current strategy of using Boyer-Moore for
trivial regexps, there's no need for separate string and re matching
and searching APIs.  There's also no need for separate compiled re and
convenience functions, since the compiled re will not be stringp.
Similarly, since strings and buffers are different types, we can have
a single API for both and dispatch.

    Ben> or have a global var. just for these commands?

What do you mean?

    Ben> as for fonts, i'm not sure what is so wrong about a
    Ben> char->font mapping,

...
> Han unification, as a representative example of a large class
> of similar issue (such as ISO 8859). 
    Ben> clarify?

On second thought, there's a compelling reason.  You'll need a
chartable per language used for every face in order to disambiguate
characters that are in more than one language's repertoire (and it
could get worse than that).  With the language-based approach, you
need a chartable for every language, and it's global under normal
usage.

    Ben> but how do i ensure that "traditional arabic" gets used for
    Ben> arabic and "times new roman" for western?

Does Times New Roman really do Arabic?  If so, is it really all that
ugly?

Japanese has the analogous case, of course.  You simply set the
precedence to { nice font for English, nice font for Japanese }, and
since the repertoire for the English font doesn't cover Japanese, you
win.  I think this will handle the majority of such issues.

Another possibility would be to filter the font's repertoire.  Ie,
something like

    if we have a language tag:
        set font-priority-list = filter font-priority-list for language
    for font in font-priority-list:
        if character in font->XEmacs-allowed-repertoire
           and font has glyph for character:
            return font

So even if Times Roman has Arabic, we won't permit it.

    Ben> [[...]]  are you assuming we can reliably store the language
    Ben> with the string's text property, and key off of that first?

Yes.  For almost all users in most cases documents will be
monolingual, and in the first go-round we will disambiguate German
from Croatian in the same way that we disambiguate ISO 8859-15 from
ISO 8859-2 now---query the environment and if necessary the user.  The
second stage of development is to use the classifier techniques you've
advocated to autodetect the language.

For casually multilingual documents, we'll just use bland fallbacks
for characters from foreign sets.  We should also provide a mark
region---evoke menu---select language UI.  For Unicode documents, we
should offer to save with Plane 14 language tags.  For your second
term paper, you can extent the classifier techniques to detecting
language changes in the buffer and automark them accordingly (or maybe
that's your PhD thesis? :-)

Serious multilingual users will have their own ways of marking
language in stored documents, we need to provide an API to make it
easy to parse those and markup the buffer.

    Ben> does a font object then contain a language->font mapping?

You mean face objects, not font objects, right?  Anyway, yes.  For
monolingual and casual multilingual use, probably we can just inherit
from the default face.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: unicode internal