Re: unicode internal

Wednesday, 5 October 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> a lot of what you say here is confusing so i'm going to make
    Ben> some general points and ask some questions.

    Ben> we need a way of representing named character classes.

Which is precisely what I said, using my words to boot.  You talked
about "generalized charsets".

    Ben> what api for new regexps would you propose?

1.  Regexp object type, a first class Lisp object with room for lots
of attributes (although it could be a vector, hash, or list rather
than a new object type).

2.  There's a family of regexp-compile methods with signature
(regexp-compile REGEXP SYNTAX) which convert the string REGEXP to a
regexp object according to SYNTAX.  Initially we should support GNU
Elisp syntax and PCRE syntax.

3.  The search and match (aka looking-at) functions return match
objects, which are first class Lisp objects with properties.  They
need to be named to avoid collision with existing Elisp functions.

4.  There is a (deprecated) global match variable, say global-match.

For backward compatibility we define

(defun re-search-forward (regexp &optional syntax)
  (setq global-match
        (new-re-search-forward (regexp-compile regexp
                                               (or syntax 'legacy-elisp))))
  (match-end global-match 0))

etc.

...
> Does it matter?  What are the programmatic applications for
> these things? 
    Ben> what do you mean?

Now that you've made it clear that you're not talking about charsets,
you're talking about what Elisp implements via "syntax tables", I'm
not confused any more.

    Ben> it does matter because with the wrong implementation we
    Ben> either [a] take a humongous amount of space or [b]
    Ben> potentially make our regexps slower than they should be.

Yeah, sure.  "Premature optimization is the root of all error."  Since
it's pretty clear that both methods will work and can be encapsulated
in the same API, pick one, get it working, and we'll deal with
optimization and/or user preference later.

    Ben> as for fonts, i'm not sure what is so wrong about a
    Ben> char->font mapping,

Han unification, as a representative example of a large class of
similar issue (such as ISO 8859).

    Ben> the intention is that `put-char-table' can take a character
    Ben> class as well as a single character, and sets a value for
    Ben> that whole class.  this seems quite natural to me -- usually,
    Ben> you want to specify the e.g. "Traditional Arabic" font for
    Ben> Arabic characters, the e.g. "MS Mincho" font for Japanese,
    Ben> etc.

Natural, and wrong for multilingual documents, which are precisely the
ones where it matters.  We should map language -> font, and then check
the font repertoire for the character and have fallbacks.  Pretty much
as we currently do.

If you want a cache mapping characters to known good fonts, that's
another matter, but I think that's more premature optimization.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: unicode internal