Re: unicode internal

Wednesday, 5 October 2005

        Stephen J. Turnbull wrote:

...
>>>>>"Ben" == Ben Wing <ben(a)666.com&gt;
writes:
>>>>>            
>>>>>

    Ben> a lot of what you say here is confusing so i'm going to make
    Ben> some general points and ask some questions.

    Ben> we need a way of representing named character classes.

Which is precisely what I said, using my words to boot.  You talked
about "generalized charsets".

 so would you call them "char-class"?  what's the api for defining and

using them?

...
    Ben> what api for new regexps would you propose?

1.  Regexp object type, a first class Lisp object with room for lots
of attributes (although it could be a vector, hash, or list rather
than a new object type).

2.  There's a family of regexp-compile methods with signature
(regexp-compile REGEXP SYNTAX) which convert the string REGEXP to a
regexp object according to SYNTAX.  Initially we should support GNU
Elisp syntax and PCRE syntax.

3.  The search and match (aka looking-at) functions return match
objects, which are first class Lisp objects with properties.  They
need to be named to avoid collision with existing Elisp functions.

 well, `regexp-compile' is already in mule-trex.  don't ask me what the 
fuck that library does.  maybe we should just nuke it.

...
4.  There is a (deprecated) global match variable, say global-match.

For backward compatibility we define

(defun re-search-forward (regexp &optional syntax)
  (setq global-match
        (new-re-search-forward (regexp-compile regexp
                                               (or syntax 'legacy-elisp))))
  (match-end global-match 0))

etc.

 maybe we should maintain the current strategy of caching, since there's 
lots of this code out there?

what do we call the new user-level commands e.g. 
`new-isearch-forward-regexp'? `isearch-forward-newre'? or have a global 
var. just for these commands?

...
    >> Does it matter?  What are the programmatic applications
for
    >> these things?

    Ben> what do you mean?

Now that you've made it clear that you're not talking about charsets,
you're talking about what Elisp implements via "syntax tables", I'm
not confused any more.

    Ben> it does matter because with the wrong implementation we
    Ben> either [a] take a humongous amount of space or [b]
    Ben> potentially make our regexps slower than they should be.

Yeah, sure.  "Premature optimization is the root of all error."  Since
it's pretty clear that both methods will work and can be encapsulated
in the same API, pick one, get it working, and we'll deal with
optimization and/or user preference later.

    Ben> as for fonts, i'm not sure what is so wrong about a
    Ben> char->font mapping,

Han unification, as a representative example of a large class of
similar issue (such as ISO 8859).

clarify?

...
    Ben> the intention is that `put-char-table' can take a
character
    Ben> class as well as a single character, and sets a value for
    Ben> that whole class.  this seems quite natural to me -- usually,
    Ben> you want to specify the e.g. "Traditional Arabic" font for
    Ben> Arabic characters, the e.g. "MS Mincho" font for Japanese,
    Ben> etc.

Natural, and wrong for multilingual documents, which are precisely the
ones where it matters.  We should map language -> font, and then check
the font repertoire for the character and have fallbacks.  Pretty much
as we currently do.

 but how do i ensure that "traditional arabic" gets used for arabic and 
"times new roman" for western?  both will have both western and arabic 
chars in them, but any other combination looks terrible.  furthermore, 
"times new roman" is missing random subsets of many unicode blocks, 
which would lead to even worse font horrifitude.

are you assuming we can reliably store the language with the string's 
text property, and key off of that first?

does a font object then contain a language->font mapping?

...
If you want a cache mapping characters to known good fonts,
that's
another matter, but I think that's more premature optimization.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: unicode internal