Re: unicode internal

Tuesday, 4 October 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> -- [2] charsets should be generalized so that they encompass
    Ben> an arbitrary set of unicode chars.

Once we have Unicode inside, charsets as currently understood make no
sense at all.  It's not a generalization, it's something completely
different.

Don't worry about it, nobody outside of the Mule Lab and Kyoto
University will notice, let alone care.

Of course we will need character classes for regular expressions and
to describe font repertoires.  But let's not call them charsets; not
only does that clash with Mule, it clashes with MIME usage too.

    Ben> -- [3] we should add unicode-compatible charsets.  the names
    Ben> should be such that they programmatically map onto
    Ben> perl-compatible (used with regexp \p, see below) charset
    Ben> names.

Just use abstract character classes and have a standard table mapping
names to predefined classes.  In other words, I don't think this
requires design, we Just Do It when we need it.

    Ben> -- [4] the perl regexp \p syntax should be adopted for
    Ben> referencing charsets.

Uh, what syntax is that?  And what is perl?  :-)

    Ben> the big problem here is \( and (, which are backwards.

I assume you mean all of the things we have that are backwards (like
|,{,} at least).  But who cares?  Regexps in use are by their nature
compiled objects.  I don't see why we can't have multiple languages
that target that virtual machine.

We absolutely need an Emacs LISP-compatible compiler, of course.

    Ben> the only reasonable solutions i can see are [a] a global variable
    Ben> to control which kinds of regexps are used;

That won't work, we don't have namespaces/packages to protect
unsuspecting legacy Elisp libraries from PCREs.  We're going to need
to provide multiple interfaces if we accept multiple syntaxes.  Note
that Python used to use the PCRE library, but they trashed that years
ago in favor of a Python implementation.  I'm not sure we could afford
that for stuff like fontlock, but surely we don't have to limit
ourselves to PCRE or rx or whatever that has a C library in the
experimental stage.

Going back to the "regexps are compiled objects" theme, isn't it time
we gave up on the creaky JIT-compilation-and-caching strategy with
global match objects?  Of course we want an interface that allows
one-shot regexps to be specified as strings for convenience, but
backwards compatibility (where you can also spell backwards G-N-U) is
the only reason to not return match objects.

Yes, Python is the "right" model here, because this is LISP.  We
_can't_ provide Perl-style syntactic support for the global match
object, so we may as well return first-class objects instead of
booleans or indicies that we can only use by feeding them to another
function anyway.

    Ben> -- [5] char tables need to be changed.  their current
    Ben> implementation is heavily tied to the current mule character
    Ben> structure.  we will also need to change `map-char-table'.
    Ben> unfortunately, this will be incompatible with its current
    Ben> workings.  fortunately, only three packages use
    Ben> map-char-table, and from looking at these three, it's not
    Ben> clear anything will break.  also, the new map-char-table will
    Ben> work like GNU Emacs.

I don't think this is anything to worry about.

    Ben> -- [6] at some point, font objects should be changed to
    Ben> include a char table so that different charsets can have
    Ben> different fonts.

Huh?  You can't win that way, it's a multivalued mapping in both
directions.  Not to mention fundamentally a non-algorithmic issue.  We
should stop thinking in terms of predefined charsets, period.  We need
to think in terms of a user-specified default repertoire, and the
repertoire of the current document.  (Of course we should make a
standard set of repertoires available by name for user convenience,
but that can come on an as-needed or as-the-fit-takes-someone basis.)

BTW I'm in the process of converting everything in the Xft branch to
use the fontconfig API for internal representation of font
_instantiators_; I think it's nuts to do anything else.  It's not
perfect, but it can represent any external name (MS Windows, Mac,
XLFD) and any internal set of properties.  Of course instantiated
fonts will use the platform representation, but I don't see any good
reason why we shouldn't translate "Courier-12:weight=bold",
"Courier:12:bold", and "-*-courier-bold-r-*-*-*-120-*-*-*-*-*-*" all
to the same font regardless of what platform we're on.

We can also use the fontconfig library itself to handle listing,
sorting, and filtering on some platforms, but that's not what I'm
talking about here.  Getting that for free in Xft is of course a bonus
from my point of view, but it also meets your "vector representation"
desideratum.

    Ben> also, in [5] we have an implementation choice.  either we use
    Ben> sorted ranges or we use page-table-style lookups, [...] 
    Ben> should we allow the user to control this, with range tables
    Ben> the default?

Does it matter?  What are the programmatic applications for these
things?

In terms of what kind of character classes to expect, with the
exception of Han and Hangul users, basically I think we can expect
mostly things isomorphic to ISO 8859: a half page for ASCII, and a
full page (at most) for other national characters, with everything
else in Unicode a "don't-care".  On the other hand, for the Han and
Hangul users we can often expect on the order of a quarter-plane (16K)
of usage due to the arbitrary (ie, phonetic and stroke count)
arrangements of those subsets within Unicode.  Neither of those
subsets has much locality of reference as far as I know.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: unicode internal