Re: unicode internal

Thursday, 6 October 2005

        Hrvoje Niksic wrote:

...
Ben Wing <ben(a)666.com&gt; writes:

>-- [1] at some point, use extent properties to track the language of a
>text.  this is well-recognized.
>    
>

I'm a bit hazy on the concept of tracking "language".  How is that
supposed to work, exactly?  I mean, a word processor can do it because
it has a chance to save its markup when saving the document.  Emacs
works, in most cases, with bare characters, or with charset (not
language) annotations, as is the case with coding cookies or with Gnus
processing MIME messages.

 some sort of concept of language is placed on a string extent property 
and (ideally) follows the text around.  where this comes from exactly 
depends on various things.  if inputted in the same session, it depends 
on the user's language environment setting in combination with the 
charset of the character.  if from some charset-specific encoding, then 
from the language of the charset, if derivable.

this is mostly relevant to CJK languages; unicode unifies them, but the 
preferred fonts and input methods are different.  for western languages, 
the mapping from charset to language is fuzzier; but we *don't* simply 
want to use some "charset" tag, because then we make artificial 
distinctions between iso-2022-jp, shift-jis and euc-jp, for example.

unicode also has the possibility of embedding a language tag in it 
(forced down the unicode consortium's throat by the japanese), but it is 
highly dispreferred.

...

>-- [4] the perl regexp \p syntax should be adopted for referencing
>charsets. (char categories just suck.) for that matter, we should move
>in the direction of being as perl-compatible as possible with our
>regexps, since that is where the world is going. (cf java, python,
>ruby, c#, ...)
>    
>

It's true that the world is moving to Perl-compatible regexps.  Note,
however, that everyone chooses a subset they like -- implementing the
whole thing is next to impossible.  Also note that Perl itself is
moving *away* from Perl regexps: see Apocalypse 5.

 sure; and some languages (e.g. java) have their own extensions to perl 
regexps.  but the overall base is well-supported, including lots of 
things like positive and negative lookahead that we don't have.  and 
perl 6 won't be out for years, if ever.

...

>the big problem here is \( and (, which are backwards.  the only
>reasonable solutions i can see are [a] a global variable to control
>which kinds of regexps are used; [b] a double set of all functions
>that take regexps.  comments?
>    
>

The problem with [a] is that library functions can and do use regexps,
and setting the variable to something they don't expect will break
them.  This is already the case with case-fold-search, but that one is
well-known to library authors.  Introducing a new one would break huge
amounts of code.

I agree with Stephen that The Right Thing would be to expose "compiled
regexps" to Lisp.  Python's "re" module provides an example of how
this can be done.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: unicode internal