Re: unicode internal

Wednesday, 5 October 2005

        Ben Wing <ben(a)666.com&gt; writes:

...
 -- [1] at some point, use extent properties to track the language of
a
 text.  this is well-recognized. 
I'm a bit hazy on the concept of tracking "language".  How is that
supposed to work, exactly?  I mean, a word processor can do it because
it has a chance to save its markup when saving the document.  Emacs
works, in most cases, with bare characters, or with charset (not
language) annotations, as is the case with coding cookies or with Gnus
processing MIME messages.

...
 -- [4] the perl regexp \p syntax should be adopted for referencing
 charsets. (char categories just suck.) for that matter, we should move
 in the direction of being as perl-compatible as possible with our
 regexps, since that is where the world is going. (cf java, python,
 ruby, c#, ...) 
It's true that the world is moving to Perl-compatible regexps.  Note,
however, that everyone chooses a subset they like -- implementing the
whole thing is next to impossible.  Also note that Perl itself is
moving *away* from Perl regexps: see Apocalypse 5.

...
 the big problem here is \( and (, which are backwards.  the only
 reasonable solutions i can see are [a] a global variable to control
 which kinds of regexps are used; [b] a double set of all functions
 that take regexps.  comments? 
The problem with [a] is that library functions can and do use regexps,
and setting the variable to something they don't expect will break
them.  This is already the case with case-fold-search, but that one is
well-known to library authors.  Introducing a new one would break huge
amounts of code.

I agree with Stephen that The Right Thing would be to expose "compiled
regexps" to Lisp.  Python's "re" module provides an example of how
this can be done.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: unicode internal