Hrvoje Niksic wrote:
Ben Wing <ben(a)666.com> writes:
>-- [1] at some point, use extent properties to track the language of a
>text. this is well-recognized.
>
>
I'm a bit hazy on the concept of tracking "language". How is that
supposed to work, exactly? I mean, a word processor can do it because
it has a chance to save its markup when saving the document. Emacs
works, in most cases, with bare characters, or with charset (not
language) annotations, as is the case with coding cookies or with Gnus
processing MIME messages.
some sort of concept of language is placed on a string extent property
and (ideally) follows the text around. where this comes from exactly
depends on various things. if inputted in the same session, it depends
on the user's language environment setting in combination with the
charset of the character. if from some charset-specific encoding, then
from the language of the charset, if derivable.
this is mostly relevant to CJK languages; unicode unifies them, but the
preferred fonts and input methods are different. for western languages,
the mapping from charset to language is fuzzier; but we *don't* simply
want to use some "charset" tag, because then we make artificial
distinctions between iso-2022-jp, shift-jis and euc-jp, for example.
unicode also has the possibility of embedding a language tag in it
(forced down the unicode consortium's throat by the japanese), but it is
highly dispreferred.
>-- [4] the perl regexp \p syntax should be adopted for referencing
>charsets. (char categories just suck.) for that matter, we should move
>in the direction of being as perl-compatible as possible with our
>regexps, since that is where the world is going. (cf java, python,
>ruby, c#, ...)
>
>
It's true that the world is moving to Perl-compatible regexps. Note,
however, that everyone chooses a subset they like -- implementing the
whole thing is next to impossible. Also note that Perl itself is
moving *away* from Perl regexps: see Apocalypse 5.
sure; and some languages (e.g. java) have their own extensions to perl
regexps. but the overall base is well-supported, including lots of
things like positive and negative lookahead that we don't have. and
perl 6 won't be out for years, if ever.
>the big problem here is \( and (, which are backwards. the only
>reasonable solutions i can see are [a] a global variable to control
>which kinds of regexps are used; [b] a double set of all functions
>that take regexps. comments?
>
>
The problem with [a] is that library functions can and do use regexps,
and setting the variable to something they don't expect will break
them. This is already the case with case-fold-search, but that one is
well-known to library authors. Introducing a new one would break huge
amounts of code.
I agree with Stephen that The Right Thing would be to expose "compiled
regexps" to Lisp. Python's "re" module provides an example of how
this can be done.