Re: regexps

Monday, 3 January 2011

        Jamie Zawinski writes:

...
 > The other question is should we try to support pcre + Emacs
 > extensions, or just use the simpler syntax.

 What Emacs extensions do you mean? 
Mule charsets and character classes mostly.  (These are increasingly
important, although lots of people don't need them at all.)  OTOH,
pcre syntax includes some POSIX stuff that we don't have even today
([[:ALPHA:]], for example), and I think it also has extensions to
POSIX, right, like context matching?  So how far in that direction
should we go?

I think we can get most of the mileage from a straightforward string-
oriented translator.

...
 These days I tend to deal with regexps much more often in perl,
 javascript and php, so when I'm back in Emacs-land, 
OK, so muscle memory etc really matters here; simple dodges like #r
are helpful, but not yet optimal.

...
 And I never remember what the Emacs equivalents of \d and \s are,

Ooh, that hits me where I live.  "\s-" instead of "\s" doesn't
bother
me, but it is annoying to me that we don't support "\d".

...
 It also sure would be nice if copying a regexp from some random
 piece of perl code and pasting it into emacs-lisp code just worked, 
I'm not sure that can happen.  You'll need to wrap it either with a
function or a variable binding because Elisp will have to default to
Emacs regexps for the forseeable future.  Is such wrapping good
enough?

...
 Also, the perl "x" syntax (allowing inline whitespace and
comments)
 is really nice.  You can fake most of that in emacs by splitting a
 regexp into multiple lines and using concat, but the more concise
 syntax would be better. 
You mean, instead of

    (concat "foo"    ;; sine qua non
            "\s+"    ;; multiple spaces occur in justified text
            "bar")   ;; preferably DNA Lounge

you want

    "foo    ;; sine qua non
     \s+    ;; multiple spaces occur in justified text
     bar    ;; preferably DNA Lounge"

or so?  Does look nice (especially avoiding that stray close paren in
the "middle" of describing the regexp).  What do you suggest for the
comment delimiter?  "#" as in Perl and Python, I guess, given your
copy/paste use case.  I think I'd rather not use both.

...
 Random thought: what if the #r syntax allowed arbitrary delimiters
 like sed's "s" does, so that #r"foo\bar" was equivalent to
 #r/foo\bar/ and #r＠foo\bar＠ ?  That would also allow trailing
 regexp flags like #r/foo/i and #r/foo \s+ bar/x. 
Pretty random (although Lars came up with the same kind of thing, so
maybe ...).  But remember, #r is reader syntax.  How do you propose
conveying those flags to the regexp compiler and/or search driver?  As
string properties, maybe.  Seems unlispy to me.

And what about #r/foo\bar/unknown -- would that be a syntax error?
But then you'd need to embed knowledge of the flag syntax in the lexer.

I think I would prefer that flags be (an) argument(s) to the regexp
translator, or use the Python "(?FLAGS)" syntax in the string.  But
I'm admittedly biased toward Python vs. Perl.  Another possibility
would be to have Perl-style regexps be delimited within the string:

    #r"/foo \s+ bar/x"

Ugly, though.  I guess you could allow

    #r/foo \s+ bar/x

and pass the string "/foo \s+ bar/x" to the translator (ie, in this
syntax, the string includes the delimiters).  Again, seems unlispy to
allow non-sexp syntax; I lean toward doing everything with ordinary
strings (though the rawstring reader syntax is very useful for regexps).

...
 True.  case-fold-search is an analogous precedent, but that
doesn't
 necessarily make it a good idea. 
I think this is going to be more like when Bash decided to obey POSIX
locales and started collating ASCII as "AaBb...Zz" so that case
sensitive search for [a-z] would pick up "Z" (but not "A"!), but not
in LANG=C.  That was a real PITA.

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://lists.xemacs.org/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: regexps