Re: regexps

Monday, 3 January 2011

        On Jan 2, 2011, at 10:53 PM, Stephen J. Turnbull wrote:

...
> It also sure would be nice if copying a regexp from some random
> piece of perl code and pasting it into emacs-lisp code just worked,

 I'm not sure that can happen.  You'll need to wrap it either with a
 function or a variable binding because Elisp will have to default to
 Emacs regexps for the forseeable future.  Is such wrapping good
 enough? 
A real world example: here's some Perl code to canonicalize the various forms of
Youtube URLs:

  # Youtube /watch?v= or /watch#!v= or /v/ URLs, with or without subdomain,
  # or possibly on youtube-nocookie.com.
  if ($url =~ m＠^http:// (?:[a-z]+\.)? (youtube) (?:-nocookie)? \.com/
                         (?: (?: watch )? (?: \? | \#! ) v= | v/ )
                         ([^<>?&,'"]+) ($|&) ＠sx) {
    my ($site, $id) = ($1, $2);
    $url = "http://www.$site.com/watch?v=$id";

Wouldn't it be nice to cut and paste that hairy regexp into Emacs-Lisp unchanged:

  (if (string-match
             #R＠^http:// (?:[a-z]+\.)? (youtube) (?:-nocookie)? \.com/
                         (?: (?: watch )? (?: \? | \#! ) v= | v/ )
                         ([^<>?&,'"]+) ($|&) ＠sx
             url)
    (setq url (replace-match #r"http://www.\2.com/watch?v=\1" nil nil url)))

(Oooh, and what if replace-match also hacked $foo into (symbol-value 'foo)?  Sweet!)

So,  "#r" can mean "just do backslash hacking", as now, for historical
compatibility, and the new "#R" can mean "read this in Perl syntax".

...
 Pretty random (although Lars came up with the same kind of thing, so
 maybe ...).  But remember, #r is reader syntax.  How do you propose
 conveying those flags to the regexp compiler and/or search driver?  As
 string properties, maybe.  Seems unlispy to me. 
It's no more unlispy than any reader macro.  The contract of the "#R" macro
is that the character following the "R" is a double-quote-like delimiter, like
in the Common Lisp |foo| symbol-quoting syntax or #|foo|# block-comment syntax.  So the
flow for the C code implementing "#R" would be:

- read the next character, number 3 (usually /, sometimes ＠, sometimes something weird).
- copy all literal bytes until the next occurrence of that character: that's the
regexp string.
- read following bytes until whitespace or EOF: that's the set of post-regexp flags.
- return Funcall ("perl-regexp-to-emacs-regexp", regexp_string, regexp_flags);

That function can either mechanically transform the string into an Emacs-syntax string, or
it can return a new object of with type-of "perl-regexp" which prints itself
with #R syntax and which string-match accepts as an argument.  (Maybe perl-regexp objects
can also be typep 'string.)

(I can't even imagine how much RMS would object to such "unnecessary
complexity". Oh wait, yes I can!)

...
 And what about #r/foo\bar/unknown -- would that be a syntax error?
 But then you'd need to embed knowledge of the flag syntax in the lexer. 
#'perl-regexp-to-emacs-regexp just signals, resulting in a parse error.

Incidentally, #r is inconsistent, as it has no way to read a string ending in a backslash
or a double-quote.  That is, the rules for parsing character N differ from the rules for
characters [0-N).  A more consistent grammar would prohibit \" within #r, which is
exactly why sed-syntax regexps allow arbitrary quote-characters after the "s".

    #r"\"X" ->  "\\\"X"  [3 bytes]
    #r"\""  ->  "\\\""   [2 bytes]
    #r"\\X" ->  "\\\\X"  [3 bytes]
    #r"\\"  ->  "\\\\"   [2 bytes]
    #r"\X"  ->  "\\X"    [2 bytes]
    #r"\"   ->  EOF

--
Jamie Zawinski      http://www.jwz.org/      http://www.dnalounge.com/

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://lists.xemacs.org/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: regexps