On 06 Aug 2001, Jens Lautenbacher yowled:
On 05 Aug 2001 12:24:12 +0100, Nix wrote:
> The work of obsessive genius heavy-duty fontification regexp? Here it
> is:
This is sick :-) But reading it a bit the comments in the file, I
wonder if we shouldn't abandon the idea of using regexps for
fontification completely. Sure, you can get away with 90% correct
Absolutely. I've always thought that using regexps for fontification is
a bad idea; they cannot do a good job in the general case and I mean
*cannot*; the languages describable by regexps (FSMs) are a subset of
those that fontification is trying to describe (context-free or
pseudo-context-sensitive). As such, fontification schemes that depend
upon e.g. counting brackets correctly are not implementable in the
current scheme.
result with some reasonable sized regexps, but to get the last 10%
makes the regexps so complicated and slow, that maybe another aproach
would be better.
More, `another approach would work'.
JDE, the emacs java environment, uses the semantic package to
construct
a real parser to get to the semantic meaning of the code. Building such a
parser with semantic seems to be a doable work if one has the BNF grammar
of language in question.
The ideal language-sensitive mode would use semantic to drive the
indentation *and* fontification; IIRC, this is one of semantic's
eventual design goals. The latest development version of semantic allows
the construction of non-rec-descent parsers, which is a good step; not
all languages can be conveniently described by rec-descent parsers
(although they're damned good for some languages, e.g. C++).
Of course this would be slower than some easy regexps, but judging
from
Slower; probably not. Regexps have little knowledge of the language
grammar, so getting them right often requires massive backtracking in
the regexp engine. A proper language grammar, on the other hand, can
often be backtrack-free (and for most common languages it is).
jde, perfectly doable on a modern machine. While this overhead is
bound, introducing more and more complex regexps will make fontlocking
as slow as one wants to have it.
There is no overhead. Compare the time taken by that massive regexp to
fontify one of my 2000-line C programs (440 seconds) with the time taken
by GCC-3.0 to parse it (0.9 seconds, according to -ftime-report). Both
engines are written in C; regex.c in the XEmacs core versus c-parse.[yc]
(and some other stuff) in GCC. The difference is that GCC has more
knowledge of the language, and that pure regexp engines are *very* bad
at parsing :)
--
`It's all about bossing computers around. Users have to say "please".
Programmers get to say "do what I want NOW or the hard disk gets it".'
-- Richard Heathfield on the nature of programming