sentence-end regexp flexibilization

Friday, 15 November 2002

        Hello!

I am proposing a patch to lisp/paragraphs.el to make the sentence-end
regexp more flexible.

The problem:  Inline markup (as in XML) should not be considered part
              of the sentence.

Example: Editing the document fragment

         <para>One sentence.  Another sentence.</para>

         A user would see the following sentences:

         <para>One sentence.  Another sentence.</para>
               \----first--/  \----second-----/

         Emacs, in moving and killing, sees these:

         <para>One sentence.  Another sentence.</para>
               \----first--/  \--------second--------/

Diagnosis: sentence-end consists of an end-marker and a
           trailing-context part.  These are not differentiated in the
           definition.  Instead, the function forward-sentence
           searches for the entire regexp and then skips back over
           whitespace.

Proposed fix: Mark within the sentence-end expression the end-marker
              as the first subexpression.  Jump to its match-end
              instead of skipping backwards.

Limitations:

  * Always looks at the first subexpression.  If preceding context is
    to be considered, that may have to contain subexpressions.  This
    may be alleviated using shy grouping, but that is not very
    portable.

  * Probably breaks every sentence-end redefinition without the
    possibility of detection.  Only the rare case where the end-marker
    happened to be the first group would survive.

  * Addresses only sentence ends, not sentence beginnings.

  * Does not fix/change other instances of whitespace skipping.  These
    skips always seem to use space, tab, and newline hard-coded in the
    code, neglecting the syntax-table.

Alternatively, this might be turned into a sgml-forward-sentence,
since that is the language family most prominently affected.  But
forward-sentence is used in several other places (kill-sentence, for
example), that would have to be overwritten, too.

For completeness, here's the sentence-end regexp I use in xml-mode:

"\\([.?!]\\)[]\"')}]*\\($\\| $\\|\t\\|  \\|</[a-zA-Z:_-]*>\\)\\([
\t\n]\\|</[a-zA-Z:_-]*>\\)*"

Comments?

diff -u -r1.1 paragraphs.el
--- paragraphs.el	2002/11/12 16:51:46	1.1
+++ paragraphs.el	2002/11/12 16:57:28
＠＠ -134,10 +134,16 ＠＠
 ensures that the paragraph functions will work equally within a region of
 text indented by a margin setting.")

-(defconst sentence-end "[.?!][]\"')}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*"
"\
+(defconst sentence-end "\\([.?!]\\)[]\"')}]*\\($\\| $\\|\t\\|  \\)[
\t\n]*" "\
 *Regexp describing the end of a sentence.
 All paragraph boundaries also end sentences, regardless.

+Mark the actual end of the sentence as the first subexpression.
+Usually, this would be the sentence-ending punctuation.  The remainder
+of the regexp then specifies required matching context.  If you have
+to use subexpressions before the `sentence-end', use the shy grouping
+operator \(?:...\) in XEmacs.
+
 In order to be recognized as the end of a sentence, the ending period,
 question mark, or exclamation point must be followed by two spaces,
 unless it's inside some sort of quotes or parenthesis.")
＠＠ -352,8 +358,8 ＠＠
 	      (end-of-paragraph-text))))))

 (defun forward-sentence (&optional arg)
-  "Move forward to next `sentence-end'.  With argument, repeat.
-With negative argument, move backward repeatedly to `sentence-beginning'.
+  "Move forward to next `sentence-end'.  With ARG, repeat.
+With negative ARG, move backward repeatedly to `sentence-beginning'.

 The variable `sentence-end' is a regular expression that matches ends of
 sentences.  A paragraph boundary also terminates a sentence."
＠＠ -361,6 +367,9 ＠＠
   (or arg (setq arg 1))
   (while (< arg 0)
     (let ((par-beg (save-excursion (start-of-paragraph-text) (point))))
+      ;; Not good: The concatenated string corresponds to the
+      ;; whitespace list at the end of the sentence-end regular
+      ;; expression.
       (if (re-search-backward (concat sentence-end "[^ \t\n]") par-beg t)
 	  (goto-char (1- (match-end 0)))
 	(goto-char par-beg)))
＠＠ -368,7 +377,12 ＠＠
   (while (> arg 0)
     (let ((par-end (save-excursion (end-of-paragraph-text) (point))))
       (if (re-search-forward sentence-end par-end t)
-	  (skip-chars-backward " \t\n")
+          ;; If this happens to be used in a context where (a) no shy
+          ;; grouping is available and (b) there must be a group
+          ;; before the sentence-ending punctuation (as in "\(sentence
+          ;; type a\|sentence type b\)\([punct]\)"), the `1' would
+          ;; have to be replaced by something configurable.
+	  (goto-char (match-end 1))
 	(goto-char par-end)))
     (setq arg (1- arg))))

Best wishes,

-- 
Felix H. Gatzemeier                   fxg(a)i3.informatik.rwth-aachen.de
Office Phone: (0(049)241)80-21313
Disclaimer: I do not speak for anyone but myself.
Please do not send me mails containing documents in proprietary
formats (such as Microsoft Word) unless you really need to.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

sentence-end regexp flexibilization