Extent lossage with `decode-coding-region' and friends

Friday, 15 October 1999

        The attached file defines a wrapper `safer-decode-coding-region' that
preserves most markers and extents across calls to
`decode-coding-region'.  It also contains some example code which
shows how `decode-coding-region' is normally broken.

It works by temporarily setting the endpoint of all extents abutting
the region to `open', and placing "fenceposts" at each end of the
region to ensure that markers get pushed and pulled in the right
directions by operations acting entirely within the region.

I have understood Hrvoje to say `decode-coding-region' totally breaks
markers and extents but I haven't found that; my experience is that
only markers and extent endpoints within the closure of the region
being decoded are at risk.  You can see that the "whole-extent" and
the external endpoint of "overlap-extent" in the example code are
fine.  If somebody has bugs that involves external markers or extent
endpoints, and recipes to replicate I'd love to have a look at them.

`safer-decode-coding-region' is safer only for markers and extent
endpoints at the boundaries.  Strictly interior markers end up at the
beginning of the decoded region; I don't know the general rule for
extent endpoints, but in the example the extent endpoint moves in the
opposite direction, to the end of the decoded region.

I don't see much reason to care about this in the case of
`decode-coding-region'; setting markers in the middle of an
externally-encoded region is semantically dubious at best.
Unfortunately, that's not true for `encode-coding-region', and I
suspect it's probably not true for any functions that work the way
`*code-coding-region' does (feeding buffer text to an lstream,
deleting the text, and inserting the output of the lstream into the
buffer).

I have figured out but not implemented a hack that can make a pretty
good guess at where an interior extent endpoint belongs after the
region is en/decoded (at least for non-modal and ISO-2022-conformant
modal encodings).  I don't see any way to handle markers, though,
because according to Info there's no way to get a list of markers.  In
any case it probably ought to be possible to handle extents much more
easily in the C code for F*code-coding-region.

Probably some of this (the hacking at the boundary of the region)
should be done at the level of lstream.c (nothing I do is specific to
`decode-coding-region'; in fact the insight came after considering an
RMS comment about the ordering of block insertions and deletions in
wid-edit.el, but this doesn't work with extents because of their
flexible closure properties).  Unfortunately that level of code is
beyond me at the moment.

Here's the code.  You need a Mule XEmacs (of course), and a Japanese
font and a color-capable terminal for the visuals.  You can't just use
`load-file' on it because that automatically translates the Japanese
from ISO-2022-JP to Mule internal encoding, ruining the experiment.
It's probably easiest to load it into a buffer, raw, and then do
`eval-buffer' on it.

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
__________________________________________________________________________
__________________________________________________________________________
What are those two straight lines for?  "Free software rules."

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998