how to add support for more Unicode characters?

Aidan Kehoe kehoea at parhasard.net
Wed Jun 22 05:59:07 EDT 2005


 Ar an fichiú lá de mí Meitheamh, scríobh David Kastrup: 

 > Overlong UTF-8 sequences are _invalid_ in UTF-8.  So they should be
 > encoded into appropriately quoted byte characters in MULE, and converted
 > back to the original sequence when writing them out. [...]

If Mule does that, then it's not implementing Unicode. The Unicode Standard
4.0, page 78, Definition 37, Encoding form conversion (D37) (this is most
relevant because the XEmacs Unicode coding systems are an abstraction around
the operation of converting data more than they are anything else) says:

  [...] A conformant encoding form conversion will treat any ill-formed code
  unit sequence as an error condition. (See conformance clause C12a.) This
  guarantees that it will _neither_ _interpret_ _nor_ _emit_† an ill-formed
  code unit sequence. Any implementation of encoding form conversion must
  take this requirement into account, because an encoding form conversion
  implicitly involves a verification that the Unicode strings being
  converted do, in fact, contain well-formed code unit sequences. [...]

† This explicitly prohibits emitting over-long UTF-8, or surrogate code
points encoded in UTF-8, or any of the other things that make a given
sequence of octets invalid UTF-8. 

Page 61, Conformance clause C12a (referenced above) says:

  When a process interprets a code unit sequence which purports to be in a
  Unicode character encoding form, it shall treat ill-formed code unit
  sequences as an error condition, and shall not interpret such sequences as
  characters.

  · For example, in UTF-8 every code unit of the form #b110xxxx _must_ be
  followed by a code unit of the form #b10xxxxxx. A sequence such as
  #b110xxxxx #b0xxxxxxx is ill-formed and _must_ _never_ _be_ _generated‡._
  When faced with this ill-formed code unit sequence while transforming or
  interpreting text, a conforming process must treat the first unit
  #b110xxxxx as an illegally terminated code unit sequence--for example, by
  signaling an error, filtering the code unit out, or representing the code
  unit with a marker such as U+FFFD REPLACEMENT CHARACTER.

‡ Again, programs are explicitly prohibited from writing out these octet
sequences if the result is to be called UTF-8. It doesn't explicitly
prohibit keeping the octet sequence around, so that's still an option, but
as with ASCII, as with data that's almost EUC-JP, so with UTF-8: _if you
don't want data to be treated as text, use a hex editor._ Line ending
treatment varies between extant versions of the editor too; realistically,
you are already too optimistic when you trust the editor not to change other
bytes if you read in a file as text, change one byte, and write it out
again.

  · Utility programs are not prevented from operating on "mangled" text. For
  example, a UTF-8 file could have had CRLF sequences introduced at every 80
  bytes by a bad mailer program. This could result in some UTF-8 byte
  sequences being interrupted by CRLFs, producting illegal byte
  sequences. This mangled text is no longer UTF-8. It is permissible for a
  conformant program to repair such text, rcognizing that the mangled text
  was originally well-formed UTF-8 byte sequences. However, such repair of
  mangled data is a special case, and it must not be used in circumstances
  where it would cause security problems.

Writing over-long UTF-8 sequences to disk as the default behaviour of the
editor when UTF-8 encoding is used, _is not a special case,_ _is
incompatible with other programs_--I'm sure you've had Latin-1 trashed into
iso-2022 escape sequences often enough to realise this is a big deal--and
_will make security problems more likely._

 > Emacs MULE preserves the original information even under those
 > circumstances.  XEmacs MULE (at least 21.4) doesn't, and this means that
 > it is pretty much impossible to get sensible behavior from it in this
 > situation: the information is simply gone.

Programmatically, _if you want the octets, use the 'binary coding system._
That's what it's for. That's what Gnus uses it for when implementing
MIME. That's what VM uses it for when implementing MIME. Not doing that is
why sendmail-user-agent's mbox FCC handling--synced from GNU Emacs back in
the last century, whee--trashed anything non-ASCII in my outgoing mail far
beyond repair. Thanks, GNU.

 > Now you'll jump at the opportunity of telling me that no user or
 > programmer deserves to be able to cope with such an insane situation.

Nothing of the sort. I'll just object that this sort of insanity is exactly
that, insanity--though it wasn't when it was written, the pending
obsolescence of ISO 2022 wasn't anything Don Knuth could have predicted--and
it should not be in core XEmacs in C when it can be in Lisp. Like in §, using
set-process-filter appropriately, or in GNU Emacs, you could define a new
coding system and use an appropriate post-read-conversion with much the same
Lisp. 

 > And this is a situation involving text.  Other things that people like to
 > do is actually load a binary file, do query and replace on it (preserving
 > string length) and save again.  Without a robust encoding, this will, of
 > course, break everything.  And if the strings _are_ in utf-8, it is most
 > convenient to do this operation in utf-8.  hexl-mode certainly does not
 > permit things like that.

So we should implement it. Thanks for the suggestion! (And note that the
below¶ still needs to be finished, don't use it in anger, just as a sign of
good faith.)

 > Now this is clearly pretty much impossible with escape-code based
 > encodings (like JIS-whatever) since the encoding spec allows for
 > redundant escape sequences.
 > 
 > But with utf-8, it is possible to preserve content.

I don't really see what's stopping keeping around the octets on disk for an
escape-based encoding, as text properties, if one were to be serious about
this approach. Especially since Emacs-Unicode already passes text properties
up from the coding systems.

;; § Code to handle TeX's idiosyncratic treatment of control-1 characters;
;; it also needs a (add-to-list 'process-coding-system-alist '("tex"
;; . binary)) and something like a set-process-filter too, to be workable. 

(defvar my-tex-previous-unprocessed-output ""
 "String to store any data at the end of the previous string that couldn't
be interpreted as UTF-8.  ")

(defun my-tex-process-filter (output)
 "Transform UTF-8 in TeX' error messages to XEmacs' internal encoding.  "
 (insert 
  (with-string-as-buffer-contents output
    (goto-char (point-min))
    ;; Replace TeX' escapes of the second control region. 
    (while (re-search-forward "\\^\\^\\([8-9][a-f0-9]\\)"  nil t)
      (replace-match (format "%c"
			     (string-to-int (match-string 1) 16))))
    (goto-char (point-min))
    (insert my-tex-previous-unprocessed-output)
    (setq my-tex-previous-unprocessed-output "")
    ;; Now, back up at the end of the string until there's an ASCII
    ;; character.
    (when (>= (char-before (point-max)) ?\200)
      (goto-char (point-max))
      (while (>= (char-before (poin\t)) ?\200)
	(backward-char))
      (setq my-tex-previous-unprocessed-output 
	    (buffer-substring (point) (point-max)))
      (delete-region (point) (point-max)))
    ;; Do the actual decoding, now we've normalised the string to UTF-8. 
    (decode-coding-region (point-min) (point-max) 'utf-8)
    ;; And replace the character that means XEmacs saw an invalid UTF-8
    ;; sequence.
    (goto-char (point-min))
    (while (search-forward 
	    (format "%c" (make-char 'japanese-jisx0208 34 46)) nil t)
      (replace-match "your preferred replacement for invalid escapes")))))

;; ¶ Something we need to add to hexl.el, for the case where people want to
;; replace strings in binaries while preserving their length. 

(defun hexl-replace-preserving-length ()
  "Replace a given string by another of the same length or shorter,
preserving the length of the string in the buffer.  "
  (interactive)
  (let* ((first-string (read-string "String to replace: "))
	 (err (or (length first-string) 
		  (error 'invalid-argument
			"Can't replace a zero length-string.")))
	 (second-string (read-string "Replacement: "))
	 (coding-system 
	  (if (string-match "[^\000-\177]" (concat first-string second-string))
	      (read-coding-system "Coding system: " 'binary)
	    'binary))
	 (length-diff 0))
    (setq first-string (encoding-coding-string first-string coding-system)
	  second-string (encoding-coding-string second-string coding-system)
	  length-diff (- (length first-string) (length second-string)))
    (if (< length-diff 0)
	(error 'invalid-argument 
	       "Second string cannot be longer than the first, sorry."))
    (setq second-string (concat second-string 
				(make-string (1+ length-diff) ?\000))
	  first-string (concat first-string "\000"))
    (if hexl-in-save-buffer (error 'invalid-state
				   "Can't replace while saving, sorry.  ")
      (set-buffer-modified-p (save-excursion
			       (save-restriction
				 (let ((buf (generate-new-buffer " hexl"))
				       (name (buffer-name))
				       (start (point-min))
				       (end (point-max))
				       (count 0)
				       modified)
				   (set-buffer buf)
				   (insert-buffer-substring name start end)
				   (set-buffer name)
				   (dehexlify-buffer)
				   (goto-char (point-min))
				   (while (search-forward first-string nil t)
				     (replace-match second-string nil t)
				     (incf count))
				   (setq modified (buffer-modified-p))
				   (delete-region (point-min) (point-max))
				   (insert-buffer-substring buf start end)
				   (kill-buffer buf)
				   (message "Replaced %d occurrence%s"
					    count (if (equal 1 count) "" "s"))
				   modified)))))))

-- 
“I, for instance, am gung-ho about open source because my family is being
held hostage in Rob Malda’s basement. But who fact-checks me, or Enderle,
when we say something in public? No-one!” -- Danny O’Brien




More information about the XEmacs-Beta mailing list