Stephen, this adds support for the ISO-IR 196 UTF-8 escape syntax in ISO
2022-oriented coding systems, as Sun used a few years ago for transferring
Unicode in X11 selections, and as you and Markus Kuhn objected to back then.
I sent the bulk of the patch to xemacs-patches last June, but one problem
with it then was that it broke auto-save files and byte-compilation. The
mapping from Unicode code points to XEmacs characters was not stable from
one invocation to the next. Implementing the UTF-8 escape syntax and using
it for just-in-time allocated code points side-steps that.
I’ve been running the bulk of the patch for a year or more, but that
implementing the UTF-8 escape syntax would solve the auto-save and
byte-compilation problem only occurred to me last night, so that code (in
mule-charset.c, mule-coding.c) is significantly newer. Given the issues my C
code has provoked recently, I’d love testing from other people before
committing.
NB; with this patch committed, XEmacs supports the whole basic multilingual
plane of Unicode, and doesn’t trash any characters in this space.
SUPERSEDES 17077.60734.730479.39256(a)parhasard.net
lisp/ChangeLog addition:
2006-05-21 Aidan Kehoe <kehoea(a)parhasard.net>
* mule/mule-ccl.el:
"X Emacs" -> "XEmacs"
* mule/mule-ccl.el (ccl-compile-mule-to-unicode): New.
* mule/mule-ccl.el (ccl-compile-unicode-to-mule): New.
* mule/mule-ccl.el (ccl-dump-mule-to-unicode): New.
* mule/mule-ccl.el (ccl-dump-unicode-to-mule): New.
* mule/mule-ccl.el (define-ccl-program):
Add two new CCL commands, and commands to describe them; document
them.
man/ChangeLog addition:
2006-05-21 Aidan Kehoe <kehoea(a)parhasard.net>
* lispref/mule.texi (CCL Syntax):
* lispref/mule.texi (CCL Statements):
Describe the mule-to-unicode and unicode-to-mule statements;
rename the section they are described in.
src/ChangeLog addition:
2006-05-21 Aidan Kehoe <kehoea(a)parhasard.net>
* charset.h:
* charset.h (struct Lisp_Charset):
* charset.h (CHARSET_ENCODE_AS_UTF_8):
* charset.h (XCHARSET_ENCODE_AS_UTF_8):
Add a flag `encode-as-utf-8' to the Mule charset structure; if
set, it's an indication to ISO 2022-oriented coding systems that
the characters of that charset should be encoded using the ISO-IR
196 UTF-8 escape syntax, since they're not members of any other
well-known character set we're aware of.
Make enum unicode_type, encode_unicode_char and Funicode_to_char
available outside of unicode.c
* lread.c:
* event-xlike-inc.c:
Use the charset.h declaration of Funicode_to_char, don't declare
it ourselves.
* general-slots.h:
Make `ccl-program' and `encode-as-utf-8' available as symbols
generally.
* mule-ccl.c:
Add CCL_MuleToUnicode, CCL_UnicodeToMule, implement them, enable
and debug CCL_MAKE_CHAR, have CCL_WriteMultibyteChar2 segfault
less, fix some grammar.
* mule-charset.c (make_charset):
* mule-charset.c (Fmake_charset):
* mule-charset.c (Fcharset_property):
* mule-charset.c (complex_vars_of_mule_charset):
Require the encode_as_utf_8 property when calling make_charset ();
accept it when creating a charset from Lisp in Fmake_charset.
* mule-coding.c:
* mule-coding.c (dynarr_add_2022_one_dimension):
* mule-coding.c (dynarr_add_2022_two_dimensions):
Add two convenience functions for iso2022_decode, to abstract out
writing UTF-8 a little.
* mule-coding.c (enum iso_esc_flag):
Add one more state to reflect the existence of the UTF-8 escape.
* mule-coding.c (struct iso2022_coding_stream):
Add a counter variable to the state to permit handling
variable-length UTF-8.
* mule-coding.c (parse_iso2022_esc):
Update the function to work with ISO_STATE_UTF_8; only the ESC % @
escape is processed in that state, everything else is ignored and
passed through by the error handler.
* mule-coding.c (iso2022_decode):
* mule-coding.c (iso2022_designate):
* mule-coding.c (iso2022_encode):
Handle the UTF-8 escape sequences in reading and in writing ISO
2022.
* redisplay-x.c (separate_textual_runs):
Add a comment to the effect that the dimension stuff breaks when
using CCL programs and registries to map to a bigger charset.
* unicode.c:
Add support for creating new characters on the fly as unknown
Unicode code points are encountered.
* unicode.c (get_free_codepoint): New.
* unicode.c (unicode_to_ichar): Reworked to create new code points
on the fly.
* unicode.c (Funicode_to_char): Update the docstring.
* unicode.c (struct unicode_coding_system):
Move enum unicode_type into charset.h.
* unicode.c (encode_unicode_char):
encode_unicode_char isn't static any longer, mule-coding.c uses
it.
* unicode.c (syms_of_unicode):
Make a couple of symbols available to unicode.c
* unicode.c (vars_of_unicode):
Tell the garbage collector about current_jit_charset, initialise
it.
XEmacs Trunk source patch:
Diff command: cvs -q diff -u
Files affected: src/unicode.c src/redisplay-x.c src/mule-coding.c src/mule-charset.c
src/mule-ccl.c src/lread.c src/general-slots.h src/event-xlike-inc.c src/charset.h
man/lispref/mule.texi lisp/mule/mule-ccl.el
Index: lisp/mule/mule-ccl.el
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/lisp/mule/mule-ccl.el,v
retrieving revision 1.9
diff -u -u -r1.9 mule-ccl.el
--- lisp/mule/mule-ccl.el 2005/05/05 17:10:38 1.9
+++ lisp/mule/mule-ccl.el 2006/05/21 19:08:48
@@ -5,20 +5,20 @@
;; Keywords: CCL, mule, multilingual, character set, coding-system
-;; This file is part of X Emacs.
+;; This file is part of XEmacs.
-;; GNU Emacs is free software; you can redistribute it and/or modify
+;; XEmacs is free software; you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation; either version 2, or (at your option)
;; any later version.
-;; GNU Emacs is distributed in the hope that it will be useful,
+;; XEmacs is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
;; GNU General Public License for more details.
;; You should have received a copy of the GNU General Public License
-;; along with GNU Emacs; see the file COPYING. If not, write to the
+;; along with XEmacs; see the file COPYING. If not, write to the
;; Free Software Foundation, Inc., 59 Temple Place - Suite 330,
;; Boston, MA 02111-1307, USA.
@@ -48,7 +48,7 @@
[if branch loop break repeat write-repeat write-read-repeat
read read-if read-branch write call end
read-multibyte-character write-multibyte-character
- translate-character
+ translate-character mule-to-unicode unicode-to-mule
iterate-multiple-map map-multiple map-single]
"Vector of CCL commands (symbols).")
@@ -100,7 +100,9 @@
write-multibyte-character
translate-character
translate-character-const-tbl
- nil nil nil nil nil nil nil nil nil nil nil nil ; 0x04-0x0f
+ mule-to-unicode
+ unicode-to-mule
+ nil nil nil nil nil nil nil nil nil nil ; 0x06-0x0f
iterate-multiple-map
map-multiple
map-single
@@ -830,6 +832,29 @@
(ccl-embed-extended-command 'translate-character rrr RRR Rrr))))
nil)
+;; Compile mule-to-unicode
+(defun ccl-compile-mule-to-unicode (cmd)
+ (if (/= (length cmd) 3)
+ (error "CCL: Invalid number of arguments: %s" cmd))
+ (let ((RRR (nth 1 cmd))
+ (rrr (nth 2 cmd)))
+ (ccl-check-register RRR cmd)
+ (ccl-check-register rrr cmd)
+ (ccl-embed-extended-command 'mule-to-unicode RRR rrr 0))
+ nil)
+
+;; Given a Unicode code point in register rrr, write the charset ID of the
+;; corresponding character in RRR, and the Mule-CCL form of its code in rrr.
+(defun ccl-compile-unicode-to-mule (cmd)
+ (if (/= (length cmd) 3)
+ (error "CCL: Invalid number of arguments: %s" cmd))
+ (let ((rrr (nth 1 cmd))
+ (RRR (nth 2 cmd)))
+ (ccl-check-register rrr cmd)
+ (ccl-check-register RRR cmd)
+ (ccl-embed-extended-command 'unicode-to-mule rrr RRR 0))
+ nil)
+
(defun ccl-compile-iterate-multiple-map (cmd)
(ccl-compile-multiple-map-function 'iterate-multiple-map cmd)
nil)
@@ -1188,6 +1213,12 @@
(let ((tbl (ccl-get-next-code)))
(insert (format "translation table(%S) r%d r%d\n" tbl RRR rrr))))
+(defun ccl-dump-mule-to-unicode (rrr RRR Rrr)
+ (insert (format "change chars in r%d and r%d to unicode\n" RRR rrr)))
+
+(defun ccl-dump-unicode-to-mule (rrr RRR Rrr)
+ (insert (format "converter UCS code %d to a Mule char\n" rrr)))
+
(defun ccl-dump-iterate-multiple-map (rrr RRR Rrr)
(let ((notbl (ccl-get-next-code))
(i 0) id)
@@ -1358,10 +1389,14 @@
;; Call CCL program whose name is ccl-program-name.
CALL := (call ccl-program-name)
+
+TRANSLATE:= ;; Not implemented under XEmacs, except mule-to-unicode and
+ ;; unicode-to-mule.
+ (translate-character REG(table) REG(charset) REG(codepoint))
+ | (translate-character SYMBOL REG(charset) REG(codepoint))
+ | (mule-to-unicode REG(charset) REG(codepoint))
+ | (unicode-to-mule REG(unicode,code) REG(CHARSET))
-TRANSLATE:= ;; Not implemented under XEmacs.
- (translate-character REG(table) REG(charset) REG(codepoint))
- | (translate-character SYMBOL REG(charset) REG(codepoint))
MAP :=
(iterate-multiple-map REG REG MAP-IDs)
| (map-multiple REG REG (MAP-SET))
@@ -1373,8 +1408,8 @@
;; Terminate the CCL program.
END := (end)
-;; CCL registers. These can contain any integer value. As r7 is used by CCL
-;; interpreter itself, its value change unexpectedly.
+;; CCL registers. These can contain any integer value. As r7 is used by the
+;; CCL interpreter itself, its value can change unexpectedly.
REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
ARG := REG | INT-OR-CHAR
Index: man/lispref/mule.texi
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/man/lispref/mule.texi,v
retrieving revision 1.14
diff -u -u -r1.14 mule.texi
--- man/lispref/mule.texi 2005/06/19 20:49:47 1.14
+++ man/lispref/mule.texi 2006/05/21 19:08:50
@@ -1825,6 +1825,15 @@
| (write INT-OR-CHAR) | (write string) | (write REG ARRAY)
| string
CALL := (call ccl-program-name)
+
+
+TRANSLATE:= ;; Not implemented under XEmacs, except mule-to-unicode and
+ ;; unicode-to-mule.
+ (translate-character REG(table) REG(charset) REG(codepoint))
+ | (translate-character SYMBOL REG(charset) REG(codepoint))
+ | (mule-to-unicode REG(charset) REG(codepoint))
+ | (unicode-to-mule REG(unicode,code) REG(CHARSET))
+
END := (end)
REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
@@ -1845,7 +1854,8 @@
The Emacs Code Conversion Language provides the following statement
types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
-@dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
+@dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, @dfn{translate} and
+@dfn{end}.
@heading Set statement:
@@ -1933,11 +1943,31 @@
@code{write} and @code{read} statements for the semantics of the I/O
operations for each type of argument.
-@heading Other control statements:
+@heading Other statements:
The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
executes a CCL program as a subroutine. It does not return a value to
the caller, but can modify the register status.
+
+ The @dfn{mule-to-unicode} statement translates an XEmacs character into a
+UCS code point, using U+FFFD REPLACEMENT CHARACTER if the given XEmacs
+character has no known corresponding code point. It takes two
+arguments; the first is a register in which is stored the character set
+ID of the character to be translated, and into which the UCS code is
+stored. The second is a register which stores the XEmacs code of the
+character in question; if it is from a multidimensional character set,
+like most of the East Asian national sets, it's stored as @samp{((c1 <<
+8) & c2)}, where @samp{c1} is the first code, and @samp{c2} the second.
+(That is, as a single integer, the high-order eight bits of which encode
+the first position code, and the low order bits of which encode the
+second.)
+
+ The @dfn{unicode-to-mule} statement translates a Unicode code point
+(an integer) into an XEmacs character. Its first argument is a register
+containing the UCS code point; the code for the correspond character
+will be written into this register, in the same format as for
+@samp{mule-to-unicode} The second argument is a register into which will
+be written the character set ID of the converted character.
The @dfn{end} statement, written @samp{(end)}, terminates the CCL
program successfully, and returns to caller (which may be a CCL
Index: src/charset.h
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/charset.h,v
retrieving revision 1.12
diff -u -u -r1.12 charset.h
--- src/charset.h 2005/10/24 10:07:34 1.12
+++ src/charset.h 2006/05/21 19:08:50
@@ -229,6 +229,11 @@
/* Which half of font to be used to display this character set */
int graphic;
+ /* If set, this charset should be written out in ISO-2022-based coding
+ systems using the escape sequence for UTF-8, not using our internal
+ representation and the associated real ISO 2022 designation. */
+ unsigned int encode_as_utf_8 :1;
+
/* If set, this is a "temporary" charset created when we encounter
an unknown final. This is so that we can successfully compile
and load such files. We allow a real charset to be created on top
@@ -261,6 +266,7 @@
#define CHARSET_REP_BYTES(cs) ((cs)->rep_bytes)
#define CHARSET_COLUMNS(cs) ((cs)->columns)
#define CHARSET_GRAPHIC(cs) ((cs)->graphic)
+#define CHARSET_ENCODE_AS_UTF_8(cs) ((cs)->encode_as_utf_8)
#define CHARSET_TYPE(cs) ((cs)->type)
#define CHARSET_DIRECTION(cs) ((cs)->direction)
#define CHARSET_FINAL(cs) ((cs)->final)
@@ -284,6 +290,7 @@
#define XCHARSET_REP_BYTES(cs) CHARSET_REP_BYTES (XCHARSET (cs))
#define XCHARSET_COLUMNS(cs) CHARSET_COLUMNS (XCHARSET (cs))
#define XCHARSET_GRAPHIC(cs) CHARSET_GRAPHIC (XCHARSET (cs))
+#define XCHARSET_ENCODE_AS_UTF_8(cs) CHARSET_ENCODE_AS_UTF_8 (XCHARSET (cs))
#define XCHARSET_TYPE(cs) CHARSET_TYPE (XCHARSET (cs))
#define XCHARSET_DIRECTION(cs) CHARSET_DIRECTION (XCHARSET (cs))
#define XCHARSET_FINAL(cs) CHARSET_FINAL (XCHARSET (cs))
@@ -548,5 +555,21 @@
int ichar_to_unicode (Ichar chr);
#endif /* MULE */
+
+/* ISO 10646 UTF-16, UCS-4, UTF-8, UTF-7, etc. */
+
+enum unicode_type
+{
+ UNICODE_UTF_16,
+ UNICODE_UTF_8,
+ UNICODE_UTF_7,
+ UNICODE_UCS_4
+};
+
+void encode_unicode_char (Lisp_Object USED_IF_MULE (charset), int h,
+ int USED_IF_MULE (l), unsigned_char_dynarr *dst,
+ enum unicode_type type, unsigned int little_endian);
+
+EXFUN (Funicode_to_char, 2); /* In unicode.c. */
#endif /* INCLUDED_charset_h_ */
Index: src/event-xlike-inc.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/event-xlike-inc.c,v
retrieving revision 1.2
diff -u -u -r1.2 event-xlike-inc.c
--- src/event-xlike-inc.c 2005/06/26 18:05:04 1.2
+++ src/event-xlike-inc.c 2006/05/21 19:08:50
@@ -27,8 +27,6 @@
included here, not in event-xlike.c. However, event-xlike.c is always
X-specific, whereas the following code isn't, in the GTK case. */
-EXFUN (Funicode_to_char, 2); /* In unicode.c. */
-
static int
#ifdef THIS_IS_GTK
emacs_gtk_event_pending_p (int how_many)
Index: src/general-slots.h
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/general-slots.h,v
retrieving revision 1.16
diff -u -u -r1.16 general-slots.h
--- src/general-slots.h 2005/07/03 21:48:01 1.16
+++ src/general-slots.h 2006/05/21 19:08:50
@@ -73,6 +73,7 @@
SYMBOL_KEYWORD (Q_callback_ex);
SYMBOL (Qcancel);
SYMBOL (Qcategory);
+SYMBOL (Qccl_program);
SYMBOL (Qcenter);
SYMBOL (Qchain);
SYMBOL (Qchange);
@@ -115,6 +116,7 @@
SYMBOL (Qdynarr_overhead);
SYMBOL (Qemergency);
SYMBOL (Qempty);
+SYMBOL (Qencode_as_utf_8);
SYMBOL (Qeq);
SYMBOL (Qeql);
SYMBOL (Qequal);
Index: src/lread.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/lread.c,v
retrieving revision 1.77
diff -u -u -r1.77 lread.c
--- src/lread.c 2006/04/29 14:36:57 1.77
+++ src/lread.c 2006/05/21 19:08:52
@@ -34,6 +34,7 @@
#include "lstream.h"
#include "opaque.h"
#include "profile.h"
+#include "charset.h" /* For Funicode_to_char. */
#include "sysfile.h"
#include "sysfloat.h"
@@ -207,8 +208,6 @@
static int locate_file_open_or_access_file (Ibyte *fn, int access_mode);
EXFUN (Fread_from_string, 3);
-
-EXFUN (Funicode_to_char, 2); /* In unicode.c. */
/* When errors are signaled, the actual readcharfun should not be used
as an argument if it is an lstream, so that lstreams don't escape
Index: src/mule-ccl.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/mule-ccl.c,v
retrieving revision 1.28
diff -u -u -r1.28 mule-ccl.c
--- src/mule-ccl.c 2005/06/26 19:05:07 1.28
+++ src/mule-ccl.c 2006/05/21 19:08:53
@@ -461,7 +461,17 @@
1:ExtendedCOMMNDRrrRRRrrrXXXXX
2:ARGUMENT(Translation Table ID)
*/
+/* Translate a character whose code point is reg[rrr] and charset ID is
+ reg[RRR], into its Unicode code point, which will be written into
+ reg[rrr]. */
+#define CCL_MuleToUnicode 0x04
+
+/* Translate a Unicode code point, in reg[rrr], into a Mule character,
+ writing the charset ID into reg[RRR] and the code point into reg[Rrr]. */
+
+#define CCL_UnicodeToMule 0x05
+
/* Iterate looking up MAPs for reg[rrr] starting from the Nth (N =
reg[RRR]) MAP until some value is found.
@@ -577,7 +587,6 @@
...
N:SEPARATOR_z (< 0)
*/
-
#define MAX_MAP_SET_LEVEL 30
typedef struct
@@ -837,26 +846,41 @@
CODE to that invalid byte. */
/* On XEmacs, TranslateCharacter is not supported. Thus, this
- macro is not used. */
-#if 0
+ macro is only used in the MuleToUnicode transformation. */
#define CCL_MAKE_CHAR(charset, code, c) \
do { \
- if ((charset) == CHARSET_ASCII) \
- (c) = (code) & 0xFF; \
- else if (CHARSET_DEFINED_P (charset) \
- && ((code) & 0x7F) >= 32 \
- && ((code) < 256 || ((code >> 7) & 0x7F) >= 32)) \
+ if ((charset) == LEADING_BYTE_ASCII) \
{ \
- int c1 = (code) & 0x7F, c2 = 0; \
+ c = (code) & 0xFF; \
+ } \
+ else if ((charset) == LEADING_BYTE_CONTROL_1) \
+ { \
+ c = ((code) & 0xFF) - 0xA0; \
+ } \
+ else if (!NILP(charset_by_leading_byte(charset)) \
+ && ((code) >= 32) \
+ && ((code) < 256 || ((code >> 8) & 0x7F) >= 32)) \
+ { \
+ int c1, c2 = 0; \
\
- if ((code) >= 256) \
- c2 = c1, c1 = ((code) >> 7) & 0x7F; \
- (c) = make_ichar (charset, c1, c2); \
+ if ((code) < 256) \
+ { \
+ c1 = (code) & 0x7F; \
+ c2 = 0; \
+ } \
+ else \
+ { \
+ c1 = ((code) >> 8) & 0x7F; \
+ c2 = (code) & 0x7F; \
+ } \
+ c = make_ichar (charset_by_leading_byte(charset), \
+ c1, c2); \
} \
else \
- (c) = (code) & 0xFF; \
- } while (0)
-#endif
+ { \
+ c = (code) & 0xFF; \
+ } \
+ } while (0)
/* Execute CCL code on SRC_BYTES length text at SOURCE. The resulting
@@ -1392,9 +1416,9 @@
case CCL_TranslateCharacter:
#if 0
- /* XEmacs does not have translate_char, and its
- equivalent nor. We do nothing on this operation. */
- CCL_MAKE_CHAR (reg[RRR], reg[rrr], i);
+ /* XEmacs does not have translate_char, nor an
+ equivalent. We do nothing on this operation. */
+ CCL_MAKE_CHAR(reg[RRR], reg[rrr], op);
op = translate_char (GET_TRANSLATION_TABLE (reg[Rrr]),
i, -1, 0, 0);
SPLIT_CHAR (op, reg[RRR], i, j);
@@ -1420,6 +1444,56 @@
reg[rrr] = i;
#endif
break;
+
+ case CCL_MuleToUnicode:
+ {
+ Lisp_Object ucs;
+
+ CCL_MAKE_CHAR(reg[rrr], reg[RRR], op);
+ ucs = Fchar_to_unicode(make_char(op));
+
+ if (NILP(ucs))
+ {
+ /* Uhh, char-to-unicode doesn't return nil at the
+ moment, only ever -1. */
+ reg[rrr] = 0xFFFD; /* REPLACEMENT CHARACTER */
+ }
+ else
+ {
+ reg[rrr] = XINT(ucs);
+ if (-1 == reg[rrr])
+ {
+ reg[rrr] = 0xFFFD; /* REPLACEMENT CHARACTER */
+ }
+ }
+ break;
+ }
+
+ case CCL_UnicodeToMule:
+ {
+ Lisp_Object scratch;
+
+ scratch = Funicode_to_char(make_int(reg[rrr]), Qnil);
+
+ if (!NILP(scratch))
+ {
+ op = XCHAR(scratch);
+ BREAKUP_ICHAR (op, scratch, i, j);
+ reg[RRR] = XCHARSET_ID(scratch);
+
+ if (j != 0)
+ {
+ i = (i << 8) | j;
+ }
+
+ reg[rrr] = i;
+ }
+ else
+ {
+ reg[rrr] = reg[RRR] = 0;
+ }
+ break;
+ }
case CCL_IterateMultipleMap:
{
Index: src/mule-charset.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/mule-charset.c,v
retrieving revision 1.46
diff -u -u -r1.46 mule-charset.c
--- src/mule-charset.c 2005/10/25 11:16:26 1.46
+++ src/mule-charset.c 2006/05/21 19:08:53
@@ -190,7 +190,7 @@
int type, int columns, int graphic,
Ibyte final, int direction, Lisp_Object short_name,
Lisp_Object long_name, Lisp_Object doc,
- Lisp_Object reg, int overwrite)
+ Lisp_Object reg, int overwrite, int encode_as_utf_8)
{
Lisp_Object obj;
Lisp_Charset *cs;
@@ -240,6 +240,7 @@
CHARSET_FINAL (cs) = final;
CHARSET_DOC_STRING (cs) = doc;
CHARSET_REGISTRY (cs) = reg;
+ CHARSET_ENCODE_AS_UTF_8 (cs) = encode_as_utf_8;
CHARSET_CCL_PROGRAM (cs) = Qnil;
CHARSET_REVERSE_DIRECTION_CHARSET (cs) = Qnil;
@@ -454,6 +455,12 @@
is passed the octets of the character, with the high
bit cleared and set depending upon whether the value
of the `graphic' property is 0 or 1.
+`encode-as-utf-8'
+ If 1, the charset will be written out using the UTF-8 escape
+ syntax in ISO 2022-oriented coding systems. Used for
+ supporting characters we know are part of Unicode but not of
+ any other known character set in escape-quoted and compound
+ text.
*/
(name, doc_string, props))
{
@@ -465,6 +472,7 @@
Lisp_Object charset = Qnil;
Lisp_Object ccl_program = Qnil;
Lisp_Object short_name = Qnil, long_name = Qnil;
+ int encode_as_utf_8 = 0;
Lisp_Object existing_charset;
int temporary = UNBOUNDP (name);
@@ -546,6 +554,14 @@
invalid_constant ("Invalid value for `direction'", value);
}
+ else if (EQ (keyword, Qencode_as_utf_8))
+ {
+ CHECK_INT (value);
+ encode_as_utf_8 = XINT (value);
+ if (encode_as_utf_8 < 0 || encode_as_utf_8 > 1)
+ invalid_constant ("Invalid value for `encode-as-utf-8'", value);
+ }
+
else if (EQ (keyword, Qfinal))
{
CHECK_CHAR_COERCE_INT (value);
@@ -553,7 +569,6 @@
if (final < '0' || final > '~')
invalid_constant ("Invalid value for `final'", value);
}
-
else if (EQ (keyword, Qccl_program))
{
struct ccl_program test_ccl;
@@ -612,7 +627,8 @@
charset = make_charset (id, name, dimension + 2, type, columns, graphic,
final, direction, short_name, long_name,
- doc_string, registry, !NILP (existing_charset));
+ doc_string, registry, !NILP (existing_charset),
+ encode_as_utf_8);
XCHARSET (charset)->temporary = temporary;
if (!NILP (ccl_program))
@@ -641,7 +657,7 @@
(charset, new_name))
{
Lisp_Object new_charset = Qnil;
- int id, dimension, columns, graphic;
+ int id, dimension, columns, graphic, encode_as_utf_8;
Ibyte final;
int direction, type;
Lisp_Object registry, doc_string, short_name, long_name;
@@ -672,10 +688,11 @@
short_name = CHARSET_SHORT_NAME (cs);
long_name = CHARSET_LONG_NAME (cs);
registry = CHARSET_REGISTRY (cs);
+ encode_as_utf_8 = CHARSET_ENCODE_AS_UTF_8 (cs);
new_charset = make_charset (id, new_name, dimension + 2, type, columns,
graphic, final, direction, short_name, long_name,
- doc_string, registry, 0);
+ doc_string, registry, 0, encode_as_utf_8);
CHARSET_REVERSE_DIRECTION_CHARSET (cs) = new_charset;
XCHARSET_REVERSE_DIRECTION_CHARSET (new_charset) = charset;
@@ -807,6 +824,7 @@
if (EQ (prop, Qfinal)) return make_char (CHARSET_FINAL (cs));
if (EQ (prop, Qchars)) return make_int (CHARSET_CHARS (cs));
if (EQ (prop, Qregistry)) return CHARSET_REGISTRY (cs);
+ if (EQ (prop, Qencode_as_utf_8)) return CHARSET_ENCODE_AS_UTF_8 (cs);
if (EQ (prop, Qccl_program)) return CHARSET_CCL_PROGRAM (cs);
if (EQ (prop, Qdirection))
return CHARSET_DIRECTION (cs) == CHARSET_LEFT_TO_RIGHT ? Ql2r : Qr2l;
@@ -1040,7 +1058,7 @@
build_string ("ASCII"),
build_msg_string ("ASCII"),
build_msg_string ("ASCII (ISO646 IRV)"),
- build_string ("\\(iso8859-[0-9]*\\|-ascii\\)"), 0);
+ build_string ("\\(iso8859-[0-9]*\\|-ascii\\)"), 0, 0);
staticpro (&Vcharset_control_1);
Vcharset_control_1 =
make_charset (LEADING_BYTE_CONTROL_1, Qcontrol_1, 2,
@@ -1049,7 +1067,7 @@
build_string ("C1"),
build_msg_string ("Control characters"),
build_msg_string ("Control characters 128-191"),
- build_string (""), 0);
+ build_string (""), 0, 0);
staticpro (&Vcharset_latin_iso8859_1);
Vcharset_latin_iso8859_1 =
make_charset (LEADING_BYTE_LATIN_ISO8859_1, Qlatin_iso8859_1, 2,
@@ -1058,7 +1076,7 @@
build_string ("Latin-1"),
build_msg_string ("ISO8859-1 (Latin-1)"),
build_msg_string ("ISO8859-1 (Latin-1)"),
- build_string ("iso8859-1"), 0);
+ build_string ("iso8859-1"), 0, 0);
staticpro (&Vcharset_latin_iso8859_2);
Vcharset_latin_iso8859_2 =
make_charset (LEADING_BYTE_LATIN_ISO8859_2, Qlatin_iso8859_2, 2,
@@ -1067,7 +1085,7 @@
build_string ("Latin-2"),
build_msg_string ("ISO8859-2 (Latin-2)"),
build_msg_string ("ISO8859-2 (Latin-2)"),
- build_string ("iso8859-2"), 0);
+ build_string ("iso8859-2"), 0, 0);
staticpro (&Vcharset_latin_iso8859_3);
Vcharset_latin_iso8859_3 =
make_charset (LEADING_BYTE_LATIN_ISO8859_3, Qlatin_iso8859_3, 2,
@@ -1076,7 +1094,7 @@
build_string ("Latin-3"),
build_msg_string ("ISO8859-3 (Latin-3)"),
build_msg_string ("ISO8859-3 (Latin-3)"),
- build_string ("iso8859-3"), 0);
+ build_string ("iso8859-3"), 0, 0);
staticpro (&Vcharset_latin_iso8859_4);
Vcharset_latin_iso8859_4 =
make_charset (LEADING_BYTE_LATIN_ISO8859_4, Qlatin_iso8859_4, 2,
@@ -1085,7 +1103,7 @@
build_string ("Latin-4"),
build_msg_string ("ISO8859-4 (Latin-4)"),
build_msg_string ("ISO8859-4 (Latin-4)"),
- build_string ("iso8859-4"), 0);
+ build_string ("iso8859-4"), 0, 0);
staticpro (&Vcharset_thai_tis620);
Vcharset_thai_tis620 =
make_charset (LEADING_BYTE_THAI_TIS620, Qthai_tis620, 2,
@@ -1094,7 +1112,7 @@
build_string ("TIS620"),
build_msg_string ("TIS620 (Thai)"),
build_msg_string ("TIS620.2529 (Thai)"),
- build_string ("tis620"),0);
+ build_string ("tis620"), 0, 0);
staticpro (&Vcharset_greek_iso8859_7);
Vcharset_greek_iso8859_7 =
make_charset (LEADING_BYTE_GREEK_ISO8859_7, Qgreek_iso8859_7, 2,
@@ -1103,7 +1121,7 @@
build_string ("ISO8859-7"),
build_msg_string ("ISO8859-7 (Greek)"),
build_msg_string ("ISO8859-7 (Greek)"),
- build_string ("iso8859-7"), 0);
+ build_string ("iso8859-7"), 0, 0);
staticpro (&Vcharset_arabic_iso8859_6);
Vcharset_arabic_iso8859_6 =
make_charset (LEADING_BYTE_ARABIC_ISO8859_6, Qarabic_iso8859_6, 2,
@@ -1112,7 +1130,7 @@
build_string ("ISO8859-6"),
build_msg_string ("ISO8859-6 (Arabic)"),
build_msg_string ("ISO8859-6 (Arabic)"),
- build_string ("iso8859-6"), 0);
+ build_string ("iso8859-6"), 0, 0);
staticpro (&Vcharset_hebrew_iso8859_8);
Vcharset_hebrew_iso8859_8 =
make_charset (LEADING_BYTE_HEBREW_ISO8859_8, Qhebrew_iso8859_8, 2,
@@ -1121,7 +1139,7 @@
build_string ("ISO8859-8"),
build_msg_string ("ISO8859-8 (Hebrew)"),
build_msg_string ("ISO8859-8 (Hebrew)"),
- build_string ("iso8859-8"), 0);
+ build_string ("iso8859-8"), 0, 0);
staticpro (&Vcharset_katakana_jisx0201);
Vcharset_katakana_jisx0201 =
make_charset (LEADING_BYTE_KATAKANA_JISX0201, Qkatakana_jisx0201, 2,
@@ -1130,7 +1148,7 @@
build_string ("JISX0201 Kana"),
build_msg_string ("JISX0201.1976 (Japanese Kana)"),
build_msg_string ("JISX0201.1976 Japanese Kana"),
- build_string ("jisx0201.1976"), 0);
+ build_string ("jisx0201.1976"), 0, 0);
staticpro (&Vcharset_latin_jisx0201);
Vcharset_latin_jisx0201 =
make_charset (LEADING_BYTE_LATIN_JISX0201, Qlatin_jisx0201, 2,
@@ -1139,7 +1157,7 @@
build_string ("JISX0201 Roman"),
build_msg_string ("JISX0201.1976 (Japanese Roman)"),
build_msg_string ("JISX0201.1976 Japanese Roman"),
- build_string ("jisx0201.1976"), 0);
+ build_string ("jisx0201.1976"), 0, 0);
staticpro (&Vcharset_cyrillic_iso8859_5);
Vcharset_cyrillic_iso8859_5 =
make_charset (LEADING_BYTE_CYRILLIC_ISO8859_5, Qcyrillic_iso8859_5, 2,
@@ -1148,7 +1166,7 @@
build_string ("ISO8859-5"),
build_msg_string ("ISO8859-5 (Cyrillic)"),
build_msg_string ("ISO8859-5 (Cyrillic)"),
- build_string ("iso8859-5"), 0);
+ build_string ("iso8859-5"), 0, 0);
staticpro (&Vcharset_latin_iso8859_9);
Vcharset_latin_iso8859_9 =
make_charset (LEADING_BYTE_LATIN_ISO8859_9, Qlatin_iso8859_9, 2,
@@ -1157,7 +1175,7 @@
build_string ("Latin-5"),
build_msg_string ("ISO8859-9 (Latin-5)"),
build_msg_string ("ISO8859-9 (Latin-5)"),
- build_string ("iso8859-9"), 0);
+ build_string ("iso8859-9"), 0, 0);
staticpro (&Vcharset_latin_iso8859_15);
Vcharset_latin_iso8859_15 =
make_charset (LEADING_BYTE_LATIN_ISO8859_15, Qlatin_iso8859_15, 2,
@@ -1166,7 +1184,7 @@
build_string ("Latin-9"),
build_msg_string ("ISO8859-15 (Latin-9)"),
build_msg_string ("ISO8859-15 (Latin-9)"),
- build_string ("iso8859-15"), 0);
+ build_string ("iso8859-15"), 0, 0);
staticpro (&Vcharset_japanese_jisx0208_1978);
Vcharset_japanese_jisx0208_1978 =
make_charset (LEADING_BYTE_JAPANESE_JISX0208_1978, Qjapanese_jisx0208_1978, 3,
@@ -1176,7 +1194,7 @@
build_msg_string ("JISX0208.1978 (Japanese)"),
build_msg_string
("JISX0208.1978 Japanese Kanji (so called \"old JIS\")"),
- build_string ("\\(jisx0208\\|jisc6226\\)\\.1978"), 0);
+ build_string ("\\(jisx0208\\|jisc6226\\)\\.1978"), 0, 0);
staticpro (&Vcharset_chinese_gb2312);
Vcharset_chinese_gb2312 =
make_charset (LEADING_BYTE_CHINESE_GB2312, Qchinese_gb2312, 3,
@@ -1185,7 +1203,7 @@
build_string ("GB2312"),
build_msg_string ("GB2312)"),
build_msg_string ("GB2312 Chinese simplified"),
- build_string ("gb2312"), 0);
+ build_string ("gb2312"), 0, 0);
staticpro (&Vcharset_japanese_jisx0208);
Vcharset_japanese_jisx0208 =
make_charset (LEADING_BYTE_JAPANESE_JISX0208, Qjapanese_jisx0208, 3,
@@ -1194,7 +1212,7 @@
build_string ("JISX0208"),
build_msg_string ("JISX0208.1983/1990 (Japanese)"),
build_msg_string ("JISX0208.1983/1990 Japanese Kanji"),
- build_string ("jisx0208.19\\(83\\|90\\)"), 0);
+ build_string ("jisx0208.19\\(83\\|90\\)"), 0, 0);
staticpro (&Vcharset_korean_ksc5601);
Vcharset_korean_ksc5601 =
make_charset (LEADING_BYTE_KOREAN_KSC5601, Qkorean_ksc5601, 3,
@@ -1203,7 +1221,7 @@
build_string ("KSC5601"),
build_msg_string ("KSC5601 (Korean"),
build_msg_string ("KSC5601 Korean Hangul and Hanja"),
- build_string ("ksc5601"), 0);
+ build_string ("ksc5601"), 0, 0);
staticpro (&Vcharset_japanese_jisx0212);
Vcharset_japanese_jisx0212 =
make_charset (LEADING_BYTE_JAPANESE_JISX0212, Qjapanese_jisx0212, 3,
@@ -1212,7 +1230,7 @@
build_string ("JISX0212"),
build_msg_string ("JISX0212 (Japanese)"),
build_msg_string ("JISX0212 Japanese Supplement"),
- build_string ("jisx0212"), 0);
+ build_string ("jisx0212"), 0, 0);
#define CHINESE_CNS_PLANE_RE(n) "cns11643[.-]\\(.*[.-]\\)?" n "$"
staticpro (&Vcharset_chinese_cns11643_1);
@@ -1224,7 +1242,7 @@
build_msg_string ("CNS11643-1 (Chinese traditional)"),
build_msg_string
("CNS 11643 Plane 1 Chinese traditional"),
- build_string (CHINESE_CNS_PLANE_RE("1")), 0);
+ build_string (CHINESE_CNS_PLANE_RE("1")), 0, 0);
staticpro (&Vcharset_chinese_cns11643_2);
Vcharset_chinese_cns11643_2 =
make_charset (LEADING_BYTE_CHINESE_CNS11643_2, Qchinese_cns11643_2, 3,
@@ -1234,7 +1252,7 @@
build_msg_string ("CNS11643-2 (Chinese traditional)"),
build_msg_string
("CNS 11643 Plane 2 Chinese traditional"),
- build_string (CHINESE_CNS_PLANE_RE("2")), 0);
+ build_string (CHINESE_CNS_PLANE_RE("2")), 0, 0);
staticpro (&Vcharset_chinese_big5_1);
Vcharset_chinese_big5_1 =
make_charset (LEADING_BYTE_CHINESE_BIG5_1, Qchinese_big5_1, 3,
@@ -1244,7 +1262,7 @@
build_msg_string ("Big5 (Level-1)"),
build_msg_string
("Big5 Level-1 Chinese traditional"),
- build_string ("big5"), 0);
+ build_string ("big5"), 0, 0);
staticpro (&Vcharset_chinese_big5_2);
Vcharset_chinese_big5_2 =
make_charset (LEADING_BYTE_CHINESE_BIG5_2, Qchinese_big5_2, 3,
@@ -1254,7 +1272,7 @@
build_msg_string ("Big5 (Level-2)"),
build_msg_string
("Big5 Level-2 Chinese traditional"),
- build_string ("big5"), 0);
+ build_string ("big5"), 0, 0);
#ifdef ENABLE_COMPOSITE_CHARS
@@ -1269,7 +1287,7 @@
build_string ("Composite"),
build_msg_string ("Composite characters"),
build_msg_string ("Composite characters"),
- build_string (""), 0);
+ build_string (""), 0, 0);
#else
/* We create a hack so that we have a way of storing ESC 0 and ESC 1
sequences as "characters", so that they will be output correctly. */
@@ -1281,6 +1299,6 @@
build_string ("Composite hack"),
build_msg_string ("Composite characters hack"),
build_msg_string ("Composite characters hack"),
- build_string (""), 0);
+ build_string (""), 0, 0);
#endif /* ENABLE_COMPOSITE_CHARS */
}
Index: src/mule-coding.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/mule-coding.c,v
retrieving revision 1.37
diff -u -u -r1.37 mule-coding.c
--- src/mule-coding.c 2006/05/11 08:58:01 1.37
+++ src/mule-coding.c 2006/05/21 19:08:54
@@ -96,6 +96,42 @@
return c >= 0xA1 && c <= 0xDF;
}
+inline static void
+dynarr_add_2022_one_dimension (Lisp_Object charset, Ibyte c,
+ unsigned char charmask,
+ unsigned_char_dynarr *dst)
+{
+ if (XCHARSET_ENCODE_AS_UTF_8 (charset))
+ {
+ encode_unicode_char (charset, c & charmask, 0,
+ dst, UNICODE_UTF_8, 0);
+ }
+ else
+ {
+ Dynarr_add (dst, c & charmask);
+ }
+}
+
+inline static void
+dynarr_add_2022_two_dimensions (Lisp_Object charset, Ibyte c,
+ unsigned int ch,
+ unsigned char charmask,
+ unsigned_char_dynarr *dst)
+{
+ if (XCHARSET_ENCODE_AS_UTF_8 (charset))
+ {
+ encode_unicode_char (charset,
+ ch & charmask,
+ c & charmask, dst,
+ UNICODE_UTF_8, 0);
+ }
+ else
+ {
+ Dynarr_add (dst, ch & charmask);
+ Dynarr_add (dst, c & charmask);
+ }
+}
+
/* Convert Shift-JIS data to internal format. */
static Bytecount
@@ -671,6 +707,10 @@
ISO_ESC_2_4, /* We've seen ESC $. This indicates
that we're designating a multi-byte, rather
than a single-byte, character set. */
+ ISO_ESC_2_5, /* We've seen ESC %. This indicates an escape to a
+ Unicode coding system; the only one of these
+ we're prepared to deal with is UTF-8, which has
+ the next character as G. */
ISO_ESC_2_8, /* We've seen ESC 0x28, i.e. ESC (.
This means designate a 94-character
character set into G0. */
@@ -752,11 +792,15 @@
character constructed by overstriking two or more characters). */
#define ISO_STATE_COMPOSITE (1 << 5)
+/* If set, we're processing UTF-8 encoded data within ISO-2022
+ processing. */
+#define ISO_STATE_UTF_8 (1 << 6)
+
/* ISO_STATE_LOCK is the mask of flags that remain on until explicitly
turned off when in the ISO2022 encoder/decoder. Other flags are turned
off at the end of processing each character or escape sequence. */
# define ISO_STATE_LOCK \
- (ISO_STATE_COMPOSITE | ISO_STATE_R2L)
+ (ISO_STATE_COMPOSITE | ISO_STATE_R2L | ISO_STATE_UTF_8)
typedef struct charset_conversion_spec
{
@@ -922,6 +966,9 @@
Lisp_Object current_charset;
int current_half;
int current_char_boundary;
+
+ /* Used for handling UTF-8. */
+ unsigned char counter;
};
static const struct memory_description ccs_description_1[] =
@@ -1344,6 +1391,15 @@
}
case ISO_ESC:
+
+ /* The only available ISO 2022 sequence in UTF-8 mode is ESC % @, to
+ exit from it. If we see any other escape sequence, pass it through
+ in the error handler. */
+ if (*flags & ISO_STATE_UTF_8 && '%' != c)
+ {
+ return 0;
+ }
+
switch (c)
{
/**** single shift ****/
@@ -1411,6 +1467,10 @@
iso->esc = ISO_ESC_2_4;
goto not_done;
+ case '%': /* Prefix to an escape to or from Unicode. */
+ iso->esc = ISO_ESC_2_5;
+ goto not_done;
+
default:
if (0x28 <= c && c <= 0x2F)
{
@@ -1432,9 +1492,31 @@
/* bzzzt! */
goto error;
}
-
-
+ /* ISO-IR 196 UTF-8 support. */
+ case ISO_ESC_2_5:
+ if ('G' == c)
+ {
+ /* Activate UTF-8 mode. */
+ *flags &= ISO_STATE_LOCK;
+ *flags |= ISO_STATE_UTF_8;
+ iso->esc = ISO_ESC_NOTHING;
+ return 1;
+ }
+ else if ('@' == c)
+ {
+ /* Deactive UTF-8 mode. */
+ *flags &= ISO_STATE_LOCK;
+ *flags &= ~(ISO_STATE_UTF_8);
+ iso->esc = ISO_ESC_NOTHING;
+ return 1;
+ }
+ else
+ {
+ /* Oops, we don't support the other UTF-? coding systems within
+ ISO 2022, only in their own context. */
+ goto error;
+ }
/**** directionality ****/
case ISO_ESC_5_11: /* ISO6429 direction control */
@@ -1822,6 +1904,87 @@
}
ch = 0;
}
+ else if (flags & ISO_STATE_UTF_8)
+ {
+ unsigned char counter = data->counter;
+ Ibyte work[MAX_ICHAR_LEN];
+ int len;
+ Lisp_Object chr;
+
+ if (ISO_CODE_ESC == c)
+ {
+ /* Allow the escape sequence parser to end the UTF-8 state. */
+ flags |= ISO_STATE_ESCAPE;
+ data->esc = ISO_ESC;
+ data->esc_bytes_index = 1;
+ continue;
+ }
+
+ switch (counter)
+ {
+ case 0:
+ if (c >= 0xfc)
+ {
+ ch = c & 0x01;
+ counter = 5;
+ }
+ else if (c >= 0xf8)
+ {
+ ch = c & 0x03;
+ counter = 4;
+ }
+ else if (c >= 0xf0)
+ {
+ ch = c & 0x07;
+ counter = 3;
+ }
+ else if (c >= 0xe0)
+ {
+ ch = c & 0x0f;
+ counter = 2;
+ }
+ else if (c >= 0xc0)
+ {
+ ch = c & 0x1f;
+ counter = 1;
+ }
+ else
+ /* ASCII, or the lower control characters. */
+ Dynarr_add (dst, c);
+
+ break;
+ case 1:
+ ch = (ch << 6) | (c & 0x3f);
+ chr = Funicode_to_char(make_int(ch), Qnil);
+
+ if (!NILP (chr))
+ {
+ assert(CHARP(chr));
+ len = set_itext_ichar (work, XCHAR(chr));
+ Dynarr_add_many (dst, work, len);
+ }
+ else
+ {
+ /* Shouldn't happen, this code should only be enabled in
+ XEmacsen with support for all of Unicode. */
+ Dynarr_add (dst, LEADING_BYTE_JAPANESE_JISX0208);
+ Dynarr_add (dst, 34 + 128);
+ Dynarr_add (dst, 46 + 128);
+ }
+
+ ch = 0;
+ counter = 0;
+ break;
+ default:
+ ch = (ch << 6) | (c & 0x3f);
+ counter--;
+ }
+
+ if (str->eof)
+ DECODE_OUTPUT_PARTIAL_CHAR (ch, dst);
+
+ data->counter = counter;
+ }
else if (byte_c0_p (c) || byte_c1_p (c))
{ /* Control characters */
@@ -2010,6 +2173,7 @@
}
Dynarr_add (dst, ISO_CODE_ESC);
+
switch (type)
{
case CHARSET_TYPE_94:
@@ -2102,6 +2266,14 @@
{ /* Processing ASCII character */
ch = 0;
+ if (flags & ISO_STATE_UTF_8)
+ {
+ Dynarr_add (dst, ISO_CODE_ESC);
+ Dynarr_add (dst, '%');
+ Dynarr_add (dst, '@');
+ flags &= ~(ISO_STATE_UTF_8);
+ }
+
restore_left_to_right_direction (codesys, dst, &flags, 0);
/* Make sure G0 contains ASCII */
@@ -2145,18 +2317,43 @@
Dynarr_add (dst, c);
char_boundary = 1;
}
-
else if (ibyte_leading_byte_p (c) || ibyte_leading_byte_p (ch))
{ /* Processing Leading Byte */
ch = 0;
charset = charset_by_leading_byte (c);
if (leading_byte_prefix_p (c))
- ch = c;
+ {
+ ch = c;
+ }
+ else if (XCHARSET_ENCODE_AS_UTF_8 (charset))
+ {
+ assert (!EQ (charset, Vcharset_control_1)
+ && !EQ (charset, Vcharset_composite));
+
+ /* If the character set is to be encoded as UTF-8, the escape
+ is always the same. */
+ if (!(flags & ISO_STATE_UTF_8))
+ {
+ Dynarr_add (dst, ISO_CODE_ESC);
+ Dynarr_add (dst, '%');
+ Dynarr_add (dst, 'G');
+ flags |= ISO_STATE_UTF_8;
+ }
+ }
else if (!EQ (charset, Vcharset_control_1)
&& !EQ (charset, Vcharset_composite))
{
int reg;
+ /* End the UTF-8 state. */
+ if (flags & ISO_STATE_UTF_8)
+ {
+ Dynarr_add (dst, ISO_CODE_ESC);
+ Dynarr_add (dst, '%');
+ Dynarr_add (dst, '@');
+ flags &= ~(ISO_STATE_UTF_8);
+ }
+
ensure_correct_direction (XCHARSET_DIRECTION (charset),
codesys, dst, &flags, 0);
@@ -2274,12 +2471,14 @@
switch (XCHARSET_REP_BYTES (charset))
{
case 2:
- Dynarr_add (dst, c & charmask);
+ dynarr_add_2022_one_dimension (charset, c,
+ charmask, dst);
break;
case 3:
if (XCHARSET_PRIVATE_P (charset))
{
- Dynarr_add (dst, c & charmask);
+ dynarr_add_2022_one_dimension (charset, c,
+ charmask, dst);
ch = 0;
}
else if (ch)
@@ -2287,6 +2486,9 @@
#ifdef ENABLE_COMPOSITE_CHARS
if (EQ (charset, Vcharset_composite))
{
+ /* #### Hasn't been written to handle composite
+ characters yet. */
+ assert(!XCHARSET_ENCODE_AS_UTF_8 (charset))
if (in_composite)
{
/* #### Bother! We don't know how to
@@ -2310,8 +2512,8 @@
else
#endif /* ENABLE_COMPOSITE_CHARS */
{
- Dynarr_add (dst, ch & charmask);
- Dynarr_add (dst, c & charmask);
+ dynarr_add_2022_two_dimensions (charset, c, ch,
+ charmask, dst);
}
ch = 0;
}
@@ -2324,8 +2526,8 @@
case 4:
if (ch)
{
- Dynarr_add (dst, ch & charmask);
- Dynarr_add (dst, c & charmask);
+ dynarr_add_2022_two_dimensions (charset, c, ch,
+ charmask, dst);
ch = 0;
}
else
Index: src/redisplay-x.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/redisplay-x.c,v
retrieving revision 1.41
diff -u -u -r1.41 redisplay-x.c
--- src/redisplay-x.c 2005/11/26 11:46:10 1.41
+++ src/redisplay-x.c 2006/05/21 19:08:55
@@ -230,6 +230,10 @@
}
#endif /* MULE */
*text_storage++ = (unsigned char) byte1;
+ /* This dimension stuff is broken if you want to use a two-dimensional
+ X11 font to display a single-dimensional character set, as is
+ appropriate for the IPA (use one of the -iso10646-1 fonts) or some
+ of the other non-standard character sets. */
if (dimension == 2)
*text_storage++ = (unsigned char) byte2;
#else /* USE_XFT */
Index: src/unicode.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/unicode.c,v
retrieving revision 1.32
diff -u -u -r1.32 unicode.c
--- src/unicode.c 2006/04/22 15:18:55 1.32
+++ src/unicode.c 2006/05/21 19:08:56
@@ -321,6 +321,10 @@
Lisp_Object Qignore_first_column;
+Lisp_Object Vcurrent_jit_charset;
+Lisp_Object Qlast_allocated_character;
+Lisp_Object Qccl_encode_to_ucs_2;
+
/************************************************************************/
/* Unicode implementation */
@@ -1001,12 +1005,73 @@
}
static Ichar
+get_free_codepoint(Lisp_Object charset)
+{
+ Lisp_Object name = Fcharset_name(charset);
+ Lisp_Object zeichen = Fget(name, Qlast_allocated_character, Qnil);
+ Ichar res;
+
+ /* Only allow this with the 96x96 character sets we are using for
+ temporary Unicode support. */
+ assert(2 == XCHARSET_DIMENSION(charset) && 96 == XCHARSET_CHARS(charset));
+
+ if (!NILP(zeichen))
+ {
+ int c1, c2;
+
+ BREAKUP_ICHAR(XCHAR(zeichen), charset, c1, c2);
+
+ if (127 == c1 && 127 == c2)
+ {
+ /* We've already used the hightest-numbered character in this
+ set--tell our caller to create another. */
+ return -1;
+ }
+
+ if (127 == c2)
+ {
+ ++c1;
+ c2 = 0x20;
+ }
+ else
+ {
+ ++c2;
+ }
+
+ res = make_ichar(charset, c1, c2);
+ Fput(name, Qlast_allocated_character, make_char(res));
+ }
+ else
+ {
+ res = make_ichar(charset, 32, 32);
+ Fput(name, Qlast_allocated_character, make_char(res));
+ }
+ return res;
+}
+
+/* The just-in-time creation of XEmacs characters that correspond to unknown
+ Unicode code points happens when:
+
+ 1. The lookup would otherwise fail.
+
+ 2. There is an entry in the charsets array for the just-in-time Unicode
+ charset.
+
+ If there are no free code points in the just-in-time Unicode character
+ set, and the charsets array is the default unicode precedence list,
+ create a new just-in-time Unicode character set, add it at the end of the
+ unicode precedence list, create the XEmacs character in that character
+ set, and return it. */
+
+static Ichar
unicode_to_ichar (int code, Lisp_Object_dynarr *charsets)
{
int u1, u2, u3, u4;
int code_levels;
int i;
int n = Dynarr_length (charsets);
+ static int number_of_jit_charsets;
+ static Ascbyte last_jit_charset_final;
type_checking_assert (code >= 0);
/* This shortcut depends on the representation of an Ichar, see text.c.
@@ -1040,8 +1105,68 @@
return make_ichar (charset, retval >> 8, retval & 0xFF);
}
}
+
+ /* Only do the magic just-in-time assignment if we're using the default
+ list. */
+ if (unicode_precedence_dynarr == charsets)
+ {
+ /* There's an issue with auto-save files here. The assignment of
+ Unicode code points to Mule characters becomes much less stable,
+ and auto-saved characters if escape-quoted is used for the
+ encoding, will be different code points from one XEmacs invocation
+ to the next. Not ideal. :-( . Still, it's better than trashing
+ unknown Unicode data by default, as was previously the
+ approach. */
+
+ if (NILP (Vcurrent_jit_charset) ||
+ (-1 == (i = get_free_codepoint(Vcurrent_jit_charset))))
+ {
+ Ascbyte setname[32];
+ Lisp_Object charset_descr = build_string
+ ("Mule charset for otherwise unknown Unicode code points.");
+ Lisp_Object charset_regr = build_string("iso10646-1");
+
+ struct gcpro gcpro1, gcpro2;
+
+ if ('\0' == last_jit_charset_final)
+ {
+ /* This final byte shit is, umm, not that cool. */
+ last_jit_charset_final = 0x30;
+ }
+
+ snprintf(setname, sizeof(setname),
+ "jit-ucs-charset-%d", number_of_jit_charsets++);
- return (Ichar) -1;
+ /* Aside: GCPROing here would be overkill according to the FSF's
+ philosophy. make-charset cannot currently GC, but is intended
+ to be called from Lisp, with its arguments protected by the
+ Lisp reader. We GCPRO in case it GCs in the future and no-one
+ checks all the C callers. */
+
+ GCPRO2 (charset_descr, charset_regr);
+ Vcurrent_jit_charset = Fmake_charset
+ (intern(setname), charset_descr,
+ nconc2 (list2(Qencode_as_utf_8, make_int(1)),
+ nconc2 (list6(Qcolumns, make_int(1), Qchars, make_int(96),
+ Qdimension, make_int(2)),
+ list6(Qregistry, charset_regr,
+ Qfinal, make_char(last_jit_charset_final++),
+ /* This CCL program is initialised in
+ unicode.el. */
+ Qccl_program, Qccl_encode_to_ucs_2))));
+ UNGCPRO;
+
+ i = get_free_codepoint(Vcurrent_jit_charset);
+ }
+
+ if (-1 != i)
+ {
+ set_unicode_conversion((Ichar)i, code);
+ /* No need to add the charset to the end of the list; it's done
+ automatically. */
+ }
+ }
+ return (Ichar) i;
}
/* Add charsets to precedence list.
@@ -1283,38 +1408,14 @@
When there is no international support (i.e. the `mule' feature is not
present), this function simply does `int-to-char' and ignores the CHARSETS
argument.
-
-Note that the current XEmacs internal encoding has no mapping for many
-Unicode code points, and if you use characters that are vaguely obscure with
-XEmacs' Unicode coding systems, you will lose data.
-
-To add support for some desired code point in the short term--note that our
-intention is to move to a Unicode-compatible internal encoding soon, for
-some value of soon--if you are a distributor, add something like the
-following to `site-start.el.'
-
-(make-charset 'distro-name-private
- "Private character set for DISTRO"
- '(dimension 1
- chars 96
- columns 1
- final ?5 ;; Change this--see docs for make-charset
- long-name "Private charset for some Unicode char support."
- short-name "Distro-Private"))
-
-(set-unicode-conversion
- (make-char 'distro-name-private #x20) #x263A) ;; WHITE SMILING FACE
-
-(set-unicode-conversion
- (make-char 'distro-name-private #x21) #x3030) ;; WAVY DASH
-
-;; ...
-;;; Repeat as necessary.
-
-Redisplay will work on the sjt-xft branch, but not with server-side X11
-fonts as is the default. However, data read in will be preserved when they
-are written out again.
+If the CODE would not otherwise be converted to an XEmacs character, and the
+list of character sets to be consulted is nil or the default, a new XEmacs
+character will be created for it in one of the `jit-ucs-charset' Mule
+character sets, and that character will be returned. There is scope for
+tens of thousands of separate Unicode code points in every session using
+this technique, so despite XEmacs' internal encoding not being based on
+Unicode, your data won't be trashed.
*/
(code, USED_IF_MULE (charsets)))
{
@@ -1558,16 +1659,6 @@
/* Unicode coding system */
/************************************************************************/
-/* ISO 10646 UTF-16, UCS-4, UTF-8, UTF-7, etc. */
-
-enum unicode_type
-{
- UNICODE_UTF_16,
- UNICODE_UTF_8,
- UNICODE_UTF_7,
- UNICODE_UCS_4
-};
-
struct unicode_coding_system
{
enum unicode_type type;
@@ -1728,7 +1819,9 @@
}
}
-static void
+/* Also used in mule-coding.c for UTF-8 handling in ISO 2022-oriented
+ encodings. */
+void
encode_unicode_char (Lisp_Object USED_IF_MULE (charset), int h,
int USED_IF_MULE (l), unsigned_char_dynarr *dst,
enum unicode_type type, unsigned int little_endian)
@@ -2444,6 +2537,8 @@
DEFSUBR (Fload_unicode_mapping_table);
+ DEFSYMBOL (Qccl_encode_to_ucs_2);
+ DEFSYMBOL (Qlast_allocated_character);
DEFSYMBOL (Qignore_first_column);
#endif /* MULE */
@@ -2518,6 +2613,9 @@
&lisp_object_dynarr_description);
init_blank_unicode_tables ();
+
+ staticpro (&Vcurrent_jit_charset);
+ Vcurrent_jit_charset = Qnil;
/* Note that the "block" we are describing is a single pointer, and hence
we could potentially use dump_add_root_block_ptr(). However, given
--
Aidan Kehoe,
http://www.parhasard.net/