Comments? Objections?
man/ChangeLog addition:
2006-04-16 Aidan Kehoe <kehoea(a)parhasard.net>
* lispref/objects.texi (Character Type):
Describe support for ?\u and ?\U as character escapes allowing you
to specify the Unicode code point of a character.
src/ChangeLog addition:
2006-04-16 Aidan Kehoe <kehoea(a)parhasard.net>
* lread.c:
* lread.c (read_escape):
Support \uABCD and \U00ABCDEF in character and string constants to
specify an XEmacs character with the corresponding Unicode
mapping.
XEmacs Trunk source patch:
Diff command: cvs -q diff -u
Files affected: src/lread.c man/lispref/objects.texi
Index: man/lispref/objects.texi
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/man/lispref/objects.texi,v
retrieving revision 1.7
diff -u -u -r1.7 objects.texi
--- man/lispref/objects.texi 2003/06/30 09:31:01 1.7
+++ man/lispref/objects.texi 2006/04/16 20:45:05
@@ -510,6 +510,23 @@
For example, character code 193 is a lowercase @samp{a} with an acute
accent, in @sc{iso}-8859-1.)
+@cindex unicode character escape
+ From version 21.5.25 onwards, XEmacs provides a syntax for specifying
+characters by their Unicode code points. @samp{?\uABCD} will give you
+an XEmacs character that maps to the code point @samp{U+ABCD} in
+Unicode-based representations (UTF-8 text files, Unicode-oriented fonts,
+etc.) Just as in the Java language, there is a slightly different syntax
+for specifying characters with code points above @samp{#xFFFF};
+@samp{\U00ABCDEF} will give you an XEmacs character that maps to the
+code point @samp{U+ABCDEF} in Unicode-based representations, if such an
+XEmacs character exists.
+
+ Unlike the Java language, XEmacs doesn't require exactly four or
+exactly eight hexadecimal digits for either syntax; you do not need to
+specify leading zeroes. Also unlike Java, while this syntax is available
+for character literals, and (see later) in strings, it is not available
+elsewhere in your Lisp source code.
+
@ignore @c None of this crap applies to XEmacs.
For use in strings and buffers, you are limited to the control
characters that exist in @sc{ascii}, but for keyboard input purposes,
@@ -614,6 +631,7 @@
@cindex backslash in character constant
@cindex octal character code
@cindex hexadecimal character code
+
Finally, there are two read syntaxes involving character codes.
It is not possible to represent multibyte or wide characters in this
way; the permissible range of codes is from 0 to 255 (@emph{i.e.},
Index: src/lread.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/lread.c,v
retrieving revision 1.76
diff -u -u -r1.76 lread.c
--- src/lread.c 2005/07/12 23:26:49 1.76
+++ src/lread.c 2006/04/16 20:45:06
@@ -208,6 +208,8 @@
static int locate_file_open_or_access_file (Ibyte *fn, int access_mode);
EXFUN (Fread_from_string, 3);
+EXFUN (Funicode_to_char, 2); /* In unicode.c. */
+
/* When errors are signaled, the actual readcharfun should not be used
as an argument if it is an lstream, so that lstreams don't escape
to the Lisp level. */
@@ -1675,6 +1677,9 @@
{
/* This function can GC */
Ichar c = readchar (readcharfun);
+ /* \u allows up to four hex digits, \U up to eight. Default to the
+ behaviour for \u, and change this value in the case that \U is seen. */
+ int unicode_hex_count = 4;
if (c < 0)
signal_error (Qend_of_file, 0, READCHARFUN_MAYBE (readcharfun));
@@ -1791,11 +1796,50 @@
}
return i;
}
+ case 'U':
+ /* Post-Unicode-2.0: Up to eight hex chars */
+ unicode_hex_count = 8;
+ case 'u':
+
+ /* A Unicode escape, as in Java (though we only permit them in strings
+ and characters, not arbitrarily in the source code.) Also, in
+ contrast with Java, the sequence of hex digits doesn't have to be
+ exactly four or eight hexadecimal digits long. */
+ {
+ REGISTER Ichar i = 0;
+ REGISTER int count = 0;
+ Lisp_Object lisp_char;
+ while (++count <= unicode_hex_count)
+ {
+ c = readchar (readcharfun);
+ /* Remember, can't use isdigit(), isalpha() etc. on Ichars */
+ if (c >= '0' && c <= '9') i = (i << 4) +
(c - '0');
+ else if (c >= 'a' && c <= 'f') i = (i << 4) +
(c - 'a') + 10;
+ else if (c >= 'A' && c <= 'F') i = (i <<
4) + (c - 'A') + 10;
+ else
+ {
+ unreadchar (readcharfun, c);
+ break;
+ }
+ }
-#ifdef MULE
- /* #### need some way of reading an extended character with
- an escape sequence. */
-#endif
+ /* Okay, we've read the code point, set the expected number of
+ digits back to the default. */
+ unicode_hex_count = 4;
+
+ lisp_char = Funicode_to_char(make_int(i), Qnil);
+ if (EQ(Qnil, lisp_char))
+ {
+ /* This is ugly and horrible and trashes the user's data, but
+ it's what unicode.c does. In the future, unicode-to-char
+ should not return nil. */
+ return make_ichar (Vcharset_japanese_jisx0208, 34 + 128, 46 + 128);
+ }
+ else
+ {
+ return XCHAR(lisp_char);
+ }
+ }
default:
return c;
--
In the beginning God created the heavens and the earth. And God was a
bug-eyed, hexagonal smurf with a head of electrified hair; and God said:
“Si, mi chiamano Mimi...”