NOTE: This patch has been committed.
SUPERSEDES 17474.44623.340909.789205(a)parhasard.net
Ar an fichiú lá de mí Aibréan, scríobh Stephen J. Turnbull:
You're right. I guess I'm on the side of two syntaxes then,
and I
still prefer fixed width because I don't want to have to remember to
write \u00DEADBEEF instead of \uDEADBEEF just because A happens to be
a hex digit.
Okay, fixed width it is.
>> I question the need for this at the present time, as
code using
>> this escape would necessarily be incompatible with 21.4.
>> I would prefer introducing these syntaxes for character constants
>> when we convert the Lisp library source encoding to Unicode.
Aidan> I think that the earlier we introduce it, the better. Look
Aidan> how long it’s taken for people to adapt #x in Lisp code,
Aidan> and that was a function of the date of introduction of the
Aidan> syntax first in XEmacs and then in GNU Emacs.
If GNU Emacs already has it, OK. Otherwise, that argument says
"Nobody will use it until GNU has it, anyway...."
They don’t have it currently, and they have the same problem with their
internal representation, and so would have the same benefit from it.
I’ve sent an implementation their way. I suspect if not ignored because of
the eternal release, it’ll be ignored because of its provenance. Oh well.
Note that converting the Lisp sources to Unicode can be done in 21.5
at any time. (At the price of inventing some syntax to signal it,
because doing it in the packages is another matter.)
? A coding cookie’s not good enough?
man/ChangeLog addition:
2006-04-29 Aidan Kehoe <kehoea(a)parhasard.net>
* lispref/objects.texi (Character Type):
Document the Unicode syntax for characters in characters and
strings.
src/ChangeLog addition:
2006-04-29 Aidan Kehoe <kehoea(a)parhasard.net>
* lread.c:
* lread.c (read_escape):
Support \uABCD and \U00ABCDEF for specifying characters by their
Unicode code point.
XEmacs Trunk source patch:
Diff command: cvs -q diff -u
Files affected: src/lread.c man/lispref/objects.texi
Index: man/lispref/objects.texi
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/man/lispref/objects.texi,v
retrieving revision 1.7
diff -u -u -r1.7 objects.texi
--- man/lispref/objects.texi 2003/06/30 09:31:01 1.7
+++ man/lispref/objects.texi 2006/04/29 14:32:22
@@ -510,6 +510,21 @@
For example, character code 193 is a lowercase @samp{a} with an acute
accent, in @sc{iso}-8859-1.)
+@cindex unicode character escape
+ From version 21.5.25 onwards, XEmacs provides a syntax for specifying
+characters by their Unicode code points. @samp{?\uABCD} will give you
+an XEmacs character that maps to the code point @samp{U+ABCD} in
+Unicode-based representations (UTF-8 text files, Unicode-oriented fonts,
+etc.) Just as in the C# language, there is a slightly different syntax
+for specifying characters with code points above @samp{#xFFFF};
+@samp{\U00ABCDEF} will give you an XEmacs character that maps to the
+code point @samp{U+ABCDEF} in Unicode-based representations, if such an
+XEmacs character exists.
+
+ Unlike in C#, while this syntax is available for character literals,
+and (see later) in strings, it is not available elsewhere in your Lisp
+source code.
+
@ignore @c None of this crap applies to XEmacs.
For use in strings and buffers, you are limited to the control
characters that exist in @sc{ascii}, but for keyboard input purposes,
@@ -614,6 +629,7 @@
@cindex backslash in character constant
@cindex octal character code
@cindex hexadecimal character code
+
Finally, there are two read syntaxes involving character codes.
It is not possible to represent multibyte or wide characters in this
way; the permissible range of codes is from 0 to 255 (@emph{i.e.},
Index: src/lread.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/lread.c,v
retrieving revision 1.76
diff -u -u -r1.76 lread.c
--- src/lread.c 2005/07/12 23:26:49 1.76
+++ src/lread.c 2006/04/29 14:32:23
@@ -208,6 +208,8 @@
static int locate_file_open_or_access_file (Ibyte *fn, int access_mode);
EXFUN (Fread_from_string, 3);
+EXFUN (Funicode_to_char, 2); /* In unicode.c. */
+
/* When errors are signaled, the actual readcharfun should not be used
as an argument if it is an lstream, so that lstreams don't escape
to the Lisp level. */
@@ -1675,6 +1677,9 @@
{
/* This function can GC */
Ichar c = readchar (readcharfun);
+ /* \u allows up to four hex digits, \U up to eight. Default to the
+ behaviour for \u, and change this value in the case that \U is seen. */
+ int unicode_hex_count = 4;
if (c < 0)
signal_error (Qend_of_file, 0, READCHARFUN_MAYBE (readcharfun));
@@ -1763,7 +1768,7 @@
}
}
if (i >= 0400)
- syntax_error ("Attempt to create non-ASCII/ISO-8859-1 character",
+ syntax_error ("Non-ISO-8859-1 character specified with octal escape",
make_int (i));
return i;
}
@@ -1791,11 +1796,51 @@
}
return i;
}
+ case 'U':
+ /* Post-Unicode-2.0: Up to eight hex chars */
+ unicode_hex_count = 8;
+ case 'u':
+ /* A Unicode escape, as in C# (though we only permit them in strings
+ and characters, not arbitrarily in the source code.) */
+ {
+ REGISTER Ichar i = 0;
+ REGISTER int count = 0;
+ Lisp_Object lisp_char;
+ while (++count <= unicode_hex_count)
+ {
+ c = readchar (readcharfun);
+ /* Remember, can't use isdigit(), isalpha() etc. on Ichars */
+ if (c >= '0' && c <= '9') i = (i << 4) +
(c - '0');
+ else if (c >= 'a' && c <= 'f') i = (i << 4) +
(c - 'a') + 10;
+ else if (c >= 'A' && c <= 'F') i = (i <<
4) + (c - 'A') + 10;
+ else
+ {
+ syntax_error ("Non-hex digit used for Unicode escape",
+ make_char (c));
+ break;
+ }
+ }
+
+ lisp_char = Funicode_to_char(make_int(i), Qnil);
+
+ if (EQ(Qnil, lisp_char))
+ {
+ /* This is ugly and horrible and trashes the user's data, but
+ it's what unicode.c does. In the future, unicode-to-char
+ should not return nil. */
#ifdef MULE
- /* #### need some way of reading an extended character with
- an escape sequence. */
+ i = make_ichar (Vcharset_japanese_jisx0208, 34 + 128, 46 + 128);
+#else
+ i = '~';
#endif
+ return i;
+ }
+ else
+ {
+ return XCHAR(lisp_char);
+ }
+ }
default:
return c;
cvs server: src/xft-fonts.h is a new entry, no comparison available
--
In the beginning God created the heavens and the earth. And God was a
bug-eyed, hexagonal smurf with a head of electrified hair; and God said:
“Si, mi chiamano Mimi...”