[PATCH] Support Unicode escapes in the Lisp reader, à la Java

Sunday, 16 April 2006

Comments? Objections? 

man/ChangeLog addition:

2006-04-16  Aidan Kehoe  <kehoea(a)parhasard.net&gt;

	* lispref/objects.texi (Character Type):
	Describe support for ?\u and ?\U as character escapes allowing you
	to specify the Unicode code point of a character. 

src/ChangeLog addition:

2006-04-16  Aidan Kehoe  <kehoea(a)parhasard.net&gt;

	* lread.c:
	* lread.c (read_escape):
	Support \uABCD and \U00ABCDEF in character and string constants to
	specify an XEmacs character with the corresponding Unicode
	mapping. 

XEmacs Trunk source patch:
Diff command:   cvs -q diff -u
Files affected: src/lread.c man/lispref/objects.texi

Index: man/lispref/objects.texi
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/man/lispref/objects.texi,v
retrieving revision 1.7
diff -u -u -r1.7 objects.texi
--- man/lispref/objects.texi	2003/06/30 09:31:01	1.7
+++ man/lispref/objects.texi	2006/04/16 20:45:05
＠＠ -510,6 +510,23 ＠＠
 For example, character code 193 is a lowercase ＠samp{a} with an acute
 accent, in ＠sc{iso}-8859-1.)

+＠cindex unicode character escape
+   From version 21.5.25 onwards, XEmacs provides a syntax for specifying
+characters by their Unicode code points.  ＠samp{?\uABCD} will give you
+an XEmacs character that maps to the code point ＠samp{U+ABCD} in
+Unicode-based representations (UTF-8 text files, Unicode-oriented fonts,
+etc.)  Just as in the Java language, there is a slightly different syntax
+for specifying characters with code points above ＠samp{#xFFFF};
+＠samp{\U00ABCDEF} will give you an XEmacs character that maps to the
+code point ＠samp{U+ABCDEF} in Unicode-based representations, if such an
+XEmacs character exists. 
+
+  Unlike the Java language, XEmacs doesn't require exactly four or
+exactly eight hexadecimal digits for either syntax; you do not need to
+specify leading zeroes. Also unlike Java, while this syntax is available
+for character literals, and (see later) in strings, it is not available
+elsewhere in your Lisp source code.
+
 ＠ignore ＠c None of this crap applies to XEmacs.
   For use in strings and buffers, you are limited to the control
 characters that exist in ＠sc{ascii}, but for keyboard input purposes,
＠＠ -614,6 +631,7 ＠＠
 ＠cindex backslash in character constant
 ＠cindex octal character code
 ＠cindex hexadecimal character code
+
   Finally, there are two read syntaxes involving character codes.
 It is not possible to represent multibyte or wide characters in this
 way; the permissible range of codes is from 0 to 255 (＠emph{i.e.},
Index: src/lread.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/lread.c,v
retrieving revision 1.76
diff -u -u -r1.76 lread.c
--- src/lread.c	2005/07/12 23:26:49	1.76
+++ src/lread.c	2006/04/16 20:45:06
＠＠ -208,6 +208,8 ＠＠
 static int locate_file_open_or_access_file (Ibyte *fn, int access_mode);
 EXFUN (Fread_from_string, 3);

+EXFUN (Funicode_to_char, 2);  /* In unicode.c.  */
+
 /* When errors are signaled, the actual readcharfun should not be used
    as an argument if it is an lstream, so that lstreams don't escape
    to the Lisp level.  */
＠＠ -1675,6 +1677,9 ＠＠
 {
   /* This function can GC */
   Ichar c = readchar (readcharfun);
+  /* \u allows up to four hex digits, \U up to eight. Default to the
+     behaviour for \u, and change this value in the case that \U is seen. */
+  int unicode_hex_count = 4;

   if (c < 0)
     signal_error (Qend_of_file, 0, READCHARFUN_MAYBE (readcharfun));
＠＠ -1791,11 +1796,50 ＠＠
 	  }
 	return i;
       }
+    case 'U':
+      /* Post-Unicode-2.0: Up to eight hex chars */
+      unicode_hex_count = 8;
+    case 'u':
+
+      /* A Unicode escape, as in Java (though we only permit them in strings
+	 and characters, not arbitrarily in the source code.) Also, in
+	 contrast with Java, the sequence of hex digits doesn't have to be
+	 exactly four or eight hexadecimal digits long. */
+      {
+	REGISTER Ichar i = 0;
+	REGISTER int count = 0;
+	Lisp_Object lisp_char;
+	while (++count <= unicode_hex_count)
+	  {
+	    c = readchar (readcharfun);
+	    /* Remember, can't use isdigit(), isalpha() etc. on Ichars */
+	    if      (c >= '0' && c <= '9')  i = (i << 4) +
(c - '0');
+	    else if (c >= 'a' && c <= 'f')  i = (i << 4) +
(c - 'a') + 10;
+            else if (c >= 'A' && c <= 'F')  i = (i <<
4) + (c - 'A') + 10;
+	    else
+	      {
+		unreadchar (readcharfun, c);
+		break;
+	      }
+	  }

-#ifdef MULE
-      /* #### need some way of reading an extended character with
-	 an escape sequence. */
-#endif
+	/* Okay, we've read the code point, set the expected number of
+	   digits back to the default.  */
+	unicode_hex_count = 4; 
+
+	lisp_char = Funicode_to_char(make_int(i), Qnil);
+	if (EQ(Qnil, lisp_char))
+	  {
+	    /* This is ugly and horrible and trashes the user's data, but
+	       it's what unicode.c does. In the future, unicode-to-char
+	       should not return nil.  */
+	    return make_ichar (Vcharset_japanese_jisx0208, 34 + 128, 46 + 128);
+	  }
+	else
+	  {
+	    return XCHAR(lisp_char);
+	  }
+      }

     default:
 	return c;

-- 
In the beginning God created the heavens and the earth. And God was a
bug-eyed, hexagonal smurf with a head of electrified hair; and God said:
“Si, mi chiamano Mimi...”

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

[PATCH] Support Unicode escapes in the Lisp reader, à la Java