Re: [QUERY] Support Unicode escapes in the Lisp reader, à la Java

Thursday, 20 April 2006

 Ar an seachtú lá déag de mí Aibréan, scríobh Stephen J. Turnbull>: 

...
 QUERY

 >>>>> "Aidan" == Aidan Kehoe <kehoea(a)parhasard.net&gt;
writes:

 2006-04-16  Aidan Kehoe  <kehoea(a)parhasard.net&gt;

 	* lispref/objects.texi (Character Type):
 	Describe support for ?\u and ?\U as character escapes allowing you
 	to specify the Unicode code point of a character. 

 What is the purpose of having two syntaxes?   
Java does it that way; the idea was to make it comfortable for people coming
from there. That said, allowing variable-length constants and only allowing
the escapes in characters and strings already makes it unfamiliar to Java
people. 

...
 I would prefer a single syntax, ?\U<HEXDIGIT>+, with an error
being
 signaled if the "character" can't be represented in that XEmacs. 
I think triggering an error for that should be done in Funicode_to_char, not
in the Lisp reader. That way it would be available to all the
Unicode-oriented coding systems. 

...
 Also, I think permitting elision of leading zeros is false
convenience.
 You rarely see TWO-digit octal constants, even in syntaxes where they are
 permitted. 
You see a one-digit octal constant in C all the time; '\0'. You don’t see
two-digit octal constants because people are used to C where two-digit octal
constants (including the leading zero) are identical to their decimal
counterparts, except that they’re more typing. That is, 07 means exactly the
same as 7, 03 means exactly the same as 3.

(I much prefer hex escapes to octal escapes, myself, and drop leading zeroes
there all the time.)

...
 I think the same will happen here; ie, you'll typically see
?\U000A for
 linefeed, if people are going to use that syntax at all. 
Well, not for line feed, I would imagine, people will stick with the old
syntax. For punctuation and Han characters, they’re four digits already; the
variable length only becomes useful for people using non-Roman alphabets,
like Greek, Cyrillic, Hebrew and Arabic. If anyone reading this uses any of
them, shout!

...
 We should use the same syntax in strings as we do in characters, and
in
 that case requiring four hexdigits will be easier to read and to write
 code for.  
That patch handles strings already--the Lisp reader handles strings as a
succession of characters, but without question marks. 

...
 (I guess in that case it makes sense to have two syntaxes, since we
do
 need to provide for Planes 1-16, but writing 8 digits 99% of the time
 would be unbearable.) 
I suspect that also programmers not paying full attention (so, all of us, at
one point or another) would write things like "1\U239645" where they didn’t
mean the 45 to be part of the Unicode symbol.

...
 I question the need for this at the present time, as code using this
 escape would necessarily be incompatible with 21.4.  I would prefer
 introducing these syntaxes for character constants when we convert the
 Lisp library source encoding to Unicode. 
I think that the earlier we introduce it, the better. Look how long it’s
taken for people to adapt #x in Lisp code, and that was a function of the
date of introduction of the syntax first in XEmacs and then in GNU Emacs. 

-- 
In the beginning God created the heavens and the earth. And God was a
bug-eyed, hexagonal smurf with a head of electrified hair; and God said:
“Si, mi chiamano Mimi...”

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [QUERY] Support Unicode escapes in the Lisp reader, à la Java