Ar an seachtú lá déag de mí Aibréan, scríobh Stephen J. Turnbull>:
QUERY
>>>>> "Aidan" == Aidan Kehoe <kehoea(a)parhasard.net>
writes:
2006-04-16 Aidan Kehoe <kehoea(a)parhasard.net>
* lispref/objects.texi (Character Type):
Describe support for ?\u and ?\U as character escapes allowing you
to specify the Unicode code point of a character.
What is the purpose of having two syntaxes?
Java does it that way; the idea was to make it comfortable for people coming
from there. That said, allowing variable-length constants and only allowing
the escapes in characters and strings already makes it unfamiliar to Java
people.
I would prefer a single syntax, ?\U<HEXDIGIT>+, with an error
being
signaled if the "character" can't be represented in that XEmacs.
I think triggering an error for that should be done in Funicode_to_char, not
in the Lisp reader. That way it would be available to all the
Unicode-oriented coding systems.
Also, I think permitting elision of leading zeros is false
convenience.
You rarely see TWO-digit octal constants, even in syntaxes where they are
permitted.
You see a one-digit octal constant in C all the time; '\0'. You don’t see
two-digit octal constants because people are used to C where two-digit octal
constants (including the leading zero) are identical to their decimal
counterparts, except that they’re more typing. That is, 07 means exactly the
same as 7, 03 means exactly the same as 3.
(I much prefer hex escapes to octal escapes, myself, and drop leading zeroes
there all the time.)
I think the same will happen here; ie, you'll typically see
?\U000A for
linefeed, if people are going to use that syntax at all.
Well, not for line feed, I would imagine, people will stick with the old
syntax. For punctuation and Han characters, they’re four digits already; the
variable length only becomes useful for people using non-Roman alphabets,
like Greek, Cyrillic, Hebrew and Arabic. If anyone reading this uses any of
them, shout!
We should use the same syntax in strings as we do in characters, and
in
that case requiring four hexdigits will be easier to read and to write
code for.
That patch handles strings already--the Lisp reader handles strings as a
succession of characters, but without question marks.
(I guess in that case it makes sense to have two syntaxes, since we
do
need to provide for Planes 1-16, but writing 8 digits 99% of the time
would be unbearable.)
I suspect that also programmers not paying full attention (so, all of us, at
one point or another) would write things like "1\U239645" where they didn’t
mean the 45 to be part of the Unicode symbol.
I question the need for this at the present time, as code using this
escape would necessarily be incompatible with 21.4. I would prefer
introducing these syntaxes for character constants when we convert the
Lisp library source encoding to Unicode.
I think that the earlier we introduce it, the better. Look how long it’s
taken for people to adapt #x in Lisp code, and that was a function of the
date of introduction of the syntax first in XEmacs and then in GNU Emacs.
--
In the beginning God created the heavens and the earth. And God was a
bug-eyed, hexagonal smurf with a head of electrified hair; and God said:
“Si, mi chiamano Mimi...”