SUPERSEDES 21.5
This is more what I have in mind, as a first draft. In particular, I
pretty much think that the long Examples and Compatibility Notes
sections should be moved to the Lispref, leaving behind about 10 terse
lines (including a reference to the Lispref).
(I thought I sent this a couple of days ago, but apparently not; at
least, it didn't make it to XEmacs Patches.)
2006-11-25 Stephen J. Turnbull <stephen(a)xemacs.org>
* test.c (make-char): Clarify docstring. Thanks to Aidan
Kehoe for inspiration, discussion, and some wording.
diff --git a/src/text.c b/src/text.c
index c5d7aee..63f56b4 100644
--- a/src/text.c
+++ b/src/text.c
@@ -4840,27 +4840,76 @@ Lstream_funget_ichar (Lstream *stream, I
/************************************************************************/
DEFUN ("make-char", Fmake_char, 2, 3, 0, /*
-Make a character from CHARSET and octets ARG1 and ARG2.
-ARG2 is required only for characters from two-dimensional charsets.
-
-Each octet should be in the range 32 through 127 for a 96 or 96x96
-charset and 33 through 126 for a 94 or 94x94 charset. (Most charsets
-are either 96 or 94x94.) Note that this is 32 more than the values
-typically given for 94x94 charsets. When two octets are required, the
-order is "standard" -- the same as appears in ISO-2022 encodings,
-reference tables, etc.
-
-\(Note the following non-obvious result: Computerized translation
-tables often encode the two octets as the high and low bytes,
-respectively, of a hex short, while when there's only one octet, it
-goes in the low byte. When decoding such a value, you need to treat
-the two cases differently when calling make-char: One is (make-char
-CHARSET HIGH LOW), the other is (make-char CHARSET LOW).)
-
-For example, (make-char 'latin-iso8859-2 185) or (make-char
-'latin-iso8859-2 57) will return the Latin 2 character s with caron.
-
-As another example, the Japanese character for "kawa" (stream), which
+Make a character from CHARSET, ARG1, and ARG2.
+CHARSET is a Lisp charset.
+Each of ARG1 and (optional) ARG2 must be integers.
+
+When compiled with MULE, a Lisp charset describes a mapping from
+positions in a table of characters to the internal encoding. In a
+file, the charset may be changed in midstream by control sequences,
+defined by a Lisp coding system (see `coding-system-p').
+
+CHARSET determines Whether ARG2 may be non-nil. Most charsets require
+ARG2 to be nil. Exceptions are mostly Asian "multibyte character
+sets" (for Chinese, Japanese, and Korean), which require ARG2.
+
+Each of ARG1 (and ARG2, if present) should be in the range 32-127.
+Some CHARSETs may impose other restrictions on the values. The most
+common one is to a 94-character set, using the values 33-126. When
+two integers are required, the order is determined by the standard
+document's reference tables. ARG1 and ARG2 may not be characters or
+bignums; they must be fixnums.
+
+Control characters are exceptional. To create an ordinary ASCII
+control character, use the 'ascii charset and the character's ASCII
+value in the range 0-31. To create a Control-1 character in the range
+128-159, use the 'control-1 charset and its position relative to 128
+in the range 0-31.
+
+Most tables for extended character sets use the code points from 128-255
+for non-ASCII characters. For convenience, ARG1 and ARG2 may also be
+in the range 160-255, in which case they are translated to ARG - 128
+before forming the character. (Translated values in the range 128-159
+may be used with both the 'ascii and 'control-1 charsets, and ARG1 is
+translated to the range 0-31 before applying.) That is, \(make-char
+CHARSET CODE) produces the character with code point CODE for any
+CHARSET in the ISO 8859 family. Note that the `split-char' function
+will always return values in the range 32-127 for ordinary characters,
+and in the range 0-31 for control characters, regardless of charset.
+
+When compiled without MULE, this function does not do much, but is
+provided for compatibility. In this case, CHARSET is interpreted as
+follows:
+
+`ascii' -- ARG1 should be in the range 0 through 127.
+`control-1' -- ARG1 should be in the range 128 through 159.
+ else -- ARG1 is coerced to be between 0 and 255, and then the high
+ bit is set.
+
+`int-to-char of the resulting ARG1' is returned, and ARG2 is always
+ignored.
+
+Note that `char-to-int' returns the integer that is the internal
+representation of a character. For backward compatibility, in MULE
+this turns out to be the usual code point for ISO-8859-1 \(ie, the
+character set which has ASCII for code points 0-127, and Latin-1 for
+code points 160-255). That is, the invariants
+
+\(= x (char-to-int (make-char 'ascii x))) ; for x in 0-127
+\(= x (char-to-int (make-char 'control-1 x))) ; for x in 128-159
+\(= x (char-to-int (make-char 'latin-iso8859-1 x))) ; for x in 160-255
+
+are satisfied. This is true for no other charsets. \(The current
+implementation is inconsistent about signaling an error when integer
+values that do not correspond to an 8-bit code are used.)
+
+Examples
+
+\(make-char 'latin-iso8859-2 #xB9) or \(make-char 'latin-iso8859-2 57)
+will return the Latin 2 character s with caron. \(You may occasionally
+see \(make-char 'latin-iso8859-2 185) in the XEmacs sources.)
+
+More complex is the Japanese character for "kawa" \(stream), which
looks something like this:
| |
@@ -4869,8 +4918,8 @@ looks something like this:
| | |
/ |
-appears in the Unicode Standard (version 2.0) on page 7-287 with the
-following values (see also page 7-4):
+It appears in the Unicode Standard with the following values in
+various Asian character sets, all derived from Chinese:
U 5DDD (Unicode)
G 0-2008 (GB 2312-80)
@@ -4880,33 +4929,55 @@ B A474 (Big Five)
C 1-4455 (CNS 11643-1986 (1st plane))
A 213C34 (ANSI Z39.64-1989)
-These are equivalent to:
-
-\(make-char 'chinese-gb2312 52 40)
-\(make-char 'japanese-jisx0208 64 110)
-\(make-char 'korean-ksc5601 116 57)
-\(make-char 'chinese-cns11643-1 76 87)
-\(decode-big5-char '(164 . 116))
-
-\(All codes above are two decimal numbers except for Big Five and ANSI
-Z39.64, which we don't support. We add 32 to each of the decimal
-numbers. Big Five is split in a rather hackish fashion into two
-charsets, `big5-1' and `big5-2', due to its excessive size -- 94x157,
-with the first codepoint in the range 0xA1 to 0xFE and the second in
-the range 0x40 to 0x7E or 0xA1 to 0xFE. `decode-big5-char' is used to
-generate the char from its codes, and `encode-big5-char' extracts the
-codes.)
-
-When compiled without MULE, this function does not do much, but it's
-provided for compatibility. In this case, the following CHARSET symbols
-are allowed:
-
-`ascii' -- ARG1 should be in the range 0 through 127.
-`control-1' -- ARG1 should be in the range 128 through 159.
- else -- ARG1 is coerced to be between 0 and 255, and then the high
- bit is set.
-
- `int-to-char of the resulting ARG1' is returned, and ARG2 is always ignored.
+\(Most of these can be found in tables in etc/unicode in the XEmacs
+sources.) To convert each code into an internal MULE character, use
+
+\(make-char 'chinese-gb2312 52 40) ; G 20 08
+\(make-char 'japanese-jisx0208 64 110) ; J 32 78
+\(make-char 'korean-ksc5601 116 57) ; K 84 25
+\(make-char 'chinese-cns11643-1 76 87) ; C 44 55
+
+The Asian standards have tables presented in "ku-ten", or "row and
+column", form. As shown in the comments, those values are not 4-digit
+decimal numbers; they are pairs of 2-digit decimal numbers
+concatenated as strings. The rows and columns are in the range 1-94,
+so we add 32 to each of the decimal numbers passed to make-char to
+force them to the range 33-126. The "1st plane" of CNS 11643
+\(denoted by the suffix "-1" on the charset symbol) is an added level
+of complexity, due to the fact that the CNS standard has around 80,000
+characters in it, which won't fit in 2 bytes \(even without the
+restriction to 33-127!)
+
+Unicode, Big Five, and ANSI are special cases. ANSI is used very
+little, and there is no corresponding MULE API to handle it. Unicode
+and Big5 do not fit the 32-127 range assumed by the `make-char', so
+they have their own "decode-char" APIs. \(Think of this as
+"decode-char from external file coding to internal XEmacs coding".
+There are corresponding "encode-char" APIs.)
+
+\(decode-big5-char '(#xA4 . #x74)) ; a cons, just to be different
+\(decode-char 'ucs #x5DDD) ; partly GNU Emacs compatible
+
+`decode-big5-char' is used to generate the char from its codes, and
+`encode-big5-char' extracts the codes. Big Five is split in hackish
+fashion into two charsets, `big5-1' and `big5-2', due to its excessive
+size -- 94x157, with the first codepoint in the range 0xA1 to 0xFE and
+the second in the range 0x40 to 0x7E or 0xA1 to 0xFE.
+
+Compatibility notes
+
+Note that all of the characters corresponding to Japanese "kawa" are
+different in MULE because they have different charsets! \(You can
+verify this by applying `char-to-int' to them.) This is also true for
+characters that exist in more than one of the ISO 8859 charsets, such
+as German sharp S. The latin-unity package helps with this problem.
+
+All this complexity is for backward compatibility with existing MULE
+code, which is based on the ISO 2022 standard for combining multiple
+charsets in a single stream. ISO 2022 is also the source of the
+restriction of position values to the range 32-127. Future versions
+of XEmacs will use Unicode as the single internal charset rather than
+try to make MULE code coherent. Further references are in the Lispref.
*/
(charset, arg1, USED_IF_MULE (arg2)))
{
END OF PATCH
_______________________________________________
XEmacs-Patches mailing list
XEmacs-Patches(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-patches