[S21.5] Update the docstring for make-char

Saturday, 25 November 2006

        SUPERSEDES 21.5

This is more what I have in mind, as a first draft.  In particular, I
pretty much think that the long Examples and Compatibility Notes
sections should be moved to the Lispref, leaving behind about 10 terse
lines (including a reference to the Lispref).

(I thought I sent this a couple of days ago, but apparently not; at
least, it didn't make it to XEmacs Patches.)

2006-11-25  Stephen J. Turnbull  <stephen(a)xemacs.org&gt;

	* test.c (make-char):  Clarify docstring.  Thanks to Aidan
	Kehoe for inspiration, discussion, and some wording.

diff --git a/src/text.c b/src/text.c
index c5d7aee..63f56b4 100644
--- a/src/text.c
+++ b/src/text.c
＠＠ -4840,27 +4840,76 ＠＠ Lstream_funget_ichar (Lstream *stream, I
 /************************************************************************/

 DEFUN ("make-char", Fmake_char, 2, 3, 0, /*
-Make a character from CHARSET and octets ARG1 and ARG2.
-ARG2 is required only for characters from two-dimensional charsets.
-
-Each octet should be in the range 32 through 127 for a 96 or 96x96
-charset and 33 through 126 for a 94 or 94x94 charset. (Most charsets
-are either 96 or 94x94.) Note that this is 32 more than the values
-typically given for 94x94 charsets.  When two octets are required, the
-order is "standard" -- the same as appears in ISO-2022 encodings,
-reference tables, etc.
-
-\(Note the following non-obvious result: Computerized translation
-tables often encode the two octets as the high and low bytes,
-respectively, of a hex short, while when there's only one octet, it
-goes in the low byte.  When decoding such a value, you need to treat
-the two cases differently when calling make-char: One is (make-char
-CHARSET HIGH LOW), the other is (make-char CHARSET LOW).)
-
-For example, (make-char 'latin-iso8859-2 185) or (make-char
-'latin-iso8859-2 57) will return the Latin 2 character s with caron.
-
-As another example, the Japanese character for "kawa" (stream), which
+Make a character from CHARSET, ARG1, and ARG2.
+CHARSET is a Lisp charset.
+Each of ARG1 and (optional) ARG2 must be integers.
+
+When compiled with MULE, a Lisp charset describes a mapping from
+positions in a table of characters to the internal encoding.  In a
+file, the charset may be changed in midstream by control sequences,
+defined by a Lisp coding system (see `coding-system-p').
+
+CHARSET determines Whether ARG2 may be non-nil.  Most charsets require
+ARG2 to be nil.  Exceptions are mostly Asian "multibyte character
+sets" (for Chinese, Japanese, and Korean), which require ARG2.
+
+Each of ARG1 (and ARG2, if present) should be in the range 32-127.
+Some CHARSETs may impose other restrictions on the values.  The most
+common one is to a 94-character set, using the values 33-126.  When
+two integers are required, the order is determined by the standard
+document's reference tables.  ARG1 and ARG2 may not be characters or
+bignums; they must be fixnums.
+
+Control characters are exceptional.  To create an ordinary ASCII
+control character, use the 'ascii charset and the character's ASCII
+value in the range 0-31.  To create a Control-1 character in the range
+128-159, use the 'control-1 charset and its position relative to 128
+in the range 0-31.
+
+Most tables for extended character sets use the code points from 128-255
+for non-ASCII characters.  For convenience, ARG1 and ARG2 may also be
+in the range 160-255, in which case they are translated to ARG - 128
+before forming the character.  (Translated values in the range 128-159
+may be used with both the 'ascii and 'control-1 charsets, and ARG1 is
+translated to the range 0-31 before applying.)  That is, \(make-char
+CHARSET CODE) produces the character with code point CODE for any
+CHARSET in the ISO 8859 family.  Note that the `split-char' function
+will always return values in the range 32-127 for ordinary characters,
+and in the range 0-31 for control characters, regardless of charset.
+
+When compiled without MULE, this function does not do much, but is
+provided for compatibility.  In this case, CHARSET is interpreted as
+follows:
+
+`ascii' -- ARG1 should be in the range 0 through 127.
+`control-1' -- ARG1 should be in the range 128 through 159.
+ else -- ARG1 is coerced to be between 0 and 255, and then the high
+         bit is set.
+
+`int-to-char of the resulting ARG1' is returned, and ARG2 is always
+ignored.
+
+Note that `char-to-int' returns the integer that is the internal
+representation of a character.  For backward compatibility, in MULE
+this turns out to be the usual code point for ISO-8859-1 \(ie, the
+character set which has ASCII for code points 0-127, and Latin-1 for
+code points 160-255).  That is, the invariants
+
+\(= x (char-to-int (make-char 'ascii x)))            ; for x in 0-127
+\(= x (char-to-int (make-char 'control-1 x)))        ; for x in 128-159
+\(= x (char-to-int (make-char 'latin-iso8859-1 x)))  ; for x in 160-255
+
+are satisfied.  This is true for no other charsets.  \(The current
+implementation is inconsistent about signaling an error when integer
+values that do not correspond to an 8-bit code are used.)
+
+Examples
+
+\(make-char 'latin-iso8859-2 #xB9) or \(make-char 'latin-iso8859-2 57)
+will return the Latin 2 character s with caron. \(You may occasionally
+see \(make-char 'latin-iso8859-2 185) in the XEmacs sources.)
+
+More complex is the Japanese character for "kawa" \(stream), which
 looks something like this:

    |     |
＠＠ -4869,8 +4918,8 ＠＠ looks something like this:
    |  |  |
   /      |

-appears in the Unicode Standard (version 2.0) on page 7-287 with the
-following values (see also page 7-4):
+It appears in the Unicode Standard with the following values in
+various Asian character sets, all derived from Chinese:

 U 5DDD     (Unicode)
 G 0-2008   (GB 2312-80)
＠＠ -4880,33 +4929,55 ＠＠ B A474     (Big Five)
 C 1-4455   (CNS 11643-1986 (1st plane))
 A 213C34   (ANSI Z39.64-1989)

-These are equivalent to:
-
-\(make-char 'chinese-gb2312 52 40)
-\(make-char 'japanese-jisx0208 64 110)
-\(make-char 'korean-ksc5601 116 57)
-\(make-char 'chinese-cns11643-1 76 87)
-\(decode-big5-char '(164 . 116))
-
-\(All codes above are two decimal numbers except for Big Five and ANSI
-Z39.64, which we don't support.  We add 32 to each of the decimal
-numbers.  Big Five is split in a rather hackish fashion into two
-charsets, `big5-1' and `big5-2', due to its excessive size -- 94x157,
-with the first codepoint in the range 0xA1 to 0xFE and the second in
-the range 0x40 to 0x7E or 0xA1 to 0xFE.  `decode-big5-char' is used to
-generate the char from its codes, and `encode-big5-char' extracts the
-codes.)
-
-When compiled without MULE, this function does not do much, but it's
-provided for compatibility.  In this case, the following CHARSET symbols
-are allowed:
-
-`ascii' -- ARG1 should be in the range 0 through 127.
-`control-1' -- ARG1 should be in the range 128 through 159.
- else -- ARG1 is coerced to be between 0 and 255, and then the high
-         bit is set.
-
- `int-to-char of the resulting ARG1' is returned, and ARG2 is always ignored. 
+\(Most of these can be found in tables in etc/unicode in the XEmacs
+sources.)  To convert each code into an internal MULE character, use
+
+\(make-char 'chinese-gb2312      52  40)    ; G 20 08
+\(make-char 'japanese-jisx0208   64 110)    ; J 32 78
+\(make-char 'korean-ksc5601     116  57)    ; K 84 25
+\(make-char 'chinese-cns11643-1  76  87)    ; C 44 55
+
+The Asian standards have tables presented in "ku-ten", or "row and
+column", form.  As shown in the comments, those values are not 4-digit
+decimal numbers; they are pairs of 2-digit decimal numbers
+concatenated as strings.  The rows and columns are in the range 1-94,
+so we add 32 to each of the decimal numbers passed to make-char to
+force them to the range 33-126.  The "1st plane" of CNS 11643
+\(denoted by the suffix "-1" on the charset symbol) is an added level
+of complexity, due to the fact that the CNS standard has around 80,000
+characters in it, which won't fit in 2 bytes \(even without the
+restriction to 33-127!)
+
+Unicode, Big Five, and ANSI are special cases.  ANSI is used very
+little, and there is no corresponding MULE API to handle it.  Unicode
+and Big5 do not fit the 32-127 range assumed by the `make-char', so
+they have their own "decode-char" APIs.  \(Think of this as
+"decode-char from external file coding to internal XEmacs coding".
+There are corresponding "encode-char" APIs.)
+
+\(decode-big5-char '(#xA4 . #x74))    ; a cons, just to be different
+\(decode-char 'ucs #x5DDD)            ; partly GNU Emacs compatible
+
+`decode-big5-char' is used to generate the char from its codes, and
+`encode-big5-char' extracts the codes.  Big Five is split in hackish
+fashion into two charsets, `big5-1' and `big5-2', due to its excessive
+size -- 94x157, with the first codepoint in the range 0xA1 to 0xFE and
+the second in the range 0x40 to 0x7E or 0xA1 to 0xFE.
+
+Compatibility notes
+
+Note that all of the characters corresponding to Japanese "kawa" are
+different in MULE because they have different charsets!  \(You can
+verify this by applying `char-to-int' to them.)  This is also true for
+characters that exist in more than one of the ISO 8859 charsets, such
+as German sharp S.  The latin-unity package helps with this problem.
+
+All this complexity is for backward compatibility with existing MULE
+code, which is based on the ISO 2022 standard for combining multiple
+charsets in a single stream.  ISO 2022 is also the source of the
+restriction of position values to the range 32-127.  Future versions
+of XEmacs will use Unicode as the single internal charset rather than
+try to make MULE code coherent.  Further references are in the Lispref.
 */
        (charset, arg1, USED_IF_MULE (arg2)))
 {

END OF PATCH

_______________________________________________
XEmacs-Patches mailing list
XEmacs-Patches(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-patches

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

[S21.5] Update the docstring for make-char