changing the values of iso-8859-* charsets

older

sarcasm, charsets,...

[Success] XEmacs 21.5-b23 "daikon"...

Ben Wing

Friday, 28 October 2005 Fri, 28 Oct '05

5:21 a.m.

(permalink)

i am 99% of the way towards beginning to try to compile unicode-internal :) currently we use 32-127 for the values of the chars in the iso-8859 charsets. maybe that was needed under old-mule, but in unicode-internal charsets can have values in any arbitrary interval or rectangle in 256 or 256x256 space. shouldn't we use 160-255? this would only matter in the output of `split-char'; `make-char' already goes either way. ben

Show replies by date

Stephen J. Turnbull

Friday, 28 October Fri, 28 Oct

7:05 a.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> currently we use 32-127 for the values of the chars in the Ben> iso-8859 charsets. maybe that was needed under old-mule, but Ben> in unicode-internal charsets can have values in any arbitrary Ben> interval or rectangle in 256 or 256x256 space. shouldn't we Ben> use 160-255? this would only matter in the output of Ben> `split-char'; `make-char' already goes either way. No. This is gratuitous incompatibility with ISO 2022, legacy X11 font indexing, and other Emacsen. Why buy trouble changing a public API? -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Ben Wing

Friday, 28 October Fri, 28 Oct

8:17 a.m.

(permalink)

Stephen J. Turnbull wrote:

...

>>>>>"Ben" == Ben Wing <ben(a)666.com> writes: >>>>> >>>>> Ben> currently we use 32-127 for the values of the chars in the Ben> iso-8859 charsets. maybe that was needed under old-mule, but Ben> in unicode-internal charsets can have values in any arbitrary Ben> interval or rectangle in 256 or 256x256 space. shouldn't we Ben> use 160-255? this would only matter in the output of Ben> `split-char'; `make-char' already goes either way. No. This is gratuitous incompatibility with ISO 2022, legacy X11 font indexing, and other Emacsen. Why buy trouble changing a public API?

it's the other way around. the current situation is incompatible with the X11 fonts, so we have to hack the values using the bogus `graphic' characteristic. note that in the new world, charsets can have values > 127 in any case. cf. big5, shift-jis, etc. so when i'm creating a new charset like `latin-windows-1252', which is compatible with iso-8859-1 but has extra chars in the range 128-159, do i do the right thing and have its chars in the range 128-255 be indexed as 128-255 (and hence be inconsistent with the `latin-iso8859-1' charset), or do i do the wrong thing and move its range down to 0-127? and then it appears to have ascii control chars in the range 0-31, but they aren't control chars, value 10 is not linefeed, value 13 is not cr, etc.? ben

Aidan Kehoe

Sunday, 30 October Sun, 30 Oct

10:51 a.m.

(permalink)

Ar an t-ochtú lá is fiche de mí Deireadh Fómhair, scríobh Ben Wing:

...

> Ben> currently we use 32-127 for the values of the chars in the > Ben> iso-8859 charsets. maybe that was needed under old-mule, but > Ben> in unicode-internal charsets can have values in any arbitrary > Ben> interval or rectangle in 256 or 256x256 space. shouldn't we > Ben> use 160-255? this would only matter in the output of > Ben> `split-char'; `make-char' already goes either way. > >No. This is gratuitous incompatibility with ISO 2022, legacy X11 font >indexing, and other Emacsen. Why buy trouble changing a public API? it's the other way around. the current situation is incompatible with the X11 fonts, so we have to hack the values using the bogus `graphic' characteristic.

It remains incompatible with ISO 2022 and other Emacsen, and will break existing code.

...

note that in the new world, charsets can have values > 127 in any case. cf. big5, shift-jis, etc. so when i'm creating a new charset like `latin-windows-1252', [...]

Create another API for character sets in that 256 char or rectangle-of-256x256-char space, then. Breaking the old API doesn’t buy a whole lot.

...

which is compatible with iso-8859-1 but has extra chars in the range 128-159, do i do the right thing and have its chars in the range 128-255 be indexed as 128-255 (and hence be inconsistent with the `latin-iso8859-1' charset), or do i do the wrong thing and move its range down to 0-127? and then it appears to have ascii control chars in the range 0-31, but they aren't control chars, value 10 is not linefeed, value 13 is not cr, etc.?

Another reason to create another API; it would be possible to implement EBCDIC character sets with one that didn’t assume ASCII compatibility, as the existing API does. With an API that didn’t suck, (make-char-alternative code-page-037 #x40) and (make-char-alternative code-page-037 #xc0) could and should mean different things. -- „Frauen achten mehr aufs Herz und weniger auf Dummheiten. Darum leben sie länger.“ (C.R. Zafón -- Übertragung von Peter Schwaar.)

Ben Wing

Monday, 31 October Mon, 31 Oct

7:35 a.m.

(permalink)

Aidan Kehoe wrote:

...

Ar an t-ochtú lá is fiche de mí Deireadh Fómhair, scríobh Ben Wing: > > Ben> currently we use 32-127 for the values of the chars in the > > Ben> iso-8859 charsets. maybe that was needed under old-mule, but > > Ben> in unicode-internal charsets can have values in any arbitrary > > Ben> interval or rectangle in 256 or 256x256 space. shouldn't we > > Ben> use 160-255? this would only matter in the output of > > Ben> `split-char'; `make-char' already goes either way. > > > >No. This is gratuitous incompatibility with ISO 2022, legacy X11 font > >indexing, and other Emacsen. Why buy trouble changing a public API? > > it's the other way around. the current situation is incompatible with > the X11 fonts, so we have to hack the values using the bogus `graphic' > characteristic. It remains incompatible with ISO 2022 and other Emacsen, and will break existing code. > note that in the new world, charsets can have values > 127 in any case. > cf. big5, shift-jis, etc. > > so when i'm creating a new charset like `latin-windows-1252', [...] Create another API for character sets in that 256 char or rectangle-of-256x256-char space, then. Breaking the old API doesn’t buy a whole lot. > which is compatible with iso-8859-1 but has extra chars in the range > 128-159, do i do the right thing and have its chars in the range 128-255 > be indexed as 128-255 (and hence be inconsistent with the > `latin-iso8859-1' charset), or do i do the wrong thing and move its range > down to 0-127? and then it appears to have ascii control chars in the > range 0-31, but they aren't control chars, value 10 is not linefeed, > value 13 is not cr, etc.? Another reason to create another API; it would be possible to implement EBCDIC character sets with one that didn’t assume ASCII compatibility, as the existing API does. With an API that didn’t suck, (make-char-alternative code-page-037 #x40) and (make-char-alternative code-page-037 #xc0) could and should mean different things.

i don't want two different kinds of charsets. but we can create all charsets with the proper indices and have `split-char' and `char-octet' generate the old, wrong indices and a new function do it right. suggestions for a new api to replace `make-char' and `split-char'? btw unicode-internal now compiles (and crashes at startup, naturally). i still need to add the translation tables for koi8-r and friends, implement surrogates and (the biggest current issue) redo font handling to eliminate the concept of one-font-per-charset. also, add a concept of "language" and introduce it appropriately in the unicode/charset conversion functions. (neither of these last two will make it into the first version of unicode-internal to be integrated into the mainline.) currently only about 64 ifdef UNICODE_INTERNAL's, and almost all localized to text.h, text.c and charset.h. (these are the only files that know anything about the actual encoding of characters. chartab.c, for example, knows only that its hashing function must be different. mule-coding.c knows only that the bogus split big5 charsets don't exist under UNICODE_INTERNAL.) charset.h is totally rewritten and might go away entirely. chartab.c is drastically changed and now uses the same basic format as the unicode translation tables. also, we unfortunately can only implement 30-bit chars, even though Unicode theoretically allows 31 bits. ben

Aidan Kehoe

Monday, 31 October Mon, 31 Oct

8:52 a.m.

(permalink)

Ar an t-aonú lá is triochad de mí Deireadh Fómhair, scríobh Ben Wing:

...

If the internal encoding is Unicode, the charset (apart from, say, 'ucs) isn’t as trivially available as it is with the iso-2022-oriented encoding. Are you suggesting implementing something like the extant translation in unicode.c ? GNU’s API of (decode-char 'ucs #x20ac) and (encode-char ?\ 'ucs), used with other symbols (which they don’t allow, and which our compatibility implementation doesn’t allow) could work well in that context.

...

btw unicode-internal now compiles (and crashes at startup, naturally). i still need to add the translation tables for koi8-r and friends, implement surrogates and (the biggest current issue) redo font handling to eliminate the concept of one-font-per-charset.

Excellent.

...

also, add a concept of "language" and introduce it appropriately in the unicode/charset conversion functions. (neither of these last two will make it into the first version of unicode-internal to be integrated into the mainline.)

Neither the Unicode conversion functions nor the charset conversion functions will make it in? That doesn’t seem very practical; I’m sure you mean something else there.

...

currently only about 64 ifdef UNICODE_INTERNAL's, and almost all localized to text.h, text.c and charset.h. (these are the only files that know anything about the actual encoding of characters. chartab.c, for example, knows only that its hashing function must be different. mule-coding.c knows only that the bogus split big5 charsets don't exist under UNICODE_INTERNAL.)

That file is full of assumptions about our internal string format which need to be changed if your’re changing that format. I haven’t seen you mention that you are, but I find it a hard to imagine supporting a 21-bit space with the existing format, let alone a 30-bit space, given that you’d have to abandon most of the leading byte architecture. But, I’m sure I’ll understand more of the details when I see the patch.

...

charset.h is totally rewritten and might go away entirely. chartab.c is drastically changed and now uses the same basic format as the unicode translation tables.

...

also, we unfortunately can only implement 30-bit chars, even though Unicode theoretically allows 31 bits.

They introduced a limit in 3.0 of 0x110000. Every code point approved by the standard will be below that. -- „Frauen achten mehr aufs Herz und weniger auf Dummheiten. Darum leben sie länger.“ (C.R. Zafón -- Übertragung von Peter Schwaar.)

Ben Wing

Monday, 31 October Mon, 31 Oct

2:14 p.m.

(permalink)

Aidan Kehoe wrote:

...

Ar an t-aonú lá is triochad de mí Deireadh Fómhair, scríobh Ben Wing: > i don't want two different kinds of charsets. but we can create all > charsets with the proper indices and have `split-char' and `char-octet' > generate the old, wrong indices and a new function do it right. > suggestions for a new api to replace `make-char' and `split-char'? If the internal encoding is Unicode, the charset (apart from, say, 'ucs) isn’t as trivially available as it is with the iso-2022-oriented encoding. Are you suggesting implementing something like the extant translation in unicode.c ?

right. these functions take an extra precedence-list argument (a list of charsets), which when nil does something reasonable as a default. essentially i removed unicode-to-char and integrated it into make-char; but i'm open to other suggestions.

...

GNU’s API of (decode-char 'ucs #x20ac) and (encode-char ?\ 'ucs), used with other symbols (which they don’t allow, and which our compatibility implementation doesn’t allow) could work well in that context.

well, basically i want some api that says "convert a charset and codepoints into a Lisp char" and vice-versa "get a charset and codepoints from a Lisp char". keep in mind that we will be running both unicode-internal and old-mule for awhile, so the api's must work with both. for this reason, i also introduced other levels of conversion, similar but a bit different; e.g. charset-codepoint-to-unicode and unicode-to-charset-codepoint unilaterally convert between a unicode codepoint and a charset codepoint, regardless of the representation of a char. on the other hand, int-to-char and char-to-int always give you the actual int that makes up the character; not very portable. if you want unicode, char-to-unicode. if you want charset codepoints, it's currently `split-char'. but i suppose i could create `char-to-charset-codepoint', which would follow the others, and `charset-codepoint-to-char', and reintroduce `unicode-to-char', and deprecate `make-char' and `split-char'. then you'd have a very symmetrical api. suggestions for less verbose names are welcome.

...

> btw unicode-internal now compiles (and crashes at startup, naturally). > i still need to add the translation tables for koi8-r and friends, > implement surrogates and (the biggest current issue) redo font handling > to eliminate the concept of one-font-per-charset. Excellent. > also, add a concept of "language" and introduce it appropriately in the > unicode/charset conversion functions. (neither of these last two will > make it into the first version of unicode-internal to be integrated into > the mainline.) Neither the Unicode conversion functions nor the charset conversion functions will make it in? That doesn’t seem very practical; I’m sure you mean something else there.

no, neither the redone font handling nor the concept of a language introduced into the unicode conversion functions will make it in. the font handling will still bogusly be in terms of charsets, with the same bogus hack currently there (under windows at least) to look harder through various fonts to find a font that can display a char, when necessary. i know what needs to be done to change this but it will be a pervasive change and i don't want to bite off too much at a time right now. similarly, the idea of introducing a language and tracking the language of text using extent properties will take some doing. first step would be to create a language object and set up properties on it, such as the charset precedence list for unicode translation. then there is a buffer-local `current-language' variable (maybe the language environments can be made to work without too much effort). so we need to pass around some sort of object from which the charset precedence list can be derived -- the list itself, a language object (maybe just a symbol, who knows), a buffer, etc. currently most functions don't bother with this, so it requires a fair amount of refactoring.

...

> currently only about 64 ifdef UNICODE_INTERNAL's, and almost all > localized to text.h, text.c and charset.h. (these are the only files > that know anything about the actual encoding of characters. chartab.c, > for example, knows only that its hashing function must be different. > mule-coding.c knows only that the bogus split big5 charsets don't exist > under UNICODE_INTERNAL.) That file is full of assumptions about our internal string format which need to be changed if your’re changing that format. I haven’t seen you mention that you are, but I find it a hard to imagine supporting a 21-bit space with the existing format, let alone a 30-bit space, given that you’d have to abandon most of the leading byte architecture.

which file? chartab.c got radically reworked to support 32-bit characters using a page-table-style lookup, as for unicode translation tables. same system is used for old-mule as well. mule-coding.c did not need so much rewriting, but it now works in terms of generic functions. e.g. when iso2022 encodes to external, it uses macros to accumulate a whole character, then calls itext_to_charset_codepoint() -- an inline function -- to get an appropriate charset and codepoint. similarly, once it's gotten a charset and codepoints, it calls charset_codepoint_to_dynarr() -- another inline function -- to write it out. these inliners work in the old and new system, conditionalized appropriately. similar stuff goes everywhere else; i abstracted out "leading bytes" and all specific knowledge of our internal representation whenever possible.

...

But, I’m sure I’ll understand more of the details when I see the patch. > charset.h is totally rewritten and might go away entirely. chartab.c is > drastically changed and now uses the same basic format as the unicode > translation tables. > also, we unfortunately can only implement 30-bit chars, even though > Unicode theoretically allows 31 bits. They introduced a limit in 3.0 of 0x110000. Every code point approved by the standard will be below that.

yeah, i've heard, but UCS-4 is still "theoretically" 31 bits, but all private use ... oh well. we'll just say the high half of the space is "private and reserved", heh heh :) ben

Stephen J. Turnbull

Tuesday, 1 November Tue, 1 Nov

5:08 a.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> well, basically i want some api that says "convert a charset Ben> and codepoints into a Lisp char" and vice-versa "get a Ben> charset and codepoints from a Lisp char". keep in mind that Ben> we will be running both unicode-internal and old-mule for Ben> awhile, FWIW, IMO nobody's going to be running with unicode-internal if neither font-handling nor language support is in, except for the sake of testing unicode-internal. So, we only need to support legacy Mule stuff for applications testers are using, on request from the testers. That will give us a much better idea of how much backward compatibility we should support, and how. Ben> on the other hand, int-to-char and char-to-int always give Ben> you the actual int that makes up the character; not very Ben> portable. But with Unicode inside, we simply don't care, because characters have no internal structure. Sure, users will find uses for inspecting the internal representation (eg, pure curiosity), so we may as well provide int-to-char and char-to-int. (And of course Unicodes are perfectly portable.) I cannot think of a single use case (excluding legacy Mule compatibility) for the charset-oriented functions. Start with the core code. There is no use of `split-char' and exactly one use of `make-char' in 21.5/lisp. There are a few of each in 21.5/lisp/mule. The majority of uses are those in about.el, european.el, greek.el, and hebrew.el. Those reflect what is just a long-standing bug: the lack of a proper facility for reading non-ASCII characters in LISP code, which will obviously be fixed by standardizing on Unicode (presumably UTF-8) for source code in XEmacs. Of the rest, I doubt that any of the language-specific uses will remain in unicode-internal. Although I haven't checked carefully, there is reason for confidence. Since they are language-specific files, they all refer to a single character set, so only the octets are of interest; the charset can be left implicit. Then surely we can use tables. That leaves 5 uses. Of those 5, the two in mule-category.el can be trivially and portably replaced by int-to-char. The use in latin.el can and should be replaced by reference to the Unicode case conversion tables. Which leaves two from mule-charset.el. Deprecate the notion of charset, as I propose, and we have *no* uses of split-char or make-char left in core. lisp/about.el: (let ((muenchen (format "M%cnchen" (make-char 'latin-iso8859-1 252)))) 21.5/lisp/mule/cyrillic.el: (split (split-char ch))) 21.5/lisp/mule/cyrillic.el: (split (split-char ch))) 21.5/lisp/mule/devan-util.el: (let ((charcodes (split-char char))) 21.5/lisp/mule/devan-util.el: (make-char 'indian-2-column ?\x21 (nth 1 charcodes)) 21.5/lisp/mule/devan-util.el: (let ((charcodes (split-char char))) 21.5/lisp/mule/devan-util.el: (make-char 'indian-is13194 (nth 2 charcodes)) 21.5/lisp/mule/ethio-util.el: (make-char 21.5/lisp/mule/ethio-util.el: (let ((char-components (split-char ch))) 21.5/lisp/mule/european.el: do (modify-syntax-entry (make-char 'latin-iso8859-14 c) "w")) 21.5/lisp/mule/european.el: (modify-syntax-entry (make-char 'latin-iso8859-14 code) "_")) 21.5/lisp/mule/european.el: do (modify-syntax-entry (make-char 'latin-iso8859-16 c) "w")) 21.5/lisp/mule/european.el: (modify-syntax-entry (make-char 'latin-iso8859-16 code) "_")) 21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xa5) "\"") 21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xb5) "\"") 21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xab) 21.5/lisp/mule/european.el: (format "(%c" (make-char 'latin-iso8859-16 #xbb))) 21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xbb) 21.5/lisp/mule/european.el: (format ")%c" (make-char 'latin-iso8859-16 #xab))) 21.5/lisp/mule/european.el: (make-char 'latin-iso8859-15 c) 21.5/lisp/mule/european.el: (string (char-syntax (make-char 'latin-iso8859-1 c))))) 21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xab) 21.5/lisp/mule/european.el: (format "(%c" (make-char 'latin-iso8859-16 #xbb))) 21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-16 #xbb) 21.5/lisp/mule/european.el: (format ")%c" (make-char 'latin-iso8859-16 #xab))) 21.5/lisp/mule/european.el: do (modify-syntax-entry (make-char 'latin-iso8859-2 c) "w")) 21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-2 32) "w") ; no-break space 21.5/lisp/mule/european.el: do (modify-syntax-entry (make-char 'latin-iso8859-3 c) "w")) 21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-3 32) "w") ; no-break space 21.5/lisp/mule/european.el: do (modify-syntax-entry (make-char 'latin-iso8859-4 c) "w")) 21.5/lisp/mule/european.el:(modify-syntax-entry (make-char 'latin-iso8859-4 32) "w") ; no-break space 21.5/lisp/mule/greek.el: do (modify-syntax-entry (make-char 'greek-iso8859-7 c) "w")) 21.5/lisp/mule/hebrew.el: do (modify-syntax-entry (make-char 'hebrew-iso8859-8 c) "w")) 21.5/lisp/mule/hebrew.el:(modify-syntax-entry (make-char 'hebrew-iso8859-8 32) "w") ; no-break space 21.5/lisp/mule/latin.el: (put-case-table-pair (make-char charset uc) (make-char charset lc) tbl)))) 21.5/lisp/mule/mule-category.el: (setq found (make-char 'ascii a))) 21.5/lisp/mule/mule-category.el: (setq list (cons (make-char 'ascii a) list))) 21.5/lisp/mule/mule-charset.el: (from-rev (nreverse (split-char from))) 21.5/lisp/mule/mule-charset.el: (to-rev (nreverse (split-char to)))) 21.5/lisp/mule/thai-xtis.el: to (nth 1 (split-char (aref chars i)))) 21.5/lisp/mule/thai-xtis.el: (setq ch (nth 1 (split-char (aref chars i))) 21.5/lisp/mule/vietnamese.el: (split-char (aref viet-viscii-decode-table i))) 21.5/lisp/mule/vietnamese.el: (split-char (aref viet-vscii-decode-table i))) How about mule-base? Not much here, either. char-table.el is a utility for describing a Mule charset. The language-specific uses can presumably be replaced by simple tables. mule-keyboard.el is an interface to terminals whose keyboards send ISO 2022 encodings. I don't think we can justify charsets on the basis of mule-base, either. mule-base/char-table.el: (make-char charset plane (+ (* r 16) l)) mule-base/char-table.el: (make-char charset (+ (* r 16) l)) mule-base/ethio-util.el: (make-char mule-base/ethio-util.el: (let ((char-components (split-char ch))) mule-base/mule-keyboard.el: (keyboard-self-insert (make-char charset last-command-char)) mule-base/mule-keyboard.el: (keyboard-self-insert (make-char charset _keyboard-first-byte_ mule-base/mule-keyboard.el: (keyboard-self-insert (make-char 'japanese-jisx0201-kana last-command-char))) mule-base/thai-xtis-util.el: (make-char 'thai-xtis))) How about charset precedence for Unicode translation? Ben> first step would be to create a language object and set up Ben> properties on it, such as the charset precedence list for Ben> unicode translation. Display doesn't need it. Windows, Quartz, and Xft all support Unicode font indices. Just give the font engine the preferred font (which will vary by language in a natural way) and the Unicode code point.[1] Text output doesn't need it all, because the coding system will be specified, and it will specify character sets. Charsets are not used for text input encoding detection as implemented in current Mule. We detect _coding categories_ based on presence or absence of ISO 2022 control functions, plus looking at the high bits of octets to detect UTF-8, Shift-JIS, and Big5. Coding systems (and thus charsets) are not detected; they are configured as properties of the coding categories. Maybe we _should_ use character set information, but we currently don't, and I don't see why `charset' needs to be exposed to LISP at this point. Anyway, for both input and output, specifying a coding system achieves everything (and more) that specifying charsets could, and is no more confusing (especially since users in general confound coding systems and character sets). What am I missing? Footnotes: [1] Sure, there are legacy fonts, especially on X11, where we quite likely will have to supply a registry in order to locate a usable font. But in those cases it will almost always be the case that given a font family all registries that contain a given character will map it to the same glyph. So just take the first one. Anything left can probably be handled ad hoc as a per-font user customization in the face. (I have xfontsel in mind as the interface: get the registries using XListFonts, then give the user a menu for the given font. Add an XEmacs-specific alist mapping X registries to descriptions that will be more useful to typical users.) -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Stephen J. Turnbull

Monday, 31 October Mon, 31 Oct

5:32 p.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> but we can create all charsets with the proper indices and Ben> have `split-char' and `char-octet' generate the old, wrong Ben> indices and a new function do it right. suggestions for a Ben> new api to replace `make-char' and `split-char'? Why do we want this interface at all, aside from backward compatibility with legacy Mule? With Unicode inside, a `get-char-components' function must be set-valued or nondeterministic (it depends on some global precedence list), and (eq (list cs o1 o2) (get-char-components (make-char-from-components cs o1 o2))) is mostly false. `charset-of-char' and `octet-of-char' don't make sense as separate APIs, they must be generated together. As far as I can see, this is not something that can be done right. There are only different ways to do it wrong, none of which are very attractive to me. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Ben Wing

Tuesday, 1 November Tue, 1 Nov

2:33 a.m.

(permalink)

Stephen J. Turnbull wrote:

...

>>>>>"Ben" == Ben Wing <ben(a)666.com> writes: >>>>> >>>>> Ben> but we can create all charsets with the proper indices and Ben> have `split-char' and `char-octet' generate the old, wrong Ben> indices and a new function do it right. suggestions for a Ben> new api to replace `make-char' and `split-char'? Why do we want this interface at all, aside from backward compatibility with legacy Mule? With Unicode inside, a `get-char-components' function must be set-valued or nondeterministic (it depends on some global precedence list), and (eq (list cs o1 o2) (get-char-components (make-char-from-components cs o1 o2))) is mostly false. `charset-of-char' and `octet-of-char' don't make sense as separate APIs, they must be generated together. As far as I can see, this is not something that can be done right. There are only different ways to do it wrong, none of which are very attractive to me.

ok. assume we need such a function, and that it takes an optional precedence list. what should it be called? and what should the function be called to generate a character from charset codepoints?

Stephen J. Turnbull

Monday, 31 October Mon, 31 Oct

7:40 a.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> it's the other way around. the current situation is Ben> incompatible with the X11 fonts, so we have to hack the Ben> values using the bogus `graphic' characteristic. Nothing bogus about it, it's perfectly (ISO 2022) standard. The X fonts' use of character codes as font indexes where they fit into the appropriate space is perfectly reasonable (although I've never actually looked closely at an ISO-8859 font, thus the confusion on my part, mea culpa), but is not of any particular interest here. Note that multibyte fonts (like Japanese) do assume font indices in the 32-127 range (usually 33-126, actually). Ben> note that in the new world, charsets can have values > 127 in Ben> any case. cf. big5, shift-jis, etc. Please stop abusing the word "charset" for an object that is only well-defined in a workspace I have no access to. It's confusing you, too, it would seem. Ben> so when i'm creating a new charset like `latin-windows-1252', Ben> which is compatible with iso-8859-1 but has extra chars in Ben> the range 128-159, do i do the right thing and have its chars Ben> in the range 128-255 be indexed as 128-255 (and hence be Ben> inconsistent with the `latin-iso8859-1' charset), or do i do Ben> the wrong thing and move its range down to 0-127? and then Ben> it appears to have ascii control chars in the range 0-31, but Ben> they aren't control chars, value 10 is not linefeed, value 13 Ben> is not cr, etc.? You're thinking in terms of Mule charsets. Don't, it's no help. Those values should _never_ _ever_ appear in a context where they could be confused with characters. We don't need named coded character sets internally, we don't need to associate random octets with charsets to make characters. (Except for backward compatibility, where backward is spelled P E R V E R S E.) We only need subsets of Unicode. Abstractly, characters from internal text (LISP characters, strings, and buffers) should only ever be mapped to their Unicode values, and then from Unicode to external coding systems for I/O. If you're worrying about the practical problems of mapping Unicode characters to font indicies, please don't bother. It's a practical problem, yes, but you aren't going to enforce sanity on fonts by perpetuating the charset bogosity. For now, _any old hack_ will do, just get glyphs on the screen for the fonts you use. As long as the API looks like a table and handles two-byte indicies, we can generalize and optimize the internals for space later, if we even need to. I agree, we'll need named font index tables (a la Cmaps). We've already got the tables in etc/unicode. Give them their Unicode names, provide an aliasing mechanism, add an xemacs vendor directory, and put anything we need that we don't already have in there. Footnotes: [1] If you read this as an anti-Microsoft rant, you're missing the point. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Ben Wing

Monday, 31 October Mon, 31 Oct

8:16 a.m.

(permalink)

Stephen J. Turnbull wrote:

...

>>>>>"Ben" == Ben Wing <ben(a)666.com> writes: >>>>> >>>>> Ben> it's the other way around. the current situation is Ben> incompatible with the X11 fonts, so we have to hack the Ben> values using the bogus `graphic' characteristic. Nothing bogus about it, it's perfectly (ISO 2022) standard. The X fonts' use of character codes as font indexes where they fit into the appropriate space is perfectly reasonable (although I've never actually looked closely at an ISO-8859 font, thus the confusion on my part, mea culpa), but is not of any particular interest here. Note that multibyte fonts (like Japanese) do assume font indices in the 32-127 range (usually 33-126, actually).

yeah, i forgot about its iso-2022 connection.

...

Ben> note that in the new world, charsets can have values > 127 in Ben> any case. cf. big5, shift-jis, etc. Please stop abusing the word "charset" for an object that is only well-defined in a workspace I have no access to. It's confusing you, too, it would seem. Ben> so when i'm creating a new charset like `latin-windows-1252', Ben> which is compatible with iso-8859-1 but has extra chars in Ben> the range 128-159, do i do the right thing and have its chars Ben> in the range 128-255 be indexed as 128-255 (and hence be Ben> inconsistent with the `latin-iso8859-1' charset), or do i do Ben> the wrong thing and move its range down to 0-127? and then Ben> it appears to have ascii control chars in the range 0-31, but Ben> they aren't control chars, value 10 is not linefeed, value 13 Ben> is not cr, etc.? You're thinking in terms of Mule charsets. Don't, it's no help. Those values should _never_ _ever_ appear in a context where they could be confused with characters. We don't need named coded character sets internally, we don't need to associate random octets with charsets to make characters. (Except for backward compatibility, where backward is spelled P E R V E R S E.) We only need subsets of Unicode. Abstractly, characters from internal text (LISP characters, strings, and buffers) should only ever be mapped to their Unicode values, and then from Unicode to external coding systems for I/O. If you're worrying about the practical problems of mapping Unicode characters to font indicies, please don't bother. It's a practical problem, yes, but you aren't going to enforce sanity on fonts by perpetuating the charset bogosity. For now, _any old hack_ will do, just get glyphs on the screen for the fonts you use. As long as the API looks like a table and handles two-byte indicies, we can generalize and optimize the internals for space later, if we even need to. I agree, we'll need named font index tables (a la Cmaps). We've already got the tables in etc/unicode. Give them their Unicode names, provide an aliasing mechanism, add an xemacs vendor directory, and put anything we need that we don't already have in there. Footnotes: [1] If you read this as an anti-Microsoft rant, you're missing the point.

i really don't understand your sarcastic attitude, or what point you're trying to make. "charset" as i have defined it is a set of characters, indexed by one or two bytes. the indices are as defined in the unicode translation tables. you can certainly see my code if you want. the main purpose of charsets in the new world is to interface with external encodings, and maybe secondarily for font indexing under X. i see no purpose in creating a totally new concept rather than extending the current `charset' concept. in the new world, you can have an arbitrary number of charsets, with any characters you want in them; the only way that the code knows what's in the charset is by the appropriate unicode translations that have been provided. i've created new charsets `japanese-shift-jis' and `latin-windows-1252' and various others; `cyrillic-koi8-r' and such will be coming soon. there will be a new `mbcs' coding system type that just encodes one or more charsets using their indices, in the obvious fashion; this will replace the need for ccl. ben

Stephen J. Turnbull

Monday, 31 October Mon, 31 Oct

2:42 p.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> i really don't understand your sarcastic attitude, or what Ben> point you're trying to make. What sarcastic attitude, aside from the comment about rants? Even that, I'm sorry you took it that way, but I think it's a huge waste of your time to be defining "latin-windows-1252" character sets at this point. We should provide an API that allows any reasonably experienced Emacs user to do that, no? And ISO 8859-1 is close enough for prototyping purposes. In fact, it has the "advantage" that people _will_ use directed quotes and suchlike, and so we'll get a natural test of error handling. Ben> "charset" as i have defined it is a set of characters, Ben> indexed by one or two bytes. the indices are as defined in Ben> the unicode translation tables. Please don't call it "charset" if all it is is a table. The word "charset" has lots of associations, some (like "94N") that we would really like to imprison in a very small area of the implementation, and some that we really want gone, period (like "leading byte"). So you are guaranteed to get frequent communication failures with other Emacs developers if you insist on using that word. I also think that your statements about what you're doing are totally confusing. Specifically: Ben> i see no purpose in creating a totally new concept rather Ben> than extending the current `charset' concept. But that's my point! According to your descriptions, you _are_ creating a totally new concept. You're _not_ extending the Mule charset concept, you're substituting a _completely different definition_. Specifically, there's no way that what you're talking about can be used in legacy Mule, right? It lacks the leading byte and graphic properties, for one, and the indicies you describe are presumbly platform integers, which are unsuitable for working with ISO 2022 multibyte character sets because they are always bigendian. Ben> there will be a new `mbcs' coding system type that just Ben> encodes one or more charsets using their indices, in the Ben> obvious fashion; There's a lot that's unobvious about such a thing. Most important, what good is it? Where will it be used? As I see it, internally there is Unicode, an unordered collection of astract characters. Where is there a role for `mbcs'? Externally, the "or more" part is unusable until you define a protocol for switching "new-style charsets". The "in the new world, you can have an arbitrary number of charsets, with any characters you want in them" aspect of "new-style charsets" is unusable until you define a protocol for defining them on the wire. But this is starting to sound like ISO 2022. ***** How would I do this? As far as I can see there are basically five kinds of external coding systems we will want to deal with. Unicode transformation formats, Unicode subsets with an associated invertible coding table, nontrivial versions of ISO 2022, other text, and non-text. (1) *Unicode transformation formats* are algorithmic and well-defined. We have most of the code we need (AFAIK the only Unicode-standard UTF missing is UTF-7, although there are a couple of weirdos out there, eg, from China). The tedious parts will be getting endianness and Unicode signatures right, although AFAIK that's well-enough done in current XEmacs Mule. (2) *Unicode subsets* (with associated tables) include all the members of the ISO 8859 family, KOI8, the IBM and Microsoft character sets, and Shift JIS. IIRC there are some dupes in Big5, but also AFAIK Big5 can in practice be subsetted to the invertible part. It's possible that some EUC coding systems can actually be treated as Unicode subsets (if they are defined or used in such a way that ISO 2022 character set designations and invocations are never used). Compressing and sharing tables may be worth some effort, but these are optimizations; abstractly tables are trivial. The most important thing will be to take care with endianness issues, since with the exception of the non-endian and specifically little-endian Unicode formats, all external coding systems that I know of are big-endian. (3) *ISO 2022* is basically hopeless for full implementation, but fortunately most of the important versions are already well-defined and implemented in XEmacs. The only important problem I know of is extended segments in X Compound Text. And this stuff should (except for cut and paste on X) rapidly decrease in importance (except possibly in Japan, where it may take longer). (4) *Other text*. I don't know of any important major ones, offhand, unless Microsoft, Apple, or IBM provide non-ISO-2022 ways of changing code pages in a file. But they're all very much on board the Unicode bandwagon---non-Unicode multilingual text should diminish rapidly for those platforms, too. So the main issue that doesn't fit into the above is private characters. (5) *Non-text*. Binary transformation formats, like deflate and BASE64. These are octet-stream to octet-stream transformations, so the implementation will be different from the codecs. Otherwise I don't suppose there's much commonality here. I would start by implementing (1) and (2), and maybe an API for (5). (3) can be implemented ad hoc by others, just port the Mule algorithms to the new framework, perhaps in less generality if the lstream interfaces and the like change, or maybe just steal them wholesale if it's just a matter of marshalling arguments. Except for Microsoft code pages (if there's a way to switch them midstream) (4) can wait until we see a need. Of course (2) requires named tables. Don't we already have those? Tables have a very simple API. With a little care we should be able to implement that API so that if we need something that's more space efficient we can do that later without changing the API, right? -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Ben Wing

Wednesday, 2 November Wed, 2 Nov

7:48 a.m.

(permalink)

Stephen J. Turnbull wrote:

...

>>>>>"Ben" == Ben Wing <ben(a)666.com> writes: >>>>> >>>>> Ben> i really don't understand your sarcastic attitude, or what Ben> point you're trying to make. What sarcastic attitude, aside from the comment about rants? Even that, I'm sorry you took it that way, but I think it's a huge waste of your time to be defining "latin-windows-1252" character sets at this point. We should provide an API that allows any reasonably experienced Emacs user to do that, no? And ISO 8859-1 is close enough for prototyping purposes. In fact, it has the "advantage" that people _will_ use directed quotes and suchlike, and so we'll get a natural test of error handling.

why is it a waste of time? took me half an hour or so.

...

Ben> "charset" as i have defined it is a set of characters, Ben> indexed by one or two bytes. the indices are as defined in Ben> the unicode translation tables. Please don't call it "charset" if all it is is a table. The word "charset" has lots of associations, some (like "94N") that we would really like to imprison in a very small area of the implementation, and some that we really want gone, period (like "leading byte"). So you are guaranteed to get frequent communication failures with other Emacs developers if you insist on using that word.

such charsets can work under old-mule, too. in fact, they do, currently; it's just that you can't encode a character in a buffer using them. but you can convert back and forth to unicode, just like for other charsets. see my next paragraph.

...

I also think that your statements about what you're doing are totally confusing. Specifically: Ben> i see no purpose in creating a totally new concept rather Ben> than extending the current `charset' concept. But that's my point! According to your descriptions, you _are_ creating a totally new concept. You're _not_ extending the Mule charset concept, you're substituting a _completely different definition_. Specifically, there's no way that what you're talking about can be used in legacy Mule, right? It lacks the leading byte and graphic properties, for one, and the indicies you describe are presumbly platform integers, which are unsuitable for working with ISO 2022 multibyte character sets because they are always bigendian. Ben> there will be a new `mbcs' coding system type that just Ben> encodes one or more charsets using their indices, in the Ben> obvious fashion; There's a lot that's unobvious about such a thing. Most important, what good is it? Where will it be used? As I see it, internally there is Unicode, an unordered collection of astract characters. Where is there a role for `mbcs'? Externally, the "or more" part is unusable until you define a protocol for switching "new-style charsets". The "in the new world, you can have an arbitrary number of charsets, with any characters you want in them" aspect of "new-style charsets" is unusable until you define a protocol for defining them on the wire. But this is starting to sound like ISO 2022.

mbcs is used for implementing things like the koi8-r coding system under unicode-internal. it can also replace the hand-coded big5 and shift-jis coding systems, in unicode-internal. it could replace them under old-mule, too, if i add a property on charsets indicating what other charsets they should preferentially be converted to. (we could do without this but then we might not get the expected charset; e.g. big5 chars might end up as gb-2312 instead of big5-1 or big5-2.)

...

***** How would I do this? As far as I can see there are basically five kinds of external coding systems we will want to deal with. Unicode transformation formats, Unicode subsets with an associated invertible coding table, nontrivial versions of ISO 2022, other text, and non-text. (1) *Unicode transformation formats* are algorithmic and well-defined. We have most of the code we need (AFAIK the only Unicode-standard UTF missing is UTF-7, although there are a couple of weirdos out there, eg, from China). The tedious parts will be getting endianness and Unicode signatures right, although AFAIK that's well-enough done in current XEmacs Mule.

there's a utf-7 implementation in the unicode book, but i don't know its licensing. any ideas? (if this doesn't work, i'm sure there are gpl-ed utf-7 implementations available.)

...

(2) *Unicode subsets* (with associated tables) include all the members of the ISO 8859 family, KOI8, the IBM and Microsoft character sets, and Shift JIS. IIRC there are some dupes in Big5, but also AFAIK Big5 can in practice be subsetted to the invertible part. It's possible that some EUC coding systems can actually be treated as Unicode subsets (if they are defined or used in such a way that ISO 2022 character set designations and invocations are never used). Compressing and sharing tables may be worth some effort, but these are optimizations; abstractly tables are trivial. The most important thing will be to take care with endianness issues, since with the exception of the non-endian and specifically little-endian Unicode formats, all external coding systems that I know of are big-endian. (3) *ISO 2022* is basically hopeless for full implementation, but fortunately most of the important versions are already well-defined and implemented in XEmacs. The only important problem I know of is extended segments in X Compound Text. And this stuff should (except for cut and paste on X) rapidly decrease in importance (except possibly in Japan, where it may take longer).

i can implement this if you can tell me the names and encodings that are typically used in these segments. the x standard only defines the general format of extended segments and doesn't say what is actually encoded in them. emacs-unicode only has big5, iso8859-14 and iso8859-15 in them. (plus there's the xfree86 utf-8 extension.)

...

(4) *Other text*. I don't know of any important major ones, offhand, unless Microsoft, Apple, or IBM provide non-ISO-2022 ways of changing code pages in a file. But they're all very much on board the Unicode bandwagon---non-Unicode multilingual text should diminish rapidly for those platforms, too. So the main issue that doesn't fit into the above is private characters. (5) *Non-text*. Binary transformation formats, like deflate and BASE64. These are octet-stream to octet-stream transformations, so the implementation will be different from the codecs. Otherwise I don't suppose there's much commonality here. I would start by implementing (1) and (2), and maybe an API for (5). (3) can be implemented ad hoc by others, just port the Mule algorithms to the new framework, perhaps in less generality if the lstream interfaces and the like change, or maybe just steal them wholesale if it's just a matter of marshalling arguments. Except for Microsoft code pages (if there's a way to switch them midstream) (4) can wait until we see a need.

we already have a gzip coding system. we also have base64 functions but not yet converted to a coding system (not too hard to do, though). internally, i already generalized coding systems (some time ago, in fact) to be typed for either bytes or characters at either end; there's also a `chain' coding system for stringing multiple coding systems together. crap like utf8-mac is internally implemented as a chain of two coding systems, one to handle utf8, the other to handle the eof transformation. externally you can't specify or see the fact that the source and sink of coding systems are typed for bytes or characters; rather, an intermediate coding system to convert between the two is automatically inserted, as necessary. (i think, at least; it's been awhile since i did this work and it's not perfectly documented in `make-coding-system'.)

...

Of course (2) requires named tables. Don't we already have those? Tables have a very simple API. With a little care we should be able to implement that API so that if we need something that's more space efficient we can do that later without changing the API, right?

keep in mind that i've already done most of the work you're describing here. i think we're talking past each other; at any rate, you seem to think i'm more confused than i am. what would help more is if you could look at the code i sent you. if there's something specific in my implementation that you don't like, i'll be happy to discuss it with you; nothing is fixed. actually, what would really help is if you could take a look at emacs-unicode-2, figure out what their api is, and summarize it. this would be extremely useful to me. ben

Stephen J. Turnbull

Wednesday, 2 November Wed, 2 Nov

5:11 p.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> why is it a waste of time? took me half an hour or so. Because other people (eg, me) could do it in 45 minutes the first time, and the same half-hour it takes you for the next one. Ben> mbcs is used for implementing things like the koi8-r coding Ben> system under unicode-internal. The KOI8 coded character sets are unibyte and have no mode shifts, and at most 256 characters. Why not simply have one table? Ben> it can also replace the hand-coded big5 and shift-jis coding Ben> systems, in unicode-internal. Sure. But as you say, we've already got implementations of those, and they're going to go away over time. Ben> any ideas? (if this doesn't work, i'm sure there are gpl-ed Ben> utf-7 implementations available.) I doubt the implementation in the Unicode book is efficient or robust, and there's no error handling in it. I'm sure there's one in Emacs and another in gconv (glibc's implememntation of iconv). Python has one. Surely Perl and Ruby do. Ben> i can implement this if you can tell me the names and Ben> encodings that are typically used in these segments. the x Ben> standard only defines the general format of extended segments Ben> and doesn't say what is actually encoded in them. According to the standard, anything with an agreed name that isn't in the list (ie, iso8859-14 and iso8859-15 violate XF86's own standard, and UTF-8 should be in there). The elegant way to implement it would be to treat it as a buffer and translate it using a new lstream, parsing the name out of the extended segment header and using that to determine the coding system. Ben> we already have a gzip coding system. we also have base64 Ben> functions but not yet converted to a coding system (not too Ben> hard to do, though). internally, i already generalized Ben> coding systems (some time ago, in fact) to be typed for Ben> either bytes or characters at either end; there's also a Ben> `chain' coding system for stringing multiple coding systems Ben> together. Yeah, I'm aware of all that, but again it's mostly stuff that somebody else can do, except that it would be really nice if the lstreams and chain coding systems were exposed to LISP somehow. Ben> keep in mind that i've already done most of the work you're Ben> describing here. i think we're talking past each other; at Ben> any rate, you seem to think i'm more confused than i am. I'm sure you know exactly what you're doing, in the small. I will look at the code asap, but your verbal descriptions do not inspire confidence that what I will find is going to be a GNU-beater in practice. GNU has more than one person seriously working on their Mule implementation, and any of the senior developers is reasonably comfortable trying to diagnose and even fix bugs. That's simply not true for current XEmacs, and you're emphasizing backward compatibility. Call it back-seat driving if you like, but somebody needs to tell you about that tree looming in front of the windshield. Ben> actually, what would really help is if you could take a look Ben> at emacs-unicode-2, figure out what their api is, and Ben> summarize it. this would be extremely useful to me. I was afraid you'd say something like that. Maybe somebody left some notes in Japanese.... -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Ben Wing

Thursday, 3 November Thu, 3 Nov

2:46 a.m.

(permalink)

Stephen J. Turnbull wrote:

...

>>>>>"Ben" == Ben Wing <ben(a)666.com> writes: >>>>> >>>>> Ben> why is it a waste of time? took me half an hour or so. Because other people (eg, me) could do it in 45 minutes the first time, and the same half-hour it takes you for the next one.

no, actually it took me half an hour to add *all* the windows charsets -- at least, all the ones that we already had tables for in etc/unicode. adding new ones will take a minute or so. it was useful, among other things, because it clarified what needs to be done with national character sets.

...

Ben> mbcs is used for implementing things like the koi8-r coding Ben> system under unicode-internal. The KOI8 coded character sets are unibyte and have no mode shifts, and at most 256 characters. Why not simply have one table?

it will, at some point. but that requires some more work to remove assumptions about ascii-compatibility while still making ascii conversion efficient.

...

Ben> it can also replace the hand-coded big5 and shift-jis coding Ben> systems, in unicode-internal. Sure. But as you say, we've already got implementations of those, and they're going to go away over time.

and there are others that work this way; johab, for example.

...

Ben> any ideas? (if this doesn't work, i'm sure there are gpl-ed Ben> utf-7 implementations available.) I doubt the implementation in the Unicode book is efficient or robust, and there's no error handling in it. I'm sure there's one in Emacs and another in gconv (glibc's implememntation of iconv). Python has one. Surely Perl and Ruby do.

actually, their implementation of utf-8 is highly optimized and i took some ideas from it; i'd take more if i knew it was kosher. their utf-7 implementation does handle, or at least report, errors; besides, we have no effective error-handling scheme. (btw python doesn't do much better, i don't think; either it just throws an error on any problems, or ignores the error and copes as best it can. we could do the same; but just throwing an error and aborting isn't very helpful. actually, in python the situation is worse, since we at least have call-with-condition-handler, which allows continuing from a thrown error; i don't think there's any equivalent in python.)

...

Ben> i can implement this if you can tell me the names and Ben> encodings that are typically used in these segments. the x Ben> standard only defines the general format of extended segments Ben> and doesn't say what is actually encoded in them. According to the standard, anything with an agreed name that isn't in the list (ie, iso8859-14 and iso8859-15 violate XF86's own standard, and UTF-8 should be in there). The elegant way to implement it would be to treat it as a buffer and translate it using a new lstream, parsing the name out of the extended segment header and using that to determine the coding system.

i think that's more or less what handa does. but it seems low-priority to me.

...

Ben> we already have a gzip coding system. we also have base64 Ben> functions but not yet converted to a coding system (not too Ben> hard to do, though). internally, i already generalized Ben> coding systems (some time ago, in fact) to be typed for Ben> either bytes or characters at either end; there's also a Ben> `chain' coding system for stringing multiple coding systems Ben> together. Yeah, I'm aware of all that, but again it's mostly stuff that somebody else can do, except that it would be really nice if the lstreams and chain coding systems were exposed to LISP somehow.

the chain coding system *is* exposed to lisp. when you want lstreams exposed, is this so people can write their own coding systems? there's a proposal that i made for doing this more-or-less efficiently at the lisp level; it's sitting around somewhere in internals.texi or text.c. ideally, this would be as efficient as ccl, if the lisp is coded reasonably carefully.

...

Ben> keep in mind that i've already done most of the work you're Ben> describing here. i think we're talking past each other; at Ben> any rate, you seem to think i'm more confused than i am. I'm sure you know exactly what you're doing, in the small. I will look at the code asap, but your verbal descriptions do not inspire confidence that what I will find is going to be a GNU-beater in practice. GNU has more than one person seriously working on their Mule implementation, and any of the senior developers is reasonably comfortable trying to diagnose and even fix bugs. That's simply not true for current XEmacs, and you're emphasizing backward compatibility. Call it back-seat driving if you like, but somebody needs to tell you about that tree looming in front of the windshield.

well, take a look. my verbal descriptions are often elliptic because it's difficult for me to type long, detailed emails. but i keep feeling like you are worried about issues i'm already dealing with.

...

Ben> actually, what would really help is if you could take a look Ben> at emacs-unicode-2, figure out what their api is, and Ben> summarize it. this would be extremely useful to me. I was afraid you'd say something like that. Maybe somebody left some notes in Japanese....

well, do your best :) ben

Stephen J. Turnbull

Thursday, 3 November Thu, 3 Nov

1:17 p.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> actually, in python the situation is worse, since we at least Ben> have call-with-condition-handler, which allows continuing Ben> from a thrown error; i don't think there's any equivalent in Ben> python. Hm? Any exception can be caught in Python. It's true that you cannot handle the error and restart processing from where you started, which could be painful if you error on byte 49,999,999 of a 50MB file, but because coding systems are written in Python it would be fairly easy to return the current state. I don't know if this is done in current codecs, but it wouldn't be hard. Current Mule, it's hard to even pass a sane value for where an error happened because of the layers on layers of buffering in coding streams.

...

> The elegant way to implement it would be to treat it as a > buffer and translate it using a new lstream, parsing the name > out of the extended segment header and using that to determine > the coding system.

Ben> i think that's more or less what handa does. but it seems Ben> low-priority to me. It was a FAQ for a while. I suspect it's no longer a FAQ because people who use ISO 8859-15 have stopped using XEmacs. :-( Ben> when you want lstreams exposed, is this so people can write Ben> their own coding systems? No. I want the standard coding systems in LISP so that ordinary hackers can do more maintenance on them. Ben> well, take a look. my verbal descriptions are often elliptic Ben> because it's difficult for me to type long, detailed emails. Ben> but i keep feeling like you are worried about issues i'm Ben> already dealing with. Could be. But it's quite clear you're more worried about maintaining Mule compatibility and efficiency for 50MB buffers than about getting the cleanest possible design in front of the reviewers and testers as quickly as possible. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Ben Wing

Friday, 4 November Fri, 4 Nov

12:58 a.m.

(permalink)

Stephen J. Turnbull wrote:

...

Current Mule, it's hard to even pass a sane value for where an error happened because of the layers on layers of buffering in coding streams.

This has nothing to do with whether it's written in C or Python and everything to do with the buffering, which would be the same in either case.

...

Ben> well, take a look. my verbal descriptions are often elliptic Ben> because it's difficult for me to type long, detailed emails. Ben> but i keep feeling like you are worried about issues i'm Ben> already dealing with. Could be. But it's quite clear you're more worried about maintaining Mule compatibility and efficiency for 50MB buffers than about getting the cleanest possible design in front of the reviewers and testers as quickly as possible.

all right, stephen, i've had enough; you're really starting to piss me off. the only thing that's really "quite clear" is that you haven't bothered to actually read any of the code i've written. please go do that, then come back to me with *specific* issues referring to *actual* code, and i'll be glad to discuss them. ben

Stephen J. Turnbull

Friday, 4 November Fri, 4 Nov

5:46 a.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> Stephen J. Turnbull wrote:

...

> Current Mule, it's hard to even pass a sane value for where an > error happened because of the layers on layers of buffering in > coding streams.

Ben> This has nothing to do with whether it's written in C or Ben> Python and everything to do with the buffering, which would Ben> be the same in either case. I didn't say passing error codes was hard because it was in C vs. Python; I was pointing out that the Python design doesn't do much buffering, so sane error-recovery is possible, and that because it's in Python, people who are not Python C internals geeks can work on it safely. I'd like the next generation of XEmacs to be informed by those design principles.

...

> Could be. But it's quite clear you're more worried about > maintaining Mule compatibility and efficiency for 50MB buffers > than about getting the cleanest possible design in front of the > reviewers and testers as quickly as possible.

Ben> all right, stephen, i've had enough; you're really starting Ben> to piss me off. You asked for help, but you obviously don't really want the help that's available. So just commit it when you're ready, and we'll either like it or not. I don't really care; I have no doubt that the 18000 line patch you sent me is far better for my purposes than what we currently have. But I am not going to be able to reverse engineer architecture from that in a reasonable amount of time, so I'm not even going to try. (That doesn't mean I'm not going to study the code, it just means it will probably be 6 months to a year before I'm ready to give you the kind of help you want.) -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Ben Wing

Friday, 4 November Fri, 4 Nov

7:23 a.m.

(permalink)

Stephen J. Turnbull wrote:

...

>> Could be. But it's quite clear you're more worried about >> maintaining Mule compatibility and efficiency for 50MB buffers >> than about getting the cleanest possible design in front of the >> reviewers and testers as quickly as possible. Ben> all right, stephen, i've had enough; you're really starting Ben> to piss me off. You asked for help, but you obviously don't really want the help that's available. So just commit it when you're ready, and we'll either like it or not. I don't really care; I have no doubt that the 18000 line patch you sent me is far better for my purposes than what we currently have. But I am not going to be able to reverse engineer architecture from that in a reasonable amount of time, so I'm not even going to try. (That doesn't mean I'm not going to study the code, it just means it will probably be 6 months to a year before I'm ready to give you the kind of help you want.)

look, stephen. i do want your help. it's just that your "help" has mostly consisted of statements asserting that i'm doing some bad thing X that i shouldn't be doing, when i'm not in fact doing that. it seems like everything i say, you manage to negatively misconstrue. sometimes your statements are contradictory; e.g. before you accused me of pushing things out too fast because i didn't completely "make charsets die", now you accuse me of being too slow! even apart from the consistent negativity of your statements, i often have a very hard time figuring out what point you're trying to make; the sense i get is that you have some major misunderstandings of what i'm doing. that is why i suggested you try reading the code. yet your response is just "sorry, don't have time", along with further sarcasm. do you not see the persistent sarcasm in comments like "you obviously don't really want the help that's available"? after hrvoje's complaints i've been trying very hard to be civil to everyone, including you, but it only seems to trigger further incivility from you. ben

Stephen J. Turnbull

Friday, 4 November Fri, 4 Nov

2:52 p.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> yet your response is just "sorry, don't have time", along Ben> with further sarcasm. do you not see the persistent sarcasm Ben> in comments like "you obviously don't really want the help Ben> that's available"? No sarcasm, just my reading based on some unpleasant facts. You keep saying I should shut up until I read the code. Currently that's an 18,000 line patch, which is only going to get longer. Given the time I'm willing to put in, we're talking months. Will you wait? Now, about emacs-unicode, what is it you want to know? It looks like most of the relevant API stuff is in src/character.h (sort of similar to 21.5's text.h), charset.h, and coding.h. The DEFUNs are in the corresponding C files. A diff against current CVS head didn't show much interesting differences in a few spot checks (eg, subr.el and simple.el), but the raw diff is about 350000 lines. Pruning that to 35000 lines will be easy, but it would help to have some idea what I'm looking for. A quick skim suggests that they're using a strategy similar to Morioka's UTF-2000, changing the internal representation in the buffer, and converting to standard Mule charset and octets representation when doing display and stream I/O. For example, the interface to fonts is still fontsets, which as far as I know are only slightly abstracted from XFontSets; I saw nothing that suggests that Unicode is going to get special treatment in accessing fonts. They do now have an abstract character type. I can't say much more than that yet, but they do have a whole set of characterp etc. APIs. Although the internal representation is basically UTF-8, there are weirdities. As you know, there are a number of "holes" in UTF-8, because legal UTF-8 must be the shortest representation. Eg, although algorithmically UTF-8 0xC0 0xA4 converts to 0x24, or "$", that's not a legal UTF-8 representation, it must be simply 0x24. So the first "hole" is the two-byte sequences 0xC0 0xNN and 0xC1 0xNN, which exactly cover the 8-bit-set octets in a natural way. Emacs is going to use them, but I'm not sure what those "8-bit characters" are for. I'm guessing binary, but I'll let you know more when I've looked closer. They're also using the 47 "high planes" from U+00110000 to U+00FFFFFF for something. I suspect they're taking another page from Morioka's book and stuffing all the national standard (ie, Mule) character sets in there. I'm not sure why they're restricted to 24 bits; it may have something to do with their Lisp_Object layout. I haven't looked at regexps yet. Superficially, except for the use of denormalized UTF-8, it looks like what you've described for XEmacs, except adapted to GNU Emacs. I suppose you've seen that much for yourself, but it's useful to me to write it down. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Stephen J. Turnbull

Tuesday, 8 November Tue, 8 Nov

8:06 a.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> the only thing that's really "quite clear" is that you Ben> haven't bothered to actually read any of the code i've Ben> written. You're absolutely right: it was a _major_ bother to read it. In the future, check it in to CVS or I'm not going to read it. Ben> please go do that, then come back to me with *specific* Ben> issues referring to *actual* code, and i'll be glad to Ben> discuss them. Vcharset_japanese_shift_jis. Shift JIS is not a character set, and this object is only used in the corresponding coding system. It should not be visible from Lisp. I don't understand what the problem with not-a-character values for 256^N character sets is. There aren't any. I would say one way to deal with that problem (if it really is a problem in practice) is to pick an arbitrary code point that is unlikely to be a character or likely to be rarely used (such as ASCII SUB, 0x1A or ASCII NUL), and use that as the not-a-character value. If you encounter it, then look it up in the inverse table (ie, to Unicode). If the "Unicode" value is 0xFFFF, then it's not-a-character. There's a remark about needing an mcbs detector. That's not possible in the coding category framework AFAICS; mcbs as a category is indistinguishable from binary (although instances such as Shift JIS and KOI8 can be detected). Even when you go to the (statistical) classifier framework, it looks to me like you're going to need a different detector for _every_ mcbs coding system. Common operations like range checking could be factored out, but (for example) KOI8 and EUC-JP (with uniform ranges) will do it differently from Shift JIS and Big5 (which have ranges which vary according to whether they are single-byte, first-of-two, or second-of-two). So each mcbs will have to have some code to make its idiosyncratic checks. ISTR that there was a comment about KOI8 being assigned the ISO8 category although it's really mcbs; it might be better to make it binary. Hard to say. Shift JIS and Big5, OTOH, clearly should be conflated with binary IMO. Would it be hard to make a no-C0-controls category, which only allows the usual whitespace characters from the 0x00-0x1F,0x7F set? All of the mbcs encodings I know of would fit into that. Even EBCDIC almost does! Why use Dynarrs for the charset list? An mbcs needs two charsets, plus the minimum Dynarr overhead of 5 words. That's the space required by 7 charsets. Seems rather implausible that you'd need more than that. Wouldn't an application with 8 or more charsets be better served by coalescing the non-overlapping charsets in the list into a single charset? In configure.ac you should force Mule on if Unicode is requested, not force Unicode off if Mule is not present. (Barfing and saying "make up your mind" would also be reasonable, but I really think that people who want Unicode _really want Unicode_.) I don't understand the logic for nuking the columns and direction properties of charsets. That's just going to introduce GNU incompatibilities in table formatting and the like. Shouldn't the no-such-character value be (Ichar) 0xFFFF, rather than (Ichar) (-1)? I guess your tables are shorts, but still it's ugly, and in code where they're mixed with ints could cause trouble. char_table subtables need to become Lisp objects, or Marcus's new allocator will break. The stuff about ranges spanning charsets in regex.c pinned my bogometer. Ranges don't even make sense within charsets any more, because collating order varies with locale in POSIX, and different languages collate the same charsets in different ways. I agree that ranges are a very convenient UI, but we should deprecate them, document that they do not have a universal meaning, and document what the meaning is in XEmacs (probably To Be Announced, I think we should avoid committing). Is Ichar_charset_obsolete_me_baby_please gone yet? -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Ben Wing

Wednesday, 9 November Wed, 9 Nov

2:09 a.m.

(permalink)

Stephen J. Turnbull wrote:

...

>>>>>"Ben" == Ben Wing <ben(a)666.com> writes: >>>>> >>>>> Ben> the only thing that's really "quite clear" is that you Ben> haven't bothered to actually read any of the code i've Ben> written. You're absolutely right: it was a _major_ bother to read it. In the future, check it in to CVS or I'm not going to read it.

in time.

...

Ben> please go do that, then come back to me with *specific* Ben> issues referring to *actual* code, and i'll be glad to Ben> discuss them. Vcharset_japanese_shift_jis. Shift JIS is not a character set, and this object is only used in the corresponding coding system. It should not be visible from Lisp.

it's a coded character set, no? it's treated that way in other applications. if i want to create an mbcs coding system for shift-jis, i need to combine the shift-jis coded charset with half-width katakana and ascii. either i hard-code it into C or i do it from lisp, at which point you need to shift-jis charset visible. there is currently no other method for declaring a coded charset than using the `charset' object (that's its point, in fact). also, are you *sure* that shift-jis is *identical* to jisx0208-anything? might it not have extra characters in it?

...

I don't understand what the problem with not-a-character values for 256^N character sets is. There aren't any. I would say one way to deal with that problem (if it really is a problem in practice) is to pick an arbitrary code point that is unlikely to be a character or likely to be rarely used (such as ASCII SUB, 0x1A or ASCII NUL), and use that as the not-a-character value. If you encounter it, then look it up in the inverse table (ie, to Unicode). If the "Unicode" value is 0xFFFF, then it's not-a-character.

well, i did pick such a value, 0x0001 i think. i'll see about implementing the inverse lookup trick.

...

There's a remark about needing an mcbs detector. That's not possible in the coding category framework AFAICS; mcbs as a category is indistinguishable from binary (although instances such as Shift JIS and KOI8 can be detected). Even when you go to the (statistical) classifier framework, it looks to me like you're going to need a different detector for _every_ mcbs coding system. Common operations like range checking could be factored out, but (for example) KOI8 and EUC-JP (with uniform ranges) will do it differently from Shift JIS and Big5 (which have ranges which vary according to whether they are single-byte, first-of-two, or second-of-two). So each mcbs will have to have some code to make its idiosyncratic checks.

right. in general, every coding system needs its own detector (mbcs or not). the current system is extremely bogus.

...

ISTR that there was a comment about KOI8 being assigned the ISO8 category although it's really mcbs; it might be better to make it binary. Hard to say. Shift JIS and Big5, OTOH, clearly should be conflated with binary IMO. Would it be hard to make a no-C0-controls category, which only allows the usual whitespace characters from the 0x00-0x1F,0x7F set? All of the mbcs encodings I know of would fit into that. Even EBCDIC almost does!

well, that comment was left over from the previous implementation of koi8 as ccl; it used to say "this should really be in the ccl category but xemacs doesn't have one". of course, a ccl category is totally bogus, as bogus as an mbcs category. it would indeed be possible to make a no-c0-controls category. actually, there's already code to do checks something like this in some of the detectors, to distinguish them from binary. in general, there needs to be a separation of detectors from categories; and in fact categories should disappear entirely. instead we should just have one "category" per coding system, and each detector assigns some likelihood to all the coding systems that it knows how to detect. a simple mbcs detector, for example, would check the characters that go by, for each existing mbcs coding system, to see whether they are defined in that coding system, and based on this, assign some likelihood to all mbcs coding systems. we could then break ties based on some preferred list of coding systems.

...

Why use Dynarrs for the charset list? An mbcs needs two charsets, plus the minimum Dynarr overhead of 5 words. That's the space required by 7 charsets. Seems rather implausible that you'd need more than that. Wouldn't an application with 8 or more charsets be better served by coalescing the non-overlapping charsets in the list into a single charset?

well, it doesn't hurt to use dynarrs. there won't be near enough mbcs coding systems for the space used by them to make any difference, and i'd rather not build in arbitrary limits if there's no need to.

...

In configure.ac you should force Mule on if Unicode is requested, not force Unicode off if Mule is not present. (Barfing and saying "make up your mind" would also be reasonable, but I really think that people who want Unicode _really want Unicode_.)

will do.

...

I don't understand the logic for nuking the columns and direction properties of charsets. That's just going to introduce GNU incompatibilities in table formatting and the like.

l2r is the default; likewise the default for the columns is the dimension of the charset. i'm not nuking these properties but just eliminating unnecessary arguments to simplify the code.

...

Shouldn't the no-such-character value be (Ichar) 0xFFFF, rather than (Ichar) (-1)? I guess your tables are shorts, but still it's ugly, and in code where they're mixed with ints could cause trouble.

you're confusing Ichars with the indices on the to-charset tables. the to-charset tables use unsigned shorts to represent the two octets in a charset codepoint. Ichars represent actual characters (unicode or old-mule), and are ints. since all valid character values are non-negative, a negative number is a good choice for "no such character", but 0xFFFF or any other positive number isn't, even if 0xFFFF is theoretically "not a character" under Unicode.

...

char_table subtables need to become Lisp objects, or Marcus's new allocator will break.

right, i did this last night.

...

The stuff about ranges spanning charsets in regex.c pinned my bogometer. Ranges don't even make sense within charsets any more, because collating order varies with locale in POSIX, and different languages collate the same charsets in different ways. I agree that ranges are a very convenient UI, but we should deprecate them, document that they do not have a universal meaning, and document what the meaning is in XEmacs (probably To Be Announced, I think we should avoid committing).

sure, collating order varies, but regexps are about searching, not sorting. a range over some part of a national charset specifies a well-defined set of characters. a range over different charsets makes no sense since charsets themselves are unordered wrt each other. a range over any subset of unicode is also well-defined. there's no reason to deprecate ranges.

...

Is Ichar_charset_obsolete_me_baby_please gone yet?

not until i rewrite the font-handling code. that won't happen till after i get the current code working and checked in. ben

Stephen J. Turnbull

Wednesday, 9 November Wed, 9 Nov

12:47 p.m.

(permalink)

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> it's a coded character set, no? Not technically, any more than EUC-JP is. It's a transformation format that would be equally applicable to any 2 X 94^1 + 94^2 coded character set. The character set is JIS X 0208. Ben> it's treated that way in other applications. That's not unreasonable if you are implementing a monolingual Japanese single buffered application. We got any of those around here? :-) Ben> there is currently no other method for declaring a coded Ben> charset than using the `charset' object (that's its point, in Ben> fact). No, charset objects have a whole bunch of ISO 2022 crap in them. A coded character set is a registered (eg, national standard) map from the integers to a repertoire of characters. No directionality, no column width, no CCL, etc. Ben> also, are you *sure* that shift-jis is *identical* to Ben> jisx0208-anything? might it not have extra characters in it? In practice, yes, Shift JIS does include characters that aren't in JIS X 0208. There is an agreed-upon standard space for user-defined characters which will never be used by a character mapped from JIS X 0208, and that's where the corporate extensions go. BTW, there are somewhere around 10 of them, and AFAIK the only thing that is nested is JIS X 0208 in the corporate sets. You can arbitrarily pick "Microsoft" and be right most of the time, I guess. Ben> the current system is extremely bogus. The current system is the best you can do without using the statistical classifier approach. It is extremely successful at distinguishing the various encodings used by Japanese, ditto Korean and Chinese. It obviously sucks for unibyte, though. Ben> in general, there needs to be a separation of detectors from Ben> categories; and in fact categories should disappear entirely. Don't you mean a separation of categories from coding systems? Categories are very strong indicators. Shift JIS and Big5 are monolingual; ISO 7-bit escapes are very strong signs of East Asian ideographs, etc. Ben> a simple mbcs detector, for example, would check the Ben> characters that go by, for each existing mbcs coding system, How do you lex a character out of the octet stream without a category? Ben> to see whether they are defined in that coding system, and Ben> based on this, assign some likelihood to all mbcs coding Ben> systems. we could then break ties based on some preferred Ben> list of coding systems. That's simply a slight generalization of coding categories. If you take your plan literally (which I don't for a minute think you do, by the way), we're back where we started for ISO Latin (most of which are complete 94+96 sets), which is where it matters. We need the statistical detector which looks are frequencies of characters, not just presence or absence. BTW, according to a colleague studying data mining, summarizing "likelihoods" for each coding system independently is not flexible enough to do a good job, and a full-blown statistical classifier no more computationally burdensome. Evidently you can get a lot of leverage from "priors", eg, if you look at LANG and see it's ja_JP, you can give a lot more weight to certain features because the chance of confusion with KOI8 is much smaller and similar things. Ben> well, it doesn't hurt to use dynarrs. OK. I still find them very hard to read; it's a barrier to new developers. Ben> l2r is the default; likewise the default for the columns is Ben> the dimension of the charset. i'm not nuking these Ben> properties but just eliminating unnecessary arguments to Ben> simplify the code. Ah, OK. Ben> since all valid character values are non-negative, a negative Ben> number is a good choice for "no such character", but 0xFFFF Ben> or any other positive number isn't, even if 0xFFFF is Ben> theoretically "not a character" under Unicode. What "theoretically"? 0xFFFF is _guaranteed_ not to be a character, or it's not Unicode. Given that, it works as a uint16_t, it works as a uint32_t or int32_t, it works as a as a uint64_t or int64_t. It probably works as a int16_t. Anybody familiar with Unicode will immediately understand it. I see the argument for negative numbers, but now you have two "not a character" values.

...

Ben> sure, collating order varies, but regexps are about Ben> searching, not sorting. You know the difference. I know the difference. Do our users? Worse yet, do other programmers? The glob "[a-z]" on Debian GNU/Linux in some locales gives you the regexp [a-zA-Y]. (Surprised the heck out of me when I first saw it.) Ben> a range over some part of a national charset specifies a Ben> well-defined set of characters. And Unicode implements only a few of them fully: ASCII, Hangul, the Japanese and Chinese syllabaries, Russian Cyrillic (but not Ukrainian IIRC), Hebrew, Arabic. However, I misdoubt that ISO Latin languages in general get the national order for any national standard. And are ranges going to behave consistently across Emacs implementations? I hope not! That would mean implementing the Mule ordering in unicode-internal. Ben> there's no reason to deprecate ranges. Maybe not for users, but they're bad practice for programmers. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

7224

days inactive

7236

days old

xemacs-beta@xemacs.org

Manage subscription

23 comments

3 participants

tags (0)

participants (3)

Aidan Kehoe
Ben Wing
Stephen J. Turnbull

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

changing the values of iso-8859-* charsets