Re: yet another XEmacs fork;-)

older

Re: serializing hash tables

Olivier Galibert

Friday, 14 November 2008 Fri, 14 Nov '08

11:58 a.m.

(permalink)

On Fri, Nov 14, 2008 at 11:04:02AM +0000, Julian Bradfield wrote:

...

Nothing, and that was my first thought. But then I thought that iconv would have the advantage of making it easy to load the translation tables on demand - why should I have 20MB (or however much it is) of translation tables in main memory of every XEmacs instance, when my typical instance only ever sees Latin-1 and maybe few dozen hanzi?

If memory is your problem, I'm pretty sure that you can dump them to C and make them fit in less than a read-only, shareable, load-on-demand megabyte without any special effort. The iso-8859, jisx, gb, cns etc mappings are not going to change any time soon once they're established. The #1 problem I've seen with real unicode support on unix is the rendering, specifically the font selection, especially when you want to still be able to reach the bitmap fonts[1]. I don't think fontconfig/xft allows to reach them, it's written by anti-aliased-vector-fonts integrists who consider their use is the only one relevant. And that documentation is for wimps, and "deprecated" and "stubbed" are synonyms. OG. [1] I can't work without -Misc-MiscFixed-Medium-R-Normal--10-100-75-75-C-60-*, but I can stand using a different name as long as the glyph rendering is identical. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Show replies by thread

Julian Bradfield

Friday, 14 November Fri, 14 Nov

5:37 p.m.

(permalink)

New subject: yet another XEmacs fork;-)

On 2008-11-14, Olivier Galibert <galibert(a)pobox.com> wrote:

...

Yes, I thought that at first also. Maybe I should go back to that.

...

The #1 problem I've seen with real unicode support on unix is the rendering, specifically the font selection, especially when you want to still be able to reach the bitmap fonts[1]. I don't think

As I wrote in my previous reply to Stephen, but then deleted before sending, "it's not as if Mule does any of the hard stuff";-) [so why is it hard to get rid of?]

...

fontconfig/xft allows to reach them, it's written by anti-aliased-vector-fonts integrists who consider their use is the only one relevant. And that documentation is for wimps, and "deprecated" and "stubbed" are synonyms.

Indeed. I think I've managed to eliminate all scalable fonts from my world, except in Firefox, where it's too much like hard work to get rid of them.

...

[1] I can't work without -Misc-MiscFixed-Medium-R-Normal--10-100-75-75-C-60-*, but I can stand using a different name as long as the glyph rendering is identical.

You must have good eyesight (or a large monitor)! Personally, I can't live without -misc-...-15-140-... and its homebrewed doublewidth counterpart. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Saturday, 15 November Sat, 15 Nov

12:53 p.m.

(permalink)

New subject: yet another XEmacs fork;-)

Julian Bradfield writes:

...

As I wrote in my previous reply to Stephen, but then deleted before sending, "it's not as if Mule does any of the hard stuff";-) [so why is it hard to get rid of?]

Compatibility. Also, there *is* a lot of hard stuff in Mule. Regexps, for one. Font-handling, for another; OOo and Firefox do a crummy job with mixed Han texts. I don't know how we're going to emulate Mule in that respect with Unicode as the primary internal character set. Seriously f*cked-up file systems are another (try reading a file whose name contains UTF-8, KOI8-R, and Shift JIS in different segments in them other apps ... yes, I've seen such!) In fact, Python just went through a big debate on coding systems for the fiile system, which ended when Guido declared that designing a system to do it right was too hard, so they went with a 99% proposal. And there's a huge amount of tweaking in the detection code and other places. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Julian Bradfield

Sunday, 16 November Sun, 16 Nov

noon

(permalink)

New subject: yet another XEmacs fork;-)

...

Regexps, for one. Font-handling, for another; OOo and Firefox do a crummy job with mixed Han texts. I don't know how we're going to emulate Mule in that respect with Unicode as the primary internal character set.

I think the Unicode-ish party line on that one would be: "XEmacs is a text editor, not a typesetting program. It is sufficient that plain text be readable; it is not necessary for plain text in an editor to be typographically perfect. Han unification has been designed so that text is always readable, even if viewed with fonts from the wrong locale. Correct typography is the responsibility of higher-level markup and applications." (Apart from the first sentence, that is a brief paraphrase of the Unibook section on the topic.) Now I don't know how true that is. *My* problem is that as someone who doesn't speak or read any East Asian language, Han unification does unify glyphs that to me look quite distinct; but if to any CJK speaker, they don't, then I have to agree with the party line.

...

Seriously f*cked-up file systems are another (try reading a file whose name contains UTF-8, KOI8-R, and Shift JIS in different segments in them other apps ... yes, I've seen such!) In fact, Python just went through a big debate on coding systems for the fiile system, which ended when Guido declared that designing a system to do it right was too hard, so they went with a 99% proposal. And there's a huge amount of tweaking in the detection code and other places.

How do systems deal with the problem that in some encodings (any ISO2022 that allows general character sets) there are many octet-strings that encode the same abstract text string? Come to that, how do UTF-8 based filesystems (Windows, Mac) behave when faced with a filename that is invalid - or are the OSes sufficiently well written to validate filename on creation? (Many years ago, we had a Pyramid Unix system, which had a network file system interface to the Vaxen. This interface did so little checking of filenames that it was possible, from a Vax, to create a Unix file on the Pyramid with a '/' in its name! Of course, the only way to remove it, or access it in any way, was from a Vax.) _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Glynn Clements

Sunday, 16 November Sun, 16 Nov

5:19 p.m.

(permalink)

New subject: yet another XEmacs fork;-)

Julian Bradfield wrote:

...

How do systems deal with the problem that in some encodings (any ISO2022 that allows general character sets) there are many octet-strings that encode the same abstract text string?

For the most part, they don't. Unicode fans pretend that anything which causes problems for Unicode doesn't exist (or is "obsolete"; apparently, they get to decide that this is the case). Anyone who actually needs to use such encodings typically avoids Unicode like the plague (I've yet to see a Japanese game for Windows which uses the Unicode API rather than the codepage-based API). If you want to retrieve a filename from the OS then pass it back at a later point, you need to retain the raw data. If you can't get at the raw data, you lose.

...

Come to that, how do UTF-8 based filesystems (Windows, Mac) behave when faced with a filename that is invalid - or are the OSes sufficiently well written to validate filename on creation?

Mostly. A more significant point is that Unicode strings aren't strings of "characters", but of Unicode code points. The conversions between Unicode and abstract characters suffer from many of the same problems as with traditional encodings. E.g. an accented letter can often be represented either as a single code representing the composed character or as a sequence of the base letter and a combining accent (Windows and Linux typically use the former, while MacOSX uses the latter). NTFS will happily let you have files whose names represent identical text but differ in the exact sequence of codepoints.

...

(Many years ago, we had a Pyramid Unix system, which had a network file system interface to the Vaxen. This interface did so little checking of filenames that it was possible, from a Vax, to create a Unix file on the Pyramid with a '/' in its name! Of course, the only way to remove it, or access it in any way, was from a Vax.)

A slightly similar situation exists on Windows, at least for registry keys (I'm not sure about filenames). The "native" NT API represents strings using an explicit length, while the Win32 API uses NUL termination. Using the native API, you can create registry keys which contain embedded NUL characters. It's impossible to specify such keys via the Win32 API. They will show up in RegEdit, minus the first NUL and anything following it. Attempting to examine the key's subkeys or value will result in a "key not found" error. -- Glynn Clements <glynn(a)gclements.plus.com> _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Monday, 17 November Mon, 17 Nov

3:28 a.m.

(permalink)

New subject: yet another XEmacs fork;-)

Glynn Clements writes:

...

Julian Bradfield wrote: > How do systems deal with the problem that in some encodings (any > ISO2022 that allows general character sets) there are many > octet-strings that encode the same abstract text string? For the most part, they don't. Unicode fans pretend that anything which causes problems for Unicode doesn't exist (or is "obsolete"; apparently, they get to decide that this is the case).

Well, yes, they do. If you want something else, feel free to start your own standards effort. Ken'ichi Handa will help, I'm sure. ;-) However, even for 99% of Han users, simply putting things into the appropriate font will work. The only people who really need to disambiguate Han are Buddhist scholars; even Japanese high school students read their Chinese poetry in Japanese fonts.

...

Anyone who actually needs to use such encodings typically avoids Unicode like the plague (I've yet to see a Japanese game for Windows which uses the Unicode API rather than the codepage-based API).

Use any example but Japanese, please. Japanese exceptionalism is alive and well throughout the society. I find it hard to believe that changing your fonts when you change your .mo files wouldn't work fine for games as it does almost everywhere else, except in truly multilingual text; I think that Japanese just enjoy being different.

...

If you want to retrieve a filename from the OS then pass it back at a later point, you need to retain the raw data. If you can't get at the raw data, you lose.

That's exactly the conclusion the Python people just came to.

...

> Come to that, how do UTF-8 based filesystems (Windows, Mac) behave > when faced with a filename that is invalid - or are the OSes > sufficiently well written to validate filename on creation? Mostly.

Mac OS X is not---it's just a Unix VFS---although HFS+ more or less is a validating FS. But even on HFS+ it's not hard to bypass the validation; see the comments on the ISO-8859-2 test in tests/automated/mule-tests.el. And in general, any system that supports mounting arbitrary file systems cannot guarantee validation. Your Pyramid/Vax example is perfectly general. In fact, typically GNOME and Windows applications simply silently drop such file names when encountered on the system.

...

A more significant point is that Unicode strings aren't strings of "characters", but of Unicode code points. The conversions between Unicode and abstract characters suffer from many of the same problems as with traditional encodings.

No, they suffer from various forms of inefficiency, but since there are two canonical decompositions you just have to do like the Japanese and make sure all strings take off their muddy shoes at the door and put on canonicalized slippers before entering the house. This isn't possible with traditional encodings, and of course it does require a lot of programmer discipline to construct and use these interfaces. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Monday, 17 November Mon, 17 Nov

3:55 a.m.

(permalink)

New subject: yet another XEmacs fork;-)

Julian Bradfield writes:

...

I think the Unicode-ish party line on that one would be: "XEmacs is a text editor, not a typesetting program.

I see no reason why it shouldn't be both. However, I believe that Unicode is sufficient to distinguish different characters, and that language information should be included in the markup, not in the character code. On the other hand, I think that language structure should be reflected in the document structure, not implied by the characters contained.

...

Now I don't know how true that is. *My* problem is that as someone who doesn't speak or read any East Asian language, Han unification does unify glyphs that to me look quite distinct; but if to any CJK speaker, they don't, then I have to agree with the party line.

There is a minority of CJK speakers who disagree. AFAIK the most vocal ones are mostly Japanese. As a non-speaker, though, you should not trust your eyes. The number and order in which the strokes are written is very important, more so than the orientation in many cases. For example, are you aware that in Han characters, rectangles are usually triangles (it only takes three strokes to write the character for "mouth", which is a square)? However, the not-so-extremists among those folks basically go to the extent of saying that "A" in Lucida Typewriter is not the same as "A" in Fraktur! (Extremists even take exception to systematic differences similar to the use of serifs in Latin glyphs.) There really is no consensus as to where to draw the line among them, either, whereas the Unicode advocates have a set of rules that are easy to apply in many thousands of common cases and ambiguous in very few cases, even for rare glyphs (except for the case of "lost" glyphs whose meaning is uncertain). Even the fanatics agree that the unified characters are closely related, though the Japanese ones insist that "Japanese" characters have an ineffable "Japanese spirit" not present in Chinese versions ....

...

How do systems deal with the problem that in some encodings (any ISO2022 that allows general character sets) there are many octet-strings that encode the same abstract text string?

They invariably treat those strings as different strings, just as Mule does (except as modified by latin-unity and similar GNU features).

...

(Many years ago, we had a Pyramid Unix system, which had a network file system interface to the Vaxen.

Oops! That was your example not Glynn's, I see. Mea maxima culpa! _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Glynn Clements

Monday, 17 November Mon, 17 Nov

11:01 a.m.

(permalink)

New subject: yet another XEmacs fork;-)

Stephen J. Turnbull wrote:

...

> > How do systems deal with the problem that in some encodings (any > > ISO2022 that allows general character sets) there are many > > octet-strings that encode the same abstract text string? > > For the most part, they don't. Unicode fans pretend that anything > which causes problems for Unicode doesn't exist (or is "obsolete"; > apparently, they get to decide that this is the case). Well, yes, they do. If you want something else, feel free to start your own standards effort. Ken'ichi Handa will help, I'm sure. ;-) However, even for 99% of Han users, simply putting things into the appropriate font will work. The only people who really need to disambiguate Han are Buddhist scholars; even Japanese high school students read their Chinese poetry in Japanese fonts.

I was referring mainly to the technical issues, e.g. the non-reversibility of encoding conversions. The problem with Unicode isn't that it's inherently defective as an encoding, but the some of the "universalism" in the way that it's often used. E.g. languages or libraries which insist that all "text" is represented in Unicode, so that e.g. readdir() -> open() fails for files which don't match a specific encoding. Any such problems are then waved away with "use UTF-8 for all filenames". No mention of how to handle filenames obtained from binary data streams with no specified encoding (e.g. tar/zip/rar files, FTP), or whether we're supposed to simply ditch customers who have other ideas about which encodings to use for their data.

...

> Anyone who actually needs to use such encodings typically avoids > Unicode like the plague (I've yet to see a Japanese game for > Windows which uses the Unicode API rather than the codepage-based > API). Use any example but Japanese, please. Japanese exceptionalism is alive and well throughout the society. I find it hard to believe that changing your fonts when you change your .mo files wouldn't work fine for games as it does almost everywhere else, except in truly multilingual text; I think that Japanese just enjoy being different.

The issue tends to apply to any language which isn't based upon the Latin alphabet, although possibly to a lesser extent than for Japanese. If a language is latin-based, it's not too much of a stretch to just stick to ASCII in situations where use of other encodings is problematic. So long as there are file formats and network protocols where filenames are sequences of bytes with no encoding specified (or where the specified encoding is often incorrect), there will be a strong temptation for application programmers to make the encoding issue Someone Else's Problem (TM) by passing the data to anything which is willing to accept a string of bytes. On Windows, that means using the legacy "A" API rather than the Unicode "W" API. On Unix, that means passing the data directly to the OS without bothering about conversions. In Unicode-everywhere environments, it means either blindly accepting any built-in conversions or, if an encoding is required, hunting for a function (any function) which returns an encoding with requiring any arguments.

...

> If you want to retrieve a filename from the OS then pass it back at a > later point, you need to retain the raw data. If you can't get at the > raw data, you lose. That's exactly the conclusion the Python people just came to.

Which conclusion? "Retain the raw data" or "you lose"?

...

> A more significant point is that Unicode strings aren't strings of > "characters", but of Unicode code points. The conversions between > Unicode and abstract characters suffer from many of the same problems > as with traditional encodings. No, they suffer from various forms of inefficiency, but since there are two canonical decompositions you just have to do like the Japanese and make sure all strings take off their muddy shoes at the door and put on canonicalized slippers before entering the house. This isn't possible with traditional encodings, and of course it does require a lot of programmer discipline to construct and use these interfaces.

Unfortunately, this canonicalisation frequently doesn't happen. It isn't too surprising, given the way that Unicode is so often touted as eliminating these sorts of problems. -- Glynn Clements <glynn(a)gclements.plus.com> _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Monday, 17 November Mon, 17 Nov

2:24 p.m.

(permalink)

New subject: yet another XEmacs fork;-)

Glynn Clements writes:

...

I was referring mainly to the technical issues, e.g. the non-reversibility of encoding conversions.

Like everything else you mention, AFAICS that's not caused by Unicode, and even with Mule code or TRON code you have issues like "packed EUC or 7-bit ISO-2022" which the charset codes can't help with. Rather, programmers tend to throw away information that they don't see an immediate use for (perhaps in the name of efficiency).

...

> > If you want to retrieve a filename from the OS then pass it back at a > > later point, you need to retain the raw data. If you can't get at the > > raw data, you lose. > > That's exactly the conclusion the Python people just came to. Which conclusion? "Retain the raw data" or "you lose"?

"Retain the raw data or lose." There's no third alternative, although sufficiently creative programmers can (and do) have their data and lose anyway. ;-)

...

Unfortunately, this canonicalisation frequently doesn't happen. It isn't too surprising, given the way that Unicode is so often touted as eliminating these sorts of problems.

Not by the Unicode Consortium though. Rather by the same lazy or overworked programmers you've been citing throughout. The fact is that the problem is the Tower of Babel. One ISO standard is not going to turn back God's wrath (in fact, it probably just made Her madder!) Unicode is a major step toward making the world safe for low energy/high burden programmers, at least in a restricted area of multilingual and/or localized text processing. But as usual, the 10% of corner cases involve 90% of the work, and also as usual, those of us who care about the corner cases are going to have to bear the burden of dealing with them. Surely that doesn't surprise you. ;-) _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Julian Bradfield

Monday, 17 November Mon, 17 Nov

6:02 p.m.

(permalink)

New subject: yet another XEmacs fork;-)

Stephen writes:

...

Julian Bradfield writes: > I think the Unicode-ish party line on that one would be: > "XEmacs is a text editor, not a typesetting program. I see no reason why it shouldn't be both. However, I believe that

Urgh. Typesetting is a hard task. Have you ever read the TeX source code? (It is, incidentally, verging on a tragedy that TeX reached stability before Unicode was really on a roll. If Unicode had been five years earlier, or TeX five years later, we would be in a much better position than we are.)

...

Oh yes. I have a basic knowledge, sufficient to use a dictionary. The kind of thing Unicode does that is really annoying is exemplified by one of the characters I use most often, namely U+5C06 将. If you're seeing this in a PRC Chinese font, you'll see the top right component being an Evening radical 夕. If you're seeing it in a Japanese or Taiwan-CNS font, you'll see the top right component as a Claw radical 爪. This isn't a traditional/simplified distinction; the traditional version is U+5C07 將 (I'm relying on mule-ucs and VM (if I have utf-8 as my primary charset, which I think I have) to get these out as the right Unicode values - I haven't yet started using my own Unicode XEmacs for my mail!), which has the Evening radical in all fonts (in the 4-stroke rather than 3-stroke version), as well as the fancy 4-stroke Half-tree-trunk radical rather than the simplified 3-stroke one. The difference between the reference glyph for U+5C06 and its rendition in CNS and Japanese is a clear example of a difference that should prevent unification: a different radical in a component. Nonetheless, the IRG went ahead and unified them anyway, on the grounds that no existing character set distinguished them, and they are in fact stylistic variants of the same abstract character, even though they might not be. If you are faced with the Japanese/CNS variant of U+5C06, and try to look it up in the Unihan radical/stroke index, you will fail, because it isn't there. And there's no general rule that Evening/Claw radicals are interchangeable in some positions, even though ultimately it's presumably because an old form U+355A 㕚 of the claw radical (so obscure it's on CNS plane 5) is very similar to the evening radical; you just have to know it for this character.

...

Unicode advocates have a set of rules that are easy to apply in many thousands of common cases and ambiguous in very few cases, even for rare glyphs (except for the case of "lost" glyphs whose meaning is uncertain).

I wouldn't mind, if they actually applied their own criteria rigorously. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Glynn Clements

Monday, 17 November Mon, 17 Nov

6:45 p.m.

(permalink)

New subject: yet another XEmacs fork;-)

Stephen J. Turnbull wrote:

...

> > > If you want to retrieve a filename from the OS then pass it back at a > > > later point, you need to retain the raw data. If you can't get at the > > > raw data, you lose. > > > > That's exactly the conclusion the Python people just came to. > > Which conclusion? "Retain the raw data" or "you lose"? "Retain the raw data or lose." There's no third alternative, although sufficiently creative programmers can (and do) have their data and lose anyway. ;-)

Oh. I was starting to worry that they had decided that OS functions were going to start automagically trashing^Wconverting everything for you. [Like wxWidgets, which likes to provide its own main() which "conveniently" converts argv[] to Unicode.]

...

The fact is that the problem is the Tower of Babel. One ISO standard is not going to turn back God's wrath (in fact, it probably just made Her madder!) Unicode is a major step toward making the world safe for low energy/high burden programmers, at least in a restricted area of multilingual and/or localized text processing. But as usual, the 10% of corner cases involve 90% of the work, and also as usual, those of us who care about the corner cases are going to have to bear the burden of dealing with them. Surely that doesn't surprise you. ;-)

The problem isn't with having to do stuff myself, it's with being unable to. Once upon a time, if you wanted to hide malicious files, you had to resort to tricks like putting them in a subdirectory named "..." or embedding control characters in the name. Nowadays, you just need to give them a name which isn't valid UTF-8 and many programs will just pretend they don't exist. -- Glynn Clements <glynn(a)gclements.plus.com> _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Mike FABIAN

Tuesday, 18 November Tue, 18 Nov

12:12 a.m.

(permalink)

New subject: yet another XEmacs fork;-)

Olivier Galibert <galibert(a)pobox.com> さんは書きました:

...

rendering, specifically the font selection, especially when you want to still be able to reach the bitmap fonts[1]. I don't think fontconfig/xft allows to reach them,

it does!

...

it's written by anti-aliased-vector-fonts integrists who consider their use is the only one relevant. And that documentation is for wimps, and "deprecated" and "stubbed" are synonyms. OG. [1] I can't work without -Misc-MiscFixed-Medium-R-Normal--10-100-75-75-C-60-*, but I can stand using a different name as long as the glyph rendering is identical.

As you say, the name is different but the glyph rendering is identical: mfabian＠magellan:~$ fc-list "Misc Fixed:pixelsize=10:style=Regular" family style file /usr/share/fonts/misc/6x10.pcf.gz: Misc Fixed:style=Regular /usr/share/fonts/japanese/5x10a.pcf.gz: Misc Fixed:style=Regular /usr/share/fonts/misc/6x10-ISO8859-1.pcf.gz: Misc Fixed:style=Regular mfabian＠magellan:~$ (Using bitmap fonts via fontconfig/Xft is a bit slower though than via the X11 core font system). On openSUSE it is called "Misc Fixed", on other distributions just "Fixed". That is because on openSUSE freetype2 contains a patch to prepend the foundry ("Misc") and a space to the family ("Fixed") for bitmap fonts (*only* for bitmap fonts!). Otherwise you get problems distinguishing the many completely different and unrelated fonts with family name "Fixed" but a different foundry because the font selection GUIs don’t let you specify the foundry. And the bitmap fonts -adobe-helvetica-* would be used by browsers for many pages which ask for "Helvetica". Prepending "Adobe " to make this "Adobe Helvetica" avoids using these bitmap fonts by default for many web pages without disabling them completely and still having them available for users who like them. In a config file, like for XEmacs, you could add the foundry, i.e. specify something like: "Misc Fixed:foundry=Misc:pixelsize=10:style=Regular" but in don’t know any GUI yet which lets you specify a foundry. -- Mike FABIAN <mfabian(a)suse.de> http://www.suse.de/~mfabian 睡眠不足はいい仕事の敵だ。 I � Unicode _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Tuesday, 18 November Tue, 18 Nov

4:23 a.m.

(permalink)

New subject: yet another XEmacs fork;-)

Glynn Clements writes:

...

> "Retain the raw data or lose." There's no third alternative, although > sufficiently creative programmers can (and do) have their data and > lose anyway. ;-) Oh. I was starting to worry that they had decided that OS functions were going to start automagically trashing^Wconverting everything for you.

They *are* going to do that (in Python 3.0), based on the current locale. The demand from losers^Hoverburdened mp3 jukebox programmers late for their afternoon golf games is too high to ignore (yes, they're using a Python binding to wxWidgets as you expected). *However*, Python 3 is also going to provide a bytes interface for those who want to do it right. (The last I looked the proposal was for that data to be cached somewhere so that you could always go back and look at it.) I guess that probably sounds like a cop-out to you, though.

...

Once upon a time, if you wanted to hide malicious files, you had to resort to tricks like putting them in a subdirectory named "..." or embedding control characters in the name. Nowadays, you just need to give them a name which isn't valid UTF-8 and many programs will just pretend they don't exist.

So much the worse for many programs, and their programmers. I really do see Unicode as a 95% solution here. It doesn't help with the transition very much, but nothing will do that as long as there are lots of lazy programmers and undemanding clients out there. However, to the extent that Unicode provides an easy way out for them, it will prevail, and it is a much more regular solution from the programming and system design point of view than ISO 2022. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Tuesday, 18 November Tue, 18 Nov

4:28 a.m.

(permalink)

New subject: yet another XEmacs fork;-)

Mike FABIAN writes:

...

In a config file, like for XEmacs, you could add the foundry, i.e. specify something like: "Misc Fixed:foundry=Misc:pixelsize=10:style=Regular" but in don’t know any GUI yet which lets you specify a foundry.

It should be easy to add that to Customize (once Customize knows about fontconfig fontspecs), though. Patches welcome. Note that teaching Customize about fontconfig fontspecs is probably non-trivial, but adding a foundry field to the face widget should be straightforward and may be useful even for legacy fonts. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Tuesday, 18 November Tue, 18 Nov

6:04 a.m.

(permalink)

New subject: yet another XEmacs fork;-)

Julian Bradfield writes:

...

Urgh. Typesetting is a hard task.

I know that. I just don't think we should rule out improvements in XEmacs's capabilities there.

...

The kind of thing Unicode does that is really annoying is exemplified by one of the characters I use most often, namely U+5C06 将. The difference between the reference glyph for U+5C06 and its rendition in CNS and Japanese is a clear example of a difference that should prevent unification: a different radical in a component.

For what it's worth, I would have no trouble reading Japanese words such as 将来 (future) or 将軍 (shogun == generalissimo) if written with the reference glyph (in the context of other simplified Han).

...

>Unicode advocates have a set of rules that are easy to apply in many >thousands of common cases and ambiguous in very few cases, even for >rare glyphs (except for the case of "lost" glyphs whose meaning is >uncertain). I wouldn't mind, if they actually applied their own criteria rigorously.

I don't see convincing evidence that they don't. It seems you just got burned by a rather arbitrary choice of some GB bureaucrat in a character you happen to refer to a lot, and the geopolitical weight of "China" that makes its standards the reference, rather than the earlier and probably more useful (to foreigners) ROC and Japanese character sets. I don't criticize you or your application, and obviously this kind of thing makes your work more difficult. I just don't see how it makes sense to reject the Unicode standard on this basis. If XEmacs slavishly kowtows to Unicode and fails to provide ways for you to work around these difficulties, that is a bug in XEmacs, not in Unicode, as far as I'm concerned. (Of course I don't plan to do the work! :-) _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Julian Bradfield

Saturday, 22 November Sat, 22 Nov

3:20 p.m.

(permalink)

New subject: yet another XEmacs fork;-), Re: yet another XEmacs fork;-)

Stephen writes:

...

I have some ideas, but they have to wait for me to finish the basics! However, you mention OOo and Firefox - does anything (apart from Emacs!) do better? It would be nice to have an example of material that Firefox deals with badly, but something else deals with well.

...

> Urgh. Typesetting is a hard task. I know that. I just don't think we should rule out improvements in XEmacs's capabilities there.

Feel free to do the work;-) I can't imagine why one would want to use Emacs to do typesetting - but then I can't imagine why one would use OOo/Word either, and millions do. (But few of those millions are typical Emacs users...)

...

And are you doing that by "error correction", i.e. seeing a wrong shape, but matching it to a nearby word that you know, or do you read as one might read Gothic (Fraktur) script, recognizing that it's the right letters, but in a style one doesn't know well? _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Saturday, 22 November Sat, 22 Nov

5:23 p.m.

(permalink)

New subject: yet another XEmacs fork;-), Re: yet another XEmacs fork;-)

Julian Bradfield writes:

...

Well, TeX does pretty damn well, as you'd expect. Omega might be even better but Omega is still kinda painful to work with. I don't have an example text, offhand, but IIRC Firefox and OOo both occasionally forget that *all* spaces at line breaks should be deleted when wrapping Japanese, and only TeX gets the thinspaces around foreign words in Japanese correct IMO (but I don't know what real Japanese speakers think about that).

...

I can't imagine why one would want to use Emacs to do typesetting

Real typesetting, of course not. I wrote "improve capabilities", not "attain world domination"! ;-)

...

>For what it's worth, I would have no trouble reading Japanese words >such as 将来 (future) or 将軍 (shogun == generalissimo) if written >with the reference glyph (in the context of other simplified Han). And are you doing that by "error correction", i.e. seeing a wrong shape, but matching it to a nearby word that you know, or do you read as one might read Gothic (Fraktur) script, recognizing that it's the right letters, but in a style one doesn't know well?

The latter. Two out of three right components in the right place ain't bad, after all (especially given that Unicode standardization is dependent on government work! ;-) The standard simplifications of Simplified Chinese are mostly similar to the deformations that result in (handwritten) abbreviated kanji, or Japanese kana. 4-stroke Claw -> 3-stroke Evening is plausible, however in Chinese I'm not conscious of other characters such as 受 which don't get that treatment (had to look it up). _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

6071

days inactive

6079

days old

xemacs-beta@xemacs.org

Manage subscription

16 comments

6 participants

tags (0)

participants (6)

Glynn Clements
Julian Bradfield
Mike FABIAN
Olivier Galibert
Stephen J. Turnbull
Stephen J. Turnbull

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: yet another XEmacs fork;-)