goofy charset selection for Unicode pastes

Glynn Clements glynn.clements at virgin.net
Fri Aug 20 20:56:50 EDT 2004


Jamie Zawinski wrote:

> In Mozilla 1.7, view an HTML file containing this source:
> 
> 	foo — bar
> 
> Select it, and in XEmacs 21.4.15, do this:
> 
> 	(charsets-in-string (get-selection 'PRIMARY 'UTF8_STRING))
> 
> The result:
> 
> 	(chinese-cns11643-1 ascii)
> 
> If you insert that string (as with a paste) then you get a *Warnings*
> buffer that says:
> 
> 	(font/warning) Unable to instantiate font for face default,
> 	charset chinese-cns11643-1
> 
> So uh, how come I'm getting Chinese fonts for something as simple as
> M-dash?

[Disclaimer: what follows is based upon a rather incomplete
understanding of mule-ucs, to say the least.]

The Unicode -> Mule translator searches the list of charsets specified
by unicode-basic-translation-charset-order-list until it finds one
which contains the specified character, i.e. mdash.

The default setting of this variable (in un-define.el) is:

	    '(ascii
	      latin-iso8859-1
	      latin-iso8859-2
	      latin-iso8859-3
	      latin-iso8859-4
	      cyrillic-iso8859-5
					;  arabic-iso8859-6
	      greek-iso8859-7
	      hebrew-iso8859-8
	      latin-iso8859-9
	      latin-iso8859-14
	      latin-iso8859-15
	      ipa
	      japanese-jisx0208
	      japanese-jisx0212
	      chinese-gb2312
	      chinese-cns11643-1 
	      chinese-cns11643-2
	      chinese-cns11643-3
	      chinese-cns11643-4
	      chinese-cns11643-5
	      chinese-cns11643-6
	      chinese-cns11643-7
	      chinese-big5-1
	      chinese-big5-2
	      korean-ksc5601
	      latin-jisx0201
	      katakana-jisx0201
	      thai-tis620
	      ethiopic
	      indian-is13194
	      chinese-sisheng
	      vietnamese-viscii-lower
	      vietnamese-viscii-upper)

None of the ISO-8859-* family have em-dash (ISO-8859-1 doesn't and,
AFAIK, the rest of them all have essentially the same set of
"punctuation" characters). The first one which *does* have an em-dash
is chinese-cns11643-1.

Notes:

1. By default, XEmacs isn't set up to use the *-iso10646-1 fonts (and
I don't think that it can; if displaying Unicode was as simple as
selecting a Unicode font, I don't think that we'd be using mule-ucs).

2. mule-ucs doesn't understand the windows-125x encodings (and, if it
wasn't for those, I doubt that many people would be using — in
the first place).

3. the choice of charset isn't determined by the presence or absence
of a suitable font; e.g. big5 also has an em-dash, but it comes after
cns11643-1, so having a big5 font won't help.

4. mule-ucs won't try to "approximate" a requested character, i.e. it
won't just give you an ASCII minus sign instead.

> Also, the pasted Chinese character looks exactly like a tilde,
> not a dash at all.

A tilde is the standard representation for an undisplayable character
(see etc/HELLO for lots of tildes and lots of font warnings). If you
had the appropriate Chinese font installed, it would probably look
correct.

> Anyone understand this?

Not really; I just have the .el files and too much spare time.

-- 
Glynn Clements <glynn.clements at virgin.net>




More information about the XEmacs-Beta mailing list