goofy charset selection for Unicode pastes
Glynn Clements
glynn.clements at virgin.net
Fri Aug 20 20:56:50 EDT 2004
Jamie Zawinski wrote:
> In Mozilla 1.7, view an HTML file containing this source:
>
> foo — bar
>
> Select it, and in XEmacs 21.4.15, do this:
>
> (charsets-in-string (get-selection 'PRIMARY 'UTF8_STRING))
>
> The result:
>
> (chinese-cns11643-1 ascii)
>
> If you insert that string (as with a paste) then you get a *Warnings*
> buffer that says:
>
> (font/warning) Unable to instantiate font for face default,
> charset chinese-cns11643-1
>
> So uh, how come I'm getting Chinese fonts for something as simple as
> M-dash?
[Disclaimer: what follows is based upon a rather incomplete
understanding of mule-ucs, to say the least.]
The Unicode -> Mule translator searches the list of charsets specified
by unicode-basic-translation-charset-order-list until it finds one
which contains the specified character, i.e. mdash.
The default setting of this variable (in un-define.el) is:
'(ascii
latin-iso8859-1
latin-iso8859-2
latin-iso8859-3
latin-iso8859-4
cyrillic-iso8859-5
; arabic-iso8859-6
greek-iso8859-7
hebrew-iso8859-8
latin-iso8859-9
latin-iso8859-14
latin-iso8859-15
ipa
japanese-jisx0208
japanese-jisx0212
chinese-gb2312
chinese-cns11643-1
chinese-cns11643-2
chinese-cns11643-3
chinese-cns11643-4
chinese-cns11643-5
chinese-cns11643-6
chinese-cns11643-7
chinese-big5-1
chinese-big5-2
korean-ksc5601
latin-jisx0201
katakana-jisx0201
thai-tis620
ethiopic
indian-is13194
chinese-sisheng
vietnamese-viscii-lower
vietnamese-viscii-upper)
None of the ISO-8859-* family have em-dash (ISO-8859-1 doesn't and,
AFAIK, the rest of them all have essentially the same set of
"punctuation" characters). The first one which *does* have an em-dash
is chinese-cns11643-1.
Notes:
1. By default, XEmacs isn't set up to use the *-iso10646-1 fonts (and
I don't think that it can; if displaying Unicode was as simple as
selecting a Unicode font, I don't think that we'd be using mule-ucs).
2. mule-ucs doesn't understand the windows-125x encodings (and, if it
wasn't for those, I doubt that many people would be using — in
the first place).
3. the choice of charset isn't determined by the presence or absence
of a suitable font; e.g. big5 also has an em-dash, but it comes after
cns11643-1, so having a big5 font won't help.
4. mule-ucs won't try to "approximate" a requested character, i.e. it
won't just give you an ASCII minus sign instead.
> Also, the pasted Chinese character looks exactly like a tilde,
> not a dash at all.
A tilde is the standard representation for an undisplayable character
(see etc/HELLO for lots of tildes and lots of font warnings). If you
had the appropriate Chinese font installed, it would probably look
correct.
> Anyone understand this?
Not really; I just have the .el files and too much spare time.
--
Glynn Clements <glynn.clements at virgin.net>
More information about the XEmacs-Beta
mailing list