Joachim Schrod wrote:
I have attached a file that has two lines (at the end). If I open that
file, I get the coding system big5. I would expect to get the coding
system iso-8859-1 or similar.
My coding categories are:
############################
## LIST OF CODING CATEGORIES (ordered by priority)
## CATEGORY:CODING-SYSTEM
##
utf-16-little-endian-bom:utf-16-little-endian-bom
utf-16-bom:utf-16-bom
utf-8-bom:utf-8-bom
iso-7:iso-2022-7bit
no-conversion:raw-text
utf-8:utf-8
iso-8-1:iso-8859-1
iso-8-2:ctext
iso-8-designate:ctext
iso-lock-shift:iso-2022-lock
shift-jis:shift-jis
big5:big5
utf-16-little-endian:utf-16-little-endian
utf-16:utf-16
ucs-4:ucs-4
I don't have much experience with XEmacs coding systems (in fact,
today I read doc strings on that topic for the first time).
Nevertheless, if I interpret that documentation correctly, iso-8-1
should be checked before big5; and since the file is encoded in
Latin1, it should match.
I have some additional information, since I learned about debug-coding-detection
in the mean time. Turning it on yields the following output on stderr:
detected coding system: nil
detect_coding_type: processing 88 bytes
First 16: .Mastab whlt u 09 4D 61 DF 73 74 61 62 20 77 E4 68 6C 74 20 75
Last 16: e Fachkonzept.). 65 20 46 61 63 68 6B 6F 6E 7A 65 70 74 2E 29 0A
seen_non_ascii: 1
no-conversion: slightly-likely
utf-8: nearly-impossible
utf-8-bom: nearly-impossible
ucs-4: as-likely-as-unlikely
utf-16: quite-improbable
utf-16-little-endian: quite-improbable
utf-16-bom: quite-improbable
utf-16-little-endian-bom: quite-improbable
iso-7: somewhat-unlikely
iso-8-designate: somewhat-unlikely
iso-8-1: somewhat-unlikely
iso-8-2: somewhat-unlikely
iso-lock-shift: somewhat-unlikely
shift-jis: quite-improbable
big5: somewhat-likely
detect_coding_type: returning 0 (keep going)
detected coding system: #<coding-system big5 big5>
detected coding system: nil
<< deleted more than 31000 lines with the same output >>
I'm more and more convinced that this is a problem with auto-detection. The test
file has only latin1-encoded German umlauts beyond ASCII, and iso-8-1 gets a tag
`somewhat-unlikely'. That doesn't seem to be correct.
And the >30,000 lines of `detected coding system: nil' are looking suspicious as
well. They don't appear when I visit a file with just the second line (and
iso-8859-1 is properly selected then). They also don't appear when I visit a
file with just the first line (when raw-text is selected as coding system).
Perhaps this helps to categorize my problem. Where is the place where this
likely/unlikely decision is made? In Lisp or in the C core?
Cheers,
Joachim
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Joachim Schrod Email: jschrod(a)acm.org
Roedermark, Germany