Re: [Bug: 21.5-b24] Problems with coding systems autodetect

Friday, 24 February 2006

        Joachim Schrod wrote:
...

 I have attached a file that has two lines (at the end). If I open that
 file, I get the coding system big5. I would expect to get the coding
 system iso-8859-1 or similar.

 My coding categories are:
 ############################
 ## LIST OF CODING CATEGORIES (ordered by priority)
 ## CATEGORY:CODING-SYSTEM
 ##
 utf-16-little-endian-bom:utf-16-little-endian-bom
 utf-16-bom:utf-16-bom
 utf-8-bom:utf-8-bom
 iso-7:iso-2022-7bit
 no-conversion:raw-text
 utf-8:utf-8
 iso-8-1:iso-8859-1
 iso-8-2:ctext
 iso-8-designate:ctext
 iso-lock-shift:iso-2022-lock
 shift-jis:shift-jis
 big5:big5
 utf-16-little-endian:utf-16-little-endian
 utf-16:utf-16
 ucs-4:ucs-4

 I don't have much experience with XEmacs coding systems (in fact,
 today I read doc strings on that topic for the first time). 
 Nevertheless, if I interpret that documentation correctly, iso-8-1
 should be checked before big5; and since the file is encoded in
 Latin1, it should match. 
I have some additional information, since I learned about debug-coding-detection 
in the mean time. Turning it on yields the following output on stderr:

detected coding system: nil
detect_coding_type: processing 88 bytes
First 16: .Mastab whlt u  09 4D 61 DF 73 74 61 62 20 77 E4 68 6C 74 20 75
Last 16: e Fachkonzept.).  65 20 46 61 63 68 6B 6F 6E 7A 65 70 74 2E 29 0A
seen_non_ascii: 1
no-conversion: slightly-likely
utf-8: nearly-impossible
utf-8-bom: nearly-impossible
ucs-4: as-likely-as-unlikely
utf-16: quite-improbable
utf-16-little-endian: quite-improbable
utf-16-bom: quite-improbable
utf-16-little-endian-bom: quite-improbable
iso-7: somewhat-unlikely
iso-8-designate: somewhat-unlikely
iso-8-1: somewhat-unlikely
iso-8-2: somewhat-unlikely
iso-lock-shift: somewhat-unlikely
shift-jis: quite-improbable
big5: somewhat-likely
detect_coding_type: returning 0 (keep going)
detected coding system: #<coding-system big5 big5>
detected coding system: nil
<< deleted more than 31000 lines with the same output >>

I'm more and more convinced that this is a problem with auto-detection. The test 
file has only latin1-encoded German umlauts beyond ASCII, and iso-8-1 gets a tag 
`somewhat-unlikely'. That doesn't seem to be correct.

And the >30,000 lines of `detected coding system: nil' are looking suspicious as 
well. They don't appear when I visit a file with just the second line (and 
iso-8859-1 is properly selected then). They also don't appear when I visit a 
file with just the first line (when raw-text is selected as coding system).

Perhaps this helps to categorize my problem. Where is the place where this 
likely/unlikely decision is made? In Lisp or in the C core?

Cheers,
	Joachim

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Joachim Schrod				Email: jschrod(a)acm.org
Roedermark, Germany

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: [Bug: 21.5-b24] Problems with coding systems autodetect