Re: Mule bugs: misidentification (Latin-1 vs. Chinese), revert issues

Monday, 14 January 2008

"Stephen J. Turnbull" <stephen(a)xemacs.org&gt; writes:

...
 Michael Sperber writes:

  > Could you give a hint about detecting UTF-8?  (I know what UTF-8 looks
  > like, but enough about the other coding systems to be able to say what
  > distinguishes them.)

 There are a lot of coding systems.  But basically if you have as many
 as 3 non-ASCII characters, the chance that any natural language text
 "looks like" UTF-8 is vanishingly small.  Except at the beginning and
 end of the string, a single byte >= 0xC0 gives you information about
 *at least* three other bytes: the preceding one may *not* be >= 0xC0,
 the following N bytes must be in the range 0x80 to 0xBF, and the next
 one after that must not be >= 0xC0. 
I'm not sure I understand: These are conditions which must hold true for
UTF-8.  Is the presence of a valid UTF-8 3-byte encoding in a byte
sequence enough to be able to say that it is UTF-8?  What about typical
Latin-1 text, whose UTF-8 encodings will include only 2-byte encodings?

-- 
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Mule bugs: misidentification (Latin-1 vs. Chinese), revert issues