Re: yet another XEmacs fork;-)

Monday, 17 November 2008

Stephen J. Turnbull wrote:

...
  > > How do systems deal with the problem that in some
encodings (any
  > > ISO2022 that allows general character sets) there are many
  > > octet-strings that encode the same abstract text string?
  > 
  > For the most part, they don't. Unicode fans pretend that anything
  > which causes problems for Unicode doesn't exist (or is "obsolete";
  > apparently, they get to decide that this is the case).

 Well, yes, they do.  If you want something else, feel free to start
 your own standards effort.  Ken'ichi Handa will help, I'm sure. ;-)

 However, even for 99% of Han users, simply putting things into the
 appropriate font will work.  The only people who really need to
 disambiguate Han are Buddhist scholars; even Japanese high school
 students read their Chinese poetry in Japanese fonts. 
I was referring mainly to the technical issues, e.g. the
non-reversibility of encoding conversions.

The problem with Unicode isn't that it's inherently defective as an
encoding, but the some of the "universalism" in the way that it's
often used. E.g. languages or libraries which insist that all "text"
is represented in Unicode, so that e.g. readdir() -> open() fails for
files which don't match a specific encoding.

Any such problems are then waved away with "use UTF-8 for all
filenames". No mention of how to handle filenames obtained from binary
data streams with no specified encoding (e.g. tar/zip/rar files, FTP),
or whether we're supposed to simply ditch customers who have other
ideas about which encodings to use for their data.

...
  > Anyone who actually needs to use such encodings typically
avoids
  > Unicode like the plague (I've yet to see a Japanese game for
  > Windows which uses the Unicode API rather than the codepage-based
  > API).

 Use any example but Japanese, please.  Japanese exceptionalism is
 alive and well throughout the society.  I find it hard to believe that
 changing your fonts when you change your .mo files wouldn't work fine
 for games as it does almost everywhere else, except in truly
 multilingual text; I think that Japanese just enjoy being different. 
The issue tends to apply to any language which isn't based upon the
Latin alphabet, although possibly to a lesser extent than for
Japanese. If a language is latin-based, it's not too much of a stretch
to just stick to ASCII in situations where use of other encodings is
problematic.

So long as there are file formats and network protocols where
filenames are sequences of bytes with no encoding specified (or where
the specified encoding is often incorrect), there will be a strong
temptation for application programmers to make the encoding issue
Someone Else's Problem (TM) by passing the data to anything which is
willing to accept a string of bytes.

On Windows, that means using the legacy "A" API rather than the
Unicode "W" API. On Unix, that means passing the data directly to the
OS without bothering about conversions. In Unicode-everywhere
environments, it means either blindly accepting any built-in
conversions or, if an encoding is required, hunting for a function
(any function) which returns an encoding with requiring any arguments.

...
  > If you want to retrieve a filename from the OS then pass it
back at a
  > later point, you need to retain the raw data. If you can't get at the
  > raw data, you lose.

 That's exactly the conclusion the Python people just came to. 
Which conclusion? "Retain the raw data" or "you lose"?

...
  > A more significant point is that Unicode strings aren't
strings of
  > "characters", but of Unicode code points. The conversions between
  > Unicode and abstract characters suffer from many of the same problems
  > as with traditional encodings.

 No, they suffer from various forms of inefficiency, but since there
 are two canonical decompositions you just have to do like the Japanese
 and make sure all strings take off their muddy shoes at the door and
 put on canonicalized slippers before entering the house.  This isn't
 possible with traditional encodings, and of course it does require a
 lot of programmer discipline to construct and use these interfaces. 
Unfortunately, this canonicalisation frequently doesn't happen. It
isn't too surprising, given the way that Unicode is so often touted as
eliminating these sorts of problems.

-- 
Glynn Clements <glynn(a)gclements.plus.com&gt;

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: yet another XEmacs fork;-)