Re: Moving to Unicode internally

Tuesday, 21 December 2004

Hi, 

This is a followup to your commentary on my initial patch of the eighteenth
of October.  It’s far, far, later than I expected it would be, but ah well,
minimal civilian casualties this time. More work and mail later this month.

 Ar an seachtú lá is fiche de mí Deireadh Fómhair, scríobh Stephen J. Turnbull:

...
 [...] This is basically unreviewable as it stands, in terms of
"should you
 continue in this direction".  The two big problems are

 (1) You MUST have a fairly detailed ChangeLog for a patch of this
     size, to function as a "reader's guide" to what is important and
     what changes are linked to each other.  See any of Ben's
     megachangelogs for an example of how to do this, and "do thou
     likewise".  Sure, they're horrible to read, but as someone who
     does read them regularly I'll say that (a) they're essential and
     (b) IMO it's not worth the effort that would be required to do a
     significantly more readable job.  Jerry James has one or two big
     log entries, which are more readable but the patches involved are
     smaller.  Maybe you should aspire to Jerry's standard.  :-) 
Okay, noted. 

...
 (2) A fairly large chunk of change is simply gratuitous: whitespace,
     fine-tuning of variable names, changes to comments not relevant to
     your work, etc.  It is _extremely_ difficult to read such, because
     it's thematically unrelated but the reviewer wonders if there's
     something he needs to check.  Big deletions are "merely" annoying,
     but ... do you really want to annoy the reviewer?  ;-) 
Certainly, the general aesthetic changes shouldn’t be part of those
patches. But as to the big deletions, when they’re made obsolete by what I’m
doing, I think taking them out makes sense.

...
 (6) Don't gratuitously change nomenclature when changing _other_
     semantics.  Even if you think FORMAT_UTF_8_INTERNAL is clearly
     superior to FORMAT_DEFAULT, if the semantics basically serve,
     leave it alone.  Instead, comment at the point of definition. 
Well, the intention was to move away from the current ISO-2022-friendly
structure, to several alternative Unicode-friendly encodings. And my feeling
is that the UTF-8 compatible encoding _should not_ be the default, given
today’s RAM sizes and that every other internationalised app in the world
save Perl has moved to using fixed-width or UTF-16 internally. So calling
UTF-8 “FORMAT_DEFAULT” would just be confusing, in that context.

I didn’t judge changing those names as gratuitous. I still don’t, to be
honest.

...
 (7) Don't remove large blocks of code that will eventually be
needed
     again.  #if 0 or #ifdef WHEN_AIDAN_WILL_WANT_TO_SEE_THIS_CODE
     and use hideif.el.  Don't forget to comment the #endif! 
If you remember, what was this in reference to? I don’t think I did remove
any code that would eventually be needed again, in any format close to what
was there already.

...
 [[Yes, in this kind of patch Ben violates these rules all over the
 place, but in practice nobody is going to thoroughly review Ben's
 megapatches from beginning to end, and they wouldn't review a myriad
 of micropatches either, so it doesn't really matter.]]

 Specific comments (mostly questions):

 (1) In re: FORMAT_UTF8_INTERNAL, both Ben and I disagree with this
     change anyway.  The purpose of the FORMAT_* defines is to tell
     text manipulation code how to find character boundaries; it is not
     to tell character manipulation code how to access character
     properties (ie, how to compute a canonical Ichar from the text
     representation).  
Hmm, but how to compute a canonical Ichar from the text representation, and
how to compute a text representation from an Ichar, has always been
dependent on the FORMAT_* defines. It’s just that FORMAT_DEFAULT was always
implicit whenever MULE was #defined.

If MULE goes away, as I would like it to do, and, from what I read, as does
Ben, that’ll become that bit more evident. 

...
     The low-level code that handles insertion,
     deletion, and marking of text (markers and extents) does not need
     to know character values at all.  It might be reasonable to
     provide Mule (leading byte is charset)  
--> and charset encodes bytecount

...
 (2) What is the purpose of the EXPLICIT_*_FORMAT #defines? 
To choose a single buffer text representation at compile time. I think
multiple, equally capable buffer text representations in a single binary are
overkill; and I think where one representation is more capable than another,
the less capable one should be taken out. The developers can experiment for
six months, and pick whichever has the best features for VM folders or
whatever.

...
 (3) People who use UTF-16 (as opposed to sticking to the BMP or
using
     UTF-32) can live with slightly inaccurate character counts as far
     as I'm concerned;  
I don’t want to knowingly give anyone code that gives inaccurate character
counts.:-)

...
 (4) Some of your comments seem wrong, especially concerning case
     comparisons.  Case _is_ a character property.  We may have find it
     useful to represent some case transformations with strings, but
     really there's no need to consider the LATIN CAPITAL LETTER SHARP
     S == 'SS' (which admittedly doesn't exist, but we could implement
     it for our own convenience) to be a string in the abstract; it's a
     composite character with a wide glyph.  
A single backspace should only eliminate a single “ S,”this LATIN CAPITAL
LETTER SHARP S should succeed in a string equality comparison with (format
"%c%c" ?S ?S) and implementing redisplay for composite characters is on
no-one’s timetable.

Dealing with external data, even if we could determine that a given “SS” 
string loaded from disk or pasted in _is_ a LATIN CAPITAL LETTER SHARP S
according to (and all these things must be taken into account, when working
it out):

- German’s standardised spelling.
- Whether or not our user chooses to honour the recent Rechtschreibreform.
- Whether or not our user wants to stick to seven bit characters because
  they’re posting to a Unix-heavy newsgroup. 
- Whether our user is in Switzerland and will thus, normally, ignore the
  es-zed (ß) entirely.

I think downcasing this LATIN CAPITAL LETTER SHARP S to es-zed will surprise
our users--disagree with me if appropriate, those of you reading with
native-speaker German--and to be able to downcase it is the only reason to
work out where to have it at all.

So what then? We now have a character for which the lowercase mapping _must
be equivalent to two things, which are themselves distinct, at the same
time_. So:

(assert (equal "SS" (upcase "ß")))
(assert (equal "ss" (downcase (upcase "ß"))))

...
     Similarly, for characters like LATIN LETTER DZ the context may
     determine whether it should be printed in uppercase ('DZ') or
     titlecase ('Dz'), but we can in fact determine case simply by looking
     at the character's code or glyph, and given the desired case, we can
     map from that letter in any case to the corresponding letter in the
     desired case. 
Sure, that’s workable. It doesn’t solve the case for the sharp s or for
downcasing Turkish LATIN CAPITAL LETTER I WITH DOT ABOVE. Or any of those
crazy ligatures around U+FB00, just to cite a some examples from many at
CaseFolding.txt on Unicode.org .

...
     In any event, I can see zero reason to deprecate qxestrncasecmp,
     although the implementation may be problematic in cases like 'SS'. 
Yeah; I think wrote that comment when I thought moving to a non-ASCII
compatible internal string representation (as opposed to the buffer
representation) was practical in the medium term. 

...
     [...] `size_t' seems like a no-brainer, but it turned out to
be very
     bad karma in practice: not only did it cause a fatal bug, but to this
     day I don't think the actual problem has ever been located. 
[And verifying alignment by checking whether addresses are divisible by the
number of bytes of alignment required is emphatically not standard C, but it
seems to have worked out beautifully.]

...
 (5) Why change `convert_ibyte_string_into_ichar_dynarr' et amici?
 If
     needed, the len parameter's type should be changed to Charcount.
     However, since callers have already done the hard work of counting
     bytes for the variable-width format, it should be left to them, no? 
Yes, normally the callers have done the work beforehand. But there remains,
for the functions themselves, the possibility that the byte count passed
across doesn’t map to an integer number of characters; my changes prevent
them moving beyond the end of the source string, in that event. 

It’s a bit theoretical, and extraneous to the bulk of these changes, but I
don’t think people read that code often enough that anyone else is likely to
pick up on the potential for it happening. 

...
 (6) When changing the eistring code, be very careful to mark changes
     that might have efficiency implications. 
Okay. A lot of the code I was reading didn’t explicitly comment on the
constant folding it was doing, so I didn’t put the energy into commenting
myself. I will, though.

...
     I don't know whether a compiler will be able to optimize the
switch
     in `null_char_ibyte_len_fmt' for example; whether you do or don't
     know, say how much you know. 
Will do. 

...
 (7) #define utf_16_surrogates_to_code(lead, trail) should probably
     have a text_checking_assert version that asserts that lead and
     trail are in the appropriate surrogate blocks. 
That’s only used, and is only intended to be used in once place, and there’s
an error signalled just above it if the surrogates are wrong. 

There are several text checking assertions on valid UTF-16 in the changes I
made to text.h; from looking at the code beyond that, it didn’t seem to me
as appropriate to make them for the coding systems. There are no assertions
at all in mule-coding.c, from my examination, so perhaps treating coding
system code as subject to text checking assertions is the right thing to do.

...
 (8) mk_wcswidth.c may be useful in the long run for its efficiency
and
     because anything that Markus Kuhn proposes as a standard deserves
     respect, but in the short run I would much prefer having this be
     Lisp-controlled.  In particular, you can be absolutely certain
     that there are terminals that implement different semantics. 
And there always have been. I'm using one now. 

...
     mk_wcswidth.c is intended to be used in a terminal that wants to
     implement a standard interface, not by applications that may or
     may not be running in such an interface.

     More precisely, it doesn't seem like that much effort to add,  
? With Ichars as UTF-32, it is pretty complex.

...
     so if you want to add it as a an option such as
     `terminal-conforms-to-standard-wcswidth-p' that would be cool.
     But terminals where `terminal-conforms-to-standard-wcswidth-p' is
     nil must be supported. 
So what algorithm should be used there for determining the width of a
character? Check the coding system being used for the console in question,
work out what character set that (Unicode) code point maps to in that coding
system, check the columns of that character set? 

Alternative; have several available versions of wcwidth at the C level, one
with Japanese semantics (--> treating everything in JISX 0212 as two-column,
doing pretty much what it does at the moment) , two with Chinese
semantics--GB vs. Big5--and one with Markus' Unicode semantics[1]. Switch
between them based on language environment. But there's nothing to stop
someone in a Japanese language environment using a UTF-8 terminal coding
system--indeed, I would expect them to be doing so on OS X, where that's the
coding system used for file names--and at that point we can't predict with
certainty what character width is the right one to use. It may even vary
based on the font used in the terminal, and that's not information we can
access via curses.

(After having used the wrong character width for the double directed
quotation marks for several weeks, I have to say that it’s something that’s
profoundly annoying in real life. I was about to argue here that we can’t
fix the thing in the general case, and, and we gain the most by endorsing an
emerging standard, but after that experience, I won’t. So, the
multiple-c-wcwidths thing seems the most attractive option; working things
out from the console coding system is probably inviting
disaster--initialising it based on language environment is probably the way
to go, if that can be organised.)

I�m sure you’ve an idea of it alread, but the existing algorithm is broken,
as it is; try the below in a *scratch* buffer with a Unicode terminal coding
system, and observe the thing wrapping at forty columns.

(let ((i 0)
      (str ""))
  (while (< i 80)
    ;; EM DASH is two columns in our internal encoding, one in MS Mincho
    ;; when treated as Unicode. 
    (setq str (format "%s%c" str (make-char 'chinese-cns11643-1 33 55))
	      i (+ 1 i)))
  str)

Oooh, here's another thing that's broken;

(equal
  ;; LATIN CAPITAL LETTER C WITH CARON
 (encode-coding-string (make-char 'latin-iso8859-2 #xc8) 'utf-8)
  ;; Idem.
 (encode-coding-string (make-char 'japanese-jisx0212 42 45) 'utf-8))
=> t

 (char-width (make-char 'latin-iso8859-2 #xc8))
=> 1

 (char-width (make-char 'japanese-jisx0212 42 45))
=> 2

(and
  ;; [2]
 (check-can-encode (make-char 'latin-iso8859-2 #xc8) 'iso-2022-jp) 
 (check-can-encode (char-width (make-char 'japanese-jisx0212 42 45)) 
	'iso-2022-jp))
=> t

So we can't just look for the first supported character set that can be used
in the console's coding system; we have to make sure that the coding system
will use it. And in the abstract it's possible that something will alter the
order of preferred character sets between the checking and the encoding.

Best regards, 

	- Aidan 

[1] Koreans probably aren't using Han characters to the extent that this
sort of thing will annoy them.

[2] 

(defun check-can-encode (string codesys)
  "Check if coding system CODESYS will preserve the information in STRING."
  (if (characterp string)
      (setq string (format "%c" string)))
  (coerce string 'string)
  (equal (decode-coding-string (encode-coding-string string codesys) codesys)
     string))

-- 
“Ah come on now Ted, a Volkswagen with a mind of its own, driving all over
the place and going mad, if that’s not scary I don’t know what is.”

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998