unicode internal issues

Saturday, 8 October 2005

        i'm starting to make changes to implement the unicode-internal support.

some of the lisp primitives related to characters will need changes.  
here are my ideas; please comment.

char-charset, char-octet and split-char will remain in a unicode world 
but take an extra optional argument -- a charset precedence list, same 
as the current `unicode-to-char'; ignored in a mule-internal world.

i'm thinking of making make-char be a polymorphic function -- either it 
takes a charset and octets, as currently, or it takes a unicode 
codepoint and an optional charset precedence list (ignored in a 
unicode-internal world), like the current `unicode-to-char' (which would 
be eliminated).  this is not very lispy but it seems preferable to 
having the name `make-char' not be the obvious way to make a character.  
alternatives are to leave the make-char/unicode-to-char split or to 
rename unicode-to-char to make-unicode-char.

also, in a unicode-internal world, make-char can return nil, if no 
unicode equivalent exists.

char-to-unicode should probably remain; but should return nil, not -1, 
in a mule-internal world if no unicode equivalent exists.

int-to-char and char-to-int always convert between chars and internal 
codes, same as current.  in a unicode-internal world, that is simply the 
unicode codepoint.

internally, some concept like "leading byte" will remain but will simply 
be an arbitrary charset index.  more than 256 charsets can exist, and 
charsets should be added for things like `windows-1252'.  for charsets 
like these, the octet in `make-charset' need not be in the iso2022 range 
of 32-127 (or equivalently, 160-255) -- windows-1252 defines various 
weird chars in the range 128-159.  we should also add big5, shift-jis 
and the like.

there should also be functions to convert directly between unicode 
codepoints and charset codepoints, without the need to go through a 
char.  maybe `charset-codepoint-to-unicode' and 
`unicode-to-charset-codepoint' (returning a list, like `split-char').  
ideas for better names?  do i have the terminology correct?

ben

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

unicode internal issues