Re: Question about how to specify coding system chains: encoding ordecoding persepctive?

Wednesday, 12 September 2001

        this is way too complicated.

the idea i have is simple and workable.

below i've copied a bunch of stuff i've been writing over the last few weeks,
including stuff on formats [second major section]

btw your complaints about detection: i've been creating detecting coding
systems, which work like identity in the encode direction and detect in the
other direction.  so far, there are two: the convert-eol type [which would
implement crlf, etc.] has a property, which if set to `type' = `autodetect' does
EOL autodetect.  the other is `undecided' which does general detection according
to the category table.  specifiable defaults could also handle eol issues: e.g.
instead of the "assumed internal" mentioned below, that "assumed"
could be
autodetect-eol.  i have a system in place to "canonicalize after decoding" the
coding systems used during decoding, converting autodetecting coding systems
into the actually detected codesys, and also handling chains in special ways,
partially because of the autodetect-eol that's currently being stuck onto the
end of all regularly created coding sytems to get the existing behavior.

i also have ideas about extending the category table, even beyond what's listed
below:  to handle lots of different ways something could be encoded [e.g.
euc-jp, base64'd, then gzip'd, then base64'd again: we could specify the
encodings specifically as base64|gzip|base64|euc-jp, but we'd like it
autodetected].  i have something in mind where the table can work recursively,
as long as the output is still external: thus, we detect base64 first, decode,
still external [i.e. bytes, not chars], so we run again, get gzip, decode, run
the detector again, get base64, then again, get euc-jp, this time the output is
internal so we maybe switch to a different category-mapping/priority table,
which might [at this moment] only detect the eol type, decode that, and then get
something that doesn't match anything. [we could put the raw-text category
first; that way, crlf and such have to have a higher likeliness in order for
them to take over.]

------------------------------------------------------------------
                            ABOUT DETECTION
------------------------------------------------------------------

   we have not yet properly abstracted detection.  it's a bit tricky to do
   because when we do detection, we basically run all the detection
   routines in parallel on chunks of data; so the struct detection_state
   needs to have space for data for all detection routines, not just one
   extra chunk of data.  for easy access to detector-specific data, and to
   the array of likeliness values that will replace the "mask", we need
   macros that declare detectors and categories and initialize constants
   for them, similar to lstream type indices.  A sample for Unicode looks
   like this:

DEFINE_DETECTOR (utf_8);
DEFINE_DETECTOR_CATEGORY (utf_8, utf_8);
DEFINE_DETECTOR (ucs_4);
DEFINE_DETECTOR_CATEGORY (ucs_4, ucs_4);
DEFINE_DETECTOR (utf_16);
DEFINE_DETECTOR_CATEGORY (utf_16, utf_16);
DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian);
DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_bom);
DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian_bom);

   we could allocate the detection structure and all type-specific data and
   likeliness values in one memory block, but we'd need to be careful with
   alignment.  i think we can make effective use of

   ALIGN_SIZE (size, ALIGNOF (max_align_t));

   however, in general the detection code has major problems and needs lots
   of work:

   -- instead of merely "yes" or "no" for particular categories, we
need a
      more flexible system, with various levels of likeliness.  Currently
      I've created a system with six levels, as follows:

enum detection_result
 {
  // Basically means a magic cookie was seen indicating this type, or
     something similar. //
  DETECTION_NEAR_CERTAINTY,
  // Characteristics seen that are unlikely to be other coding system types
     -- e.g. ISO-2022 escape sequences, or perhaps a consistent pattern of
     alternating zero bytes in UTF-16, along with Unicode LF or CRLF
     sequences at regular intervals. (Zero bytes are unlikely or impossible
     in most text encodings.) //
  DETECTION_QUITE_PROBABLE,
  // At least some characteristics seen that match what's normally found in
     this encoding -- e.g. in Shift-JIS, a number of two-byte Japanese
     character sequences in the right range, and nothing out of range; or
     in Unicode, much higher statistical variance in the odd bytes than in
     the even bytes, or vice-versa (perhaps the presence of regular EOL
     sequences would bump this too to DETECTION_QUITE_PROBABLE).  This is
     quite often a statistical test. //
  DETECTION_SOMEWHAT_LIKELY,
  // Default state.  Perhaps it indicates pure ASCII or something similarly
     vague seen in Shift-JIS.

     (On the other hand, for a pure ASCII detector, this might be exactly
     what you want.  Such a detector ideally wants all bytes in the range
     0x20 - 0x7E (no high bytes!), except for whitespace control chars and
     perhaps a few others; LF, CR, or CRLF sequences at regular intervals
     (where "regular" might mean an average < 100 chars and 99% < 300 for
     code and other stuff of the "text file w/line breaks" variety, but for
     the "text file w/o line breaks" variety, excluding blank lines,
     averages could easily be 600 or more with 2000-3000 char "lines" not
     so uncommon); similar statistical variance between odds and evens (not
     Unicode); frequent occurrences of the space character; letters more
     common than non-letters; etc.  Granted, this doesn't even apply to
     everything called "ASCII", and we could potentially distinguish off
     ASCII for code, ASCII for text, etc. as separate categories.  However,
     it does give us a lot to work off of, in deciding what likelihood to
     choose -- and it shows there's in fact a lot of detectable patterns to
     look for even in something seemingly so generic as ASCII.) //
  DETECTION_NO_INFORMATION,
  // Some characteristics seen that are unusual for this encoding --
     e.g. unusual control characters in a plain-text encoding, or little
     statistical variance in the odd and even bytes in UTF-16. //
  DETECTION_UNLIKELY,
  // An erroneous sequence was seen. //
  DETECTION_IMPOSSIBLE
 };

   -- a smarter algorithm to pick the winning category -- something that
      weighs the likelihood of being correct against the priority, and may
      decide to present more than one possibility to the user.

   -- The simple list of coding categories per detectors is not enough.
      Instead of coding categories, we need parameters.  For example,
      Unicode might have separate detectors for UTF-8, UTF-7, UTF-16,
      and perhaps UCS-4; or UTF-16/UCS-4 would be one detection type.
      UTF-16 would have parameters such as "little-endian" and "needs
BOM",
      and possibly another one like "collapse/expand/leave alone composite
      sequences" once we add this support.  Usually these parameters
      correspond directly to a coding system parameter.  Different
      likelihood values can be specified for each parameter as well as for
      the detection type as a whole.  The user can specify particular
      coding systems for a particular combination of detection type and
      parameters, or can give "default parameters" associated with a
      detection type.  In the latter case, we create a new coding system as
      necessary that corresponds to the detected type and parameters.

   -- There need to be two priority lists and two
      category->coding-system lists.  Once is general, the other
      category->langenv-specific.  The user sets the former, the langenv
      category->the latter.  The langenv-specific entries take precedence
      category->over the others.  This works similarly to the
      category->category->Unicode charset priority list.

   -- a better means of presentation.  rather than just coming up
      with the new file decoded according to the detected coding
      system, allow the user to browse through the file and
      conveniently reject it if it looks wrong; then detection
      starts again, but with that possibility removed.  in cases where
      certainty is low and thus more than one possibility is presented,
      the user can browse each one and select one or reject them all.

   -- fail-safe: even after the user has made a choice, if they
      later on realize they have the wrong coding system, they can
      go back, and we've squirreled away the original data so they
      can start the process over.  this may be tricky.

   -- using a larger buffer for detection.  we use just a small
      piece, which can give quite random results.  we may need to
      buffer up all the data we look through because we can't
      necessarily rewind.  the idea is we proceed until we get a
      result that's at least at a certain level of certainty
      (e.g. "probable") or we reached a maximum limit of how much
      we want to buffer.

   -- dealing with interactive systems.  we might need to go ahead
      and present the data before we've finished detection, and
      then re-decode it, perhaps multiple times, as we get better
      detection results.

   -- Clearly some of these are more important than others.  at the
   very least, the "better means of presentation" should be
   implementation as soon as possibl, along with a very simple means
   of fail-safe whenever the data is readibly available, e.g. it's
   coming from a file, which is the most common scenario.

------------------------------------------------------------------
                            ABOUT FORMATS
------------------------------------------------------------------

when calling make-coding-system, the name can be a cons of (format1 .
format2), specifying that it decodes format1->format2 and encodes the other
way.  if only one name is given, that is assumed to be format1, and the
other is either `external' or `internal' depending on the end type.
normally the user when decoding gives the decoding order in formats, but
can leave off the last one, `internal', which is assumed.  a multichain
might look like gzip|multibyte|unicode, using the coding systems named
`gzip', `(unicode . multibyte)' and `unicode'.  the way this actually works
is by searching for gzip->multibyte; if not found, look for gzip->external
or gzip->internal. (In general we automatically do conversion between
internal and external as necessary: thus gzip|crlf does the expected, and
maps to gzip->external, external->internal, crlf->internal, which when
fully specified would be gzip|external:external|internal:crlf|internal --
see below.)  To forcibly fit together two converters that have explicitly
specified and incompatible names (say you have unicode->multibyte and
iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this
case are compatible), you can force-cast using :, like this:
ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between
internal and external formats, the conversion happens automatically.)

--------------------------------------------------------------------------
ABOUT PDUMP, UNICODE, AND RUNNING XEMACS FROM A DIRECTORY WITH WEIRD CHARS
--------------------------------------------------------------------------

-- there's the problem that XEmacs can't be run in a directory with
   non-ASCII/Latin-1 chars in it, since it will be doing Unicode
   processing before we've had a chance to load the tables.  In fact,
   even finding the tables in such a situation is problematic using
   the normal commands.  my idea is to eventually load the stuff
   extremely extremely early, at the same time as the pdump data gets
   loaded.  in fact, the unicode table data (stored in an efficient
   binary format) can even be stuck into the pdump file (which would
   mean as a resource to the executable, for windows).  we'd need to
   extend pdump a bit: to allow for attaching extra data to the pdump
   file. (something like pdump_attach_extra_data (addr, length)
   returns a number of some sort, an index into the file, which you
   can then retrieve with pdump_load_extra_data(), which returns an
   addr (mmap()ed or loaded), and later you pdump_unload_extra_data()
   when finished.  we'd probably also need
   pdump_attach_extra_data_append(), which appends data to the data
   just written out with pdump_attach_extra_data().  this way,
   multiple tables in memory can be written out into one contiguous
   table. (we'd use the tar-like trick of allowing new blocks to be
   written without going back to change the old blocks -- we just rely
   on the end of file/end of memory.) this same mechanism could be
   extracted out of pdump and used to handle the non-pdump situation
   (or alternatively, we could just dump either the memory image of
   the tables themselves or the compressed binary version).  in the
   case of extra unicode tables not known about at compile time that
   get loaded before dumping, we either just dump them into the image
   (pdump and all) or extract them into the compressed binary format,
   free the original tables, and treat them like all other tables.

--------------------------------------------------------------------------
               HANDLING WRITING A FILE SAFELY, WITHOUT DATA LOSS
--------------------------------------------------------------------------

      -- When writing a file, we need error detection; otherwise somebody
      will create a Unicode file without realizing the coding system
      of the buffer is Raw, and then lose all the non-ASCII/Latin-1
      text when it's written out.  We need two levels

      1. first, a "safe-charset" level that checks before any actual
         encoding to see if all characters in the document can safely
         be represented using the given coding system.  FSF has a
         "safe-charset" property of coding systems, but it's stupid
         because this information can be automatically derived from
         the coding system, at least the vast majority of the time.
         What we need is some sort of
         alternative-coding-system-precedence-list, langenv-specific,
         where everything on it can be checked for safe charsets and
         then the user given a list of possibilities.  When the user
         does "save with specified encoding", they should see the same
         precedence list.  Again like with other precedence lists,
         there's also a global one, and presumably all coding systems
         not on other list get appended to the end (and perhaps not
         checked at all when doing safe-checking?).  safe-checking
         should work something like this: compile a list of all
         charsets used in the buffer, along with a count of chars
         used.  that way, "slightly unsafe" charsets can perhaps be
         presented at the end, which will lose only a few characters
         and are perhaps what the users were looking for.

      2. when actually writing out, we need error checking in case an
         individual char in a charset can't be written even though the
         charsets are safe.  again, the user gets the choice of other
         reasonable coding systems.

      3. same thing (error checking, list of alternatives, etc.) needs
         to happen when reading!  all of this will be a lot of work!

   --ben

"Stephen J. Turnbull" wrote:
...

 >>>>> "ben" == Ben Wing <ben(a)666.com&gt; writes:

     ben> Stefan is right, we need formats, not operations.

 I think this is more confusion than clarification.  In general, there
 may be a number of different things to do with a format, or even a
 pair of formats.

 Eg, isn't what you really want to be able to write something like this:

 (define-format jp-gz
   ;; The (detect ...) form is like a cond.
   ;; Each clause has a format to be checked; if the condition is OK,
   ;; execute the list of converters in the cdr of the clause in order.
   '(decoder
     (detect (gzip gunzip
                   (detect (utf-8     utf8-to-mule)
                           (shift-jis sjis-to-mule)
                           (euc-jp    eucjp-to-mule)
                           (iso2022jp iso2022jp-to-mule)
                           ;; yeah yeah this syntax is bogus
                           (other     (error "This isn't Japanese.")))
                   ;; no error if eol not detected, assume Unix
                   (detect (eol-crlf dos-to-unix)
                           (eol-cr   mac-to-unix))
             (other (error "This isn't gzip."))))))

 But now it's not obvious what should be done with a gzip if we specify
 (gzip internal).  Should we apply gunzip to it, or should we apply the
 jp-gz decoder?  Both result in the `internal' format.  How do we
 disambiguate that?

 Also, note that the jp-gz decoder does not have a well-defined inverse
 operation.  But wouldn't we like to be able to define such auto-
 detecting converters, and pipeline them with other converters?  Eg,

 (define-format maybe-gz
   '(decoder
     (detect (gzip gunzip))))

 You can put the maybe-gz converter in any input pipeline, and it will
 detect gzipped files and gunzip them but not barf if they aren't (the
 plain gzip converter should do so, I think).  But it doesn't have an
 inverse, either.  This isn't quite it:

 (define-format maybe-gz-too
   '(decoder
     (detect (gzip gunzip))
     encoder
     (detect (gzip identity)
             (other gzip))))

 but that looks kinda useful to me.

 We do not have these problems if we treat the operations as operations,
 rather than try to deduce them from pairs of formats.  That's always
 been one of the problems with Mule: it insists on doing conversions
 that you don't want and don't know about until your data is corrupted.

 I don't see how we can avoid that if we try to work in terms of
 formats rather than operations.

 --
 University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
 Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
 _________________  _________________  _________________  _________________
 What are those straight lines for?  "XEmacs rules." 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Question about how to specify coding system chains: encoding ordecoding persepctive?