this is way too complicated.
the idea i have is simple and workable.
below i've copied a bunch of stuff i've been writing over the last few weeks,
including stuff on formats [second major section]
btw your complaints about detection: i've been creating detecting coding
systems, which work like identity in the encode direction and detect in the
other direction. so far, there are two: the convert-eol type [which would
implement crlf, etc.] has a property, which if set to `type' = `autodetect' does
EOL autodetect. the other is `undecided' which does general detection according
to the category table. specifiable defaults could also handle eol issues: e.g.
instead of the "assumed internal" mentioned below, that "assumed"
could be
autodetect-eol. i have a system in place to "canonicalize after decoding" the
coding systems used during decoding, converting autodetecting coding systems
into the actually detected codesys, and also handling chains in special ways,
partially because of the autodetect-eol that's currently being stuck onto the
end of all regularly created coding sytems to get the existing behavior.
i also have ideas about extending the category table, even beyond what's listed
below: to handle lots of different ways something could be encoded [e.g.
euc-jp, base64'd, then gzip'd, then base64'd again: we could specify the
encodings specifically as base64|gzip|base64|euc-jp, but we'd like it
autodetected]. i have something in mind where the table can work recursively,
as long as the output is still external: thus, we detect base64 first, decode,
still external [i.e. bytes, not chars], so we run again, get gzip, decode, run
the detector again, get base64, then again, get euc-jp, this time the output is
internal so we maybe switch to a different category-mapping/priority table,
which might [at this moment] only detect the eol type, decode that, and then get
something that doesn't match anything. [we could put the raw-text category
first; that way, crlf and such have to have a higher likeliness in order for
them to take over.]
------------------------------------------------------------------
ABOUT DETECTION
------------------------------------------------------------------
we have not yet properly abstracted detection. it's a bit tricky to do
because when we do detection, we basically run all the detection
routines in parallel on chunks of data; so the struct detection_state
needs to have space for data for all detection routines, not just one
extra chunk of data. for easy access to detector-specific data, and to
the array of likeliness values that will replace the "mask", we need
macros that declare detectors and categories and initialize constants
for them, similar to lstream type indices. A sample for Unicode looks
like this:
DEFINE_DETECTOR (utf_8);
DEFINE_DETECTOR_CATEGORY (utf_8, utf_8);
DEFINE_DETECTOR (ucs_4);
DEFINE_DETECTOR_CATEGORY (ucs_4, ucs_4);
DEFINE_DETECTOR (utf_16);
DEFINE_DETECTOR_CATEGORY (utf_16, utf_16);
DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian);
DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_bom);
DEFINE_DETECTOR_CATEGORY (utf_16, utf_16_little_endian_bom);
we could allocate the detection structure and all type-specific data and
likeliness values in one memory block, but we'd need to be careful with
alignment. i think we can make effective use of
ALIGN_SIZE (size, ALIGNOF (max_align_t));
however, in general the detection code has major problems and needs lots
of work:
-- instead of merely "yes" or "no" for particular categories, we
need a
more flexible system, with various levels of likeliness. Currently
I've created a system with six levels, as follows:
enum detection_result
{
// Basically means a magic cookie was seen indicating this type, or
something similar. //
DETECTION_NEAR_CERTAINTY,
// Characteristics seen that are unlikely to be other coding system types
-- e.g. ISO-2022 escape sequences, or perhaps a consistent pattern of
alternating zero bytes in UTF-16, along with Unicode LF or CRLF
sequences at regular intervals. (Zero bytes are unlikely or impossible
in most text encodings.) //
DETECTION_QUITE_PROBABLE,
// At least some characteristics seen that match what's normally found in
this encoding -- e.g. in Shift-JIS, a number of two-byte Japanese
character sequences in the right range, and nothing out of range; or
in Unicode, much higher statistical variance in the odd bytes than in
the even bytes, or vice-versa (perhaps the presence of regular EOL
sequences would bump this too to DETECTION_QUITE_PROBABLE). This is
quite often a statistical test. //
DETECTION_SOMEWHAT_LIKELY,
// Default state. Perhaps it indicates pure ASCII or something similarly
vague seen in Shift-JIS.
(On the other hand, for a pure ASCII detector, this might be exactly
what you want. Such a detector ideally wants all bytes in the range
0x20 - 0x7E (no high bytes!), except for whitespace control chars and
perhaps a few others; LF, CR, or CRLF sequences at regular intervals
(where "regular" might mean an average < 100 chars and 99% < 300 for
code and other stuff of the "text file w/line breaks" variety, but for
the "text file w/o line breaks" variety, excluding blank lines,
averages could easily be 600 or more with 2000-3000 char "lines" not
so uncommon); similar statistical variance between odds and evens (not
Unicode); frequent occurrences of the space character; letters more
common than non-letters; etc. Granted, this doesn't even apply to
everything called "ASCII", and we could potentially distinguish off
ASCII for code, ASCII for text, etc. as separate categories. However,
it does give us a lot to work off of, in deciding what likelihood to
choose -- and it shows there's in fact a lot of detectable patterns to
look for even in something seemingly so generic as ASCII.) //
DETECTION_NO_INFORMATION,
// Some characteristics seen that are unusual for this encoding --
e.g. unusual control characters in a plain-text encoding, or little
statistical variance in the odd and even bytes in UTF-16. //
DETECTION_UNLIKELY,
// An erroneous sequence was seen. //
DETECTION_IMPOSSIBLE
};
-- a smarter algorithm to pick the winning category -- something that
weighs the likelihood of being correct against the priority, and may
decide to present more than one possibility to the user.
-- The simple list of coding categories per detectors is not enough.
Instead of coding categories, we need parameters. For example,
Unicode might have separate detectors for UTF-8, UTF-7, UTF-16,
and perhaps UCS-4; or UTF-16/UCS-4 would be one detection type.
UTF-16 would have parameters such as "little-endian" and "needs
BOM",
and possibly another one like "collapse/expand/leave alone composite
sequences" once we add this support. Usually these parameters
correspond directly to a coding system parameter. Different
likelihood values can be specified for each parameter as well as for
the detection type as a whole. The user can specify particular
coding systems for a particular combination of detection type and
parameters, or can give "default parameters" associated with a
detection type. In the latter case, we create a new coding system as
necessary that corresponds to the detected type and parameters.
-- There need to be two priority lists and two
category->coding-system lists. Once is general, the other
category->langenv-specific. The user sets the former, the langenv
category->the latter. The langenv-specific entries take precedence
category->over the others. This works similarly to the
category->category->Unicode charset priority list.
-- a better means of presentation. rather than just coming up
with the new file decoded according to the detected coding
system, allow the user to browse through the file and
conveniently reject it if it looks wrong; then detection
starts again, but with that possibility removed. in cases where
certainty is low and thus more than one possibility is presented,
the user can browse each one and select one or reject them all.
-- fail-safe: even after the user has made a choice, if they
later on realize they have the wrong coding system, they can
go back, and we've squirreled away the original data so they
can start the process over. this may be tricky.
-- using a larger buffer for detection. we use just a small
piece, which can give quite random results. we may need to
buffer up all the data we look through because we can't
necessarily rewind. the idea is we proceed until we get a
result that's at least at a certain level of certainty
(e.g. "probable") or we reached a maximum limit of how much
we want to buffer.
-- dealing with interactive systems. we might need to go ahead
and present the data before we've finished detection, and
then re-decode it, perhaps multiple times, as we get better
detection results.
-- Clearly some of these are more important than others. at the
very least, the "better means of presentation" should be
implementation as soon as possibl, along with a very simple means
of fail-safe whenever the data is readibly available, e.g. it's
coming from a file, which is the most common scenario.
------------------------------------------------------------------
ABOUT FORMATS
------------------------------------------------------------------
when calling make-coding-system, the name can be a cons of (format1 .
format2), specifying that it decodes format1->format2 and encodes the other
way. if only one name is given, that is assumed to be format1, and the
other is either `external' or `internal' depending on the end type.
normally the user when decoding gives the decoding order in formats, but
can leave off the last one, `internal', which is assumed. a multichain
might look like gzip|multibyte|unicode, using the coding systems named
`gzip', `(unicode . multibyte)' and `unicode'. the way this actually works
is by searching for gzip->multibyte; if not found, look for gzip->external
or gzip->internal. (In general we automatically do conversion between
internal and external as necessary: thus gzip|crlf does the expected, and
maps to gzip->external, external->internal, crlf->internal, which when
fully specified would be gzip|external:external|internal:crlf|internal --
see below.) To forcibly fit together two converters that have explicitly
specified and incompatible names (say you have unicode->multibyte and
iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this
case are compatible), you can force-cast using :, like this:
ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between
internal and external formats, the conversion happens automatically.)
--------------------------------------------------------------------------
ABOUT PDUMP, UNICODE, AND RUNNING XEMACS FROM A DIRECTORY WITH WEIRD CHARS
--------------------------------------------------------------------------
-- there's the problem that XEmacs can't be run in a directory with
non-ASCII/Latin-1 chars in it, since it will be doing Unicode
processing before we've had a chance to load the tables. In fact,
even finding the tables in such a situation is problematic using
the normal commands. my idea is to eventually load the stuff
extremely extremely early, at the same time as the pdump data gets
loaded. in fact, the unicode table data (stored in an efficient
binary format) can even be stuck into the pdump file (which would
mean as a resource to the executable, for windows). we'd need to
extend pdump a bit: to allow for attaching extra data to the pdump
file. (something like pdump_attach_extra_data (addr, length)
returns a number of some sort, an index into the file, which you
can then retrieve with pdump_load_extra_data(), which returns an
addr (mmap()ed or loaded), and later you pdump_unload_extra_data()
when finished. we'd probably also need
pdump_attach_extra_data_append(), which appends data to the data
just written out with pdump_attach_extra_data(). this way,
multiple tables in memory can be written out into one contiguous
table. (we'd use the tar-like trick of allowing new blocks to be
written without going back to change the old blocks -- we just rely
on the end of file/end of memory.) this same mechanism could be
extracted out of pdump and used to handle the non-pdump situation
(or alternatively, we could just dump either the memory image of
the tables themselves or the compressed binary version). in the
case of extra unicode tables not known about at compile time that
get loaded before dumping, we either just dump them into the image
(pdump and all) or extract them into the compressed binary format,
free the original tables, and treat them like all other tables.
--------------------------------------------------------------------------
HANDLING WRITING A FILE SAFELY, WITHOUT DATA LOSS
--------------------------------------------------------------------------
-- When writing a file, we need error detection; otherwise somebody
will create a Unicode file without realizing the coding system
of the buffer is Raw, and then lose all the non-ASCII/Latin-1
text when it's written out. We need two levels
1. first, a "safe-charset" level that checks before any actual
encoding to see if all characters in the document can safely
be represented using the given coding system. FSF has a
"safe-charset" property of coding systems, but it's stupid
because this information can be automatically derived from
the coding system, at least the vast majority of the time.
What we need is some sort of
alternative-coding-system-precedence-list, langenv-specific,
where everything on it can be checked for safe charsets and
then the user given a list of possibilities. When the user
does "save with specified encoding", they should see the same
precedence list. Again like with other precedence lists,
there's also a global one, and presumably all coding systems
not on other list get appended to the end (and perhaps not
checked at all when doing safe-checking?). safe-checking
should work something like this: compile a list of all
charsets used in the buffer, along with a count of chars
used. that way, "slightly unsafe" charsets can perhaps be
presented at the end, which will lose only a few characters
and are perhaps what the users were looking for.
2. when actually writing out, we need error checking in case an
individual char in a charset can't be written even though the
charsets are safe. again, the user gets the choice of other
reasonable coding systems.
3. same thing (error checking, list of alternatives, etc.) needs
to happen when reading! all of this will be a lot of work!
--ben
"Stephen J. Turnbull" wrote:
>>>>> "ben" == Ben Wing <ben(a)666.com> writes:
ben> Stefan is right, we need formats, not operations.
I think this is more confusion than clarification. In general, there
may be a number of different things to do with a format, or even a
pair of formats.
Eg, isn't what you really want to be able to write something like this:
(define-format jp-gz
;; The (detect ...) form is like a cond.
;; Each clause has a format to be checked; if the condition is OK,
;; execute the list of converters in the cdr of the clause in order.
'(decoder
(detect (gzip gunzip
(detect (utf-8 utf8-to-mule)
(shift-jis sjis-to-mule)
(euc-jp eucjp-to-mule)
(iso2022jp iso2022jp-to-mule)
;; yeah yeah this syntax is bogus
(other (error "This isn't Japanese.")))
;; no error if eol not detected, assume Unix
(detect (eol-crlf dos-to-unix)
(eol-cr mac-to-unix))
(other (error "This isn't gzip."))))))
But now it's not obvious what should be done with a gzip if we specify
(gzip internal). Should we apply gunzip to it, or should we apply the
jp-gz decoder? Both result in the `internal' format. How do we
disambiguate that?
Also, note that the jp-gz decoder does not have a well-defined inverse
operation. But wouldn't we like to be able to define such auto-
detecting converters, and pipeline them with other converters? Eg,
(define-format maybe-gz
'(decoder
(detect (gzip gunzip))))
You can put the maybe-gz converter in any input pipeline, and it will
detect gzipped files and gunzip them but not barf if they aren't (the
plain gzip converter should do so, I think). But it doesn't have an
inverse, either. This isn't quite it:
(define-format maybe-gz-too
'(decoder
(detect (gzip gunzip))
encoder
(detect (gzip identity)
(other gzip))))
but that looks kinda useful to me.
We do not have these problems if we treat the operations as operations,
rather than try to deduce them from pairs of formats. That's always
been one of the problems with Mule: it insists on doing conversions
that you don't want and don't know about until your data is corrupted.
I don't see how we can avoid that if we try to work in terms of
formats rather than operations.
--
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
_________________ _________________ _________________ _________________
What are those straight lines for? "XEmacs rules."