"Stephen J. Turnbull" <stephen(a)xemacs.org> writes:
>> Writing a formal RFC2822 parser might actually gives us
less
>> than what we need...
Simon> I don't think so -- having a real parser would allow you to
Simon> say that you want to look at the first comment after an
Simon> addr-spec. A function could try all the standard ways to
Simon> put the full name in a header line and return it to the
Simon> caller.
The problem is that there aren't (as of 1123, I'm not really up on the
RFC2822 clarifications) standardized ways, and identifying them is
actually more a lexical problem than an RFC2?822 syntactic issue. A
correct parser would at least allow us to separate the net address
from the rest of the goop, though, which none of the current
applications I use (Gnus, VM, BBDB, Supercite) do with 100% accuracy.
Identifying the addr-spec should be straight forward, as for
extracting the full name RFC 2822 now says one format SHOULD be used.
There is at least one older format used (first comment after
addr-spec) but supporting those two <famous last words>should be
sufficient</>.
>> Another problem (separate from RFC2822 issues) is that
things
>> like BBDB store the rendered version, so if we start pulling
>> our unrendered chunks of headers, we'd have to render them
>> before comparing...
Simon> This is a problem in BBDB currently too. Most if not all
Simon> non-ASCII names have multiple entries in my .bbdb because
Simon> they are encoded differently. OTOH using raw mail headers
Simon> would be just as bad. A proper
Simon> canonicalize-then-compare-for-equality function is needed
Simon> to fix this.
"Proper" probably has to wait for full Unicode support, as people can
always confuse you with [U+00E4] <-> [U+0061 U+0308] in UTF-8, or even
[a "] in pidgen composition.
I don't think there will ever be a proper solution. Full unicode has
its problems too; the main one is that Unicode is a moving target and
keeps evolving, but also that you usually need to implement a
decomposition mechanism to compare strings. And Unicode Inc keeps
remapping the decomposition tables so you never know what results you
get.
But as long as the BBDB database is stored in a UCS (which it
currently is not), a quite high degree of canonicalization is
automatic in Mule if rendered versions are always used, as things
like Unicode composing characters are rare.
They will become more common when Unicode becomes more common... most
likely a unicode decomposition should be performed before comparing
the strings though.
We really need to get this stuff unified and ripped _out_ of the
MUAs
and into a separate library so that fixes and improvements to VM,
Gnus, Mew, Wanderlust, TM, SEMI, mh-e, etc etc all go into the same
library.
I agree completely. Unfortunately I don't have the time. :-/