Re: latin encodings, mail-lib, and posix regexp classes

Thursday, 6 June 2002

        Josh Huber <huber(a)alum.wpi.edu&gt; writes:

...
 Simon Josefsson <jas(a)extundo.com&gt; writes:

> RFC (2)822 articles should not contain iso-8859-15 characters.
> Perhaps mail-extr is supposed to operate on raw articles, not MIME
> decoded ones?

 Ah, you're probably right.  Now I'm not sure what's supposed to be
 happening, since BBDB gets passed the header after it's been decoded,
 but it uses mail-extract-address-components (mail-extr.el) or
 rfc822-addresses (rfc822.el) [both part of mail-lib].  So, where's the
 bug?  There's code in there already to handle latin-1 chars:

 (let* ((latin1-ss (string (make-char 'latin-iso8859-1 223)))
        (latin9-ss (string (make-char 'latin-iso8859-15 1759)))
        (latin1-addr (concat "Joe Te" latin1-ss "t
<joe.test(a)foo.org&gt;&quot;))
        (latin9-addr (concat "Joe Te" latin9-ss "t
<joe.test(a)foo.org&gt;&quot;)))
   (concat "Works: <" (car (mail-extract-address-components latin1-addr))
 	  ">, Broken: <" (car (mail-extract-address-components latin9-addr))
 	  ">"))

 => "Works: <Joe Teßt>, Broken: <Joe Te>"    
This might just be an accident, I don't think mail-extr.el was
designed for anything but ASCII.  Can't BBDB be modified to work with
the raw header?  Then it doesn't have to rely on the callee passing it
correctly decoded data.

...
> mail-extr.el is in need of FSF syncing.  Perhaps that would the
> first step?  It is a large task undertaking though.

 Indeed.  Since there isn't support (athough there is mention of it in
 regex.h) for POSIX char classes, this will be even more work.  Adding
 support for them first would be a good thing, imho.  Or perhaps we
 could rely on the fact that syntax tables are defined for each part of
 the address during parsing and just use \sw as the match character? 
Emacs' mail-extr.el uses POSIX char classes in three places (which I
suspect might not be entirely correct, it only works for raw 8bits in
headers, which is forbidden), so I think it can be synced without
worrying about it very much.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: latin encodings, mail-lib, and posix regexp classes