Stephen J. Turnbull writes:
I agree that VM is better than Netscape 4.0 in the UI, since it has
a
separate toggle for threading, rather than listing it as a kind of
sorting.
But the implementation contradicts what the docs say:
No, I don't think so. You can turn on threading and sort the messages by
various sorting orders. Unless there is a bug in the implementation, they
should all work. The following bug-report provides some background as to
why this was important:
https://bugs.launchpad.net/vm/+bug/498323
In Kyle Jones's VM, there was one built-in sort order for threads (what we call
'activity' order these days), and that is all you got. People tried to
persuade him to make it flexible but he didn't have any interest in it. I
myself found activity order confusing when dealing with large threads and
wished for the 'date' order. Some people prefer to use the
'reversed-date'
order and presumably 'reversed-activity' order. So, these are the 4
important orders people want to use in practice. In future, we might have
'delivery-date' and its reverse, and God knows what else.
> How the threads affect the sort order is of course handled in
the
> sorting function.
There's no "of course" about it. It's much simpler if you sort the
sibling sets in the tree rather than trying to use thread information
in the sort. It's not clear to me that it's all that much slower in
practice (vm-sort is O(N log N), jwz-thread is O(N log M) where N is
messages in folder and M is number of siblings per parent).
So, you would sort the roots of all threads by the requested sort order,
then sort the children of each root, and then their children and so on,
recursively? That was in fact the first idea that occurred to me. But then
I figured out a way to incorporate everything in just one sort applied to
all messages (I still do remember how people used to sort punched cards in
the old days!) and it became straightforward.
The trick is based on the following property:
The sort order of two messages (m1, m2) in a threaded sort reduces to the
sort order of their oldest different ancestors (p1, p2).
That is, m1 precedes m2 iff if p1 precedes p2.
If m1 and m2 belong to different threads, then their oldest different
ancestors would be their thread roots. Then the property is clear.
If they belong to the same thread, then you can take the thread minus its
root as a message list, and the same reasoning applies within the pared down
message list.
So, vm-sort-compare-thread, given m1 and m2, returns the corresponding p1
and p2, and the remaining sort-keys will order p1 and p2 in an appropriate
way.
I think I would call this clean and elegant, rather than "complex". If I
was working in a higher-order programming language, I would make its type to
be
vm-sort-compare-thread : sort-order -> sort-order
which captures more clearly what is going on.
As far as I can see, there are a couple of ideas missing.
In particular, the empty container at root concept. ISTR something
else, but it was a while back.
I don't recall all the details either. As I said, I used Jamie's web page
to understand what VM was doing. But I made no attempt to modify VM's
algorithm to correspond to Jamie's. One difference I remember vividly is
that Kyle Jones was keeping cycles in the thread-tree and detecting them
when traversing the tree. This made things unstable. One time the cycle
might get resolved one way and another time another way. So, I used Jamie's
recommendation not to keep cycles in the thread tree at all, and that made
things cleaner and clearer.
There are a number of reasons why I don't think discarding the
containers is a good idea in VM. First, the thread tree is not
folder-local, it's universal (ie, across users and nodes on the
Internet!)
I am not sure I understand how the Internet comes into the picture. VM's
thread trees are indeed folder-local. Virtual folders have their own thread
trees. I see that it would be feasible to keep a single thread tree for an
entire VM session. That would eliminate the issues of having to update the
thread trees of virtual folders when an underlying folder changes. However,
you would need to address the problem of reclaiming the thread tree nodes of
the folders that have been quit, basically rolling out your own memory
manager for the thread trees. So, that doesn't come for free!
Once I get the basic algorithm working, I plan to look at
computing "views" per-folder from a global database for the VM
session. (Maybe this isn't a great idea globally for a VM session,
but virtual folders mean a per-real-folder database doesn't fly
either. If you use V A and the like a lot, you're going to be
building a lot of databases unless you reuse existing thread
information.)
Indeed. But, as you observed earlier, building the thread-tree is not all
that expensive. So, doing it as part of virtual folder creation is quite
acceptable.
What VM does avoid is the rebuilding of the thread tree when you get new
mail or expunge a folder or just re-sort the messages. It would be feasible
to rebuild the thread trees in those cases. I don't have strong objections
to that if the thread-tree building is fast enough. One worry is that the
cycles in the thread tree might get resolved differently each time, and that
might confuse the user.
Finally, you're quite right, that (limited!) use of operations
on
trees like reparenting (sub)trees, and adding or removing nodes can be
very useful to help the user present the threads. One that you didn't
mention here but is in the VM code is one that bugs me a lot, namely
multiple instances of the same message. The database of containers
can group these for you in a flexible way, the summary list can't.
Yes, indeed, multiple instances are definitely a serious issue. I referred
to them as "duplicate copies" of messages earlier. They did give me
significant trouble because I didn't worry about them initially. But,
later, I discovered that it was important to keep track of which copy among
the duplicate copies is regarded as the canonical copy in the thread tree.
The duplicate copies share the same message-id, but we cannot assume that
they are identical. In the simplest case, one of them might have the
'deleted' attribute set and another one not. The duplicate copies have
different UID's on the IMAP server. Sometimes, they even have different
References headers because the mailing-list admins munge them!
> Oh, if I do understand why it proved to be so hard, it will
> probably go into a journal or perhaps it will turn into a bigger
> project!
Message threading in principle isn't hard, because the RFC 1035/5322
definitions of References and In-Reply-To guarantee a tree if all MUAs
implement those fields conformantly, and the information provided
gives the appropriate links.
Coming up with the algorithms wasn't hard, especially, because they were
there already from Kyle Jones. But smoothing out the problems and fixing
the bugs was hard. I do have some idea why that was. When I wrote out the
invariants for the data structure, they turned out to be longer than a
screenful. So, at each point in the (internal) code, I had to know which of
the invariants were true and which were not. And, that is quite a lot of
intellectual labor!
So, the research question for me is how to layer the data structure so that
the invariants are modularized and the intellectual labor in working with
the code is reduced. I am also interested in how much the existing
verification tools (such as Code Contracts from Microsoft) can help in
checking these invariants, and whether new verification technology would be
needed.
Cheers,
Uday
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://lists.xemacs.org/mailman/listinfo/xemacs-beta