Re: [RFC PATCH] Re: excessive thread fusing

Subject: Re: [RFC PATCH] Re: excessive thread fusing

Date: Mon, 21 Apr 2014 12:20:58 -0400

To: Mark Walters

Cc: notmuch

From: Austin Clements


Quoth Mark Walters on Apr 21 at  8:20 am:
> 
> >> I haven't tracked through all the logic of the existing algorithm for
> >> this case. But I don't like hearing that notmuch constructs different
> >> threads for the same messages presented in different orders. This sounds
> >> like a bug separate from what we've discussed above. 
> 
> I think I have now found this bug and it is separate from the malformed
> In-Reply-To problems.
> 
> The problem is that when we merge threads we update all the thread-ids
> of documents in the loser thread. But we don't (if I understand the code
> correctly) update dangling "metadata" references to threads which don't
> (yet) have any documents.

This exactly the problem I wrote
id:1395608456-9673-1-git-send-email-amdragon@mit.edu to test, but I
had convinced myself everything was okay because we link a message to
both its parents and all of its children.  But that's only true if you
eventually receive the linking message (which in the test I made, you
do).  In this case, you never receive the linking message, so even
though notmuch has enough information to bring the two threads
together, it doesn't.

Maybe I should create a second variant of that test where all of the
messages reference their entire heritage (rather than just their
immediate parent) and test that they're *always* in one thread
regardless of receipt order (rather than only checking once they've
all been received)?  I think that would weed out this case.

> To make this explicit consider the 2 messages 17,18 in the set. 
> 
> Message 17 has id <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk> and has no
> references/in-reply-to so has no parents.
> 
> Message 18 has a reference to <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk>
> and an in-reply-to to <e.fraga@ucl.ac.uk> and
> <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk>
> 
> If you add 17 first then it gets thread-id 1 and then when you add 18 message 18 gets
> thread-id 2 as does the dangling link to the "unseen" message
> e.fraga@ucl.ac.uk, and then message 17 is moved to thread-id 2.
> 
> However, if you add 18 first then it gets thread-id 1 and the dangling
> link gets id 1. When 17 is added it gets thread-id 2, message 18 gets
> thread-id updated to 2 *but* the dangling link to e.fraga@ucl.ac.uk does
> not get updated so stays thread-id 1.
> 
> In particular when 52 comes along with its link to e.fraga@ucl.ac.uk
> then:
> 
>         In the first case it gets put in thread-id 3 and the other two
>         messages get moved into thread 3.
> 
>         In the second case, message 52 gets put in thread 3 and thread 1
>         (the dangling link) gets moved into thread 3 leaving thread 2
>         (containing messages 17 and 18) untouched.

So, there's an obvious, messy way to fix this: update the metadata
references when we do thread renumbering.  This is messy because that
data *isn't indexed*.  The only way to find the records we need to
update is to scan them all.  This isn't completely terrible because
it's a sequential scan and we could cache it in memory, but it
certainly isn't going to help notmuch new's performance.  (My database
has 6,749 of these, which takes ~1 second to scan on a cold cache,
though that's with an SSD [1]).


But let me propose an idea I've been kicking around for a while: ghost
message documents.  Rather than using user metadata for tracking these
missing messages, use regular documents with the exact same terms we
use now for message IDs and thread IDs, but with a Tghost term instead
of a Tmail term to distinguish their type.  This solves the problem
using infrastructure we already have in place, simplifies the message
linking code, and may even make it faster.  It's a schema update, but
a simple and fast one.  I think the hardest part is that things like
notmuch_database_find_message would need to distinguish ghosts and
regular messages (which may require pre-fetching the Tghost or Tmail
posting list to do efficiently).

This also sets us up to do some cool things in the future, though
they're more invasive.  If we have message-like documents for these
ghosts, we can store other message-like metadata as well.  If we store
tags on them, then we can keep tags around for deleted messages and
*reapply them* if the message comes back.  This would finally fix the
races we have now where, if a message is renamed or moved during a
notmuch new, we may think it's deleted only to reindex it with default
tags on the next run.  We could also pre-tag messages that haven't
been indexed yet, say from procmail or when sending a message.  This
could simplify or even obviate notmuch insert.  If we add message
ctimes as proposed by Dave Mazières, this would give us a place to
store and query ctimes of deleted messages (otherwise it's unclear how
you find out about deletions without a full database scan).  In
effect, the database becomes truly monotonic.

[1] Curious?
    yes n | xapian-inspect postlist.DB | \
    awk '!/Key/ {next} /Key: \\x00\\xc0thread_id_/ {N++} /Key: \\x00\\xd0/ {exit} END {print N}'

Thread: