Re: [RFC PATCH] Re: excessive thread fusing

Subject: Re: [RFC PATCH] Re: excessive thread fusing

Date: Sun, 20 Apr 2014 13:03:34 +0100

To: Carl Worth, David Bremner, notmuch

Cc:

From: Mark Walters


On Sun, 20 Apr 2014, Carl Worth <cworth@cworth.org> wrote:
> Mark Walters <markwalters1009@gmail.com> writes:
>> I have done dome debugging of this.
>
> Thanks for looking closely, Mark!
>
>> There is a patch below which fixes this test case but who knows what
>> it breaks! Please DO NOT apply unless someone who knows this code says
>> it's OK.
>
> I wrote much of the original code being patched here, so hopefully I
> understand it and can say something useful.
>
> I agree that the patch should not be applied. I don't like to see one
> piece of code not trusting another in the same code base. If the
> parse_references() function doesn't deal well with a malformed header,
> then we should fix it, not step around it.

>
> Meanwhile, not treating all potential referenced message IDs
> consistently could definitely make the notmuch algorithm more fragile
> and sensitive to the order of message indexing, etc. So let's not do
> that.

I agree. This bug first came up in id:874nvcekjk.fsf@qmul.ac.uk; I think
that got mostly fixed by cf8aaafbad68
(id:1361836225-17279-1-git-send-email-aaronecay@gmail.com and related
thread) so we may want to check whether that change is still wanted if
we fix the actual bug.

> Instead, let's track down and fix the actual bug.
>
> Thanks for the idea of using two-digit names for these messages. That
> makes it much easier to inspect the relevant headers.
>
> Below, I've grepped out the actual References and In-Reply-To headers
> From the messages, and then simply substituted minimal, and
> easy-to-understand values for the message IDs.
>
> With these minimally modified headers, it's easy to manually inspect the
> relationships and see that messages 17 and 18 belong in one thread, and
> messages 32-52 belong in a separate thread.
>
> It's also quite easy to see the potential source of the bug. The
> In-Reply-To headers for messages 18, 32, and 52 all share a common
> string (an email address) formatted to look like a message-id,
> "<e.fraga@ucl.ac.uk>". If notmuch looks at those headers, and treats
> that string as a message-id, then all of theses messages will be
> connected into a single thread.
>
> And since that's the reported behavior, it seems likely that
> "<e.fraga@ucl.ac.uk>" is the cause of this bug.
>
>> I put some debug stuff in _notmuch_database_link_message_to_parents and
>> I think that the problem comes from the call to parse_references on line
>> 1767 which adds the malformed in-reply-to header to the hash table, so
>> this malformed line gets added as a potential parent. 
>
> Am I correct that your debugging showed that "<e.fraga@ucl.ac.uk>" is
> being added to the hash table?

Yes that is correct.

> My inspection of _parse_references() and parse_message_id() suggests
> that that's exactly what notmuch is doing, (treating both of the
> angle-bracketed portions ("<e.fraga@ucl.ac.uk>" as well as the actual
> message-ID, "<ID17>" or "<ID31>" or "<ID39>") as message IDs.
>
> So it seems like we need a new _parse_in_reply_to() function to use in
> place of _parse_references() and the new function will need a better
> heuristic for dealing with the unpredictability of In-Reply-To.
>
> The only real reason that we are trying to grab multiple message ID
> values from an In-Reply-To header is that RFC 2822 explicitly allows
> that, (to support a message simultaneously replying to multiple
> messages). I don't believe that that's common, but we might as well
> support it. At the same time, RFC 2822 also explicitly specifies that
> the In-Reply-To header will consist of nothing but message IDs.
>
> So perhaps the heuristic here could be to notice any characters outside
> of angle brackets, (like "Message" in the headers below), and in that
> case go to a strictly "not RFC 2822" mode and look for exactly one
> message ID. At that point, JWZ would recommend "the first <>-bracketed
> text found therein", but that would give precisely the wrong answer in
> this particular case. Here the correct Message ID appears in the last
> <>-bracketed text. I have not surveyed a large email corpus to determine
> how often "last <>-bracketed text" would fail as a heuristic.
>
> Another idea would be to trigger specifically on common forms. Judging
> From the samples in this particular thread, it seems like a workable
> heuristic would be:
>
> 	If the In-Reply-To header begins with '<':
>
> 		Parse that initial portion as a message ID
>
> 	Else if it ends with '>':
>
> 		Parse that final portion as a message ID
>
> 	Else
>
> 		Ignore this garbage-valued header.
>
> That's probably the best and most reliably thing to do here.
>
> Does anyone have any better ideas?

Is there a case coming before all the above: if the In-Reply-To header
is correctly formed then parse as we do currently? (You sort of suggest
so above but I just wanted to check)

>> As a clear example that I don't understand this code I don't know why
>> this no longer causes a problem if message 17 gets added too.
>
> I wanted to test my own knowledge of the code to see if I could explain
> this. But I didn't precisely follow your explanation of the behavior you
> saw. In cases (1) and (2) of your description, what order are you using
> to "add all messages" or "add all apart from 52"?

I just untarred the tar file David posted. Then the messages get added
in the following order:

45 39 47 33 31 18 42 51 41 46 37 44 35 36 34 49 40 48 38 52 17 50 32 43

which is the same as the order in the tar file. (I think this is notmuch
using inode based sort as it has not seen the directory before)

In Case 2 I started with a fresh untar; then moved message 52 out of the
Maildir; ran notmuch new,  then moved message 52 back
into the the Maildir tree and ran notmuch new again.

> Then, for cases (3) and (4), what is done before adding the messages
> mentioned in these cases? Add all other messages? Again, in what order?

In case 3 I started with a fresh untar. Moved all the message except 18
elsewhere. ran notmuch new. moved message 52 back and ran notmuch new.

In have checked case 4 carefully adding messages 1 at a time and running
notmuch new between each addition.

If I add 18 17 52 I get 2 threads.
If I add 17 18 52 I get 1 thread

> I haven't tracked through all the logic of the existing algorithm for
> this case. But I don't like hearing that notmuch constructs different
> threads for the same messages presented in different orders. This sounds
> like a bug separate from what we've discussed above. 

I agree but I don't know the logic well enough to be sure.

Best wishes

Mark

>
> -Carl
>
> 18:References: <ID17>
> 32:References: <ID31>
> 33:References: <ID31> <ID32>
> 34:References: <ID31> <ID32> <ID33>
> 35:References: <ID31> <ID32> <ID33>
> 36:References: <ID31> <ID32> <ID33> <ID35>
> 37:References: <ID31> <ID32> <ID33> <ID35> <ID36>
> 38:References: <ID31> <ID32> <ID33> <ID35> <ID36> <ID37>
> 39:References: <ID31> <ID32>
> 40:References: <ID31> <ID32> <ID39>
> 41:References: <ID31> <ID32> <ID39> <ID40>
> 42:References: <ID31> <ID32> <ID39> <ID40> <ID41>
> 43:References: <ID31> <ID32> <ID39> <ID40> <ID41> <ID42>
> 44:References: <ID31> <ID32> <ID39> <ID40> <ID41> <ID42>
> 45:References: <ID31> <ID32> <ID39> <ID40>
> 46:References: <ID31> <ID32> <ID39> <ID40> <ID45>
> 47:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46>
> 48:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47>
> 49:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47> <ID48>
> 50:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47> <ID48> <ID49>
> 51:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47> <ID48> <ID49> <ID50>
> 52:References: <ID31> <ID32> <ID39>
>
> 18:In-reply-to: Message from Eric S Fraga <e.fraga@ucl.ac.uk> of "Tue, 01 Mar 2011 15:25:38 GMT." <ID17>
> 32:In-Reply-To: Message from Eric S Fraga <e.fraga@ucl.ac.uk> of "Thu, 10 Mar 2011 21:00:16 GMT." <ID31>
> 33:In-Reply-To: <ID32> (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500")
> 34:In-Reply-To: <ID33>
> 35:In-Reply-To: <ID33>
> 36:In-Reply-To: <ID35> (Carsten Dominik's message of "Sun, 13 Mar 2011 08:39:13 +0100")
> 37:In-Reply-To: <ID36>
> 38:In-Reply-To: <ID37> (Carsten Dominik's message of "Mon, 14 Mar 2011 08:40:33 +0100")
> 39:In-Reply-To: <ID32> (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500")
> 40:In-Reply-To: <ID39>
> 41:In-Reply-To: <ID40> (Carsten Dominik's message of "Fri, 11 Mar 2011 12:36:13 +0100")
> 42:In-Reply-To: <ID41>
> 43:In-Reply-To: <ID42>
> 44:In-Reply-To: <ID42>
> 45:In-reply-to: Message from Carsten Dominik <carsten.dominik@gmail.com> of "Fri, 11 Mar 2011 12:36:13 +0100." <ID40>
> 46:In-Reply-To: <ID45>
> 47:In-reply-to: Message from Carsten Dominik <carsten.dominik@gmail.com> of "Mon, 14 Mar 2011 11:21:36 BST." <ID46>
> 48:In-Reply-To: <ID47>
> 49:In-reply-to: Message from Carsten Dominik <carsten.dominik@gmail.com> of "Mon, 14 Mar 2011 18:02:54 BST." <ID48>
> 51:In-Reply-To: <ID50>
> 52:In-reply-to: Message from Eric S Fraga <e.fraga@ucl.ac.uk> of "Fri, 11 Mar 2011 08:47:58 GMT." <ID39>
>
> -- 
> carl.d.worth@intel.com

Thread: