[RFC PATCH] Re: excessive thread fusing

Subject: [RFC PATCH] Re: excessive thread fusing

Date: Sun, 20 Apr 2014 08:14:56 +0100

To: David Bremner, notmuch

Cc:

From: Mark Walters


On Sat, 19 Apr 2014, David Bremner <david@tethera.net> wrote:
> Gregor Zattler mentioned some problems with threading at 
>
>        id:20120126004024.GA13704@shi.workgroup
>
> After some off list discussions, I believe I have a smaller test case.
>
> The attached maildir contains 24 messages from the org mode list.
>
> According to notmuch, these form one thread, but I can't figure out
> exactly why. It seems like the chronologically first two messages should
> be a seperate thread. There are several of the infamous malformed ME-E
> In-reply-to headers, but each of these messages also has a References
> header; this seems to indicate a case missed by commit cf8aaafbad68.

Hi 

I have done dome debugging of this. There is a patch below which fixes
this test case but who knows what it breaks! Please DO NOT apply unless
someone who knows this code says it's OK.

First, the bug is quite sensitive. The attached 24 messages are numbered
and i will use the last two digits to refer to them (ie the 2 digits are
the ?? in 1397885606.0002??.mbox:2,). The number range from 17-52; 17
and 18 should be one thread and the rest a different thread.

1) If you add all messages you get one thread. 
2) If you add all apart from 52 you get 2 threads. However, then adding
52 still gives two threads.
3) If you add 18 and then 52 you get 1 thread.
4) If you add 17 and 18 then 52 you get 2 threads.

I think notmuch will use inode sort and since the tar file contains
these three files in the order 18 52 17 we get cases 1 and 2 above.

I put some debug stuff in _notmuch_database_link_message_to_parents and
I think that the problem comes from the call to parse_references on line
1767 which adds the malformed in-reply-to header to the hash table, so
this malformed line gets added as a potential parent. 

As a clear example that I don't understand this code I don't know why
this no longer causes a problem if message 17 gets added too.

Best wishes

Mark

---
 lib/database.cc |   21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 1efb14d..373a255 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1763,20 +1763,23 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
 					    this_message_id,
 					    parents, refs);
 
-    in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
-    in_reply_to_message_id = parse_references (message,
-					       this_message_id,
-					       parents, in_reply_to);
-
     /* For the parent of this message, use the last message ID of the
      * References header, if available.  If not, fall back to the
-     * first message ID in the In-Reply-To header. */
+     * first message ID in the In-Reply-To header. We only parse the
+     * In-Reply-To header if we need to as otherwise we might
+     * contanimate the hash table if it is malformed. */
     if (last_ref_message_id) {
         _notmuch_message_add_term (message, "replyto",
                                    last_ref_message_id);
-    } else if (in_reply_to_message_id) {
-	_notmuch_message_add_term (message, "replyto",
-			     in_reply_to_message_id);
+    } else {
+	in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
+	in_reply_to_message_id = parse_references (message,
+						   this_message_id,
+						   parents, in_reply_to);
+	if (in_reply_to_message_id) {
+	    _notmuch_message_add_term (message, "replyto",
+				       in_reply_to_message_id);
+	}
     }
 
     keys = g_hash_table_get_keys (parents);
-- 
1.7.10.4





Thread: