Re: Proof of concept for counting messages in thread

Subject: Re: Proof of concept for counting messages in thread

Date: Sun, 19 Feb 2023 09:04:42 -0400

To: Michael J Gruber

Cc: notmuch@notmuchmail.org

From: David Bremner


Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:

>
> Yes, the extra ones all are ghosts, and I slowly remember that they
> scared me in the past already ...
>
> These ghosts appear to be pretty common. It happens all the time that
> I am joined to an existing discussion thread where I do not have all
> references.

I have about 8% ghost messages in my 730k messages. I don't think I have
any situation as extreme as you do with hundreds of ghost messages for a
small number of actual messages in thread.

If you would like to calculate the ratio for your mail store, you can run

% xapian-delve -v -A Tghost ~/.local/share/notmuch/default/xapian
% xapian-delve -v -A Tmail ~/.local/share/notmuch/default/xapian

> I'd go as far as to say that counting ghosts as thread
> members makes this useless for me. On the other hand, notmuch's own
> count gets this right. And getting different counts is even more
> confusing.

The count shown in e.g. notmuch search is calculated after the query has
been run, so it isn't easily usable as part of a query. Maybe there is a
way to trade off some performance for less false positives. In principle
we could do a query for each thread found by the current technique to
postprocess the results. I can see that getting pretty slow if there are
many results though.

At least for the original motivation of looking for messages without
replies counting ghost messages makes some sense. In general it also
makes sense for finding large threads. I did the query '(thread (count
200 *))' on my mail store and most matches are genuinely large
threads. A few are false positive like the one you describe. In my case
it is easy to see where the ghosts come from, as the (spam) messages
have hundreds of (presumably fictional) references.

>
>> 2) Do they have more than one G term? That suggests a bug somewhere. We
>> actually have a test in the test suite [1] for that, but of course that is
>> with a simple artificial database.
>
> No, they all have one. But their sheer number looks suspicious: those
> 5 "real" e-mails have maybe 20 reference headers in total, and some of
> them refer to some of those 5. Grepping the account store for those
> references gives me around that number. Where do the 110 ghosts (90
> extra) come from which this thread points to? Still scared by them ...
> we need ghost busters!

The only information attached to a ghost message is the thread-id and
the message-id.  You can get a visual picture of the thread with the
attached script. But that will probably just confirm what you did with
grep. To see what is in the database, you can run

% quest -btype:T -bthread:G -d mail/.notmuch/xapian "type:ghost and thread:0000000000000002"

That gives you record numbers, that you can examine with xapian-delve
-r.



draw-thread (application/octet-stream)
_______________________________________________
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-leave@notmuchmail.org

Thread: