Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

Subject: Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag

Date: Sat, 28 Jan 2012 10:51:16 +0000

To: Austin Clements

Cc: notmuch@notmuchmail.org

From: Mark Walters


> >  	exclude_query = _notmuch_exclude_tags (query, final_query);
> >  
> > -	final_query = Xapian::Query (Xapian::Query::OP_AND_NOT,
> > -					 final_query, exclude_query);
> > +	enquire.set_weighting_scheme (Xapian::BoolWeight());
> > +	enquire.set_query (exclude_query);
> > +
> > +	mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ());
> > +
> > +	GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int));
> > +
> > +	for (iterator = mset.begin (); iterator != mset.end (); iterator++)
> > +	{
> > +	    unsigned int doc_id = *iterator;
> > +	    g_array_append_val (excluded_doc_ids, doc_id);
> > +	}
> > +	messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set);
> > +	_notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids,
> > +				  excluded_doc_ids);
> 
> This might be inefficient for message-only queries, since it will
> fetch *all* excluded docids.  This highlights a basic difference
> between message and thread search: thread search can return messages
> that don't match the original query and hence needs to know all
> potentially excluded messages, while message search can only return
> messages that match the original query.

I now have some benchmarks (not run enough times to be hugely accurate
so ignore minor differences). The full results are below. The summary
is:

Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of
lkml). I mark 1 000 000 deleted
Small-archive = 70 000 messages in 35 000 threads. 10 000 marked
deleted.

Doing the initial exclude work on the big collection takes about 0.8s
and on the small collection about 0.01s. So any query to the big
collection takes at least 0.8s longer and this all occurs before any
results appear.

I then implemented the exclude doing it once for each thread query in
_notmuch_create_thread. Roughly this made any query 50% slower.

In normal front end use even the 0.8s is not totally unusable, but it is
totally unacceptable in the backend where a user might do something like

for i in ` notmuch search --output=threads  from:xxx ` ; 
do 
   notmuch search --output=messages $i; 
done

to list all messages in all matching threads.

So I think my conclusions are:

(1) message only queries must be done without the full exclude.
(2) thread queries which only match one message should not do the full
exclude
(3) it would be nice to switch between the two approaches depending on
size but I don't see how to do that without extra(!) queries
(4) One possible might be do something that say does thirty threads with
the by thread method and then if not finished does the full exclude.
(5) thread-by-thread might be best for  Jani's limit-match 
id:"1327692900-22926-1-git-send-email-jani@nikula.org" 

Obviously, anything setting an exclude flag like this will be slower
(since it is doing more work): the question is are either of these (or a
combination like (4) above) acceptable?

I now have a mostly working implementation from library to
emacs frontend and I do like the overall outcome.

The complete benchmarks are below

Best wishes

Mark

LARGE COLLECTION is 1,100,000 messages 290,000 threads 1,000,000 deleted
SMALL COLLECTION is 70,000 messages in 35,000 threads 10,000 deleted

benchmarks: all times in seconds, x/y/z means a query which matches x
threads with y matching messages and z messages in total. Ig or ignore
means with the tag-exclude turned off (i.e. with a query matching the
excluded tag). list all messages is the time for the for loop listed
above giving all message-ids for all messages in any thread matching a
query.

Finally the three columns are master with exclude code disabled,
thread-thread is doing excludes once per thread construction, and
in-advance does all the exclude work in advance as in the patches I posted.

In most cases the benchmark is the average of a lot of runs so the
database should have been as cached as one could hope.

			master-(all)	thread-thread	in-advance
LARGE COLLECTION		   			
show single message	0.016		0.018		0.78
search single message	0.015		0.016		0.78
search single with tag	0.015		0.015		0.009
945/2627/20000
query ignore		2.9		n/a		3
query 			2.9		4.2		3.8
list all messages (ig)	13		n/a		13
list all messages 	13		14		12mins
4754/13000/110000
query ignore		15.9		n/a		17
query 			15.9		22		17.6
only messages		1.25		1.26		1.9
177/483/1752		
query			0.3		0.42		1.1

search '*'		20mins		28mins		21.5mins			

SMALL COLLECTION
1500/2800/5600
query			1.8		2.7		2
list all messages	14.5		16.4		30
single message		0.008		0.008		0.018

search '*'		28		49		32

Thread: