On Sat, 21 Nov 2009 17:36:18 -0500, Brett Viren <brett.viren@gmail.com> wrote: > Processed 130871 total files in 38m 7s (57 files/sec.). > Added 102723 new messages to the database (not much, really). Just be glad that you have so little mail. ;-) > This was ~2GB of mail on a 2.5GHz CPU. That seems pretty reasonable > to me but I'd like to rerun the "notmuch new" under google perftools > to see if there are any obvious bottlenecks that might be cleaned up. To me, here are the obvious things to fix after looking at a profile: 1. We're spending a *lot* of time searching in the Xapian database. But our initial indexing operation should only be *writing* data into the database, so what's this searching about? Well, at each new message, we're looking up the ID from it's In-Reply-To header to find a thread-ID to link to, and then we're looking up all of the IDs from its References header to find thread IDs that need to be merged with ours. So both parent and child lookups. And since those are taking a bunch of time, I think it might make sense to just keep a hashtable mapping message-ID -> thread-ID and do lookups in that, (should have plenty of memory on current machines even with lots of mail). 2. We're hitting the slow Xapian document updates for thread-ID merging. Whenever we find a child that was already in the database with one thread ID that should have ours, we simply want to set its thread ID to ours. But as we've talked about recently, Xapian has a bug (defect 250) that makes it much more expensive than it should be to update a single term. So, we could do a first pass over the messages to find all their thread IDs and get them to settle down before doing any indexing in a separate, second pass. Step (2) should help even if we don't do step (1), but clearly we can do both. It would be great if anyone wants to take a look at either or both of these, otherwise I will when I can. -Carl