Re: [notmuch] 25 minutes load time with emacs -f notmuch

Subject: Re: [notmuch] 25 minutes load time with emacs -f notmuch

Date: Sat, 21 Nov 2009 18:07:10 +0100

To: Stefan Schmidt, notmuch@notmuchmail.org

Cc:

From: Carl Worth


On Sat, 21 Nov 2009 15:51:11 +0100, Stefan Schmidt <stefan@datenfreihafen.org> wrote:
> Disclaimer: I'm using vim, in combination with mutt for email, for years, but
> never dealt with emacs. Please have this in mind and spot any emacs user errors
> in this report. :)

Hi Stefan, welcome to Notmuch! And don't worry, we don't discriminate
(too much) against non-emacs users around here.

> I have first seen notmuch several weeks ago as it seems a silent project. Being
> more then happy now that it envolves quickly and a real developer community
> builds around it.

Yes. Notmuch was a silent project since it was just something that I was
doing for myself. I was always writing it as free software, and even had
a public git repository available, but hadn't advertised it at all yet.

And Keith did rather catch me off guard by announcing it. But I can't
complain as we have gotten a nice community started already, and it's
great to have other people writing the code that I intended to
write. :-)

But it's also true that some obvious problems just aren't taken care of
yet.

> But now to my problem. Getting m mail indexed was easy enough:
> 
> stefan@excalibur:~$ du -chs not-much-mail/
> 1.5G    not-much-mail/
> 1.5G    total
> stefan@excalibur:~$ time notmuch new
> Found 103677 total files.
> Processed 103677 total files in 42m 30s (40 files/sec.).
> Added 100899 new messages to the database (not much, really).

Good. I'm glad that went fairly smoothly for you.

Though, frankly, I think we need to fix "notmuch new" to do much better
than 40 files/sec. One plan I have for this is to not use the database
to search for message IDs when adding many messages---but to instead
just use a hash-table (seeded from any messages already in the
database). This would allow us to do all thread resolution before
indexing messages, without having to do the N different searches, and
also means we'd avoid continually rewriting documents when merging
thread IDs.

> I put (require  'notmuch) in my ~/.emacs ans start emacs with the -f notmuch
> option to enter the notmuch mode.

I'm glad you've figured that much out. I feel bad that that's not even
in the documentation anywhere yet.

> What happends then is that a notmuch process gets started and emacs
> waits for the return.

OK. This is a known shortcoming. As Bdale supposes, this problem is from
notmuch trying to load and construct every thread in your
database. There are actually several different bugs/missing features
here that should be addressed:

  * "notmuch new" should look at the R flag in maildir files to
    determine that they are read and do not need to be marked as "inbox"
    and "unread"

  * "notmuch setup" should prompt for some date range, ("last 2 months"
    by default?) before which no messages will be considered unread.

Either of those two fixes would have prevented your particular
problem. But it's still easy to generate searches that return large
numbers of results. So there's some more to do:

  * The emacs code needs to call "notmuch search" with the --first and
    --max-threads options to get a limited set of results, (one or two
    screenfuls). You should be able to test this at the command line and
    see that it returns results quickly. Then, of course, we'd like the
    emacs code to fill in subsequent screenfuls as you page.

But none of that helps you right now. What you need is to retroactively
remove all of the "inbox" and "unread" tags from messages older than
some time period. So then there's another missing feature:

  * We need to support date-range-based searches. If we had that you
    could just do:

	notmuch tag -inbox -unread until:"2 months ago"

    But we don't quite have this yet. Xapian does have support for a
    slightly less convenient date range specification:

	1970-01-01..2009-09-21

    but it turns out that we can't even use that just yet, since to make
    that work we would have to have dates saved as YYYYMMDD strings for
    each message, (where instead we have time_t values stored serialized
    into a string that will sort correctly.). So we need a new
    ValueRangeProcessor class to map to timestamps, and then we'll need
    some fancy parsing to do things like "2 months ago".

So, what's the best thing to do today if you want to start playing with
notmuch? I think you could pick one of the above to work on, (a quick
hack to "notmuch new" and a re-import might do the trick). Or you might
just remove the inbox and unread tags from all messages and then just
let messages that are actually *new* in the future get tagged into the
inbox by "notmuch new". Oh, but then there's another missing feature:

  * We need a syntax to specify a search string that should match all
    messages. Then you could do:

	notmuch tag -inbox -unread <whatever-magic-we-came-up-with>

Yikes! So many bugs and missing features. How is anyone actually using
this system? Well, Keith and I were able to get past all this by simply
doing a "notmuch restore" based on tags we got from sup-dump. So here,
is another attempt:

  1. Run "notmuch dump <some-file>" to get the list of message IDs, (all
     with their "inbox" and "unread" tags).

  2. Edit that file to remove the tags you want.

  3. Run "notmuch restore <some-file>" to cause the tags to be removed.

But, (*sigh*), that's not good either, because "notmuch dump" is
currently hard-coded to dump messages in message-ID order rather than
date order, (so you can't easily do something like "just remove the tags
from messages older than two months).

So, there's sadly no easy way to get what you want with the tools in
their current form. I guess that's the pain that you get for being an
early adopter. :-}

But if hacking a little C code doesn't scare you away, a lot of the
things listed above are actually really easy to fix. (Like, fixing
"notmuch dump" to just run in date order is a one-line change. Adding a
--sort command-line option to it wouldn't be much harder, etc.)

So hopefully the above serves as a nice TODO list.

Thanks everyone for your interest in this software even in its current,
can-be-painful-to-use state.

-Carl

PS. Expect the mass-re-tag operations to be about as slow as the
original "notmuch new" import of the messages. That's a known bug in
Xapian that's one of the highest priority things that I'd like to fix,
(along with all of the above and all the other things I want to do...)

At least we're not running out of things to work on here.

Thread: