Re: Notmuch DB Problems

Subject: Re: Notmuch DB Problems

Date: Mon, 10 Sep 2018 08:01:06 -0300

To: Mueen Nawaz,


From: David Bremner

Mueen Nawaz <> writes:

> After a lot of poking around, I figured out the problem, and this may be
> of interest to the developers (although not sure if it is a xapian issue
> or a notmuch issue).
> Here's why it would freeze:
> I have a post-new hook that runs a Python script. Depending on whether
> the new email it is processing matches a rule I have, it will fire off
> an email to the sender using the SMTP library in Python.
> I had recently upgraded my MTA (PostFix), and it had a backward
> incompatible change that broke my config. I don't know why, but I could
> still send emails via Emacs, but when I tried to send them via Python,
> Postfix would log an error and it would not send. The Python statement
> would freeze (I guess Postfix doesn't return an appropriate response?
> Not sure why). 
> I have a cron job to run "notmuch new" 3 times an hour. Since the hook
> was frozen, so was the notmuch new command. I had quite a lot of
> "notmuch new" processes. I assume this meant the DB was locked all this
> time for writing.

notmuch unlocks the database before running the hook, so I don't
understand how a hung hook results in a locked database. If it happens
again (or you're motivated to set up a testbed) I'd be interested in the
output of

           lsof ~/Maildir/.notmuch/xapian/flintlock

Also, is this by chance a network file system? Because those often
break locking.

> Now killing all those jobs did not fix the database. It was still
> broken. And as we saw the second time round, it was /really/ broken - it
> would not even open in read-only mode.

That seems like something the Xapian devs (in copy) might be interested
in fixing, if you could come up with a simple reproducer.

> It is scary that if a post-new hook freezes while the database is
> locked, it could (eventually) clobber the database. I don't know if
> notmuch can do anything to prevent this outcome?

notmuch could be cleverer about timing out on trying to acquire a
lock. I suspect it's a bit delicate to get that right, and I've been
hoping the underlying primitives would get a bit more flexible
w.r.t. locking.

We could also potentially run hooks in the equivalent of "timeout", but
I don't know how much code that would be.  A simpler option (once we
understand what the real problem is) would be to suggest that users use
timeout themselves in hooks to be run unattended.

signature.asc (application/pgp-signature)
notmuch mailing list