Re: [PATCH] Add configurable changed tag to messages that have been changed on disk

Subject: Re: [PATCH] Add configurable changed tag to messages that have been changed on disk

Date: Wed, 23 Apr 2014 16:59:20 -0400

To: David Mazieres expires 2014-07-22 PDT

Cc: notmuch

From: Austin Clements

Hi Dave!

Quoth David Mazieres on Apr 23 at  2:00 am:
> Gaute Hope <> writes:
> > A db-tick or a _good_ ctime solution can as far as I can see solve both
> > David M's (correct me if I am wrong) and my purposes, as well as
> > probably have more use cases in the future. It would even be an
> > interesting direct search: show me everything that changed lately,
> > sorted.
> I could live with a db-tick scheme.  I would prefer a ctime scheme,
> since then I can answer questions such as "what has changed in the last
> five minutes"?  I mean all kinds of other stuff starts to break if your
> clock goes backwards on a mail server machine, not the least of which is
> that incremental backups will fail silently, so you risk losing your
> mail.
> A middle ground might be to use the maximum of two values: 1) the
> time-of-day at which notmuch started executing, and 2) the highest ctime
> in the database plus 100 microseconds (leaving plenty of slop to store
> timestamps as IEEE doubles with 52 significant bits).  Since the values
> will be Btree-indexed, computing the max plus one will be cheap.

This makes me curious if you've considered how to fit this in to
Xapian.  The Xapian query syntax supports range queries over document
"values", but within the Xapian B-tree, values are stored in docid
order, not value order, so Xapian's range query operator is actually a
full scan in implementation.  I assume it does this so it doesn't have
to store both forward and inverse indexes of values.  (I spent some
time figuring out the layout of the Xapian database and have fairly
detailed notes if anyone's curious.)

This is still reasonably fast in practice because it's a sequential
scan and only requires a few bytes per message, but it's probably not
what you'd expect.  That said, Xapian does track per-value statistics
that would suffice for the particular problem of monotonic time stamps
(e.g., Database::get_value_upper_bound).

In principle it would be possible to use user metadata or even
document terms to support true B-tree range scans by ctime order, but
I don't think it's possible to express queries over this using
Xapian's query parser.  I've written about 90% of a (new) custom query
parser for Notmuch that would enable this, but little things like my
looming thesis deadline have interfered with me finishing it.

> Incidentally, if you are really this paranoid about time stamps, it
> should bother you that notmuch's directory timestamps only have one
> second granularity.  It's not that hard to get a new message delivered
> in the same second that notmuch new finished running.  In my
> synchronizer, I convert st_mtim (a struct timespec) into a double and
> keep that plus size in the database to decide if I need to re-hash
> files.  But for directories, I'm stuck with NOTMUCH_VALUE_TIMESTAMP,
> which are quantized to the second.  (Ironically, I think
> Xapian::sortable_serialize converts time_ts to doubles anyway, so
> avoiding st_mtim is not really helping performance.)

This is historical (and, I agree, unfortunate).  But nobody's
complained, so it hasn't been worth changing the libnotmuch interface
to support sub-second directory mtimes.  However, notmuch new does
correctly handle deliveries in the same second it runs.  If the
wall-clock time when it starts is the same as the on-disk directory
mtime, it skips updating the in-database directory mtime at the end.
Hence, on the next run, it will still consider the directory
out-of-date.  It's a bit of a hack, but it's a hack that would be
necessary for supporting older file systems even if we did support
sub-second timestamps.