Re: [PATCH] Add configurable changed tag to messages that have been changed on disk

Subject: Re: [PATCH] Add configurable changed tag to messages that have been changed on disk

Date: Wed, 23 Apr 2014 15:31:49 -0700

To: Austin Clements

Cc: notmuch


Austin Clements <amdragon@MIT.EDU> writes:

>> A middle ground might be to use the maximum of two values: 1) the
>> time-of-day at which notmuch started executing, and 2) the highest ctime
>> in the database plus 100 microseconds (leaving plenty of slop to store
>> timestamps as IEEE doubles with 52 significant bits).  Since the values
>> will be Btree-indexed, computing the max plus one will be cheap.
> This makes me curious if you've considered how to fit this in to
> Xapian.  The Xapian query syntax supports range queries over document
> "values", but within the Xapian B-tree, values are stored in docid
> order, not value order, so Xapian's range query operator is actually a
> full scan in implementation.  I assume it does this so it doesn't have
> to store both forward and inverse indexes of values.  (I spent some
> time figuring out the layout of the Xapian database and have fairly
> detailed notes if anyone's curious.)

Aside from finding the previous max time, everything else should work
identically to the date query operator and NOTMUCH_VALUE_TIMESTAMP.

Though I believe you, I'm a little surprised the values aren't indexed.
An alternative design might use terms like XCTIMExxxxxxxxxxxxxxxx where
the x's are hex digits.  But this seems a bit clunky and not using
Xapian the way it is indented to be used.

When I do a query with a giant result set ordered by date (notmuch
search --sort=oldest-first "*"), the first few results come back pretty
quickly, so I guess the full database scan is not an issue, at least for
~10^5 messages.

> This is still reasonably fast in practice because it's a sequential
> scan and only requires a few bytes per message, but it's probably not
> what you'd expect.  That said, Xapian does track per-value statistics
> that would suffice for the particular problem of monotonic time stamps
> (e.g., Database::get_value_upper_bound).

Oh, well in that case there is no issue.  That max is the only statistic
we need.  Everything that requires a full database scan, like get me all
messages whose properties have changed since time X, is something that
you can't do at all right now.  And in fact I'm already scanning all
100,000 message IDs AND diffing the results against a separate sqlite
database to detect changes in only 0.09 seconds (Linux) or 1.2 seconds
(32-bit OpenBSD).  This will only make that faster, and additionally
allow other people to do what I'm doing without writing a bunch of C++

> In principle it would be possible to use user metadata or even
> document terms to support true B-tree range scans by ctime order, but
> I don't think it's possible to express queries over this using
> Xapian's query parser.  I've written about 90% of a (new) custom query
> parser for Notmuch that would enable this, but little things like my
> looming thesis deadline have interfered with me finishing it.

Yeah, I've been avoiding the query parser and just scanning terms and
postlists directly.  Since the lack of ctime forced me to scan the whole
database anyway, I found it much faster to scan each tag's posting list
and dump the results into sqlite than to extract tag terms on a
per-document basis the way notmuch dump does.

>> Incidentally, if you are really this paranoid about time stamps, it
>> should bother you that notmuch's directory timestamps only have one
>> second granularity.
> This is historical (and, I agree, unfortunate).  But nobody's
> complained, so it hasn't been worth changing the libnotmuch interface
> to support sub-second directory mtimes.  However, notmuch new does
> correctly handle deliveries in the same second it runs.  If the
> wall-clock time when it starts is the same as the on-disk directory
> mtime, it skips updating the in-database directory mtime at the end.
> Hence, on the next run, it will still consider the directory
> out-of-date.  It's a bit of a hack, but it's a hack that would be
> necessary for supporting older file systems even if we did support
> sub-second timestamps.

Yeah, is kind of a problem me.  I currently scan the XFDIRENTRY terms
belonging to a directory only if the directory's notmuch mtime has
changed since the last time I examined Xapian's state.  I used to scan
the actual directories, which was fine, but not so useful because I
don't actually want to deal with messages that notmuch has not yet
indexed.  Conversely, if a directory has not changed since the last time
muchsync ran, but notmuch's idea of the directory has changed (because
someone ran notmuch new), then I do care about scanning for new/deleted

But couldn't notmuch fix the sub-second problem in a fully backwards
compatible way?  After all, the database is already storing these mtimes
as doubles.  Even for OSes that don't support st_mtim, notmuch could
just add 0.00001 seconds to the previous timestamp of a modified