Re: [PATCH] Add configurable changed tag to messages that have been changed on disk

Subject: Re: [PATCH] Add configurable changed tag to messages that have been changed on disk

Date: Sun, 06 Apr 2014 22:19:19 +0200

To: Gaute Hope, notmuch@notmuchmail.org

Cc:

From: David Mazieres


Gaute Hope <eg@gaute.vetsj.com> writes:

> When one of the source files for a message is changed on disk, renamed,
> deleted or a new source file is added. A configurable changed tag is
> is added. The tag can be configured under the option 'changed_tags' in
> the [new] section, the default is none. Tests have been updated to
> accept the new config option.
>
> notmuch-setup now asks for a changed tag after the new tags question.
>
> This could be useful for for example 'afew' to detect remote changes in
> IMAP folders and update the FolderNameFilter to also add tags or remove
> tags when a _existing_ message has been added to or removed from a
> maildir.

I think this is the wrong way to achieve such functionality, because
then the change tag A) is expensive to remove, B) is easy to misuse
(remember to call fsync everywhere before deleting the change tag), and
C) can be used by only one application.

A better approach would be to add a new "modtime" xapian value that is
updated whenever the tags or any other terms (such as XFDIRENTRY) are
added to or deleted from a docid.  If it's a Xapian value, rather than a
term, then modtime will be queriable just like date, allowing multiple
applications to query all docids modified since the last time they ran.

I currently have multiple applications that could significantly benefit
from such a modtime.  An obvious one is proper incremental backups with
notmuch-dump.

Another example is a tool I have that synchromizes maildirs and notmuch
tags across machines.  With the current interface, there is no way to do
this without scanning the entire database, because any message, even a
very old one, may have changed tags or links.  Moreover, something like
notmuch-dump is way, way too slow to run every time you want to check
for new mail.  notmuch-dump costs 5-10 seconds on my 110,000-message
maildir!  In fact, any approach the gathers tags associated with each
individual docid is a complete non-starter, forcing me to violate
abstraction and examine the postlists associated with each tag and
XFDIRENTRY term.  Even my highly optimized implementation takes about
250 msec (1400 msec on a 32-bit machine), which adds perceptible latency
to synchronizing my clients' notmuch maildirs with my server's when I
poll for new mail.

Yet another application is something like nottoomuch-addresses, which
currently uses an occasionally incorrect heuristic to detect new
messages based on the Date header.

Let me make a stronger statement, which is that not only are
modification times an incredibly useful and general primitive, but lack
of modification times is the single thing that kept me away from notmuch
despite years of wanting to switch.  In the end, I invested months
developing a highly-optimized change detector that efficiently diffs
Xapian's Btrees against a mysql database with a snapshot of the same
information.  My solution works, and I now enjoy a replicated notmuch
setup synchronized across three machines, including offline access on my
laptop.  But my 4,000-line C++ program might have been a 400-line shell
script if only notmuch supported docid mod times.

Also, to put this in perspective, how long does it take to remove the
changed tags from a bunch of messages?  If it's longer than 300 msec on
a 64-bit machine, then even with a single application you'd be better
off using my crazy on-the-side mysql version vector scheme.

David

Thread: