Re: [notmuch] Mail in git

Subject: Re: [notmuch] Mail in git

Date: Wed, 17 Feb 2010 10:03:36 -0500

To: notmuch

Cc:

From: Ben Gamari


Excerpts from martin f krafft's message of Tue Feb 16 20:21:01 -0500 2010:
> What I am wondering is if (explicit) tags couldn't be represented as
> tree-objects with this.
> 
>   evenless-link   — link a message object with a tree object
>   evenless–unlink – unlink a message object from tree object
>     [replaces evenless-unlink]

I was actually wondering this very thing. I'd just be worried about tags
with large numbers of messages (presumably we would need an All tag,
that would contain a reference to every known message). It seems like
the simple act of adding a message to the repository could turn into an
extremely expensive operation.

Moreover, deleting a message could also be quite expensive as this will
require rewriting all of the tags that reference it. Surely, we would
need to batch these sort of operations to avoid disasterous performance.

However, even with batching, it seems we would face some pretty serious
scalability issues. I think if we were to implement tag storage in
trees, we'd need to use a multi-level tree. This way we could avoid
rewriting a tree object containing all of the tag's messages on every
change. I apologize if this was already obvious to everyone but me.

> 
> messages would then be deleted whenever using git-gc.
> 
> No idea how this would sync if we don't keep ancestry. Otoh, it
> would probably not be very expensive to do just that.

I think that keeping the ancestry would be quite important and would
come with relatively low overhead given the correct dereferencing of
data structures.

> 
> notmuch would then only search and provide the hash ID(s); tags
> would be a function of storage.
> 
> Is it possible to find out all trees that reference a given object
> with Git in constant or sub-linear time?
> 
I don't believe so. I think this is one of the reasons why git gc is so
expensive.

- Ben

Thread: