Re: [notmuch] Mail in git

Subject: Re: [notmuch] Mail in git

Date: Tue, 16 Feb 2010 10:08:45 +0100

To: Stewart Smith, notmuch@notmuchmail.org

Cc:

From: Michal Sojka


Hi Stewart,

On Mon, 15 Feb 2010 11:29:14 +1100, Stewart Smith <stewart@flamingspork.com> wrote:
> Which goes from a 15GB Maildir to a 3.7GB git repo.

That's quite interesting ratio. I've tried a plain git add and git gc on
my mail store and the result was a repo of approximately 50% of mail
store size. Do you think that this difference might be caused by the way
you created the packs?

> 
> The algorithm of evenless.pl is basically:
> 1 get next directory entry
> 2 if is directory, recurse into it
> 3 write item to git (git hash-object -w)
> 4 add item to tree object
> 5 if number of items written = 1000
>   5.1 make pack of last 1000 items
> 6 goto 1

So it seems that you have all you mails in a single tree. How long it
takes to caculate difference of two trees (git diff-tree --name-status)?
This operation will be needed by "notmuch new" to determine which
files/blobs to index. I suppose it will be better if mail blobs are
stored in subtrees. If a subtree is not changed git doesn't need to
descend to it because it has the same sha1.

I think that storing mails in a similar structure as in .git/objects
(i.e. 256 subdirectories based on the first sha1 byte and file names
based on the last 39 sha1 bytes) would be reasonable.

> Next step?
> 
> Make notmuch be able to read mail out of it and add it to an index
> (oh, and some kind of verification and error checking about creating
> the git repo).

Besides using git to compact the size of mail store, another feature that
cames with git for free is synchronization. For this to work, you only
need to store tags in the repo. What might work is to store tags in
files named <mail-name>.tags. The tags would be stored in the files
alphabetically, one tag per line. I guess, that this way makes it easy
to merge tags during synchronization even without writing custom git
merge driver.

Onother point that must be solved if we would like to use git with
notmuch is the license problem. As it was pointed out by Carl in another
thread, Git is licensed under GPLv2 only whereas notmuch under GPLv3 and
these licences are incompatible. So I think we will need some kind of
hooks in notmuch from which external programs (git) will be called.

Cheers,
 Michal

Thread: