Re: Mail archives in Git using ssoma

Subject: Re: Mail archives in Git using ssoma

Date: Sun, 21 Aug 2016 21:14:55 +0000

To: W. Trevor King

Cc: notmuch@notmuchmail.org, meta@public-inbox.org

From: Eric Wong


"W. Trevor King" <wking@tremily.us> wrote:
> On Sun, Aug 21, 2016 at 06:37:04PM +0000, Eric Wong wrote:
> > Btw, for public-inbox, I'm using git-fast-import now, so imports are
> > a bit faster and $GIT_DIR/ssoma.index is no longer used.  This was
> > crucial for getting git@vger archives imported in a reasonable time.
> 
> ssoma-mda imports 22k notmuch messages in around 15 minutes (with
> profiling enabled), and:

In contrast, git@vger is around 300K messages.  LKML is well
into the millions, and I hope public-inbox (and git!) can handle
that one day, even on cheap hardware (haven't tried).

One problem I noticed with ssoma-mda is that it gets slower as
more messages get imported, since all those files sit in the
index, and the git index format is bad for incremental updates
with big, flat trees.  Big trees are a general problem with git:

    I'm now storing blob IDs directly in Xapian and will be
    using them more to avoid tree lookups.  tree creation
    lookups degrade the same way the index does as they
    get bigger.

    Currently it's using 2/38 of the SHA-1 like git loose
    objects; a goal might be to move towards supporting 2/2/36
    (or deeper) as Jeff noted substantial object traversal
    improvements:

https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/

    Of course, support for 2/38 will be retained for old
    archives/messages.

>   $ python -m cProfile -o profile import.py notmuch.mbox
>   $ python -c "import pstats; p=pstats.Stats('profile'); p.sort_stats('cumulative').print_stats(10)"
>   Sun Aug 21 12:56:49 2016    profile
> 
>            101823722 function calls (99078415 primitive calls) in 885.069 seconds
> 
>      Ordered by: cumulative time
>      List reduced from 1145 to 10 due to restriction <10>
> 
>      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>        70/1    0.002    0.000  885.069  885.069 {built-in method exec}
>           1    0.111    0.111  885.069  885.069 /home/wking/src/notmuch/notmuch-archives.git/import.py:9(<module>)
>           1    0.400    0.400  884.915  884.915 /home/wking/src/notmuch/notmuch-archives.git/import.py:17(import_mbox)
>       22875    0.601    0.000  863.371    0.038 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:362(deliver)
>       22875    8.943    0.000  810.459    0.035 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:207(append)
>       22875    0.418    0.000  308.353    0.013 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:146(write_tree)
>       22875  307.855    0.013  307.855    0.013 {built-in method git_index_write_tree}
>       22874    0.575    0.000  279.293    0.012 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:238(diff_to_tree)
>       22874  278.501    0.012  278.501    0.012 {built-in method git_diff_tree_to_index}

It looks like writing the index is already the slowest, here, in
terms of total time, too.  It might be interesting if you
profiled each *-mda invocation to see the degradation from the
first to last message.

>       22875    0.088    0.000   80.413    0.004 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:99(read)
> 
> 38 ms per ssoma delivery is probably fast enough, especially if you

Not even close for me :)

> are invoking ssoma-mda once per message, since process setup will take a similar amount of time:
> 
>   $ time python -c 'print("hello")'
>   hello
> 
>   real    0m0.016s
>   user    0m0.013s
>   sys     0m0.003s
> 
> It's possible that fast-import would shave a few ms off the pygit2
> addition (I'm not sure, and maybe pygit2 is faster than fast-import).
> But I doubt it matters enough either way to be worth changing unless
> you are dealing with a really large corpus.

One key feature is fast-import avoids writing an index entirely.
I think pygit2 would have to learn that, too.

Thread: