Re: Deduplication ?

Subject: Re: Deduplication ?

Date: Mon, 02 Jun 2014 20:06:09 +0300

To: Mark Walters, Tomi Ollila, Vladimir Marek,


From: Jani Nikula

On Mon, 02 Jun 2014, Mark Walters <> wrote:
> Tomi Ollila <> writes:
>> On Mon, Jun 02 2014, Mark Walters <> wrote:
>>> Vladimir Marek <> writes:
>>> If you want to save disk space then you could delete the duplicates
>>> after with something like
>>> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
>>> xargs -0
>> What if there are 3 duplicates (or 4... ;)
> I was assuming that it was merging 2 duplicate-free bunches of messages,
> but I guess the new 100000 might not be. In that case running the above
> repeatedly (ie until it is a no-op) would be fine. 

With 'notmuch new' in between the runs, obviously.

Alternatively, find the biggest --duplicate=N which still outputs
something, and run the command for each N...2.

>> One should also have some message content heuristics to determine that the
>> content is indeed duplicate and not something totally different (not that
>> we can see the different content anyway... but...)
> That would be nice.

And quite hard.