Re: Deduplication ?

Subject: Re: Deduplication ?

Date: Mon, 02 Jun 2014 15:26:17 +0100

To: Tomi Ollila, Vladimir Marek, notmuch@notmuchmail.org

Cc:

From: Mark Walters


Tomi Ollila <tomi.ollila@iki.fi> writes:

> On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote:
>
>> Vladimir Marek <Vladimir.Marek@oracle.com> writes:
>>
>>> Hi,
>>>
>>> I want to import bigger chunk of archived messages into my notmuch
>>> database. It's about 100k messages. The problem is, that I most probably
>>> have quite a lot of those messages in the DB. Basically I would like to
>>> add only those I don't have already.
>>>
>>> There are two possibilities
>>>
>>> a) I will add all the 100k messages and then remove the duplicities.
>>>
>>> b) I will write a script which will parse the message ID's of the
>>>    to-be-added messages and try to match them to the notmuch DB. Adding
>>>    only files I can't find already.
>>>
>>> Ad b) might be better option, but I started to play with the idea of
>>> deduplication. I'm thinking about listing all the message IDs stored in
>>> DB, listing all files belonging to the IDs and deleting all but one.
>>> Also I'm thinking about implementing some simple algorithm telling me
>>> whether the messages are really very similar. Just to be sure I don't
>>> delete something I don't want to.
>>>
>>> Was anyone playing with the idea?
>>
>> I am not sure what your use case is but notmuch automatically
>> deduplicates: that is if the message-id is one it has already seen no
>> further indexing takes place. The only thing that happens is the new
>> filename gets added to the list of filenames for the message.
>>
>> Thus importing should be almost as fast as if the message were not
>> there, and the database should be almost identical to what you would get
>> if you only imported the genuine new messages.
>>
>> If you want to save disk space then you could delete the duplicates
>> after with something like
>>
>> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
>> xargs -0
>
> What if there are 3 duplicates (or 4... ;)

I was assuming that it was merging 2 duplicate-free bunches of messages,
but I guess the new 100000 might not be. In that case running the above
repeatedly (ie until it is a no-op) would be fine. 

>
>>
>> (but please test it carefully first!)
>
> One should also have some message content heuristics to determine that the
> content is indeed duplicate and not something totally different (not that
> we can see the different content anyway... but...)

That would be nice.

Best wishes

Mark


>>
>> I would think something like this is better than trying to parse the
>> message-ids yourself.
>
>
>>
>> Best wishes
>>
>> Mark
>>
>
> Tomi
>
>
>>
>>>
>>> -- 
>>> 	Vlad

Thread: