Re: Deduplication ?

Subject: Re: Deduplication ?

Date: Mon, 02 Jun 2014 15:10:30 +0100

To: Vladimir Marek, David Edmondson

Cc: notmuch@notmuchmail.org

From: Mark Walters


Vladimir Marek <Vladimir.Marek@oracle.com> writes:

>> > I want to import bigger chunk of archived messages into my notmuch
>> > database. It's about 100k messages. The problem is, that I most probably
>> > have quite a lot of those messages in the DB. Basically I would like to
>> > add only those I don't have already.
>> >
>> > There are two possibilities
>> >
>> > a) I will add all the 100k messages and then remove the duplicities.
>> >
>> > b) I will write a script which will parse the message ID's of the
>> >    to-be-added messages and try to match them to the notmuch DB. Adding
>> >    only files I can't find already.
>> >
>> > Ad b) might be better option, but I started to play with the idea of
>> > deduplication. I'm thinking about listing all the message IDs stored in
>> > DB, listing all files belonging to the IDs and deleting all but one.
>> > Also I'm thinking about implementing some simple algorithm telling me
>> > whether the messages are really very similar. Just to be sure I don't
>> > delete something I don't want to.
>> >
>> > Was anyone playing with the idea?
>> 
>> notsync[1] used the (lack of) existence of a message id in the store to
>> decide whether to add something from an IMAP server, but it is old,
>> crufty, unused and unloved code.
>
> I see, that's close to my b) solution, thanks!

Did you mean a) here? The idea was to add them all first and then run
this script to delete the duplicates.

Best wishes

Mark

> -- 
> 	Vlad
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

Thread: