Re: Deduplication ?

Subject: Re: Deduplication ?

Date: Mon, 02 Jun 2014 15:10:30 +0100

To: Vladimir Marek, David Edmondson


From: Mark Walters

Vladimir Marek <> writes:

>> > I want to import bigger chunk of archived messages into my notmuch
>> > database. It's about 100k messages. The problem is, that I most probably
>> > have quite a lot of those messages in the DB. Basically I would like to
>> > add only those I don't have already.
>> >
>> > There are two possibilities
>> >
>> > a) I will add all the 100k messages and then remove the duplicities.
>> >
>> > b) I will write a script which will parse the message ID's of the
>> >    to-be-added messages and try to match them to the notmuch DB. Adding
>> >    only files I can't find already.
>> >
>> > Ad b) might be better option, but I started to play with the idea of
>> > deduplication. I'm thinking about listing all the message IDs stored in
>> > DB, listing all files belonging to the IDs and deleting all but one.
>> > Also I'm thinking about implementing some simple algorithm telling me
>> > whether the messages are really very similar. Just to be sure I don't
>> > delete something I don't want to.
>> >
>> > Was anyone playing with the idea?
>> notsync[1] used the (lack of) existence of a message id in the store to
>> decide whether to add something from an IMAP server, but it is old,
>> crufty, unused and unloved code.
> I see, that's close to my b) solution, thanks!

Did you mean a) here? The idea was to add them all first and then run
this script to delete the duplicates.

Best wishes


> -- 
> 	Vlad
> _______________________________________________
> notmuch mailing list