Deduplication ?

Subject: Deduplication ?

Date: Mon, 2 Jun 2014 14:32:12 +0200

Cc:

Hi,

I want to import bigger chunk of archived messages into my notmuch
database. It's about 100k messages. The problem is, that I most probably
have quite a lot of those messages in the DB. Basically I would like to
add only those I don't have already.

There are two possibilities

a) I will add all the 100k messages and then remove the duplicities.

b) I will write a script which will parse the message ID's of the
   to-be-added messages and try to match them to the notmuch DB. Adding
   only files I can't find already.

Ad b) might be better option, but I started to play with the idea of
deduplication. I'm thinking about listing all the message IDs stored in
DB, listing all files belonging to the IDs and deleting all but one.
Also I'm thinking about implementing some simple algorithm telling me
whether the messages are really very similar. Just to be sure I don't
delete something I don't want to.

Was anyone playing with the idea?

-- 
	Vlad

Previous message (by thread): Re: Deduplication ?

Thread:

Vladimir Marek—Deduplication ? [inbox, unread]
- David Edmondson—Re: Deduplication ? [inbox, unread]
  - Vladimir Marek—Re: Deduplication ? [inbox, unread]
    - Mark Walters—Re: Deduplication ? [inbox, unread]
      - Mark Walters—Re: Deduplication ? [inbox, unread]
- Mark Walters—Re: Deduplication ? [inbox, unread]
  - Tomi Ollila—Re: Deduplication ? [inbox, unread]
    - Mark Walters—Re: Deduplication ? [inbox, unread]
      - Jani Nikula—Re: Deduplication ? [inbox, unread]
        David Edmondson—Re: Deduplication ? [inbox, unread]
        Jani Nikula—Re: Deduplication ? [inbox, unread]
        Vladimir Marek—Re: Deduplication ? [attachment, inbox, unread]
        Tomi Ollila—Re: Deduplication ? [inbox, unread]