Re: Deduplication ?

Hi,

So I wrote some code which works for me well. I have erased ~40k
messages out of 500k. It does not try to be complete solution, it only
detects and removes the obvious cases. The idea is to help me control
the number of duplicates when I import big mail archives which surely
contain many duplicates into my mail database.

> Thinking about this a bit...

> The headers are likely to be different, so you could remove them (get
> rid of everything up to the first empty line).

Yes, that's what I ended up doing. And I delete the files which have
less 'Received:' headers.

> Various mailing lists add footers, so you would need to remove them (a
> regular expression based approach would catch most of them easily).

I defined a list of known footers. Then I take the two mails with the
same message-id, create diff between them and  compare it to the list of
footers.

> The remaining content should be the same for identical messages, so a
> sensible hash (md5) could be used to compare.
> 
> Although, some MTAs modify the body of the message when manipulating
> encoding. I don't know how to address this.

I'm attaching my perl script if anyone is interested. It's in no way
complete solution. It is supposed to be used as

notmuch search --output=files --duplicate=2 '*' > dups
./dedup # It opens the file 'dups'

The attached version does not remove anyting (the 'unlink' command is
commented out).

Interestingly this does not work (it seems to return all messages):
notmuch search --output=messages --duplicate=2 '*'

Also I have found that if I run 'notmuch search' and 'notmuch new' at
the same time, the notmuch search crashes sometimes. That's why I don't
use

notmuch search ... | ./dedup

Use with care :)

Thank you for your help
-- 
	Vlad

Re: Deduplication ?

Thread: