Re: Deduplication ?

Subject: Re: Deduplication ?

Date: Mon, 02 Jun 2014 18:25:42 +0100

To: Jani Nikula, Mark Walters, Tomi Ollila, Vladimir Marek,


From: David Edmondson

On Mon, Jun 02 2014, Jani Nikula wrote:
>>> One should also have some message content heuristics to determine that the
>>> content is indeed duplicate and not something totally different (not that
>>> we can see the different content anyway... but...)
>> That would be nice.
> And quite hard.

Thinking about this a bit...

The headers are likely to be different, so you could remove them (get
rid of everything up to the first empty line).

Various mailing lists add footers, so you would need to remove them (a
regular expression based approach would catch most of them easily).

The remaining content should be the same for identical messages, so a
sensible hash (md5) could be used to compare.

Although, some MTAs modify the body of the message when manipulating
encoding. I don't know how to address this.