Re: Fixed Message-ID trouble

Subject: Re: Fixed Message-ID trouble

Date: Tue, 26 Sep 2023 14:15:31 +0200

To: Alexander Adolf

Cc: Daniel Corbe, notmuch@notmuchmail.org

From: Andreas Kähäri


On Tue, Sep 26, 2023 at 01:44:00PM +0200, Alexander Adolf wrote:
> David Bremner <david@tethera.net> writes:
> 
> > Alexander Adolf <alexander.adolf@condition-alpha.com> writes:
> >
> >> Bearing in mind that re-recognising a message which has arrived
> >> multiple times via different routes is a worthwhile feature, it would
> >> seem to me that a hash over the invariant part of the message, that is
> >> the body, would allow for such detection. In that light, it would seem
> >> to me that the tuple (body_hash, message_id) could be a candidate for
> >> a “unique enough”(tm) identifier?
> >
> > I always had the impression that the message body had too variation
> > imposed by different delivery routes for this to be very helpful:
> > essentially the hash would be different for every file due to trailers
> > added by mailing lists,
> 
> Ah, good point. I hadn't thought of mailing list trailers. Could these
> perhaps be detected via the signature line separator "-- \n"?
> 
> I guess this also touches on the question of what a consensus definition
> of "sameness" could be. If we take the message-id only, it'd be a purely
> technical one. If we'd include the content one way or another (for
> instance via hash over the body), that would rather be an editorial
> definition of "sameness".
> 
> > re-encoding,
> 
> Like...? utf-8 to/from quoted-printable...?
> 
> > stupid "external message" headers added by malicious^Wcorporate mail
> > servers, etc...
> 
> Headers would not "muddy the waters" since they are headers. In my mind,
> the hash would be over the body only.

Hi, I'm not really part of the discussion, but I can add a quick thought
and a suggestion.

There are corporate mail servers that add a boilerplate "header" to the
body of outgoing email messages.  The more common practice is to add a
"footer" to the message.  I have seen these footers being added both
before and after the user's signature.  You can not use a hash that
contains the body of the message to identify the message as unique.

Using the earliest Received header (the one furtherst down) as a unique
identifier would possibly be a better approach.  Since this likely
contains the identity of the originating mail server, some mail queue
ID, and a timestamp, it should be unique enough to identify the message,
even if the message is received via multiple routes and has a non-unique
Message ID.

> > I could be wrong, maybe hashing is a useful approach, but I'd need to
> > see some numbers to be convinced.
> 
> I fully agree that we need to adapt to the realities of how things are
> actually used, not how they were intended to be used.
> 
> How would I find instances of multiple files for the same message-id in
> my database for example?
> 
> 
> Cheers,
> 
>   --alexander
> _______________________________________________
> notmuch mailing list -- notmuch@notmuchmail.org
> To unsubscribe send an email to notmuch-leave@notmuchmail.org

-- 
Andreas (Kusalananda) Kähäri
Uppsala, Sweden

.
_______________________________________________
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-leave@notmuchmail.org

Thread: