On Tue, Sep 26, 2023 at 01:44:00PM +0200, Alexander Adolf wrote: > David Bremner <david@tethera.net> writes: > > > Alexander Adolf <alexander.adolf@condition-alpha.com> writes: > > > >> Bearing in mind that re-recognising a message which has arrived > >> multiple times via different routes is a worthwhile feature, it would > >> seem to me that a hash over the invariant part of the message, that is > >> the body, would allow for such detection. In that light, it would seem > >> to me that the tuple (body_hash, message_id) could be a candidate for > >> a “unique enough”(tm) identifier? > > > > I always had the impression that the message body had too variation > > imposed by different delivery routes for this to be very helpful: > > essentially the hash would be different for every file due to trailers > > added by mailing lists, > > Ah, good point. I hadn't thought of mailing list trailers. Could these > perhaps be detected via the signature line separator "-- \n"? > > I guess this also touches on the question of what a consensus definition > of "sameness" could be. If we take the message-id only, it'd be a purely > technical one. If we'd include the content one way or another (for > instance via hash over the body), that would rather be an editorial > definition of "sameness". > > > re-encoding, > > Like...? utf-8 to/from quoted-printable...? > > > stupid "external message" headers added by malicious^Wcorporate mail > > servers, etc... > > Headers would not "muddy the waters" since they are headers. In my mind, > the hash would be over the body only. Hi, I'm not really part of the discussion, but I can add a quick thought and a suggestion. There are corporate mail servers that add a boilerplate "header" to the body of outgoing email messages. The more common practice is to add a "footer" to the message. I have seen these footers being added both before and after the user's signature. You can not use a hash that contains the body of the message to identify the message as unique. Using the earliest Received header (the one furtherst down) as a unique identifier would possibly be a better approach. Since this likely contains the identity of the originating mail server, some mail queue ID, and a timestamp, it should be unique enough to identify the message, even if the message is received via multiple routes and has a non-unique Message ID. > > I could be wrong, maybe hashing is a useful approach, but I'd need to > > see some numbers to be convinced. > > I fully agree that we need to adapt to the realities of how things are > actually used, not how they were intended to be used. > > How would I find instances of multiple files for the same message-id in > my database for example? > > > Cheers, > > --alexander > _______________________________________________ > notmuch mailing list -- notmuch@notmuchmail.org > To unsubscribe send an email to notmuch-leave@notmuchmail.org -- Andreas (Kusalananda) Kähäri Uppsala, Sweden . _______________________________________________ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-leave@notmuchmail.org