David Bremner <david@tethera.net> writes: > Alexander Adolf <alexander.adolf@condition-alpha.com> writes: > >> Bearing in mind that re-recognising a message which has arrived >> multiple times via different routes is a worthwhile feature, it would >> seem to me that a hash over the invariant part of the message, that is >> the body, would allow for such detection. In that light, it would seem >> to me that the tuple (body_hash, message_id) could be a candidate for >> a “unique enough”(tm) identifier? > > I always had the impression that the message body had too variation > imposed by different delivery routes for this to be very helpful: > essentially the hash would be different for every file due to trailers > added by mailing lists, Ah, good point. I hadn't thought of mailing list trailers. Could these perhaps be detected via the signature line separator "-- \n"? I guess this also touches on the question of what a consensus definition of "sameness" could be. If we take the message-id only, it'd be a purely technical one. If we'd include the content one way or another (for instance via hash over the body), that would rather be an editorial definition of "sameness". > re-encoding, Like...? utf-8 to/from quoted-printable...? > stupid "external message" headers added by malicious^Wcorporate mail > servers, etc... Headers would not "muddy the waters" since they are headers. In my mind, the hash would be over the body only. > I could be wrong, maybe hashing is a useful approach, but I'd need to > see some numbers to be convinced. I fully agree that we need to adapt to the realities of how things are actually used, not how they were intended to be used. How would I find instances of multiple files for the same message-id in my database for example? Cheers, --alexander _______________________________________________ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-leave@notmuchmail.org