Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]

Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]

Date: Sat, 05 Sep 2015 14:13:00 -0500

To: David Bremner, notmuch@notmuchmail.org

Cc:

From: Rob Browning


Rob Browning <rlb@defaultvalue.org> writes:

> David Bremner <david@tethera.net> writes:

>> It seems plausible to specify UTF-8 input for the library, but what
>> about the CLI? It seems like the canonicalization operation increases
>> the chance of mangling user input in non-UTF-8 locales.
>
> Yes, the key question: what does notmuch intend?  i.e. given a sequence
> of bytes, how will notmuch interpret them?  I think we should decide
> that, and document it clearly somewhere.
>
> The commit message describes my understanding of how things currently
> work, and if/when I get time, I'd like to propose some related
> documentation updates (perhaps to notmuch-search-terms or
> notmuch-insert/new?).
>
> Oh, and if I do understand things correctly, notmuch may already stand a
> chance of mangling any bytes that aren't an invalid UTF-8 byte sequence,
> but also aren't actually in UTF-8 (excepting encodings that are a strict
> subset of UTF-8, like ASCII).
>
> For example (if I did this right), [0xd1 0xa1] is valid UTF-8, producing
> omega "ѡ", and also valid Latin-1, producing "Ñ¡".

So on this particular point, I'm perhaps too used to thinking about the
general encoding problem, and wasn't thinking about our specific
constraints.

If (1) "normal" message bodies are required to be US-ASCII (which I'd
neglected to remember might be the case), and (2) MIME handles the rest,
then perhaps notmuch will only receive raw bytes via user input
(i.e. query strings, etc.).

In which case, we could just document that notmuch interprets user input
as UTF-8 (and we might or might not mention the Latin-1 fallback).

Later locale support could be added if desired, and none of this would
involve the quite nasty problem of encoding detection.

-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Thread: