Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]

Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]

Date: Wed, 02 Sep 2015 21:45:12 -0500

To: David Bremner, notmuch@notmuchmail.org

Cc:

From: Rob Browning


David Bremner <david@tethera.net> writes:

> One way to break this up into more bite sized pieces would be to first
> create one or more tests that fail with current notmuch, and mark those
> as broken.

Right - for the moment I just wanted to post what I had for
consideration.  I didn't want to spend too much more time on the
approach if was uninteresting/inappropriate.

One simple place to start might be the included T570-normalization.sh.
Though perhaps that should be "canonicalization"?

> Can you explain why notmuch is the right place to do this, and not
> Xapian? I know we talked back and forth about this, but I never really
> got a solid sense of what the conclusion was. Is it just dependencies?

I have no strong opinion there, but to do the work in Xapian will
require a new release at a minimum, and likely new dependencies.

And generally speaking, I suppose I have a suspicion that application
needs with respect to encoding "detection", tokenization, stemming, stop
words, synonyms, phrase detection, etc. may be domain specific and
complex enough that Xapian won't want to try to accommodate the broad
array of possibilities, at least not in its core library.

Though it might try to handle some or all of that by providing suitable
customizability (presumably via callbacks or subclassing or...).  And
since I'm new to Xapian, I'm not completely sure what's already
available.

> It seems plausible to specify UTF-8 input for the library, but what
> about the CLI? It seems like the canonicalization operation increases
> the chance of mangling user input in non-UTF-8 locales.

Yes, the key question: what does notmuch intend?  i.e. given a sequence
of bytes, how will notmuch interpret them?  I think we should decide
that, and document it clearly somewhere.

The commit message describes my understanding of how things currently
work, and if/when I get time, I'd like to propose some related
documentation updates (perhaps to notmuch-search-terms or
notmuch-insert/new?).

Oh, and if I do understand things correctly, notmuch may already stand a
chance of mangling any bytes that aren't an invalid UTF-8 byte sequence,
but also aren't actually in UTF-8 (excepting encodings that are a strict
subset of UTF-8, like ASCII).

For example (if I did this right), [0xd1 0xa1] is valid UTF-8, producing
omega "ѡ", and also valid Latin-1, producing "Ñ¡".

> I suppose some upgrade code to canonicalize all the terms? That sounds
> pretty slow.

Perhaps, or I suppose you could just document that older indexed data
might not be canonicalized, and that you should reindex if that matters
to you.  Although I suppose anyone with affected characters might well
want to reindex if the canonical form isn't the one people normally
receive (which seemed possible).

Hmm, another question -- for terms, does notmuch store ordinal
positions, Unicode character offsets, input byte offsets, or...?
Canonicalization will of course change the latter.

I imagine it might be possible to traverse the index terms and just
detect and merge those affected, but no idea if that would be
reasonable.

> I really didn't look at the code very closely, but there were a
> surprising number of calls to talloc_free. But those kind of details can
> wait.

Right, I wasn't sure what the policies were, so in most cases, I just
tried to release the data when it was no longer needed.

Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Thread: