Re: accented characters

Subject: Re: accented characters

Date: Mon, 13 Nov 2017 15:35:15 +0100

To: David Bremner

Cc: notmuch@notmuchmail.org, Bruno Deremble

From: Stefano Zacchiroli


On Mon, Nov 13, 2017 at 09:22:36AM -0400, David Bremner wrote:
> The other thing I don't know is how many people would be happy with just
> stripping all accents. That could be done in a gmime filter, as you
> suggest. That would be more likely to require changes to the query
> language. Off hand I don't know how to transparently de-accent all query
> words.

My gut feeling is that removing accents by default from both the terms
in the index and user queries would go a long way in addressing this
problem. Especially so if it's a boolean option in notmuch config (which
default to stripping accents).

As a random example/data point, chromium does that and when you search
unaccented strings in a web page will find any combination of them with
accents. Is, by far, my best UX experience w.r.t. accents on GNU/Linux.

Unicode has a notion of canonical form that rearrange accented
characters in a sequence of non-accented characters + modifiers
https://en.wikipedia.org/wiki/Unicode_equivalence . A bunch of libraries
use that stuff to normalize-away accents in unicode strings. I'm aware
of a few in Python for instance, but not in C++ (which I believe is what
you'd be interested in).

HTH,
-- 
Stefano Zacchiroli . zack@upsilon.cc . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader & OSI Board Director  . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »
_______________________________________________
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Thread: