Re: multilingual notmuch (and Content-Language)

Subject: Re: multilingual notmuch (and Content-Language)

Date: Sun, 18 Mar 2018 21:32:35 +0200

To: Daniel Kahn Gillmor,


From: Jani Nikula

On Sun, 18 Mar 2018, Daniel Kahn Gillmor <> wrote:
>  * if we know our index expects english, and we have a message part that
>    *is not* english (e.g. Content-Language: es), we could avoid indexing
>    that part.

Why would we do that? Search mostly works just fine for non-English
languages, it's just that the *stemming* is not right.

> what do you think?  what ideas are missing from the branstorm above?  I'd
> love to hear from people with multilingual mailboxes about how we might
> be able to make notmuch work better for them.

With my limited understanding of this, stemming happens both at indexing
and searching. Basically at indexing, the term generator indexes both
the full and the stemmed version of words. I'm wondering if we could
look at Content-Language (and missing that, heuristics), and (if the
user so desires) use multiple term generators with different stemmers on
a per document basis. Or, use non-stemming indexing for unidentified or
unsupported languages. How far would that take us? Then, perhaps, we
could also perform language specific queries?

I don't know how feasible that is, or if it would require Xapian

notmuch mailing list