Re: multilingual notmuch (and Content-Language)

Subject: Re: multilingual notmuch (and Content-Language)

Date: Sun, 18 Mar 2018 21:32:35 +0200

To: Daniel Kahn Gillmor, notmuch@notmuchmail.org

Cc:

From: Jani Nikula


On Sun, 18 Mar 2018, Daniel Kahn Gillmor <dkg@fifthhorseman.net> wrote:
>  * if we know our index expects english, and we have a message part that
>    *is not* english (e.g. Content-Language: es), we could avoid indexing
>    that part.

Why would we do that? Search mostly works just fine for non-English
languages, it's just that the *stemming* is not right.

> what do you think?  what ideas are missing from the branstorm above?  I'd
> love to hear from people with multilingual mailboxes about how we might
> be able to make notmuch work better for them.

With my limited understanding of this, stemming happens both at indexing
and searching. Basically at indexing, the term generator indexes both
the full and the stemmed version of words. I'm wondering if we could
look at Content-Language (and missing that, heuristics), and (if the
user so desires) use multiple term generators with different stemmers on
a per document basis. Or, use non-stemming indexing for unidentified or
unsupported languages. How far would that take us? Then, perhaps, we
could also perform language specific queries?

I don't know how feasible that is, or if it would require Xapian
changes.

BR,
Jani.
_______________________________________________
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Thread: