multilingual notmuch (and Content-Language)

Date: Sun, 18 Mar 2018 15:02:31 +0000



From: Daniel Kahn Gillmor describes a Content-Language:
header. describes
a multipart/multilingual Content-Type.

notmuch currently uses xapian with a hard-coded English stemmer which
works great for me as a monolingual American, but limits the
applicability of notmuch to Anglophiles (people who speak English).
That makes me sad.

AIUI, xapian is pretty much committed to being a single-language
indexer.  But i just wanted to point out that it's possible that we
could be smarter about this in notmuch, and wanted to make a space for
possible design discussion.

a few concrete suggestions (intended as brainstorming, feedback welcome):

 * if we know our index expects english, and we have a message part that
   *is not* english (e.g. Content-Language: es), we could avoid indexing
   that part.

 * during indexing, we could add a property to each message when we
   discover a Content-Language header.  this would let you do something
   like "notmuch search property:lang=es" to find all messages
   explicitly tagged as spanish.

 * (pretty crazy) If we're willing to search in another language we
   could add an additional xapian database configured that language, and
   we could index identified parts in that language.

 * for text parts without a Content-Language: header, we could do some
   concrete heuristics to guess the language.  For example, choose the
   1000 most popular words for each language we might know about, and
   look for their presence in the text.  Choose the language that is
   most heavily represented, and store it in the index as a property.
   this could be combined with the suggestions above.

what do you think?  what ideas are missing from the branstorm above?  I'd
love to hear from people with multilingual mailboxes about how we might
be able to make notmuch work better for them.


