On Sun, Mar 18 2018, Daniel Kahn Gillmor wrote: > https://tools.ietf.org/html/rfc3282 describes a Content-Language: > header. https://tools.ietf.org/html/rfc8255 describes > a multipart/multilingual Content-Type. > > notmuch currently uses xapian with a hard-coded English stemmer which > works great for me as a monolingual American, but limits the > applicability of notmuch to Anglophiles (people who speak English). > That makes me sad. > > AIUI, xapian is pretty much committed to being a single-language > indexer. Have you seen the different stemmers it already has? Reference: https://xapian.org/docs/sourcedoc/html/dir_430c089e7e18d7ac6ff937a35cc3312c.html > But i just wanted to point out that it's possible that we > could be smarter about this in notmuch, and wanted to make a space for > possible design discussion. > > a few concrete suggestions (intended as brainstorming, feedback welcome): > > * if we know our index expects english, and we have a message part that > *is not* english (e.g. Content-Language: es), we could avoid indexing > that part. I'd prefer leaving the choice of default stemmer to the user. > * during indexing, we could add a property to each message when we > discover a Content-Language header. this would let you do something > like "notmuch search property:lang=es" to find all messages > explicitly tagged as spanish. > > * (pretty crazy) If we're willing to search in another language we > could add an additional xapian database configured that language, and > we could index identified parts in that language. Do we need to have separate DB if we can use different stemmers dynamically? > * for text parts without a Content-Language: header, we could do some > concrete heuristics to guess the language. For example, choose the > 1000 most popular words for each language we might know about, and > look for their presence in the text. Choose the language that is > most heavily represented, and store it in the index as a property. > this could be combined with the suggestions above. +1 for heuristics. > what do you think? what ideas are missing from the branstorm above? I'd > love to hear from people with multilingual mailboxes about how we might > be able to make notmuch work better for them. As an actively bilingual person (English and Spanish), I love this idea. Servilio -- Servilio Afre Puentes Programmer/Analyst, SHARCNET project RHPCS | http://www.rhpcs.mcmaster.ca SHARCNET | https://sharcnet.ca Compute Ontario | http://computeontario.ca Compute/Calcul Canada | http://computecanada.ca 905-525-9140, x22540 _______________________________________________ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch