On Fri, Apr 25, 2025 at 06:41:48AM -0300, David Bremner wrote: > Is it possible (and sensible) to have multiple stemmers active at the > same time? Most people who want to stem a non-English language will > probably have a multilingual mail store. The short answer is yes. The appropriate way to handle them depends what you want to support. As a general design point, where feasible I'd suggest making the assumption that each email is in a single language. It's possible to not assume this, but unless you have reliable tagging of the languages of spans of text (which is really rare in reality for any document collection) you would need to decide on a granularity (e.g. leaf MIME part, paragraph, sentence, etc) and detect the language (e.g. with textcat or similar). It's also more complex to implement, slower to search, and language detection of shorter spans of text tends to be less reliable. For notmuch you can probably have the user configure a list of languages to limit the detection to which may help. With this single-language assumption, you can shard the database by language so all the English documents go in one Xapian DB, all the French in another, etc. You don't have to, but then a single language search can just open one shard, and a multi-language search will run over all the shards but can be optimised better. Then you need to decide how queries are handled. There are two connected questions here - what to search and how to determine the language of the query. Some options: * A French query only searches French documents, etc. There could also be a "universal" search mode which searches everything without any stemming (e.g. if you want to be able to find any email referring to a person/place/thing/etc). Either you need to detect the query language (probably not reliable enough) or rely on the user telling you. * The query string is stemmed for each possible language and these are combined with suitable filters, something like this (where `Len` is a boolean term indexing the English documents, etc): query = (query_parsed_with_en_stemmer & (Query("Len") * 0)) | (query_parsed_with_fr_stemmer & (Query("Lfr") * 0)) | (query_parsed_with_de_stemmer & (Query("Lde") * 0)); (Here `* 0` just zeros the scale factors to give boolean filters.) Advantages are that you don't need to determine (or be told) the query's language, and misdetected languages are still likely to work fairly well (especially as we're most likely to misdetect a related language). The main downside is a more complex query, but if you shard by language then term `Len` indexes exactly all the documents in a single shard and nothing else, and the query optimiser and matcher will spot that and shortcut so it's more like running each subquery on its respective shard. (The approach still works without the sharding, and you may find it's fast enough with everything in a single database.) * A hybrid of the two: query = (query_parsed_with_fr_stemmer & (Query("Lfr") * 0)) | (query_parsed_without_stemming & ~Query("Lfr")); Here you still need to guess (or be told) a language for the query, but using the wrong language mostly just means an unstemmed search. Cheers, Olly _______________________________________________ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-leave@notmuchmail.org