Re: regex search in the body

Subject: Re: regex search in the body

Date: Sun, 27 Apr 2025 23:11:52 +0100

To: David Bremner

Cc: notmuch@notmuchmail.org

From: Olly Betts


On Fri, Apr 25, 2025 at 06:41:48AM -0300, David Bremner wrote:
> Is it possible (and sensible) to have multiple stemmers active at the
> same time? Most people who want to stem a non-English language will
> probably have a multilingual mail store.

The short answer is yes.

The appropriate way to handle them depends what you want to support.

As a general design point, where feasible I'd suggest making the
assumption that each email is in a single language.  It's possible
to not assume this, but unless you have reliable tagging of the
languages of spans of text (which is really rare in reality for any
document collection) you would need to decide on a granularity (e.g.
leaf MIME part, paragraph, sentence, etc) and detect the language (e.g.
with textcat or similar).  It's also more complex to implement, slower
to search, and language detection of shorter spans of text tends to be
less reliable.  For notmuch you can probably have the user configure
a list of languages to limit the detection to which may help.

With this single-language assumption, you can shard the database by
language so all the English documents go in one Xapian DB, all the
French in another, etc.  You don't have to, but then a single language
search can just open one shard, and a multi-language search will run
over all the shards but can be optimised better.

Then you need to decide how queries are handled.  There are two
connected questions here - what to search and how to determine the
language of the query.  Some options:

* A French query only searches French documents, etc.  There could also
  be a "universal" search mode which searches everything without any
  stemming (e.g. if you want to be able to find any email referring to a
  person/place/thing/etc).  Either you need to detect the query language
  (probably not reliable enough) or rely on the user telling you.

* The query string is stemmed for each possible language and these are
  combined with suitable filters, something like this (where `Len` is
  a boolean term indexing the English documents, etc):

  query = (query_parsed_with_en_stemmer & (Query("Len") * 0))
        | (query_parsed_with_fr_stemmer & (Query("Lfr") * 0))
        | (query_parsed_with_de_stemmer & (Query("Lde") * 0));

  (Here `* 0` just zeros the scale factors to give boolean filters.)

  Advantages are that you don't need to determine (or be told) the
  query's language, and misdetected languages are still likely to work
  fairly well (especially as we're most likely to misdetect a related
  language).

  The main downside is a more complex query, but if you shard by
  language then term `Len` indexes exactly all the documents in
  a single shard and nothing else, and the query optimiser and matcher
  will spot that and shortcut so it's more like running each subquery on
  its respective shard.  (The approach still works without the sharding,
  and you may find it's fast enough with everything in a single
  database.)
  
* A hybrid of the two:

  query = (query_parsed_with_fr_stemmer & (Query("Lfr") * 0))
        | (query_parsed_without_stemming & ~Query("Lfr"));

  Here you still need to guess (or be told) a language for the query,
  but using the wrong language mostly just means an unstemmed search.

Cheers,
    Olly
_______________________________________________
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-leave@notmuchmail.org

Thread: