Re: regex search in the body

Subject: Re: regex search in the body

Date: Wed, 26 Mar 2025 06:11:39 -0300

To: Peter Münster, notmuch@notmuchmail.org

Cc:

From: David Bremner


Peter Münster <pm@a16n.net> writes:

> On Thu, Mar 20 2025, David Bremner wrote:
>
>> not getting enough matches with a word search?
>
> Yes, indeed.
>
>
>> Can you give me an example of the kind of search you are trying to do?
>
> I would like to find all messages with the substring "identité":
> - identité
> - identités
> - l'identité
> - l’identité
> - d'identité
> - d’identité

Thanks for explaining your use case. I previously thought that regex
search would not be helpful on single terms, but I can see it would be a
workaround for notmuch's inadequate unilingual stemming (which is a
harder problem to fix).

The follow source change seems to enable it at least for s-expression
queries:

diff --git a/lib/parse-sexp.cc b/lib/parse-sexp.cc
index 930888e9..7ce218fe 100644
--- a/lib/parse-sexp.cc
+++ b/lib/parse-sexp.cc
@@ -85,7 +85,7 @@ static _sexp_prefix_t prefixes[] =
     { "attachment",     Xapian::Query::OP_AND,          SEXP_INITIAL_MATCH_ALL,
       SEXP_FLAG_FIELD | SEXP_FLAG_WILDCARD | SEXP_FLAG_EXPAND },
     { "body",           Xapian::Query::OP_AND,          SEXP_INITIAL_MATCH_ALL,
-      SEXP_FLAG_FIELD },
+      SEXP_FLAG_FIELD | SEXP_FLAG_REGEX},
     { "date",           Xapian::Query::OP_INVALID,      SEXP_INITIAL_MATCH_ALL,
       SEXP_FLAG_RANGE },
     { "from",           Xapian::Query::OP_AND,          SEXP_INITIAL_MATCH_ALL,

The test suite and documentation would need to be adjusted, but I think
we could probably support that in the next major release of notmuch
(0.40). If you are comfortable building from source you can of course
just make the change in your build of notmuch.

With that change your query could be done as

   NOTMUCH_DEBUG_QUERY=t ./notmuch count --query=sexp '(body (rx identité))'

It does take about 5 seconds to run on this fairly fast computer, with
my ~800k messages.

Emacs integration would be a seperate question, and would probably
require a hard build dependency on the sfsexp library, but that is a
discussion already started.

In principle a similar change should work for the Xapian (infix) query
parser, but unfortunately there is some complications that I didn't
manage to (quickly) debug.  So I don't know if we can support the infix
syntax or not. I don't think that's a blocker, as there are already
several kinds of search that are only supported in the s-expression
query syntax.

> And, less important, it would be nice (it fails with mu) to search in
> html-only messages. Example:
>
> "/v.*hicule/" should match "v&eacute;hicule"
>

This won't work in notmuch either, because "v&eacute;hicule" is indexed
as two or three terms (words).
_______________________________________________
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-leave@notmuchmail.org

Thread: