Re: how to search for hyphenated words? (was: how to search for Morse code?)

Subject: Re: how to search for hyphenated words? (was: how to search for Morse code?)

Date: Wed, 13 Mar 2019 11:23:34 -0700

To: David Bremner, Carl Worth, Gregor Zattler, notmuch@notmuchmail.org

Cc:

From: Matt Armstrong


David Bremner <david@tethera.net> writes:

> Matt Armstrong <marmstrong@google.com> writes:
>
>> Carl Worth <cworth@cworth.org> writes:
>>
>>> Hi Gregor,
>>>
>>> The trick here is that when notmuch is indexing body text it feeds it
>>> into a Xapian function that parses the text by finding "terms" in the
>>> text. And this parser considers both punctuation and whitespace as
>>> separators between terms.
>>
>> I notice that Xapian supports something called "phrase searches",
>> documented as:
>>
>>   "A phrase surrounded with double quotes ("") matches documents
>>   containing that exact phrase. Hyphenated words are also treated as
>>   phrases, as are cases such as filenames and email addresses
>>   (e.g. /etc/passwd or president@whitehouse.gov)."
>>
>> I assume that this particular Xapian feature is unavailable in notmuch?
>> If so, I wonder if enabling has ever been considered?
>
> It is enabled, and documented in notmuch-search-terms(7). Unfortunately
> I don't think it's related to the original request. The mention of
> hyphenated words is about the input to the query parser, not the
> (necessarily) the retrieved text.

Ah, so it boils down to the Xapian definition of "exact phrase."
Notably, "exact phrase" is not "identical sequence of characters" as
some people might expect.

Quick tests with various search engines reveal their phrase search as
operating the same way.  E.g. searching for "org notmuch" finds all
sorts of results:

  org-notmuch.el
  notmuchmail.org/notmuch-emacs/
  to:devicetree@vger.kernel.org notmuch tag +inbox +unread -new
  (require 'org-notmuch nil t)
  https://notmuchmail.org/notmuch-emacs/. *
  imaps://mail.example.org/Notmuch/search

For what it is worth, one thing I've taken to doing is using period
separators in the notmuch phrase searches I use in scripts and even
interactively.  Using periods is generally immune to confusing issues
related to quoting double quoted things, and always remains a single
shell "word."  They are also, most often, clearly not the exact content
I'm searching for, so they make it clear than the match algorithm is
inexact.  E.g.

  subject:notmuch.is.wonderful

instead of:

  subject:"notmuch is wonderful"
_______________________________________________
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Thread: