Re: [PATCH 2/5] util: Function to parse boolean term queries

Subject: Re: [PATCH 2/5] util: Function to parse boolean term queries

Date: Tue, 25 Dec 2012 20:23:00 -0500

To: David Bremner

Cc: notmuch@notmuchmail.org

From: Austin Clements


Quoth David Bremner on Dec 25 at 10:18 am:
> Austin Clements <amdragon@MIT.EDU> writes:
> 
> > +    if (consume_double_quote (&pos)) {
> > +	char *out = talloc_strdup (ctx, pos);
> > +	pos = *term_out = out;
> > +	while (1) {
> 
> Overall the control flow here is a bit tricky to follow. I'm not sure if
> a real loop condition would help or make it worse.
> 
> > +	    if (! *pos) {
> > +		/* Premature end of string */
> > +		goto FAIL;
> > +	    } else if (*pos == '"') {
> > +		if (*++pos != '"')
> > +		    break;
> > +	    } else if (consume_double_quote (&pos)) {
> > +		break;
> > +	    }
> 
> I'm confused by the asymmetry here. Quoted strings can start with
> unicode quotes, but internally can only have ascii '"'? Is this
> documented somewhere?

Only in the source, to my knowledge.  Here's how Xapian parses these
things (where 'it' is a UTF8 string iterator):

if (it != end && is_double_quote(*it)) {
    // Quoted boolean term (can contain any character).
    ++it;
    while (it != end) {
	if (*it == '"') {
	    // Interpret "" as an escaped ".
	    if (++it == end || *it != '"')
		break;
	} else if (is_double_quote(*it)) {
	    ++it;
	    break;
	}
	Unicode::append_utf8(name, *it++);
    }
} else {
    // Can't boolean filter prefix a subexpression, so
    // just use anything following the prefix until the
    // next space or ')' as part of the boolean filter
    // term.
    while (it != end && *it > ' ' && *it != ')')
	Unicode::append_utf8(name, *it++);
}

> > +    } else {
> > +	while (*pos > ' ' && *pos != ')')
> > +	    ++pos;
> > +	if (*pos)
> > +	    goto FAIL;
> > +    }
> 
> So if there is no quote, we skip the part after the ':'? I guess I
> probably missed something because that doesn't sound like the intended
> behaviour.

This isn't skipping it; it's checking its well-formedness.  In this
case, *term_out already points to a correct string that can be used
literally; we just have to check that there's no trailing garbage
after the boolean query.

This is certainly worth commenting.

For the record, I also tried passing the query straight to the
library, without parsing it in the CLI (and simply checking that the
query returned exactly one result), and it was noticeably slower (the
restore performance test took between 3.82 and 5.25 seconds for the
code in this series and ~7.2 seconds using a general query.)

Thread: