RE: [PATCH] test: add known broken test for indexing html

Subject: RE: [PATCH] test: add known broken test for indexing html

Date: Sat, 18 Mar 2017 12:04:07 -0300

To: Jeffrey Stedfast, notmuch@notmuchmail.org

Cc:

From: David Bremner


Jeffrey Stedfast <jestedfa@microsoft.com> writes:

> Hi David,
>
> Base64 encoded inline image data is always within the src attribute value of an <img> tag and will always begin with "data:" followed by the mime-type and then followed by ";base64," so it's pretty easy to spot.
>
> While on this topic, why index HTML attribute values at all? Other than perhaps some known ones like perhaps the 'alt' value of <img> tags?
>
> I would argue that the only portion of any HTML that you should be indexing at all for searching is the character data between tags.

We're not currently parsing the HTML, so none of these distinctions are
really available to us. Maybe adding an HTML parser is the right
solution, but it's a bit non-trivial.

d

Thread: