RE: [PATCH] test: add known broken test for indexing html

Subject: RE: [PATCH] test: add known broken test for indexing html

Date: Sat, 18 Mar 2017 12:08:27 -0300

To: Jeffrey Stedfast, notmuch@notmuchmail.org

Cc:

From: David Bremner


Jeffrey Stedfast <jestedfa@microsoft.com> writes:

> Base64 encoded inline image data is always within the src attribute
> value of an <img> tag and will always begin with "data:" followed by
> the mime-type and then followed by ";base64," so it's pretty easy to
> spot.
>
> While on this topic, why index HTML attribute values at all? Other
>than perhaps some known ones like perhaps the 'alt' value of <img>
>tags?
>
> I would argue that the only portion of any HTML that you should be
> indexing at all for searching is the character data between tags.
>

I should mention that we also have a fair amount of base64 gunk from
inline PGP signatures. I'm not sure if it's just ugly to look at when
dumping the database term, or if it actually makes a measurable
difference in time/space usage.

d

Thread: