RE: [PATCH] test: add known broken test for indexing html

Subject: RE: [PATCH] test: add known broken test for indexing html

Date: Sat, 18 Mar 2017 16:21:34 +0000

To: David Bremner, notmuch@notmuchmail.org

Cc:

From: Jeffrey Stedfast


Hey David,

I actually have an HTML tokenizer for MimeKit for (among other things) this type of purpose. Perhaps I need to port that to C and include that with GMime 😊

https://github.com/jstedfast/MimeKit/tree/master/MimeKit/Text

Jeff

> -----Original Message-----
> From: David Bremner [mailto:david@tethera.net]
> Sent: Saturday, March 18, 2017 11:04 AM
> To: Jeffrey Stedfast <jestedfa@microsoft.com>; notmuch@notmuchmail.org
> Subject: RE: [PATCH] test: add known broken test for indexing html
> 
> Jeffrey Stedfast <jestedfa@microsoft.com> writes:
> 
> > Hi David,
> >
> > Base64 encoded inline image data is always within the src attribute value of
> an <img> tag and will always begin with "data:" followed by the mime-type
> and then followed by ";base64," so it's pretty easy to spot.
> >
> > While on this topic, why index HTML attribute values at all? Other than
> perhaps some known ones like perhaps the 'alt' value of <img> tags?
> >
> > I would argue that the only portion of any HTML that you should be
> indexing at all for searching is the character data between tags.
> 
> We're not currently parsing the HTML, so none of these distinctions are really
> available to us. Maybe adding an HTML parser is the right solution, but it's a
> bit non-trivial.
> 
> d

Thread: