Hey David, I actually have an HTML tokenizer for MimeKit for (among other things) this type of purpose. Perhaps I need to port that to C and include that with GMime 😊 https://github.com/jstedfast/MimeKit/tree/master/MimeKit/Text Jeff > -----Original Message----- > From: David Bremner [mailto:david@tethera.net] > Sent: Saturday, March 18, 2017 11:04 AM > To: Jeffrey Stedfast <jestedfa@microsoft.com>; notmuch@notmuchmail.org > Subject: RE: [PATCH] test: add known broken test for indexing html > > Jeffrey Stedfast <jestedfa@microsoft.com> writes: > > > Hi David, > > > > Base64 encoded inline image data is always within the src attribute value of > an <img> tag and will always begin with "data:" followed by the mime-type > and then followed by ";base64," so it's pretty easy to spot. > > > > While on this topic, why index HTML attribute values at all? Other than > perhaps some known ones like perhaps the 'alt' value of <img> tags? > > > > I would argue that the only portion of any HTML that you should be > indexing at all for searching is the character data between tags. > > We're not currently parsing the HTML, so none of these distinctions are really > available to us. Maybe adding an HTML parser is the right solution, but it's a > bit non-trivial. > > d