[David Bremner] Re: RFC: drop html tags

Subject: [David Bremner] Re: RFC: drop html tags

Date: Tue, 21 Mar 2017 14:55:20 -0300

To: notmuch@notmuchmail.org

Cc:

From: David Bremner


Steven Allen <steven@stebalien.com> writes:

> David Bremner <david@tethera.net> writes:
>> Although HTML itself is not regular (probably not anything sane in the
>> latest incarnations), well formed tags should be as far as I know.
>> Here is a simple fix to the problem of giant embedded images in HTML:
>> drop all tags.  Unbalanced < > could force an HTML part not to be
>> indexed.
>
> What about attribute values?
>
>     <input value="a<b">
>
> Contrary to a lot of misinformation on the web, I'm pretty sure this is
> perfectly legal in HTML (not XML).
>
> Docs: https://www.w3.org/TR/html5/syntax.html#attributes-0
>
> In the JavaScript regex format, I believe the correct way to parse this is:
>
>     /<("[^"]*"|'[^']*'|[^"'>]*)*>/g
>
> Basically, while inside a tag, ignore everything between double and single quotes.

Thanks for the reality check. It should be possible to handle quotes. In
my limited understanding of that regex, we can do a bit better by
forcing pairs of quotes to match, since I <chaos attribute="'"> is
probably legal.

Cheers,

d

Thread: