Steven Allen <steven@stebalien.com> writes: > David Bremner <david@tethera.net> writes: >> Although HTML itself is not regular (probably not anything sane in the >> latest incarnations), well formed tags should be as far as I know. >> Here is a simple fix to the problem of giant embedded images in HTML: >> drop all tags. Unbalanced < > could force an HTML part not to be >> indexed. > > What about attribute values? > > <input value="a<b"> > > Contrary to a lot of misinformation on the web, I'm pretty sure this is > perfectly legal in HTML (not XML). > > Docs: https://www.w3.org/TR/html5/syntax.html#attributes-0 > > In the JavaScript regex format, I believe the correct way to parse this is: > > /<("[^"]*"|'[^']*'|[^"'>]*)*>/g > > Basically, while inside a tag, ignore everything between double and single quotes. Thanks for the reality check. It should be possible to handle quotes. In my limited understanding of that regex, we can do a bit better by forcing pairs of quotes to match, since I <chaos attribute="'"> is probably legal. Cheers, d