RFC: drop html tags

Subject: RFC: drop html tags

Date: Tue, 21 Mar 2017 10:15:43 -0300

Cc:

Although HTML itself is not regular (probably not anything sane in the
latest incarnations), well formed tags should be as far as I know.
Here is a simple fix to the problem of giant embedded images in HTML:
drop all tags.  Unbalanced < > could force an HTML part not to be
indexed.

If the general approach seems sensible, then it can probably be tidied
up a bit, e.g.  by storing a state table in the filter struct, rather
than creating a function to define the appropriate state table and
jumping through a function pointer. On the other hand, in principle
this approach is more flexible as it does not insist that all scanners
are automata based. I originally wanted to try a real HTML parser, but
I couldn't see how to get the one I looked at (gumbo) working easily
in "stream" mode.

Previous message (by thread): [rfc patch 6/6] lib/index: add simple html filter

Thread:

David Bremner—RFC: drop html tags [inbox, unread]
- David Bremner—[rfc patch 1/6] test: add known broken test for indexing html [inbox, notmuch::bug, notmuch::obsolete, notmuch::patch, notmuch::test, unread]
- David Bremner—[rfc patch 2/6] lib: add content type argument to uuencode filter. [inbox, notmuch::obsolete, notmuch::patch, unread]
- David Bremner—[rfc patch 3/6] lib/index: Add another layer of indirection in filtering [inbox, notmuch::obsolete, notmuch::patch, unread]
- David Bremner—[rfc patch 4/6] lib/index: separate state table definition from scanner. [inbox, notmuch::obsolete, notmuch::patch, unread]
- David Bremner—[rfc patch 5/6] lib/index: generalize filter name [inbox, notmuch::obsolete, notmuch::patch, unread]
- David Bremner—[rfc patch 6/6] lib/index: add simple html filter [inbox, notmuch::obsolete, notmuch::patch, unread]