Although HTML itself is not regular (probably not anything sane in the latest incarnations), well formed tags should be as far as I know. Here is a simple fix to the problem of giant embedded images in HTML: drop all tags. Unbalanced < > could force an HTML part not to be indexed. If the general approach seems sensible, then it can probably be tidied up a bit, e.g. by storing a state table in the filter struct, rather than creating a function to define the appropriate state table and jumping through a function pointer. On the other hand, in principle this approach is more flexible as it does not insist that all scanners are automata based. I originally wanted to try a real HTML parser, but I couldn't see how to get the one I looked at (gumbo) working easily in "stream" mode.