Re: Drop HTML tags when indexing

Subject: Re: Drop HTML tags when indexing

Date: Sat, 25 Mar 2017 09:59:20 -0300

To: notmuch@notmuchmail.org

Cc:

From: David Bremner


David Bremner <david@tethera.net> writes:

> Steven Allen pointed out [2] that the previous scanner [1] was a
> little too simplistic. This version handles (or claims to) quoted
> strings in attributes, which can apparently contain '>'and '<'
> characters. This required generalizing the state machine runner a bit
> [3] to handle states with out-degree more than two.

For what it is worth, this series shrunk my index by about the same
amount as skipping html messages entirely: I have about 15% messages
with html parts, and this series made the index about 15% smaller.

d

Thread: