Re: Drop HTML tags when indexing

Subject: Re: Drop HTML tags when indexing

Date: Sat, 25 Mar 2017 09:59:20 -0300

Cc:

David Bremner <david@tethera.net> writes:

> Steven Allen pointed out [2] that the previous scanner [1] was a
> little too simplistic. This version handles (or claims to) quoted
> strings in attributes, which can apparently contain '>'and '<'
> characters. This required generalizing the state machine runner a bit
> [3] to handle states with out-degree more than two.

For what it is worth, this series shrunk my index by about the same
amount as skipping html messages entirely: I have about 15% messages
with html parts, and this series made the index about 15% smaller.

d

Previous message (by thread): Re: [PATCH 1/7] test: add known broken test for indexing html

Thread:

David Bremner—Drop HTML tags when indexing [inbox, unread]
- David Bremner—[PATCH 1/7] test: add known broken test for indexing html [inbox, notmuch::bug, notmuch::fixed, notmuch::patch, notmuch::pushed, notmuch::test, unread]
  - David Bremner—Re: [PATCH 1/7] test: add known broken test for indexing html [inbox, unread]
  - David Bremner—Re: [PATCH 1/7] test: add known broken test for indexing html [inbox, unread]
- David Bremner—[PATCH 2/7] lib: add content type argument to uuencode filter. [inbox, notmuch::obsolete, notmuch::patch, unread]
- David Bremner—[PATCH 3/7] lib/index: Add another layer of indirection in filtering [inbox, notmuch::obsolete, notmuch::patch, unread]
- David Bremner—[PATCH 4/7] lib/index: separate state table definition from scanner. [inbox, notmuch::obsolete, notmuch::patch, unread]
- David Bremner—[PATCH 5/7] lib/index: generalize filter name [inbox, notmuch::obsolete, notmuch::patch, unread]
- David Bremner—[PATCH 6/7] lib/index.cc: generalize filter state machine [inbox, notmuch::obsolete, notmuch::patch, unread]
- David Bremner—[PATCH 7/7] lib/index: add simple html filter [inbox, notmuch::obsolete, notmuch::patch, unread]
- Daniel Lublin (quite)—Re: Drop HTML tags when indexing [inbox, unread]
- David Bremner—Re: Drop HTML tags when indexing [inbox, unread]