RE: [PATCH] test: add known broken test for indexing html

Subject: RE: [PATCH] test: add known broken test for indexing html

Date: Sat, 18 Mar 2017 13:37:21 +0000

To: David Bremner, notmuch@notmuchmail.org

Cc:

From: Jeffrey Stedfast


Hi David,

Base64 encoded inline image data is always within the src attribute value of an <img> tag and will always begin with "data:" followed by the mime-type and then followed by ";base64," so it's pretty easy to spot.

While on this topic, why index HTML attribute values at all? Other than perhaps some known ones like perhaps the 'alt' value of <img> tags?

I would argue that the only portion of any HTML that you should be indexing at all for searching is the character data between tags.

Hope my $0.02 helps,

Jeff

> -----Original Message-----
> From: notmuch [mailto:notmuch-bounces@notmuchmail.org] On Behalf Of
> David Bremner
> Sent: Saturday, March 18, 2017 9:25 AM
> To: notmuch@notmuchmail.org
> Subject: [PATCH] test: add known broken test for indexing html
> 
> 'quite' on IRC reported that notmuch new was grinding to a halt during initial
> indexing, and we eventually narrowed the problem down to some html parts
> with large embedded images. These cause the number of terms added to
> the Xapian database to explode (the first 400 messages generated 4.6M
> unique terms), and of course the resulting terms are not much use for
> searching.
> ---
> 
> I'm not sure the best approach to fix this. Workarounds include limiting the
> size of the part indexed, and skipping html parts. The latter is easy, but
> probably too drastic.  A nice solution might be a filter similar to the existing
> one that strips out uuencoded text but for base64. Alas base64 crud seems
> to come with all kinds of syntactic wrappers, so it's probably harder to filter.
> 
> 
>  test/T680-html-indexing.sh       | 12 +++++++
>  test/corpora/README              |  3 ++
>  test/corpora/html/embedded-image | 69
> ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 84 insertions(+)
>  create mode 100755 test/T680-html-indexing.sh  create mode 100644
> test/corpora/html/embedded-image
> 
> diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh new file
> mode 100755 index 00000000..78768c4f
> --- /dev/null
> +++ b/test/T680-html-indexing.sh
> @@ -0,0 +1,12 @@
> +#!/usr/bin/env bash
> +test_description="indexing of html parts"
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus html
> +
> +test_begin_subtest 'embedded images should not be indexed'
> +test_subtest_known_broken
> +notmuch search
> kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 >
> +OUTPUT test_expect_equal_file /dev/null OUTPUT
> +
> +test_done
> diff --git a/test/corpora/README b/test/corpora/README index
> 77c48e6e..c9a35fed 100644
> --- a/test/corpora/README
> +++ b/test/corpora/README
> @@ -9,3 +9,6 @@ default
>  broken
>    The broken corpus contains messages that are broken and/or RFC
>    non-compliant, ensuring we deal with them in a sane way.
> +
> +html
> +  The html corpus contains html parts
> diff --git a/test/corpora/html/embedded-image
> b/test/corpora/html/embedded-image
> new file mode 100644
> index 00000000..40851530
> --- /dev/null
> +++ b/test/corpora/html/embedded-image
> @@ -0,0 +1,69 @@
> +From: =?utf-8?b?bWFsbW9ib3Jn?= <daemon@lublin.se>
> +To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <daemon@lublin.se>
> +Date: Tue, 19 Jul 2016 11:54:24 +0200
> +X-Feed2Imap-Version: 1.2.5
> +Message-Id: <boendemalmoborg-1834@eltanin.uberspace.de>
> +Subject:
> +=?utf-
> 8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
> +Content-Type: multipart/alternative; boundary="=-1468922508-176605-
> 12427-9500-21-="
> +MIME-Version: 1.0
> +
> +
> +--=-1468922508-176605-12427-9500-21-=
> +Content-Type: text/plain; charset=utf-8; format=flowed
> +Content-Transfer-Encoding: 8bit
> +
> +<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
> +
> +Malmö 2016-07-09
> +
> +I skrivande stund är vi i färd med att avetablera vår entreprenad på
> +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett
> +större dräneringsarbete som i sin tur har inneburit vissa
> +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
> +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och
> +vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort.
> +Nu kommer den vackra fastigheten att klara sig torrskodd under många år
> +framöver [A]
> +
> +
> +
> +[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
> +--
> +Feed: Förvaltnings AB Malmöborg
> +<http://malmoborg.se>
> +Item: Tack alla trafikanter och fotgängare!
> +<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
> +Date: 2016-07-19 11:54:24 +0200
> +Author: malmoborg
> +Filed under: Nyheter
> +
> +--=-1468922508-176605-12427-9500-21-=
> +Content-Type: text/html; charset=utf-8
> +Content-Transfer-Encoding: 8bit
> +
> +<table border="1" width="100%" cellpadding="0" cellspacing="0"
> +borderspacing="0"><tr><td> <table width="100%" bgcolor="#EDEDED"
> +cellpadding="4" cellspacing="2"> <tr><td
> +align="right"><b>Feed:</b></td> <td width="100%"><a
> +href="http://malmoborg.se"> <b>Förvaltnings AB Malmöborg</b> </a>
> +</td></tr><tr><td align="right"><b>Item:</b></td> <td width="100%"><a
> +href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/
> +"><b>Tack alla trafikanter och fotgängare!</b> </a>
> +</td></tr></table></td></tr></table>
> +
> +<p>Malmö 2016-07-09</p>
> +<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på
> +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett
> +större dräneringsarbete som i sin tur har inneburit vissa
> +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
> +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och
> +vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort.
> +Nu kommer den vackra fastigheten att klara sig torrskodd under många år
> +framöver <img
> +src="data:image/gif;base64,R0lGODlhDwAPALMOAP/qAEVFRQAAAP/OAP/
> JAP+0AP6d
> +AP/+k//9E///////
> +xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YV
> abO
> +GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCV
> g8
> +KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7
> +" alt=":-)" class="wp-smiley" /> </p>
> +<p>&nbsp;</p>
> +<hr width="100%"/>
> +<table width="100%" cellpadding="0" cellspacing="0"> <tr><td
> +align="right"><font
> +color="#ababab">Date:</font>&nbsp;&nbsp;</td><td><font
> +color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr> <tr><td
> +align="right"><font
> +color="#ababab">Author:</font>&nbsp;&nbsp;</td><td><font
> +color="#ababab">malmoborg</font></td></tr>
> +<tr><td align="right"><font color="#ababab">Filed
> +under:</font>&nbsp;&nbsp;</td><td><font
> +color="#ababab">Nyheter</font></td></tr>
> +</table>
> +
> +--=-1468922508-176605-12427-9500-21-=--
> --
> 2.11.0
> 
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> https://notmuchmail.org/mailman/listinfo/notmuch

Thread: