Hi David, Base64 encoded inline image data is always within the src attribute value of an <img> tag and will always begin with "data:" followed by the mime-type and then followed by ";base64," so it's pretty easy to spot. While on this topic, why index HTML attribute values at all? Other than perhaps some known ones like perhaps the 'alt' value of <img> tags? I would argue that the only portion of any HTML that you should be indexing at all for searching is the character data between tags. Hope my $0.02 helps, Jeff > -----Original Message----- > From: notmuch [mailto:notmuch-bounces@notmuchmail.org] On Behalf Of > David Bremner > Sent: Saturday, March 18, 2017 9:25 AM > To: notmuch@notmuchmail.org > Subject: [PATCH] test: add known broken test for indexing html > > 'quite' on IRC reported that notmuch new was grinding to a halt during initial > indexing, and we eventually narrowed the problem down to some html parts > with large embedded images. These cause the number of terms added to > the Xapian database to explode (the first 400 messages generated 4.6M > unique terms), and of course the resulting terms are not much use for > searching. > --- > > I'm not sure the best approach to fix this. Workarounds include limiting the > size of the part indexed, and skipping html parts. The latter is easy, but > probably too drastic. A nice solution might be a filter similar to the existing > one that strips out uuencoded text but for base64. Alas base64 crud seems > to come with all kinds of syntactic wrappers, so it's probably harder to filter. > > > test/T680-html-indexing.sh | 12 +++++++ > test/corpora/README | 3 ++ > test/corpora/html/embedded-image | 69 > ++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 84 insertions(+) > create mode 100755 test/T680-html-indexing.sh create mode 100644 > test/corpora/html/embedded-image > > diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh new file > mode 100755 index 00000000..78768c4f > --- /dev/null > +++ b/test/T680-html-indexing.sh > @@ -0,0 +1,12 @@ > +#!/usr/bin/env bash > +test_description="indexing of html parts" > +. ./test-lib.sh || exit 1 > + > +add_email_corpus html > + > +test_begin_subtest 'embedded images should not be indexed' > +test_subtest_known_broken > +notmuch search > kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > > +OUTPUT test_expect_equal_file /dev/null OUTPUT > + > +test_done > diff --git a/test/corpora/README b/test/corpora/README index > 77c48e6e..c9a35fed 100644 > --- a/test/corpora/README > +++ b/test/corpora/README > @@ -9,3 +9,6 @@ default > broken > The broken corpus contains messages that are broken and/or RFC > non-compliant, ensuring we deal with them in a sane way. > + > +html > + The html corpus contains html parts > diff --git a/test/corpora/html/embedded-image > b/test/corpora/html/embedded-image > new file mode 100644 > index 00000000..40851530 > --- /dev/null > +++ b/test/corpora/html/embedded-image > @@ -0,0 +1,69 @@ > +From: =?utf-8?b?bWFsbW9ib3Jn?= <daemon@lublin.se> > +To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <daemon@lublin.se> > +Date: Tue, 19 Jul 2016 11:54:24 +0200 > +X-Feed2Imap-Version: 1.2.5 > +Message-Id: <boendemalmoborg-1834@eltanin.uberspace.de> > +Subject: > +=?utf- > 8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?= > +Content-Type: multipart/alternative; boundary="=-1468922508-176605- > 12427-9500-21-=" > +MIME-Version: 1.0 > + > + > +--=-1468922508-176605-12427-9500-21-= > +Content-Type: text/plain; charset=utf-8; format=flowed > +Content-Transfer-Encoding: 8bit > + > +<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/> > + > +Malmö 2016-07-09 > + > +I skrivande stund är vi i färd med att avetablera vår entreprenad på > +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett > +större dräneringsarbete som i sin tur har inneburit vissa > +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några > +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och > +vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. > +Nu kommer den vackra fastigheten att klara sig torrskodd under många år > +framöver [A] > + > + > + > +[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif > +-- > +Feed: Förvaltnings AB Malmöborg > +<http://malmoborg.se> > +Item: Tack alla trafikanter och fotgängare! > +<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/> > +Date: 2016-07-19 11:54:24 +0200 > +Author: malmoborg > +Filed under: Nyheter > + > +--=-1468922508-176605-12427-9500-21-= > +Content-Type: text/html; charset=utf-8 > +Content-Transfer-Encoding: 8bit > + > +<table border="1" width="100%" cellpadding="0" cellspacing="0" > +borderspacing="0"><tr><td> <table width="100%" bgcolor="#EDEDED" > +cellpadding="4" cellspacing="2"> <tr><td > +align="right"><b>Feed:</b></td> <td width="100%"><a > +href="http://malmoborg.se"> <b>Förvaltnings AB Malmöborg</b> </a> > +</td></tr><tr><td align="right"><b>Item:</b></td> <td width="100%"><a > +href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/ > +"><b>Tack alla trafikanter och fotgängare!</b> </a> > +</td></tr></table></td></tr></table> > + > +<p>Malmö 2016-07-09</p> > +<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på > +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett > +större dräneringsarbete som i sin tur har inneburit vissa > +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några > +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och > +vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. > +Nu kommer den vackra fastigheten att klara sig torrskodd under många år > +framöver <img > +src="data:image/gif;base64,R0lGODlhDwAPALMOAP/qAEVFRQAAAP/OAP/ > JAP+0AP6d > +AP/+k//9E/////// > +xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YV > abO > +GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCV > g8 > +KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7 > +" alt=":-)" class="wp-smiley" /> </p> > +<p> </p> > +<hr width="100%"/> > +<table width="100%" cellpadding="0" cellspacing="0"> <tr><td > +align="right"><font > +color="#ababab">Date:</font> </td><td><font > +color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr> <tr><td > +align="right"><font > +color="#ababab">Author:</font> </td><td><font > +color="#ababab">malmoborg</font></td></tr> > +<tr><td align="right"><font color="#ababab">Filed > +under:</font> </td><td><font > +color="#ababab">Nyheter</font></td></tr> > +</table> > + > +--=-1468922508-176605-12427-9500-21-=-- > -- > 2.11.0 > > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > https://notmuchmail.org/mailman/listinfo/notmuch