[PATCH 2/3] test: add known broken test for indexing text/* attachments

Subject: [PATCH 2/3] test: add known broken test for indexing text/* attachments

Date: Sat, 20 Aug 2022 11:50:06 -0700

To: notmuch@notmuchmail.org

Cc: jwilk@jwilk.net

From: David Bremner


The general problem of indexing attachments requires some help to turn
things into text, but (most?) text/* should be doable internally,
possibly with optimizations as for the text/html case.
---
 test/T050-new.sh                              |   8 +
 ...TCH-1-2-system_data_types.7-srcfix.txt:2,S | 282 ++++++++++++++++++
 2 files changed, 290 insertions(+)
 create mode 100644 test/corpora/indexing/PATCH-1-2-system_data_types.7-srcfix.txt:2,S

diff --git a/test/T050-new.sh b/test/T050-new.sh
index 6791f87c..cb67889c 100755
--- a/test/T050-new.sh
+++ b/test/T050-new.sh
@@ -455,4 +455,12 @@ Date: Fri, 17 Jun 2016 22:14:41 -0400
 EOF
 test_expect_equal_file EXPECTED OUTPUT
 
+add_email_corpus indexing
+
+test_begin_subtest "index text/* attachments"
+test_subtest_known_broken
+notmuch search id:20200930101213.2m2pt3jrspvcrxfx@localhost.localdomain > EXPECTED
+notmuch search id:20200930101213.2m2pt3jrspvcrxfx@localhost.localdomain and ersatz > OUTPUT
+test_expect_equal_file_nonempty EXPECTED OUTPUT
+
 test_done
diff --git a/test/corpora/indexing/PATCH-1-2-system_data_types.7-srcfix.txt:2,S b/test/corpora/indexing/PATCH-1-2-system_data_types.7-srcfix.txt:2,S
new file mode 100644
index 00000000..1361c6f2
--- /dev/null
+++ b/test/corpora/indexing/PATCH-1-2-system_data_types.7-srcfix.txt:2,S
@@ -0,0 +1,282 @@
+From mboxrd@z Thu Jan  1 00:00:00 1970
+Return-Path: <SRS0=/pzd=DH=vger.kernel.org=linux-man-owner@kernel.org>
+X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
+	aws-us-west-2-korg-lkml-1.web.codeaurora.org
+X-Spam-Level: 
+X-Spam-Status: No, score=-8.3 required=3.0 tests=BAYES_00,DKIM_SIGNED,
+	DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
+	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,
+	SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no
+	version=3.4.0
+Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
+	by smtp.lore.kernel.org (Postfix) with ESMTP id AFE3FC4727E
+	for <linux-man@archiver.kernel.org>; Wed, 30 Sep 2020 10:12:21 +0000 (UTC)
+Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
+	by mail.kernel.org (Postfix) with ESMTP id 4E0D62074A
+	for <linux-man@archiver.kernel.org>; Wed, 30 Sep 2020 10:12:21 +0000 (UTC)
+Authentication-Results: mail.kernel.org;
+	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Osm9Pn67"
+Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
+        id S1725823AbgI3KMU (ORCPT <rfc822;linux-man@archiver.kernel.org>);
+        Wed, 30 Sep 2020 06:12:20 -0400
+Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50038 "EHLO
+        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
+        with ESMTP id S1725779AbgI3KMU (ORCPT
+        <rfc822;linux-man@vger.kernel.org>); Wed, 30 Sep 2020 06:12:20 -0400
+Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443])
+        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5026DC061755
+        for <linux-man@vger.kernel.org>; Wed, 30 Sep 2020 03:12:20 -0700 (PDT)
+Received: by mail-pf1-x443.google.com with SMTP id b124so832681pfg.13
+        for <linux-man@vger.kernel.org>; Wed, 30 Sep 2020 03:12:20 -0700 (PDT)
+DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
+        d=gmail.com; s=20161025;
+        h=date:from:to:cc:subject:message-id:references:mime-version
+         :content-disposition:in-reply-to:user-agent;
+        bh=qR1FJVXOhU6/g+m4SoSco3vMtV+CNvRvNyXS1xuG+T4=;
+        b=Osm9Pn67G380QiA1ORltntJShSHlKg/KZZfKV8ebvfEXJw9893EO0N6J6GDR+zkmHi
+         TOQuIe7x9y95Pipm54rWWEW33U3gwoXRHsPc2Kivm6L8Ixb+f0T0rMPKw/FOkL8OGo9t
+         WmmSvnlErAXHqBq9aRAJJsf2bSlDgdAyYY1Qe6PSq2hKi2rg+sOy1Vaj4RqZ6jTK/DWY
+         tX28Ql0XS3kKWp0Lc8MNsSP+SXlcdwHQYll5LeReAg1oi++hICgWphuMmo3OH+2B1WtO
+         hMH7VuUONqbuE1aLoZ6PyyUlCeN1soJd8bKY0cmY0TKCsw0Jvkuh/XzYDVNi6wOSM6Ez
+         okpA==
+X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
+        d=1e100.net; s=20161025;
+        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
+         :mime-version:content-disposition:in-reply-to:user-agent;
+        bh=qR1FJVXOhU6/g+m4SoSco3vMtV+CNvRvNyXS1xuG+T4=;
+        b=TJU+duGLhruSES/5sJy4y1wfcltfokDpA58edkSUJyasvsooUo67VNtOB3ZK49iHm5
+         C/cjy0ExxTECB0aM6p+B1jcePdWoPUaVBY9bVd/Q5DNhm4KhTO8ON96gB43d2rLWLOiK
+         /Y1vCu+MwOpY0JQTojbC140s/JYccR/KPapTmbUkRzrpmeoYqw8CbBPV60rQxYCn9GUu
+         FeCXJY5q9OfaYW1viQZoBL5n1IMMpJDVa61Q8gZ33b3wRCvQv/x1eZCsVlYpjcqf7Umc
+         /Amx3i27cxvo8pSvvwiTzrlJHJv0Gkytz13i7s+zW+XKzZRyzy3yirtU2DFTGat6FeMn
+         H8Ig==
+X-Gm-Message-State: AOAM530Yon7xNOW6kiuy6bVpbpwbzR/9pldRB49OtZaSAHAZg7Gyf7qE
+        JXgAH20rZzYlwqOZyeZCeAwtWh09PeI=
+X-Google-Smtp-Source: ABdhPJxzyZAVDBtMwQ5+dUqVg37y/LgZByrSaTxvhS6wnx6sJuG8ROItw0CwDAg939XUVADeje/nZQ==
+X-Received: by 2002:a63:c547:: with SMTP id g7mr1563654pgd.234.1601460739764;
+        Wed, 30 Sep 2020 03:12:19 -0700 (PDT)
+Received: from localhost.localdomain ([1.129.172.177])
+        by smtp.gmail.com with ESMTPSA id k14sm1804437pjd.45.2020.09.30.03.12.17
+        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
+        Wed, 30 Sep 2020 03:12:19 -0700 (PDT)
+Date:   Wed, 30 Sep 2020 20:12:15 +1000
+From:   "G. Branden Robinson" <g.branden.robinson@gmail.com>
+To:     "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
+Cc:     Jakub Wilk <jwilk@jwilk.net>, linux-man@vger.kernel.org
+Subject: Re: [PATCH 1/2] system_data_types.7: srcfix
+Message-ID: <20200930101213.2m2pt3jrspvcrxfx@localhost.localdomain>
+References: <20200925080330.184303-1-colomar.6.4.3@gmail.com>
+ <20200927061015.4obt73pdhyh7wecu@localhost.localdomain>
+ <20200928132959.x4koforqnzohxh5u@jwilk.net>
+ <9b8303fe-969e-c9f0-e3cd-0590b342d5bf@gmail.com>
+MIME-Version: 1.0
+Content-Type: multipart/signed; micalg=pgp-sha256;
+        protocol="application/pgp-signature"; boundary="jg2hlfugxpumieke"
+Content-Disposition: inline
+In-Reply-To: <9b8303fe-969e-c9f0-e3cd-0590b342d5bf@gmail.com>
+User-Agent: NeoMutt/20180716
+Precedence: bulk
+List-ID: <linux-man.vger.kernel.org>
+X-Mailing-List: linux-man@vger.kernel.org
+
+
+--jg2hlfugxpumieke
+Content-Type: multipart/mixed; boundary="wl6i3r6gpq7ibouc"
+Content-Disposition: inline
+
+
+--wl6i3r6gpq7ibouc
+Content-Type: text/plain; charset=us-ascii
+Content-Disposition: inline
+Content-Transfer-Encoding: quoted-printable
+
+Hi Jakub and Michael,
+
+At 2020-09-29T14:13:26+0200, Michael Kerrisk (man-pages) wrote:
+> On 9/28/20 3:29 PM, Jakub Wilk wrote:
+> > Hi Branden!
+> >=20
+> > In groff_man_style(7) you wrote:
+> >> Unused macro arguments are more often simply omitted, or good style
+> >> suggests that a more appropriate macro be chosen, that earlier
+> >> arguments are more important than later ones, or that arguments
+> >> have identical significance such that skipping any is superfluous.
+> >=20
+> > After 15 minutes of gawking at this sentence, I still don't
+> > understand what are you trying to say here. The sentence should be
+> > either thoroughly rephrased or removed.
+>=20
+> I must say that I too found it hard to parse. I presume, Branden,
+> that you mean:
+>=20
+> [[
+> Unused macro arguments are more often simply omitted, or good style=20
+> suggests
+> EITHER (1)=20
+> that a more appropriate macro be chosen,=20
+> (2)
+> that earlier arguments are more important than later ones, or
+> (3)
+> that arguments have=20
+> identical significance such that skipping any is superfluous.
+> ]]
+
+You got it.  But it was too much work.
+
+> But it takes a few scans to work that out. Perhaps break this into
+> smaller pieces, or add some explicit structuring elements to the
+> sentence?
+
+I was trying to be comprehensive with respect to several anti-patterns I
+had in mind.  However, using the anti-patterns concretely is premature
+at that point in the page.  So I both expanded and relocated the
+material.
+
+I'm attaching what I've just committed to groff git.
+
+Further feedback is welcome, of course; revision of documentation is a
+process that is never completed, only abandoned.  And I haven't given up
+yet.  :)
+
+Thank you both for your reviews.
+
+Regards,
+Branden
+
+--wl6i3r6gpq7ibouc
+Content-Type: text/x-diff; charset=us-ascii
+Content-Disposition: attachment; filename="excise_standardese.diff"
+Content-Transfer-Encoding: quoted-printable
+
+commit dd2c4cf05a659ae7127e342924668ff0fa0deaa1
+Author: G. Branden Robinson <g.branden.robinson@gmail.com>
+Date:   Wed Sep 30 19:56:38 2020 +1000
+
+    groff_man_style(7): Clarify empty macro arguments.
+   =20
+    Rewrite some ersatz standardese I had managed to concoct regarding why
+    empty macro arguments are usually not needed.  Put an expanded
+    discussion, with anti-patterns and remedies, in section "Notes", with
+    forward reference from subsection "Macro reference preliminaries".
+   =20
+    Thanks to Jakub Wilk and Michael Kerrisk for the critique.
+
+diff --git a/tmac/groff_man.7.man.in b/tmac/groff_man.7.man.in
+index c62d97ba..b96cbaf4 100644
+--- a/tmac/groff_man.7.man.in
++++ b/tmac/groff_man.7.man.in
+@@ -281,23 +281,8 @@ but the
+ package is designed such that this should seldom be necessary.
+ _ifstyle()dnl
+ .
+-Unused macro arguments are more often simply omitted,
+-.\" antipattern: '.TP ""' (just '.TP' will do)
+-or good style suggests that a more appropriate macro be chosen,
+-.\" antipattern: '.BI "" italic bold' (use '.IB' instead)
+-that earlier arguments are more important than later ones,
+-.\" antipattern: '.TH foo 1 "" "foo "1.2.3"' (don't skip the date!)
+-.\" antipattern: '.IP "" 4n' (use .TP or .RS/.RE, depending on needs)
+-or that arguments have identical significance such that skipping any is
+-superfluous.
+-.\" antipattern: '.B one two "" three' (pointless)
+-.\"   Technically, the above has a side-effect of additional space
+-.\"   between "two" and "three", but there are much more obvious ways of
+-.\"   getting it if desired.
+-.\"     .B "one two  three"
+-.\"     .B one "two " three
+-.\"     .B one two " three"
+-.\"     .B one two\~ three
++See section \(lqNotes\(rq below for examples of cases where better
++alternatives to empty arguments in macro calls are available.
+ _endif()dnl
+ .
+ Most macro arguments are strings that will be output as text;
+@@ -3235,6 +3220,63 @@ Some tips on troubleshooting your man pages follow.
+ .
+ .
+ .TP
++\(bu Do I ever need to use an empty macro argument ("")?
++Probably not.
++.
++When this seems necessary,
++often a shorter or clearer alternative is available.
++.
++.\" antipattern: '.TP ""' (just '.TP' will do)
++.\" antipattern: '.BI "" italic bold' (use '.IB' instead)
++.\" antipattern: '.TH foo 1 "" "foo 1.2.3"' (don't skip the date!)
++.\" antipattern: '.IP "" 4n' (use .TP or .RS/.RE, depending on needs)
++.\" antipattern: '.B one two "" three' (pointless)
++.\"   Technically, the above has a side-effect of additional space
++.\"   between "two" and "three", but there are much more obvious ways of
++.\"   getting it if desired.
++.\"     .B "one two  three"
++.\"     .B one "two " three
++.\"     .B one two " three"
++.\"     .B one two\~ three
++.TS
++c c
++lfCB lfCB.
++Instead of.\|.\.	.\|.\|.do this.
++_
++\&.TP \(dq\(dq	.TP
++\&.BI \(dq\(dq italic-text bold-text	.IB italic-text bold-text
++\&.TH foo 1 \(dq\(dq \(dqfoo 1.2.3\(dq	.TH foo 1 \
++\f(CIyyyy\fP-\f(CImm\fP-\f(CIdd\fP \(dqfoo 1.2.3\(dq
++\&.IP \(dq\(dq 4n	.TP 4n
++\&.B one two \(dq\(dq three	.B one two three
++.TE
++.
++.
++.IP
++In the title heading
++.RB ( .TH ),
++the date of the page's last revision is more important than packaging
++information;
++it should not be omitted.
++.
++Ideally,
++a page maintainer will keep both up to date.
++.
++.
++.IP
++In the last example,
++the empty argument does have a subtly different effect than its
++suggested replacement;
++the empty argument becomes an additional space character\(embut it is a
++regular breaking space,
++so it can be discarded at the end of an output line.
++.
++It is better not to be subtle,
++particularly with space,
++which can be overlooked in source and rendered forms.
++.
++.
++.TP
+ .RB \(bu " .RS" " doesn't indent relative to my indented paragraph"
+ The
+ .B .RS
+
+--wl6i3r6gpq7ibouc--
+
+--jg2hlfugxpumieke
+Content-Type: application/pgp-signature; name="signature.asc"
+
+-----BEGIN PGP SIGNATURE-----
+
+iQIzBAEBCAAdFiEEh3PWHWjjDgcrENwa0Z6cfXEmbc4FAl90WfUACgkQ0Z6cfXEm
+bc5raQ/9GhXG/5U7McaEEu+aW1IgaTYTMbsMpew5u3tBlj3/IenGzsy8wDO912BD
+aHPSedYoc485k1Vh/Kowyx569RhyIXiMtH7uINCEtheMSUNgITNFqXo8mhaqVMlU
+3JoV12btQddOIqHnGX6c5V9Z38KXFmVctD6CxjLaWGLp/Bu9tSKwSaHOOmtUYyOv
+fYpMzr0amd4z9f+O8PPnToqBhwUitEvis1ZHYU6gIj8VwOjD0gNsWjA9HR3uC3c9
+GK/R5przMANrNejzSgofm0/yAL6a61WhqhYEtzLUYu2NFnsyNJWzITNsNnoxzgQ5
+liKL0Onmw0YWjOo4Z9Zht9Iyd6JhJxW0uRwlpFhE6UlCkFHK8nbv3NbHT2xlx/po
+rxY5jDC3Ap3+mdYHY8k5o8vFd4QOXc2bSTuDRZoWtFZQsjnl4Fpkqks1W54Txq4y
+o3Vu9aOPx//Jfi8sDc/qD/mFnyUu+AMFWjIj8UxQN4HmbrbXg/DEczRfP68DjOiX
+ssy/0Rmm/H1cu7oBMoSss63mpk/NvPTSzzCR+VhU4PHQ7rxSZYS105tzkBVfe37e
+hSS00rQVWe2YnI1KkfJHFjzveHiPXf+IxC0Z4PpJuLhl+pIZ/FgxJ5yEkX0XVUIy
+aYRzKI3JaJktYl6WvulKSBPzQxIyOgrqVkZW4lv/uTh64pE6E5w=
+=oeam
+-----END PGP SIGNATURE-----
+
+--jg2hlfugxpumieke--
+
-- 
2.35.1

_______________________________________________
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-leave@notmuchmail.org

Thread: