Re: [PATCH v4] lib: replace the header parser with gmime

Subject: Re: [PATCH v4] lib: replace the header parser with gmime

Date: Mon, 31 Mar 2014 00:10:20 +0300

To: Austin Clements

Cc: notmuch@notmuchmail.org

From: Jani Nikula


On Sun, 30 Mar 2014, Austin Clements <amdragon@MIT.EDU> wrote:
> Various little comments below.  Overall this is looking really good.
>
> Quoth Jani Nikula on Mar 23 at 10:01 pm:
>> The notmuch library includes a full blown message header parser. Yet
>> the same message headers are parsed by gmime during indexing. Switch
>> to gmime parsing completely.
>> 
>> These are the main changes:
>> 
>> * Gmime stops header parsing at the first invalid header, and presumes
>>   the message body starts from there. The current parser is quite
>>   liberal in accepting broken headers. The change means we will be
>>   much pickier about accepting invalid messages.
>> 
>> * The current parser converts tabs used in header folding to
>>   spaces. Gmime preserve the tabs. Due to a broken python library used
>>   in mailman, there are plenty of mailing lists that produce headers
>>   with tabs in header folding, and we'll see plenty of tabs. (This
>>   change has been mitigated in preparatory patches.)
>> 
>> * For pure header parsing, the current parser is likely faster than
>>   gmime, which parses the whole message rather than just the
>>   headers. Since we parse the message and its headers using gmime for
>>   indexing anyway, this avoids and extra header parsing round when
>>   adding new messages. In case of duplicate messages, we'll end up
>>   parsing the full message although just headers would be
>>   sufficient. All in all this should still speed up 'notmuch new'.
>> 
>> * Calls to notmuch_message_get_header() may be slightly slower than
>>   previously for headers that are not indexed in the database, due to
>>   parsing of the whole message. Within the notmuch code base, notmuch
>>   reply is the only such user.
>> 
>> ---
>> 
>> This is v4 of patches [1] and [2], combining them into one. Patch [3]
>> is required to make all the tests pass.
>> 
>> The implementation has been changed considerably. Notably we only
>> cache the decoded headers to a hash table if and when they are
>> needed. The main reasons for doing caching at all is that gmime
>> provides raw headers that need decoding, and to minimize changes
>> outside of message-file.c we want to own and track the allocated
>> strings instead of pushing the responsibility to callers.
>> 
>> This addresses most comments from Austin.
>> 
>> The diff in message-file.c is rather confusing due to a mix of
>> added/removed lines. Looking at the patched file is much more
>> pleasant.
>> 
>> BR,
>> Jani.
>> 
>> [1] id:bdef45dc21c777130a7ae1fa650f172d8c6eaaf4.1391456555.git.jani@nikula.org
>> [2] id:31d785c4a3e4b90862a0fdc545d4e900a4c898e2.1391456555.git.jani@nikula.org
>> [3] id:1395247490-19892-1-git-send-email-jani@nikula.org
>> ---
>>  lib/database.cc       |  15 +-
>>  lib/index.cc          |  70 +--------
>>  lib/message-file.c    | 420 ++++++++++++++++++++------------------------------
>>  lib/notmuch-private.h |  54 +++----
>>  4 files changed, 208 insertions(+), 351 deletions(-)
>> 
>> diff --git a/lib/database.cc b/lib/database.cc
>> index aef748f75b5d..4750f40cf0fb 100644
>> --- a/lib/database.cc
>> +++ b/lib/database.cc
>> @@ -1971,15 +1971,10 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
>>      if (ret)
>>  	goto DONE;
>>  
>> -    notmuch_message_file_restrict_headers (message_file,
>> -					   "date",
>> -					   "from",
>> -					   "in-reply-to",
>> -					   "message-id",
>> -					   "references",
>> -					   "subject",
>> -					   "to",
>> -					   (char *) NULL);
>> +    /* Parse message up front to get better error status. */
>> +    ret = notmuch_message_file_parse (message_file);
>> +    if (ret)
>> +	goto DONE;
>>  
>>      try {
>>  	/* Before we do any real work, (especially before doing a
>> @@ -2066,7 +2061,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
>>  	    date = notmuch_message_file_get_header (message_file, "date");
>>  	    _notmuch_message_set_header_values (message, date, from, subject);
>>  
>> -	    ret = _notmuch_message_index_file (message, filename);
>> +	    ret = _notmuch_message_index_file (message, message_file);
>>  	    if (ret)
>>  		goto DONE;
>>  	} else {
>> diff --git a/lib/index.cc b/lib/index.cc
>> index 78c18cf36d10..46a019325454 100644
>> --- a/lib/index.cc
>> +++ b/lib/index.cc
>> @@ -425,63 +425,15 @@ _index_mime_part (notmuch_message_t *message,
>>  
>>  notmuch_status_t
>>  _notmuch_message_index_file (notmuch_message_t *message,
>> -			     const char *filename)
>> +			     notmuch_message_file_t *message_file)
>>  {
>> -    GMimeStream *stream = NULL;
>> -    GMimeParser *parser = NULL;
>> -    GMimeMessage *mime_message = NULL;
>> +    GMimeMessage *mime_message;
>>      InternetAddressList *addresses;
>> -    FILE *file = NULL;
>>      const char *from, *subject;
>> -    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>> -    static int initialized = 0;
>> -    char from_buf[5];
>> -    bool is_mbox = false;
>> -    static bool mbox_warning = false;
>> -
>> -    if (! initialized) {
>> -	g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
>> -	initialized = 1;
>> -    }
>> -
>> -    file = fopen (filename, "r");
>> -    if (! file) {
>> -	fprintf (stderr, "Error opening %s: %s\n", filename, strerror (errno));
>> -	ret = NOTMUCH_STATUS_FILE_ERROR;
>> -	goto DONE;
>> -    }
>> -
>> -    /* Is this mbox? */
>> -    if (fread (from_buf, sizeof (from_buf), 1, file) == 1 &&
>> -	strncmp (from_buf, "From ", 5) == 0)
>> -	is_mbox = true;
>> -    rewind (file);
>>  
>> -    /* Evil GMime steals my FILE* here so I won't fclose it. */
>> -    stream = g_mime_stream_file_new (file);
>> -
>> -    parser = g_mime_parser_new_with_stream (stream);
>> -    g_mime_parser_set_scan_from (parser, is_mbox);
>> -
>> -    mime_message = g_mime_parser_construct_message (parser);
>> -
>> -    if (is_mbox) {
>> -	if (!g_mime_parser_eos (parser)) {
>> -	    /* This is a multi-message mbox. */
>> -	    ret = NOTMUCH_STATUS_FILE_NOT_EMAIL;
>> -	    goto DONE;
>> -	}
>> -	/* For historical reasons, we support single-message mboxes,
>> -	 * but this behavior is likely to change in the future, so
>> -	 * warn. */
>> -	if (!mbox_warning) {
>> -	    mbox_warning = true;
>> -	    fprintf (stderr, "\
>> -Warning: %s is an mbox containing a single message,\n\
>> -likely caused by misconfigured mail delivery.  Support for single-message\n\
>> -mboxes is deprecated and may be removed in the future.\n", filename);
>> -	}
>> -    }
>> +    mime_message = notmuch_message_file_get_mime_message (message_file);
>> +    if (! mime_message)
>> +	return NOTMUCH_STATUS_FILE_NOT_EMAIL;
>
> This could also be because of out-of-memory.
> notmuch_message_file_get_mime_message (and notmuch_message_file_parse,
> in turn) should probably take a notmuch_status_t *status_ret argument
> like other internal functions.

Agreed (though I opted for returning notmuch_status_t and passing
GMimeMessage** as parameter).

>
>>  
>>      from = g_mime_message_get_sender (mime_message);
>>  
>> @@ -502,15 +454,5 @@ mboxes is deprecated and may be removed in the future.\n", filename);
>>  
>>      _index_mime_part (message, g_mime_message_get_mime_part (mime_message));
>>  
>> -  DONE:
>> -    if (mime_message)
>> -	g_object_unref (mime_message);
>> -
>> -    if (parser)
>> -	g_object_unref (parser);
>> -
>> -    if (stream)
>> -	g_object_unref (stream);
>> -
>> -    return ret;
>> +    return NOTMUCH_STATUS_SUCCESS;
>>  }
>> diff --git a/lib/message-file.c b/lib/message-file.c
>> index a2850c278b5a..88662608d319 100644
>> --- a/lib/message-file.c
>> +++ b/lib/message-file.c
>> @@ -26,30 +26,15 @@
>>  
>>  #include <glib.h> /* GHashTable */
>>  
>> -typedef struct {
>> -    char *str;
>> -    size_t size;
>> -    size_t len;
>> -} header_value_closure_t;
>> -
>>  struct _notmuch_message_file {
>>      /* File object */
>>      FILE *file;
>> +    char *filename;
>>  
>> -    /* Header storage */
>> -    int restrict_headers;
>> +    /* Cache for decoded headers */
>>      GHashTable *headers;
>> -    int broken_headers;
>> -    int good_headers;
>> -    size_t header_size; /* Length of full message header in bytes. */
>> -
>> -    /* Parsing state */
>> -    char *line;
>> -    size_t line_size;
>> -    header_value_closure_t value;
>>  
>> -    int parsing_started;
>> -    int parsing_finished;
>> +    GMimeMessage *message;
>>  };
>>  
>>  static int
>> @@ -76,15 +61,12 @@ strcase_hash (const void *ptr)
>>  static int
>>  _notmuch_message_file_destructor (notmuch_message_file_t *message)
>>  {
>> -    if (message->line)
>> -	free (message->line);
>> -
>> -    if (message->value.size)
>> -	free (message->value.str);
>> -
>>      if (message->headers)
>>  	g_hash_table_destroy (message->headers);
>>  
>> +    if (message->message)
>> +	g_object_unref (message->message);
>> +
>>      if (message->file)
>>  	fclose (message->file);
>>  
>> @@ -102,20 +84,17 @@ _notmuch_message_file_open_ctx (void *ctx, const char *filename)
>>      if (unlikely (message == NULL))
>>  	return NULL;
>>  
>> +    /* Only needed for error messages during parsing. */
>> +    message->filename = talloc_strdup (message, filename);
>> +    if (message->filename == NULL)
>> +	goto FAIL;
>> +
>>      talloc_set_destructor (message, _notmuch_message_file_destructor);
>>  
>>      message->file = fopen (filename, "r");
>>      if (message->file == NULL)
>>  	goto FAIL;
>>  
>> -    message->headers = g_hash_table_new_full (strcase_hash,
>> -					      strcase_equal,
>> -					      free,
>> -					      g_free);
>> -
>> -    message->parsing_started = 0;
>> -    message->parsing_finished = 0;
>> -
>>      return message;
>>  
>>    FAIL:
>> @@ -137,264 +116,205 @@ notmuch_message_file_close (notmuch_message_file_t *message)
>>      talloc_free (message);
>>  }
>>  
>> -void
>> -notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
>> -					va_list va_headers)
>> +notmuch_status_t
>> +notmuch_message_file_parse (notmuch_message_file_t *message)
>>  {
>> -    char *header;
>> +    GMimeStream *stream;
>> +    GMimeParser *parser;
>> +    notmuch_status_t status = NOTMUCH_STATUS_SUCCESS;
>> +    static int initialized = 0;
>> +    char from_buf[5];
>> +    notmuch_bool_t is_mbox = FALSE;
>> +    static notmuch_bool_t mbox_warning = FALSE;
>>  
>> -    if (message->parsing_started)
>> -	INTERNAL_ERROR ("notmuch_message_file_restrict_headers called after parsing has started");
>> +    if (message->message)
>> +	return NOTMUCH_STATUS_SUCCESS;
>>  
>> -    while (1) {
>> -	header = va_arg (va_headers, char*);
>> -	if (header == NULL)
>> -	    break;
>> -	g_hash_table_insert (message->headers,
>> -			     xstrdup (header), NULL);
>> +    if (! initialized) {
>> +	g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
>> +	initialized = 1;
>>      }
>>  
>> -    message->restrict_headers = 1;
>> -}
>> -
>> -void
>> -notmuch_message_file_restrict_headers (notmuch_message_file_t *message, ...)
>> -{
>> -    va_list va_headers;
>> -
>> -    va_start (va_headers, message);
>> +    message->headers = g_hash_table_new_full (strcase_hash, strcase_equal,
>> +					      free, g_free);
>> +    if (! message->headers)
>> +	return NOTMUCH_STATUS_OUT_OF_MEMORY;
>>  
>> -    notmuch_message_file_restrict_headersv (message, va_headers);
>> -}
>> +    /* Is this mbox? */
>> +    if (fread (from_buf, sizeof (from_buf), 1, message->file) == 1 &&
>> +	strncmp (from_buf, "From ", 5) == 0)
>> +	is_mbox = TRUE;
>> +    rewind (message->file);
>>  
>> -static void
>> -copy_header_unfolding (header_value_closure_t *value,
>> -		       const char *chunk)
>> -{
>> -    char *last;
>> +    stream = g_mime_stream_file_new (message->file);
>>  
>> -    if (chunk == NULL)
>> -	return;
>> +    /* We'll own and fclose the FILE* ourselves. */
>> +    g_mime_stream_file_set_owner (GMIME_STREAM_FILE (stream), FALSE);
>>  
>> -    while (*chunk == ' ' || *chunk == '\t')
>> -	chunk++;
>> +    parser = g_mime_parser_new_with_stream (stream);
>> +    g_mime_parser_set_scan_from (parser, is_mbox);
>>  
>> -    if (value->len + 1 + strlen (chunk) + 1 > value->size) {
>> -	unsigned int new_size = value->size;
>> -	if (value->size == 0)
>> -	    new_size = strlen (chunk) + 1;
>> -	else
>> -	    while (value->len + 1 + strlen (chunk) + 1 > new_size)
>> -		new_size *= 2;
>> -	value->str = xrealloc (value->str, new_size);
>> -	value->size = new_size;
>> +    message->message = g_mime_parser_construct_message (parser);
>> +    if (! message->message) {
>> +	status = NOTMUCH_STATUS_FILE_NOT_EMAIL;
>> +	goto DONE;
>>      }
>>  
>> -    last = value->str + value->len;
>> -    if (value->len) {
>> -	*last = ' ';
>> -	last++;
>> -	value->len++;
>> +    if (is_mbox) {
>> +	if (! g_mime_parser_eos (parser)) {
>> +	    /* This is a multi-message mbox. */
>> +	    status = NOTMUCH_STATUS_FILE_NOT_EMAIL;
>> +	    goto DONE;
>> +	}
>> +	/*
>> +	 * For historical reasons, we support single-message mboxes,
>> +	 * but this behavior is likely to change in the future, so
>> +	 * warn.
>> +	 */
>> +	if (! mbox_warning) {
>> +	    mbox_warning = TRUE;
>> +	    fprintf (stderr, "\
>> +Warning: %s is an mbox containing a single message,\n\
>> +likely caused by misconfigured mail delivery.  Support for single-message\n\
>> +mboxes is deprecated and may be removed in the future.\n", message->filename);
>
> I still wonder if this is the right place to issue this warning.  For
> example, doesn't this mean that, when constructing a reply to a single
> message mbox, we'll issue this warning?
>
> Or maybe it's time to drop single message mbox support.  We deprecated
> it in 0.15.1, which was over a year ago.  OTOH, wheezy still has
> 0.13.2, so there could be people still using mboxes without the
> warning.  OTOOH, we only kept this around to appear David anyway.  So
> maybe the answer is to leave this in an not think too hard about it.

I'm adding a prep patch to drop mbox support altogether.

>
>> +	}
>>      }
>>  
>> -    strcpy (last, chunk);
>> -    value->len += strlen (chunk);
>> +  DONE:
>> +    g_object_unref (stream);
>> +    g_object_unref (parser);
>> +
>> +    if (status) {
>> +	g_hash_table_destroy (message->headers);
>> +	message->headers = NULL;
>>  
>> -    last = value->str + value->len - 1;
>> -    if (*last == '\n') {
>> -	*last = '\0';
>> -	value->len--;
>> +	if (message->message) {
>> +	    g_object_unref (message->message);
>> +	    message->message = NULL;
>> +	}
>> +
>> +	rewind (message->file);
>>      }
>> +
>> +    return status;
>>  }
>>  
>> -/* As a special-case, a value of NULL for header_desired will force
>> - * the entire header to be parsed if it is not parsed already. This is
>> - * used by the _notmuch_message_file_get_headers_end function.
>> - * Another special case is the Received: header. For this header we
>> - * want to concatenate all instances of the header instead of just
>> - * hashing the first instance as we use this when analyzing the path
>> - * the mail has taken from sender to recipient.
>> - */
>> -const char *
>> -notmuch_message_file_get_header (notmuch_message_file_t *message,
>> -				 const char *header_desired)
>> +GMimeMessage *
>> +notmuch_message_file_get_mime_message (notmuch_message_file_t *message)
>>  {
>> -    int contains;
>> -    char *header, *decoded_value, *header_sofar, *combined_header;
>> -    const char *s, *colon;
>> -    int match, newhdr, hdrsofar, is_received;
>> -    static int initialized = 0;
>> +    if (notmuch_message_file_parse (message))
>> +	return NULL;
>>  
>> -    is_received = (strcmp(header_desired,"received") == 0);
>> +    return message->message;
>> +}
>>  
>> -    if (! initialized) {
>> -	g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
>> -	initialized = 1;
>> -    }
>> +/*
>> + * Get all instances of a header decoded and concatenated.
>> + *
>> + * The result must be freed using g_free().
>> + *
>> + * Return NULL on errors, empty string for non-existing headers.
>> + */
>> +static char *
>> +_notmuch_message_file_get_combined_header (notmuch_message_file_t *message,
>> +					   const char *header)
>> +{
>> +    GMimeHeaderList *headers;
>> +    GMimeHeaderIter *iter;
>> +    char *combined = NULL;
>>  
>> -    message->parsing_started = 1;
>> -
>> -    if (header_desired == NULL)
>> -	contains = 0;
>> -    else
>> -	contains = g_hash_table_lookup_extended (message->headers,
>> -						 header_desired, NULL,
>> -						 (gpointer *) &decoded_value);
>> -
>> -    if (contains && decoded_value)
>> -	return decoded_value;
>> -
>> -    if (message->parsing_finished)
>> -	return "";
>> -
>> -#define NEXT_HEADER_LINE(closure)				\
>> -    while (1) {							\
>> -	ssize_t bytes_read = getline (&message->line,		\
>> -				      &message->line_size,	\
>> -				      message->file);		\
>> -	if (bytes_read == -1) {					\
>> -	    message->parsing_finished = 1;			\
>> -	    break;						\
>> -	}							\
>> -	if (*message->line == '\n') {				\
>> -	    message->parsing_finished = 1;			\
>> -	    break;						\
>> -	}							\
>> -	if (closure &&						\
>> -	    (*message->line == ' ' || *message->line == '\t'))	\
>> -	{							\
>> -	    copy_header_unfolding ((closure), message->line);	\
>> -	}							\
>> -	if (*message->line == ' ' || *message->line == '\t')	\
>> -	    message->header_size += strlen (message->line);	\
>> -	else							\
>> -	    break;						\
>> -    }
>> +    headers = g_mime_object_get_header_list (GMIME_OBJECT (message->message));
>> +    if (! headers)
>> +	return NULL;
>>  
>> -    if (message->line == NULL)
>> -	NEXT_HEADER_LINE (NULL);
>> +    iter = g_mime_header_iter_new ();
>> +    if (! iter)
>> +	return NULL;
>>  
>> -    while (1) {
>> +    if (! g_mime_header_list_get_iter (headers, iter))
>> +	goto DONE;
>>  
>> -	if (message->parsing_finished)
>> -	    break;
>> +    do {
>> +    const char *value;
>
> Incorrect indentation.

I wonder what happened there!

>
>> +	char *decoded;
>>  
>> -	colon = strchr (message->line, ':');
>> +	if (strcasecmp (g_mime_header_iter_get_name (iter), header) != 0)
>> +	    continue;
>>  
>> -	if (colon == NULL) {
>> -	    message->broken_headers++;
>> -	    /* A simple heuristic for giving up on things that just
>> -	     * don't look like mail messages. */
>> -	    if (message->broken_headers >= 10 &&
>> -		message->good_headers < 5)
>> -	    {
>> -		message->parsing_finished = 1;
>> -		break;
>> +	value = g_mime_header_iter_get_value (iter);
>> +	decoded = g_mime_utils_header_decode_text (value);
>
> Maybe a comment saying that GMime retains ownership of value, but
> decoded must be freed with g_free?  GMime is confusing about
> ownership, and the fact that value *doesn't* have to be freed (which
> is not documented by GMime!) just adds to the confusion.  I could see
> this being easy to screw up during later code changes.

Agreed.

>
>> +	if (! decoded) {
>> +	    if (combined) {
>> +		g_free (combined);
>> +		combined = NULL;
>>  	    }
>> -	    NEXT_HEADER_LINE (NULL);
>> -	    continue;
>> +	    goto DONE;
>>  	}
>>  
>> -	message->header_size += strlen (message->line);
>> +	if (combined) {
>> +	    char *tmp = g_strdup_printf ("%s %s", combined, decoded);
>> +	    g_free (decoded);
>> +	    g_free (combined);
>> +	    if (! tmp) {
>> +		combined = NULL;
>> +		goto DONE;
>> +	    }
>>  
>> -	message->good_headers++;
>> +	    combined = tmp;
>> +	} else {
>> +	    combined = decoded;
>> +	}
>> +    } while (g_mime_header_iter_next (iter));
>
> Yet another funny GMime API...  Did you consider
>
> for (; g_mime_header_iter_is_valid (iter); g_mime_header_iter_next (iter))
>
> ?  I realize that the g_mime_header_iter_is_valid isn't actually
> necessary because g_mime_header_list_get_iter returns NULL if the
> iterator would be empty, but using a for-loop would make it more
> obvious that this code is correct.

As I mentioned on IRC, that would lead to an infinite loop. Ugh.

>
>>  
>> -	header = xstrndup (message->line, colon - message->line);
>> +    /* Return empty string for non-existing headers. */
>> +    if (! combined)
>> +	combined = g_strdup ("");
>>  
>> -	if (message->restrict_headers &&
>> -	    ! g_hash_table_lookup_extended (message->headers,
>> -					    header, NULL, NULL))
>> -	{
>> -	    free (header);
>> -	    NEXT_HEADER_LINE (NULL);
>> -	    continue;
>> -	}
>> +  DONE:
>> +    g_mime_header_iter_free (iter);
>>  
>> -	s = colon + 1;
>> -	while (*s == ' ' || *s == '\t')
>> -	    s++;
>> +    return combined;
>> +}
>
> Oof.  That received header had better be worth it!

I'm not sure it is, but we've got users for it, so...

>
>>  
>> -	message->value.len = 0;
>> -	copy_header_unfolding (&message->value, s);
>> +/* Return NULL on errors, empty string for non-existing headers. */
>
> This may sound strange, but I'd be inclined to omit this comment,
> since it made me not realize I should go looking for the much more
> thorough comment in notmuch-private.h.

Agreed.

>
>> +const char *
>> +notmuch_message_file_get_header (notmuch_message_file_t *message,
>> +				 const char *header)
>> +{
>> +    const char *value = NULL;
>> +    char *decoded;
>> +    notmuch_bool_t found;
>>  
>> -	NEXT_HEADER_LINE (&message->value);
>> +    if (notmuch_message_file_parse (message))
>> +	return NULL;
>>  
>> -	if (header_desired == NULL)
>> -	    match = 0;
>> +    /* If we have a cached decoded value, use it. */
>> +    found = g_hash_table_lookup_extended (message->headers, header,
>> +					  NULL, (gpointer *) &value);
>> +    if (found)
>> +	return value ? value : "";
>
> When would header be found, but value be NULL?  (If this can't in fact
> happen, how about using g_hash_table_lookup?)

Right. This was a remnant of header restricting which has now been
removed. Switching to g_hash_table_lookup.

>
>> +
>> +    if (strcasecmp (header, "received") == 0) {
>> +	/*
>> +	 * The Received: header is special. We concatenate all
>> +	 * instances of the header as we use this when analyzing the
>> +	 * path the mail has taken from sender to recipient.
>> +	 */
>> +	decoded = _notmuch_message_file_get_combined_header (message, header);
>> +    } else {
>> +	value = g_mime_object_get_header (GMIME_OBJECT (message->message),
>> +					  header);
>> +	if (value)
>> +	    decoded = g_mime_utils_header_decode_text (value);
>>  	else
>> -	    match = (strcasecmp (header, header_desired) == 0);
>> -
>> -	decoded_value = g_mime_utils_header_decode_text (message->value.str);
>> -	header_sofar = (char *)g_hash_table_lookup (message->headers, header);
>> -	/* we treat the Received: header special - we want to concat ALL of 
>> -	 * the Received: headers we encounter.
>> -	 * for everything else we return the first instance of a header */
>> -	if (strcasecmp(header, "received") == 0) {
>> -	    if (header_sofar == NULL) {
>> -		/* first Received: header we encountered; just add it */
>> -		g_hash_table_insert (message->headers, header, decoded_value);
>> -	    } else {
>> -		/* we need to add the header to those we already collected */
>> -		newhdr = strlen(decoded_value);
>> -		hdrsofar = strlen(header_sofar);
>> -		combined_header = g_malloc(hdrsofar + newhdr + 2);
>> -		strncpy(combined_header,header_sofar,hdrsofar);
>> -		*(combined_header+hdrsofar) = ' ';
>> -		strncpy(combined_header+hdrsofar+1,decoded_value,newhdr+1);
>> -		g_free (decoded_value);
>> -		g_hash_table_insert (message->headers, header, combined_header);
>> -	    }
>> -	} else {
>> -	    if (header_sofar == NULL) {
>> -		/* Only insert if we don't have a value for this header, yet. */
>> -		g_hash_table_insert (message->headers, header, decoded_value);
>> -	    } else {
>> -		free (header);
>> -		g_free (decoded_value);
>> -		decoded_value = header_sofar;
>> -	    }
>> -	}
>> -	/* if we found a match we can bail - unless of course we are
>> -	 * collecting all the Received: headers */
>> -	if (match && !is_received)
>> -	    return decoded_value;
>> -    }
>> -
>> -    if (message->parsing_finished) {
>> -        fclose (message->file);
>> -        message->file = NULL;
>> +	    decoded = g_strdup ("");
>>      }
>>  
>> -    if (message->line)
>> -	free (message->line);
>> -    message->line = NULL;
>> -
>> -    if (message->value.size) {
>> -	free (message->value.str);
>> -	message->value.str = NULL;
>> -	message->value.size = 0;
>> -	message->value.len = 0;
>> -    }
>> +    if (! decoded)
>> +	return NULL;
>>  
>> -    /* For the Received: header we actually might end up here even
>> -     * though we found the header (as we force continued parsing
>> -     * in that case). So let's check if that's the header we were
>> -     * looking for and return the value that we found (if any)
>> -     */
>> -    if (is_received)
>> -	return (char *)g_hash_table_lookup (message->headers, "received");
>> -
>> -    /* We've parsed all headers and never found the one we're looking
>> -     * for. It's probably just not there, but let's check that we
>> -     * didn't make a mistake preventing us from seeing it. */
>> -    if (message->restrict_headers && header_desired &&
>> -	! g_hash_table_lookup_extended (message->headers,
>> -					header_desired, NULL, NULL))
>> -    {
>> -	INTERNAL_ERROR ("Attempt to get header \"%s\" which was not\n"
>> -			"included in call to notmuch_message_file_restrict_headers\n",
>> -			header_desired);
>> -    }
>> +    /* Cache the decoded value. We also own the strings. */
>> +    g_hash_table_insert (message->headers, xstrdup (header), decoded);
>>  
>> -    return "";
>> +    return decoded;
>>  }
>> diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
>> index 59eb2bc285a5..734a4e338554 100644
>> --- a/lib/notmuch-private.h
>> +++ b/lib/notmuch-private.h
>> @@ -46,6 +46,8 @@ NOTMUCH_BEGIN_DECLS
>>  
>>  #include <talloc.h>
>>  
>> +#include <gmime/gmime.h>
>> +
>>  #include "xutil.h"
>>  #include "error_util.h"
>>  
>> @@ -320,13 +322,6 @@ notmuch_message_set_author (notmuch_message_t *message, const char *author);
>>  const char *
>>  notmuch_message_get_author (notmuch_message_t *message);
>>  
>> -
>> -/* index.cc */
>> -
>> -notmuch_status_t
>> -_notmuch_message_index_file (notmuch_message_t *message,
>> -			     const char *filename);
>> -
>>  /* message-file.c */
>>  
>>  /* XXX: I haven't decided yet whether these will actually get exported
>> @@ -352,32 +347,31 @@ _notmuch_message_file_open_ctx (void *ctx, const char *filename);
>>  void
>>  notmuch_message_file_close (notmuch_message_file_t *message);
>>  
>> -/* Restrict 'message' to only save the named headers.
>> +/* Parse the message.
>>   *
>> - * When the caller is only interested in a short list of headers,
>> - * known in advance, calling this function can avoid wasted time and
>> - * memory parsing/saving header values that will never be needed.
>> + * This will be done automatically as necessary on other calls
>> + * depending on it, but an explicit call allows for better error
>> + * status reporting.
>> + */
>> +notmuch_status_t
>> +notmuch_message_file_parse (notmuch_message_file_t *message);
>
> Most internal functions have a leading _.  (We've diverged from that
> in a few places, especially in notmuch_message_file_*, but it would be
> nice to not diverge further from that convention.)

Fixed.

>
>> +
>> +/* Get the gmime message of a message file.
>>   *
>> - * The variable arguments should be a list of const char * with a
>> - * final '(const char *) NULL' to terminate the list.
>> + * The message file is parsed as necessary.
>>   *
>> - * If this function is called, it must be called before any calls to
>> - * notmuch_message_get_header for this message.
>> + * Returns GMimeMessage* on success (which the caller must not unref),
>> + * NULL if the message file parsing fails.
>>   *
>> - * After calling this function, if notmuch_message_get_header is
>> - * called with a header name not in this list, then NULL will be
>> - * returned even if that header exists in the actual message.
>> + * XXX: Would be nice to not have to expose GMimeMessage here.
>>   */
>> -void
>> -notmuch_message_file_restrict_headers (notmuch_message_file_t *message, ...);
>> -
>> -/* Identical to notmuch_message_restrict_headers but accepting a va_list. */
>> -void
>> -notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
>> -					va_list va_headers);
>> +GMimeMessage *
>> +notmuch_message_file_get_mime_message (notmuch_message_file_t *message);
>
> Same comment about _.

Ditto.

>
>>  
>>  /* Get the value of the specified header from the message as a UTF-8 string.
>>   *
>> + * The message file is parsed as necessary.
>> + *
>>   * The header name is case insensitive.
>>   *
>>   * The Received: header is special - for it all Received: headers in
>> @@ -387,13 +381,19 @@ notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
>>   * only until the message is closed. The caller should copy it if
>>   * needing to modify the value or to hold onto it for longer.
>>   *
>> - * Returns NULL if the message does not contain a header line matching
>> - * 'header'.
>> + * Returns NULL on errors, empty string if the message does not
>> + * contain a header line matching 'header'.
>>   */
>>  const char *
>>  notmuch_message_file_get_header (notmuch_message_file_t *message,
>>  				 const char *header);
>>  
>> +/* index.cc */
>> +
>> +notmuch_status_t
>> +_notmuch_message_index_file (notmuch_message_t *message,
>> +			     notmuch_message_file_t *message_file);
>> +
>
> Any particular reason this floated down?  (Obviously it's not an
> actual problem.)

It needs the typedef for notmuch_message_file_t. Previous versions of
the patch moved the typedef, but this keeps the typedef next to other
stuff from message-file.c.


Thanks for the review.

BR,
Jani.



>
>>  /* messages.c */
>>  
>>  typedef struct _notmuch_message_node {

Thread: