On Sun, Aug 19 2012, david@tethera.net wrote: > From: David Bremner <bremner@debian.org> > > Initial use case is testing dump and restore, so we only have > message-ids and tags. > > The message ID's are nothing like RFC compliant, but it doesn't seem > any harder to roundtrip random UTF-8 strings than RFC-compliant ones. > > Tags are UTF-8, even though notmuch is in principle more generous than > that. > --- Mostly LGTM (the whole series). Few comments inline... Finally, 6/6 adds known broken test -- when will we see this code taken into use and the broken test fixed :) > test/.gitignore | 1 + > test/Makefile.local | 9 +++ > test/basic | 2 +- > test/random-corpus.c | 202 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 213 insertions(+), 1 deletion(-) > create mode 100644 test/random-corpus.c [ ... ] > > diff --git a/test/random-corpus.c b/test/random-corpus.c > new file mode 100644 > index 0000000..8c5b559 > --- /dev/null > +++ b/test/random-corpus.c [ ... ] > + > +/* Current largest UTF-32 value defined. Note that most of these will > + * be printed as boxes in most fonts. > + */ Should we be talking about UTF-8 valies. UTF-8 (currently has the same limit). > + > +#define GLYPH_MAX 0x10FFFE > + > +static gunichar > +random_unichar () > +{ > + int start = 1, stop = GLYPH_MAX; > + int class = random() % 2; > + > + /* > + * Choose about half ascii as test characters, as ascii > + * punctation and whitespace is the main cause of problems for > + * the (old) restore parser > + */ > + switch (class) { > + case 0: > + /* ascii */ > + start = 0x01; > + stop = 0x7f; > + break; > + case 1: > + /* the rest of unicode */ > + start = 0x80; > + stop = GLYPH_MAX; > + } > + > + if (start == stop) > + return start; > + else > + return start + (random() % (stop - start + 1)); > +} > + > +static char * > +random_utf8_string (void *ctx, size_t char_count) > +{ > + > + gchar *buf = NULL; > + size_t buf_size = 0; > + > + size_t offset = 0; > + > + size_t i; > + > + buf = talloc_realloc (ctx, NULL, gchar, char_count); > + buf_size = char_count; > + > + for (i = 0; i < char_count; i++) { > + gunichar randomchar; > + size_t written; > + > + /* 6 for one glyph, one for null */ > + if (buf_size - offset < 8) { > + buf_size += 16; > + buf = talloc_realloc (ctx, buf, gchar, buf_size); This reallocation will hit many times, as originally there was just char_count bytes allocated -- this limit will probably get hit before halfway the creation of random string (half uses 1 byte, other half 2, 3 or 4 bytes, mostly 4 (even only half of the 4-byte range is used...) Maybe originally allocating char_count * 2 + 8 and if realloc required (char_count - i) * 2 + 8... or maybe better, just doing the latter realloc and replacing first with buf = NULL; buf_size = 0; Alternatively you could play with random states; calculate size, reset random state, alloc size + 1 and write chars. > + } > + > + randomchar = random_unichar(); > + > + written = g_unichar_to_utf8 (randomchar, buf + offset); > + > + if (written <= 0) { > + fprintf (stderr, "error converting to utf8\n"); > + exit (1); > + } > + > + offset += written; > + > + } Above there is extra newline. There are a few others in other files (at least after opening and before closing brace). Maybe uncrustify your source :) > + buf[offset] = 0; > + return buf; > +} > + [ ... ] Tomi