Quoth David Bremner on Nov 25 at 8:05 pm: > Austin Clements <amdragon@MIT.EDU> writes: > >> +add_email_corpus takes arguments "--small" and "--medium" for when you > >> +want smaller corpuses to check. > > > > "corpora"? > > reworded to say > > ,---- > | add_email_corpus takes arguments "--small" and "--medium" for when you > | want smaller subsets of the corpus to check. > `---- That's clearer. > > > > I'm a bit confused by this. What happens if you don't specify --small > > or --medium? Is the "large"/default corpus just the combined small > > and medium corpora? Would be worth a comment, at least. > > Hopefully the README makes this clear(er) now? The README definitely helps. Might still be worth a comment in the code since it took me some thinking to realize it would do something reasonable when given no argument. Perhaps above the initial assignment of arg, # With no argument, use the entire (combined) corpus to acknowledge that this is a legitimate and intentional code path? > > This probably doesn't matter now, but I wonder if we want to unpack on > > first use to somewhere not test-specific and then cp -rl the corpus > > into the test directory. I haven't tried unpacking the corpus yet, > > but if you're running tests repeatedly to compare results, or running > > more than one performance test, it seems like a full decompress and > > unpack could get onerous. > > Hmm. On my machine it is 10s for the copy versus 45s for a full > unpack. For some reason I tested with "cp -a" which is incredibly slow, > so I thought there was no loss. For comparison the basic test takes > about 10 minutes on the same machine. > > In any case this can wait until we have a second test file and a second > call to add_mail_corpus, adding caching now would not help. It would help (a little) if you run basic multiple times. I think it's completely reasonable to leave it as is for now and see if caching would help down the road.