Karl Wiberg writes: > On Fri, Dec 4, 2009 at 1:29 AM, Carl Worth wrote: > > And a step beyond that would support different languages for > > different emails, but that sounds like something "hard" to identify. > > But probably not as hard as identifying spam. It could probably be > done with a simple Bayesian filter counting word frequencies---but > it'd be much better if somebody else had already solved the problem, > since this smells suspiciously like something that ought to be a > separate project and put in a library ... does anyone know if such a > project already exists? There's TextCat: http://www.let.rug.nl/vannoord/TextCat/ It looks at n-gram frequencies, and can guess pretty reliably from even a fairly small amount of text. TextCat is in Perl. I don't know if there's a C or C++ implementation but it isn't a huge piece of code - finding a good technique was the clever part of it. Cheers, Olly