Tuesday, July 10, 2012
Google Ngrams part 2
I originally thought, probably like most people, that Google ngrams was pretty fun and amazing until I figured out that most of their data is completely wrong. I am kind of impressed at how bad the actual dataset is. I think there are a few problems which could most likely be easily fixed. One is with fixing errors from the Optical Character Recognition (OCR). There are some consistent errors that could be fixed with some simple parsing such as recognizing english letter characters as non-english letter characters. I know Google is also using a very loose interpretation of what a word is and what they actually mean is a string of characters. I don't think many? any? languages consider $0.00 or 2& a word
One of the other problems is proper nouns and non-common, non-english words. For example words in books that are actively translating in the text See here for example. Figure 2 in the original paper would be completely wrong by all normal standards because it focuses so much on frequency yet the frequencies would be all wrong. The reason so many of the words in ngrams are not in the dictionary is because they either are not words or they are not english words!!! Further, there is a clear dependence on the number of books for a given year and the number of ngrams and most likely spurious ngrams(though I have not verified the spurious ngram part).
While writing this I stumbled upon Microsoft's Web N-grams http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx
which I will look at soon. Hopefully in the next two weeks I will put up the basic information I found from my language analysis. I am going to the Netherlands to perform some ultrafast laser spectroscopy so hopefully I will have lots of free time to code and write as I heard that they don't work on the weekends overseas. haha.