Make Words Usable for Machine Learning
|Twinkle, twinkle, little star.||2||1||1|
|Twinkle, twinkle, all the night.||2||1||1||1|
Build a Matrix
While our example was simple (6 words), term frequency matrices on larger datasets can be tricky.
Imagine turning every word in the Oxford English dictionary into a matrix, that’s 171,476 columns. Now imagine adding everyone’s names, every corporation or product or street name that ever existed. Now feed it slang. Feed it every rap song. Feed it fantasy novels like Lord of the Rings or Harry Potter so that our model will know what to do when it encounters “The Shire” or “Hogwarts.” Good, now that’s just English. Do the same thing again for Russian, Mandarin, and every other language.
After this is accomplished, we are approaching a several billion column matrix; two problems arise. First, it becomes computationally unfeasible and memory intensive to perform calculations over this matrix. Secondly, the curse of dimensionality kicks in and distance measurements become so absurdly large in scale that they all seem the same. Most of the research and time that goes into natural language processing is less about the syntax of language (which is important) but more about how to reduce the size of this matrix.
Now that we know what we must do and the challenges that we must face in order to reach our desired result. The next three blogs in the series will be directly to address these problems. We will introduce you to 3 concepts: conforming, stemming, and stop word removal.