Text Analytics (Natural Language Processing) with R Programming

It is my firm conviction that Natural Language Processing/Text Analytics is a must-have skill for any practicing Data Scientist. From analyzing customer feedback in NSAT surveys to scraping Microsoft’s internal job postings for analyzing popular technical skills, to segmenting customers via textual features, I have universally found that Text Analytics is a wildly useful skill.

Not surprisingly, I am often asked by students of our Bootcamp, folks that I mentor on Data Science and my LinkedIn contacts about the subject of Text Analytics. The good news is that there are many great resources for the R programmer to learn Text Analytics. What follows is a practical curriculum where the only required knowledge is basic R programming skills. I have read all of the books referenced below and can attest that studying the curriculum will have you mastering Text Analytics in no time!

Text Analytics with R for Students of Literature is quite simply the best, most straightforward introduction to working with text that I have found. Professor Jockers illustrates many of the fundamentals using out of the box R programming. This book provides a great foundation for anyone looking to get started in Text Analytics with R.

Taming Text is the next stop on the Text Analytics journey. While this book is primarily written for Java programmers, there is a lot of theory that is immensely useful for R programmers learning to work with text. Additionally, the book covers the OpenNLP Java library which is available to R programmers via the excellent openNLP package.

The CRAN NLP Task View illustrates the wide-ranging Text Analytics support for the R programmer. Unfortunately, it also illustrates that the landscape is fractured as well. However, a couple of packages are worthy of study. The tm package is often the go-to Text Analytics package for R programmers. However, the new quanteda package shows a lot of promise. Lastly, the excellent openNLP package deserves a second callout.

Introduction to Information Retrieval, while focused primarily on the problem of search, nevertheless contains a wealth of theory and understanding (e.g., the Vector Space Model) to take the R programmer to the next level. The text is language agnostic, is quite excellent, and free!

While the Natural Language Toolkit (NLTK) is Python-based, the book on the subject is a wealth of goodness to the R programmer. I put this resource last in the list as learning the above conceptual material and R packages provides the necessary background to translate some of the concepts (e.g., chunking) into the R context. Awesome stuff and free to boot!

There you have it, a practical curriculum for the R programmer to ramp into Text Analytics. Don’t hesitate to reach out if you have any questions or comments – I monitor my blog almost continually.

Until next time, happy data sleuthing!

Watch our video tutorials on Text Analytics.