/ R

10 R Packages I Wish I Knew About Earlier

Initial Use of R Packages

I started using R about 3 years ago. It was slow going at first. R had tricky and less intuitive syntax than languages I was used to, and it took a while to get accustomed to the nuances. It wasn’t immediately clear to me that the power of the language was bound up with the community and the diverse packages available.

R can be more prickly and obscure than other languages like Python or Java. The good news is that there are tons of packages which provide simple and familiar interfaces on top of Base R. This post is about ten packages I love and use everyday and ones I wish I knew about earlier.

R Package 1: sqldf


One of the steepest parts of the R learning curve is the syntax. It took me a while to get over using <- instead of =. I hear people say a lot of times "How do I just do a VLOOKUP?! R is great for general data munging tasks, but it takes a while to master. I think it’s safe to say that sqldf was my R “training wheels.”

sqldf let’s you perform SQL queries on your R data frames. People coming over from SAS will find it very familiar and anyone with basic SQL skills will have no trouble using it–sqldf uses SQLite syntax.

R Package 2: forecast


I don’t do time series analysis very often, but when I do forecast is my library of choice. forecast makes it incredibly easy to fit time series models like ARIMA, ARMA, AR, Exponential Smoothing, etc.

My favorite feature is the resulting forecast plot:

R Package: Forecast of Lung Disease Related Deaths in the UK

R Package 3: plyr


When I first started using R, I was using basic control operations for manipulating data (for, if, while, etc.). I quickly learned that this was an amateur move, and that there was a better way to do it.

In R, the apply family of functions is the preferred way to call a function on each element of a list or vector. While Base R has this out of the box, its usage can be tricky to master. I’ve found the plyr package to be an easy to use substitute forsplitapplycombine functionality in Base R.

plyr gives you several functions (ddplydaplydlplyadplyldply) following a common blueprint: Split a data structure into groups, apply a function on each group, return the results in a data structure.

ddply splits a data frame and returns a data frame (hence the dd). daply splits a data frame and results an array (hence the da). Hopefully you’re getting the idea here.

R Package 4: stringr


I find base R’s string functionality to be extremely difficult and cumbersome to use. Another package written by Hadley Wickhamstringr, provides some much needed string operators in R. Many of the functions use data structures that aren’t commonly used when doing basic analysis.

Image of R Package Stringer Library

stringr is remarkably easy to use. Nearly all of the functions (and all of the important ones) are prefixed with “str” so they’re very easy to remember.

R Package 5: The Database Driver Package of Your Choice


Everyone does it when they first start (myself included). You’ve just written an awesome query in your preferred SQL editor. Everything is perfect – the column names are all snake case, the dates have the right datatype, you finally debugged the "must appear in the GROUP BY clause or be used in an aggregate function" issue. You’re ready to do some analysis in R, so you run the query in your SQL editor, copy the results to a csv (or…God forbid… .xlsx) and read into R. You don’t have to do this!

R has great drivers for nearly every conceivable database. On the off chance you’re using a database which doesn’t have a standalone driver (SQL Server), you can always use RODBC.

Next time you’ve got that perfect query written, just paste it into R and execute it using RPostgreSQLRMySQLRMongoRMongo, or RODBC. In addition to preventing you from having tens of hundreds of CSV files sitting around, running the query in R saves you time both in I/O but also in converting datatypes. Dates, times, and date-times will be automatically set to their R equivalent. It also makes your R script reproducible, so you or someone else on your team can easily produce the same results.

R Package 6: lubridate


I’ve never had great luck with dates in R. I’ve never fully grasped the idiosyncrasies of working with POSIXs vs. R Dates. Enter lubridate.

lubridate is one of those magical libraries that just seems to do exactly what you expect it to. The functions all have obvious names like yearmonthymd, and ymd_hms. It’s similar to Moment.js for those familiar with JavaScript.

Here’s a really handy reference card that I found in a paper. It covers just about everything you might conceivably want to do to a date.

R Package 7: ggplot2


Another Hadley Wickham package and probably his most widely known one.ggplot2 ranks high on everyone’s list of favorite R packages. It’s easy to use and it produces some great looking plots. It’s a great way to present your work, and there are many resources available to help you get started.

R Package 8: qcc


qcc is a library for statistical quality control. Back in the 1950’s, the now defunct Western Electric Company was looking for a better way to detect problems with telephone and electrical lines. They came up with a set of rules to help them identify problematic lines. The rules look at the historical mean of a series of data points and based on the standard deviation, the rules help judge whether a new set of points is experiencing a mean shift.

The classic example is monitoring a machine that produces lug nuts. Let’s say the machine is supposed to produce 2.5 inch long lug nuts. We measure a series of lug nuts: 2.48, 2.47, 2.51, 2.52, 2.54, 2.42, 2.52, 2.58, 2.51. Is the machine broken? Well it’s hard to tell, but the Western Electric Rules can help.

While you might not be monitoring telephone lines, qcc can help you monitor transaction volumes, visitors or logins on your website, database operations, and lots of other processes.

R Package: xbar.one chart for x and new .x QCC image

R Package 9: reshape2


I always find that the hardest part of any sort of analysis is getting the data into the right format. reshape2 is yet another package by Hadley Wickham that specializes in converting data from wide to long format and vice versa. I use it all the time in conjunction with ggplot2 and plyr.

It’s a great way to quickly take a look at a dataset and get your bearings. You can use the <code style="font-weight: inherit; font-style: inherit;">melt function to convert wide data to long data, and <code style="font-weight: inherit; font-style: inherit;">dcast to go from long to wide.

R Package 10: randomForest


This list wouldn’t be complete without including at least one machine learning package you can impress your friends with. Random Forest is a great algorithm to start with. It’s easy to use, can do supervised or unsupervised learning, it can be used with many different types of datasets, but most importantly it’s effective! Here’s how it works in R.

Shared from http://blog.yhat.com/

Raja Iqbal

Raja Iqbal

Raja is the CEO and Chief Data Scientist at Data Science Dojo. He has worked at Microsoft Bing and Bing Ads in various R&D roles in data science and machine learning.

Read More