Aspiring Data Scientist? You’ll Need Some Math!

Why So Much Math?

At the end of each of our bootcamps we ask our students to provide us with feedback on their experience. In particular, we ask for honest assessments and opinions on how we can improve. It’s something we take very seriously at Data Science Dojo and I can list a number of changes we’ve made as a direct result of student feedback. Given that our students come from a broad spectrum of backgrounds, it is not surprising that we invariably receive feedback that distills down to, “why so much mathematics for machine learning?”

Math and Programming – the Tools of the Data Scientist

It is my firm conviction that you do not need a PhD in Statistics/Computer Science/Machine Learning/Whatever to become a Data Scientist. However, it is my firm conviction that ultimately Data Science boils down to two things – Mathematics and Programming. Per this belief, our bootcamp curriculum is engineered to provide the required foundation in both mathematical concepts/theory and programming for Data Science.

As you might imagine, it is very rare for our students to provide feedback along the lines of, “why so much programming?” Some students comment that there was more programming than they expected, but rarely is the need for a Data Scientist to have coding skills questioned. Not so for mathematics. This is unfortunate as I would strenuously argue that without some mathematical knowledge a Data Scientist will not be able to build effective models.

Math and Programming – the Tools of the Data Scientist

Here’s a hypothetical example to illustrate my point. An aspiring Data Scientist does some research regarding a particular problem and finds a blog post, a paper, and/or a forum post recommending the application of a regression model built with Stochastic Gradient Descent to the problem space. The following screenshot is an excerpt from Python’s most excellent scikit-learn library.

NOTE – Rest assured that similar R examples exist as well (e.g., the awesome glmnet pacakge) and I only use scikit-learn here as the scikit-learn HTML documentation is more visually attractive ;-).

SGDRegressor API

The above green boxes illustrate some of the mathematical knowledge required to use this algorithm to build the most effective model. For example:

  1. The Stochastic Gradient Descent algorithm – what is it and how does it work.
  2. Regularization – what is it and how does it work.
  3. The differences between L1 and L2 regularization – why a Data Scientist might want one vs. the other or a blend of both.

I believe this relatively simple example illustrates my point about math and programming. Specifically, this example shows that without the required math knowledge, the Data Scientist has little hope of coding up the training/construction of the most effective model in any reasonable way.

For these reasons, our students learn every highlighted item above as part of our curriculum’s coverage of regression. We also teach our students the mathematics and theory for other important topics like decision trees, boosting, and recommender systems. It is also for these reasons that I advise the aspiring Data Scientists that I mentor that eventually they will need to dust off their math textbooks.

Until next time, happy data sleuthing!