Data Science for Doofuses: What Toolbox to Use

Call it machine learning, AI, advanced data analytics, or data mining. It all boils down to looking at datasets and finding patterns that tell you something you didn’t know. For example that your average revenue per customer is $192, that the number of intrusion attempts on your network correlates with your number of tweets, or that log entries can be grouped around three cluster centroids.

Your management and customers will focus on the sexy part—the buzzwords and the sophisticated-sounding techniques: support vector machines, deep learning, etc.—but if you’ve already been in this business you’ll know that the actual job is usually 80% boring, with tasks such as

  • Figuring out how to convert data from obscure-format X to standardish format Y.
  • Decluttering the dataset, fixing misformatted entries, dealing with missing values, and all the data preparation and cleaning steps.
  • Finding the right tools for the job and learning how to use them. Maybe you’re joining a team that’s using different tools than the ones you’re used to, or the customer wants you to use this tool they have a license for.

This post is about the last point, tools. I’m not going to explain classification and regression, how to manage your data analysis project, or how to make visualizations that will look impressive in your powerpoint slides.

This post is for readers who 1) are not data scientists, 2) don’t have money to spend on consultants or software licenses, and would rather use free and open-source solutions. If you have the budget, those may be a better option—you’ll learn from experts and you’ll get technical support. Finally, I’ll just assume that, like everyone, you know Python.

Alright, here’s my “10 things highly productive people do” list:

Use Anaconda as Python environment

Seriously. Don’t use your system’s Python install. Anaconda is “the leading open data science platform powered by Python,” and contains most of the tools you’ll need to data-science in Python. With Anaconda, always work within dedicated conda environments (the equivalent of Python’s virtualenvs).

Use Jupyter as a notebook

Jupyter is iPython on steroids: a web-based notebook interface that lets you write cells of code interleaved with Markdown-formatted text and HTML and Javascript content. This is convenient to share your work with non-technical people.

Use Pandas for basic data analysis

You want to add this line to your list of imports:

import pandas as pd

Pandas is a “high-performance, easy-to-use data structures and data analysis tools” and the de facto standard for manipulating data frames in Python. Pandas comes in the default Anaconda install.

Use Scikit-learn for modelling

Scikit-learn lets you carry out all the standard modelling tasks in predictive and exploratory analytics. It’s not as efficient as some proprietary systems and may choke on large amounts of data, but it will do the trick for most of your modelling jobs. Like Pandas, you’ll find Scikit-learn in Anaconda.

If you need a GUI, use Orange

Orange is essentially a GUI for Scikit-learn, and a GUI that does not suck. Orange has improved greating since the 3.0 version, when it switched to using numpy and Scikit-learn instead of dedicated C++ components. It will work on Linux, macOS, or Windows.

If you deal with BIG DATA, use Spark

Big data is whatever is too big to manage locally on your laptop. When running complex analytics, inefficiency will most likely be CPU-bound than memory-bound though. I found Spark to be surprisingly easy to use (a bit less to setup properly), compared to other distributed computation engines or databases. If you already know Scala or if you love Java then use the Scala API. Otherwise stick to the simpler Python API PySpark (warning: it’s less documented than it’s Scala counterpart, and some Scala functions aren’t available in Python.)

You can use Spark within any Python script, by entering the following code:

import findspark
findspark.init()
import pyspark

sc = pyspark.SparkContext()
sqlContext = pyspark.HiveContext(sc)

You’ll then be able to make queries using the Spark SQL module, entering queries either using Spark’s DataFrame API or with plain old SQL.

Protips: In Spark, use DataFrame objects rather than the legacy RDD format. Use the spark.ml API rather than the legacy MLlib. Set up a cluster with enough cores and memory, otherwise it may be slower than small-data tools.

If you need help, use Google

You’ll find how to solve most of your Pandas or Scikit-learn issues on the usual places (StackOverflow, etc.). But Spark debugging can be painful, especially when using PySpark rather than Scala, and you might encounter bugs or undocumented functionalities. You’ll also need some fine-tuning to optimize Spark processing performance. Learning how it works under the hood will help you solving performance issues.

Leave a Reply