Call it machine learning, AI, advanced data analytics, or data mining. It all boils down to looking at datasets and finding patterns that tell you something you didn’t know. For example that your average revenue per customer is $192, that the number of intrusion attempts on your network correlates with your number of tweets, or that log entries can be grouped around three cluster centroids.
Your management and customers will focus on the sexy part—the buzzwords and the sophisticated-sounding techniques: support vector machines, deep learning, etc.—but if you’ve already been in this business you’ll know that the actual job is usually 80% boring, with tasks such as
- Figuring out how to convert data from obscure-format X to standardish format Y.
- Decluttering the dataset, fixing misformatted entries, dealing with missing values, and all the data preparation and cleaning steps.
- Finding the right tools for the job and learning how to use them. Maybe you’re joining a team that’s using different tools than the ones you’re used to, or the customer wants you to use this tool they have a license for.
This post is about the last point, tools. I’m not going to explain classification and regression, how to manage your data analysis project, or how to make visualizations that will look impressive in your powerpoint slides.
This post is for readers who 1) are not data scientists, 2) don’t have money to spend on consultants or software licenses, and would rather use free and open-source solutions. If you have the budget, those may be a better option—you’ll learn from experts and you’ll get technical support. Finally, I’ll just assume that, like everyone, you know Python.
Alright, here’s my “10 things highly productive people do” list:
Use Anaconda as Python environment
Seriously. Don’t use your system’s Python install. Anaconda is “the leading open data science platform powered by Python,” and contains most of the tools you’ll need to data-science in Python. With Anaconda, always work within dedicated conda environments (the equivalent of Python’s virtualenvs).
Use Jupyter as a notebook
Use Pandas for basic data analysis
You want to add this line to your list of imports:
import pandas as pd
Pandas is a “high-performance, easy-to-use data structures and data analysis tools” and the de facto standard for manipulating data frames in Python. Pandas comes in the default Anaconda install.
Use Scikit-learn for modelling
Scikit-learn lets you carry out all the standard modelling tasks in predictive and exploratory analytics. It’s not as efficient as some proprietary systems and may choke on large amounts of data, but it will do the trick for most of your modelling jobs. Like Pandas, you’ll find Scikit-learn in Anaconda.
If you need a GUI, use Orange
Orange is essentially a GUI for Scikit-learn, and a GUI that does not suck. Orange has improved greating since the 3.0 version, when it switched to using numpy and Scikit-learn instead of dedicated C++ components. It will work on Linux, macOS, or Windows.
If you deal with BIG DATA, use Spark
Big data is whatever is too big to manage locally on your laptop. When running complex analytics, inefficiency will most likely be CPU-bound than memory-bound though. I found Spark to be surprisingly easy to use (a bit less to setup properly), compared to other distributed computation engines or databases. If you already know Scala or if you love Java then use the Scala API. Otherwise stick to the simpler Python API PySpark (warning: it’s less documented than it’s Scala counterpart, and some Scala functions aren’t available in Python.)
You can use Spark within any Python script, by entering the following code:
import findspark findspark.init() import pyspark sc = pyspark.SparkContext() sqlContext = pyspark.HiveContext(sc)
You’ll then be able to make queries using the Spark SQL module, entering queries either using Spark’s DataFrame API or with plain old SQL.
Protips: In Spark, use
DataFrame objects rather than the legacy
RDD format. Use the
spark.ml API rather than the legacy
MLlib. Set up a cluster with enough cores and memory, otherwise it may be slower than small-data tools.
If you need help, use Google
You’ll find how to solve most of your Pandas or Scikit-learn issues on the usual places (StackOverflow, etc.). But Spark debugging can be painful, especially when using PySpark rather than Scala, and you might encounter bugs or undocumented functionalities. You’ll also need some fine-tuning to optimize Spark processing performance. Learning how it works under the hood will help you solving performance issues.