Future-Proof Your Data Science Workflow
It’s been said that data is the new oil.
As one of the most valuable resources in the tech age, it’s critical for companies to have a tight process for solving data problems while handling ever-increasing volumes of information. According to projections from statistical data provider Statista, global data creation is projected to grow to more than 180 zettabytes by 2025.
But in the early stages of a product’s or firm’s development, it may be hard to predict future needs for information access and storage while creating clearly defined steps for solving data problems in the present. How can a company future-proof its data science workflow?
“In the early stages of building a data product, when you have less certainty about future needs, you want to keep things simple, modular and flexible,” said Pamela Wu, a senior data scientist at business directory data company Enigma.
Wu recognizes that as the company grows, its workflow will evolve and that focusing on testing now will pay dividends later.
Built In NYC caught up with Wu for more insights into building data science workflows for an evolving company.
Tell us a bit about your technical process for building data science workflows. What are some of your favorite tools?
Databricks — via Apache Spark — has been a great fit for us because it’s general purpose and versatile. We’re constantly expanding our product and use Databricks notebooks to do exploratory analysis with rich visualizations. We can hook a notebook up to a job cluster that runs validation on a schedule, or even iterate on a production job with small data, and then hook it up to a job cluster that we trigger whenever we need to change the inputs. Notebooks can be versioned with Git alongside their corresponding Python script file, allowing us to code review the notebook source code.
I expect this answer will be very different in a year, as Enigma continues to grow. We’re currently exploring a platform specialized for data labeling that can help us manage the training data we’ve created and all of the validation labels we’ve accumulated.
What processes or best practices have you found most effective in creating workflows that are reproducible and stable?
For stability, testing is paramount. And I mean all flavors of testing: unit testing, integration testing, end-to-end testing. This may seem obvious, but it can be hard for a data scientist to stomach the fact that the amount of time they estimated for completing a task has to be doubled to include writing tests that cover the whole range of expected behavior.
There are a lot of great guides out there on best practices for testing. The principle I find most relevant for data scientists, who write a lot of tests on number and string transformations, is to use numbers and values you can reason out in less than five seconds. Data scientists should be writing tests that they could run in their own heads only slightly slower than the testing framework could. Also, try to hack yourself and make your logic fail — but only one hack per test, please.
Try to hack yourself and make your logic fail — but only one hack per test, please.”
What advice do you have for other data scientists looking to improve how they build workflows?
Make sure that your workflow meets your company’s particular needs. For example, Enigma is a data company and our work directly impacts company revenues. For this reason, validation is a crucial part of our data science workflow.
We strive to be customer-obsessed and validate our models based on what customers are seeing, as opposed to baseline distribution. We’ve also spent a lot of time figuring out how to make our validation repeatable and to validate as close to the end of our pipeline as possible. Before we put anything into production we look at the absolute end of the line to make sure we’re seeing it through the eyes of our customer.
It’s also important to invest in clear product documentation on key data science decisions. In the absence of firm product guidelines, decisions will be made ad hoc and lead to internal contradictions. Efficiency begins with stringent guidelines that are documented and rooted in customer investigation.