How to Build a Successful Data Science Workflow

Flatiron Health team members chatting — Flatiron Health

In a relatively new and dynamic field like data science, adaptability is key.

“There’s no one workflow that suits everyone and every project,” said CB Insights Senior Data Scientist Rongyao Huang. “Identify the primary focus of each project phase and pick tools accordingly.” Since the field evolves fast, Huang is constantly on the lookout for new tools as they emerge.

Huang and other data scientists across New York agree: selecting a data science workflow depends on the particulars of the project, which should be discussed early in the planning phase and include stakeholders across the business, including product and engineering.

“At the beginning of the process, I ask partners about the problem we are trying to solve and the metrics that will be most useful,” Natalie Goldman, data analyst at theSkimm, said. “I use this question as a tool to help focus on the user and determine what we can actually find using data.”

Defining metrics and documenting progress throughout a project’s pipeline help build repeatable processes and track down errors. When an established metric of success is missed, teams can retrace their steps, determine what went wrong, and iterate quickly to ensure future progress.

“Build pipelines to iterate as fast as possible while testing robustly,” said Orchard Lead Data Engineer Greg Svigruha. “Don’t get too wedded to any specific tool, framework or language. Use anything that provides the most value.”

Rongyao Huang

Senior Data Scientist • CB Insights

Huang said data science projects for her team are broken into three phases: exploration, refinement and productization. Each stage has a different primary focus and an effective workflow should aim to speed up each one in regards to its goal.

What are some of your favorite tools that you’ve used to build your data science workflow?

Jupyter Notebook is the best environment for the exploration stage. Its interactive nature and ability to combine programs with documentation and visuals makes it more efficient for fast prototyping and collaboration. Docker and virtualenv isolate project environments, making them replicable and portable, a must-have when doing remote development or collaborating with others. And we use Project Log to keep a record for each campaign.

A good graphical user interface in the refinement stage makes refactoring a lot easier. Our favorite GUIs include PyCharm, Atom, Visual Studio Code and Sublime. We use Google Sheets for error analysis and performance tracking. And we check TensorBoard for deep learning projects.

In the productionization stage, we’ve developed templates for jobs and services. A data science solution can implement the interfaces of how data should be fetched, prepared, processed and saved. The remaining work of configuration, logging and deployment are abstracted away. We’re also continuously adding to a utils library where common functionalities like database reads and writes are standardized.

Data and model versioning in the refinement stage

Huang said it's helpful to standardize how data and model versions are named, annotated and saved. Her process includes automating functions like seeding random competents, adding Unix timestamps to standardize file and folder names, saving metadata alongside data and models, and backing everything up on Amazon S3 after hitting “save.”

What are some of your best practices for creating reproducible and stable workflows?

A good reproducibility framework is PyTorch Lightning, a lightweight ML wrapper that helps to organize PyTorch code to decouple the data science from engineering and automate training pipelines. The Lightning template implements standardized interfaces of practices like model configuration, data preparation and others.

However, I think this should be a cautious investment, especially for small teams. It comes with a flexibility-versus-automation tradeoff. There’s also a learning curve for users and a maintenance cost for developers. It should be used when helping teams with a broad range of tasks, automating heavy components in the workflow and in instances in which the team are committed to maintaining it in the long run.

They're Hiring | View 10 Jobs CB Insights is Hiring | View 10 Jobs

Christine Hung

VP of Data Insights Engineering • Flatiron Health

Hung’s team approaches data science experiments with the idea in mind that they should be easily replicated. Using tools like Jupyter Notebook for experimenting or Blocks for pipeline-building, they prioritize traceability and repeatability so they can track errors easily.

What are some of your favorite tools that you’ve used to build your data science workflow?

Our data science team uses the right tool for the specific project at hand. Depending on who the team is collaborating with, we may end up using different technology sets, but within each project, we always have reproducibility top of mind.

Some of our favorite tools include the Jupyter Notebook for experimenting and prototyping, scikit-learn for model development, and an in-house, multi-language ETL system called Blocks for building end-to-end data science pipelines.

What are some of your best practices for creating reproducible and stable workflows?

We are believers in versioning not only code and models, but also in versioning data. To reproduce the results of an analysis, it’s essential that we are able to go back to the original data easily. We also spend a lot of time ensuring our workflows have the right level of continuous evaluation and monitoring so that it’s easy to spot regressions and stability issues as soon as they surface.

Lastly, we take the time to educate our teams on reproducibility and implementing processes that help ensure we are following best practices. A little bit of work up front goes a long way.

What advice do you have for other data scientists looking to improve their workflows?

It’s important to design workflows that are inclusive of the tools and technologies commonly used by the team members involved in a project. Sometimes, when multiple teams are involved in a project, there are ways to integrate multiple languages together to allow domain experts to work in their preferred toolset while still enabling an efficient, stable and reproducible workflow.

It’s also important to treat every analysis and experiment as something that’s going to be revisited. Be very thoughtful about monitoring and documenting how data is generated, transformed and analyzed. Design a process that requires minimal additional effort, but allows for every analysis to be easily recreated.

They're Hiring | View 3 Jobs Flatiron Health is Hiring | View 3 Jobs

Harry Shore

Data Analyst • The Trade Desk

Shore said the data and engineering teams work very closely together at The Trade Desk. Communication creates transparency and encourages the sharing of different perspectives.

What are some of your favorite tools that you’ve used to build your data science workflow?

We’ve long used Vertica as our core data store and analytics platform, which is great for data exploration and rapid prototyping. It’s easy to pull data and export it to another application. When I’m uncovering relationships in a new data set, I’ll often do some filtering and aggregation in Vertica, then throw the result set into Tableau to build exploratory plots.

However, as our modeling work got more sophisticated, we moved from Vertica to Spark because some algorithms just aren’t feasible to implement in SQL. Zeppelin is a great tool for rapid Spark development, and our engineering team has worked to make our core data sets available in Parquet format via S3.

What are some of your best practices for creating reproducible and stable workflows?

The move from Vertica to Spark had the potential to complicate workflows; it was easy to schedule execution of a Vertica query and copy the results table into a production system. But models in Spark can have a wide array of designs, dependencies and outputs.

One of our engineers had a great idea: What if we had a single Spark project that could be used to both develop and run models? So, we worked closely with engineering as they developed a shared library, where each model was its own class, and each class could be run on a schedule using Airflow. The killer feature was that the library could also be loaded into Zeppelin. So as each data scientist was doing data exploration and prototyping, the code they were writing used the exact same helper functions and data interfaces available in production. This methodology made for a close to seamless transition from prototyping to production.

What advice do you have for other data scientists looking to improve their workflows?

We have a close working relationship with our engineers. From the first design conversations through productization and release, both data scientists and engineers are part of the conversation. This benefits everyone; early data science prototypes can be informed by production considerations, and the engineers working on the final stages of deployment have an understanding of how the model is supposed to work. Different people have different perspectives, and hearing them all at various stages of development can be helpful.

Also, don’t add too much structure before you need it. The pipeline we built has been helpful, but we’ve also made a point of keeping the structure pretty loose. Each model accesses data and is scheduled to run the same way. But beyond that, design is driven by the requirements of the project and what the data scientist building it thinks is appropriate.

Natalie Goldman

Data Analyst • theSkimm

Context is crucial for Goldman and her team. While numbers and raw data are important, Goldman said they can be deceiving when they lack circumstantial information. So it’s key to spend time analyzing qualitative data and user research.

What are some of your favorite tools that you’ve used to build your data science workflow?

At a high level, my workflow is as follows: align on success metrics; find, validate, clean and analyze the data; apply models; communicate results; make recommendations and continue to monitor results.

At the beginning of the process, I ask partners about the problem we are trying to solve and the metrics that will be most useful. I use this question as a “tool” to help focus on the user and determine what we can actually find using data.

What are some of your best practices for creating reproducible and stable workflows?

I have found building informative, customizable dashboards to be the most effective, especially during long testing periods. I often share my dashboard with a partner in the company to test its effectiveness and then iterate depending on whether or not it is successfully interpreted without my assistance. Practices I have found helpful include using colors, usually green and red, as indicators of “good” and “bad,” and using text boxes on the dashboard to aid in interpretation. I also build in filters or editable fields to zero in on key data without changing anything on the back end.

What advice do you have for other data scientists looking to improve their workflows?

Documentation, organization and effective dashboarding are three tools to improve workflow. I also recommend using a planned file structure, setting calendar reminders and systematically collecting results and storing them in one place.

Additionally, as data professionals, we often trust numbers as the holy grail for everything we do. However, it’s important to recognize that numbers and metrics can sometimes be misleading without the proper context. Qualitative data and user research can provide invaluable insight into how users interact with products, and why they do what they do. Integrating research data such as surveys, Net Promoter Scores or brand studies into our workflows can help us put things into perspective.

Maureen Teyssier

Chief Data Scientist • Reonomy, an Altus Group Business

“Data metrics are necessary before and after feature generation, and on the output from the model,” Teyssier said.

Clear metrics — defined early alongside key stakeholders — make output changes easy to monitor.

What are some of your favorite tools that you’ve used to build your data science workflow?

We begin data science projects with a discovery phase that includes stakeholders from the product, data science and engineering teams. Having this collaboration upfront greatly increases the success rate of the projects. It gives our data scientists enough context to feel confident in their decisions and allows them to feel the importance of their work.

Machine learning projects are only successful when high-signal data is fed into models. Our data scientists create this signal by doing visualizations and analysis in Databricks, which allows them to extract data from many points in our Spark-Scala production pipelines. Using an interactive Spark environment also allows them to write code that is easier to transition into our production pipelines, which is important with cross-functional teams.

What are some of your best practices for creating reproducible and stable workflows?

When there’s machine learning embedded in production pipelines, it’s essential to create actionable metrics in several locations within the pipeline. “Actionable” means the metrics have enough breadth to capture changes in the data, but they aren’t so general that it’s difficult to obtain an understanding of what is going wrong.

Data metrics are necessary before and after feature generation, and on the output from the model. Having these metrics means when output changes, it’s possible to quickly identify whether or not it’s okay. If it’s not okay, the metrics indicate where it needs to be fixed. We have also chosen not to dynamically train the models because, for a growing company, it adds a lot of uncertainty for marginal lift.

What advice do you have for other data scientists looking to improve how they build their workflows?

It’s important for a data scientist to consider a few key things: a clear idea of what needs to be built; ready access to the data needed to test features and models; the right tools that allow for quick iteration; clear communication with the people that will be implementing the model in production; and a way to surface technical performance metrics to stakeholders in the company.

Greg Svigruha

Lead Data Engineer • Orchard

Svigruha’s data team relies heavily on testing and redeployment since their work in helping users buy and sell their home at fair prices is dependent on the ever-changing housing market. Their practice of redeploying weekly involves backtesting simulations of previous modeling algorithms.

What are some of your favorite tools that you’ve used to build your data science workflow?

Our data science model changes continually with the housing market, so the models need to be redeployed for two main reasons. We have near real-time transactional data feeding into our system accounting for movements in the market, and we need the model and its predictions to reflect these changes. Improvements are deployed multiple times a week.

We retrain our production model every night to make sure it has the latest market data. We use Airflow to perform a number of functions on AWS, like executing the latest algorithm to create model files, building a Docker image from the prediction service’s codebase, deploying and performing walk forward testing.

What are some of your best practices for creating reproducible and stable workflows?

Every change to the modeling algorithm needs to be backtested before it’s deployed to production, which is the most challenging part. Ideally, our backtests simulate how the modeling algorithm would have performed over the last year, had we deployed it one year ago.

The evaluation data has to be large enough to be statistically significant and to counter seasonal effects, but we cannot use the same model to predict for an entire year because we would have trained new ones during that time. So, we repeatedly create new versions of the model by applying a shifting window on a historical dataset. One key difference compared to reality is that we simulate weekly instead of daily, retraining for cost and capacity reasons. This workflow is also orchestrated with Airflow and AWS.

Steps in Orchard’s backtest simulations

Adding new features to the model algorithm and regenerating training data
Launching 52 EC2 machines for the 52 weeks in a year
Training models on shifted versions of 10 years of historical data
Aggregating residuals and computing statistics of the model’s expected performance
Comparing performance to baseline and deciding on a proposed change

They're Hiring | View 8 Jobs Orchard is Hiring | View 8 Jobs

Data and model versioning in the refinement stage

Steps in Orchard’s backtest simulations

Recent Articles