Data Scientists Share What Technologies They’re Using to Build Their Data Pipelines — and Why

Written by Madeline Hester
Published on Jan. 17, 2020
Brand Studio Logo

Before Harry’s was founded in 2013, most men purchased shaving razors at their local convenience stores. By adopting the direct-to-consumer business model, Harry’s consumers are able to buy thousands of German-engineered razors at affordable prices. 

But the company doesn’t operate on sharp edges alone. With a $112 million Series D funding round in December 2017, the e-commerce company expanded its offerings into skin care and shaving cream. Still, each new razor that goes down the factory line in Germany translates to only one thing for Head of Analytics Pooja Modi: high volumes of data.

To address scaling needs, Modi said, “We’re spending a lot of energy on data validation and measuring data quality at every single step, with robust monitoring and alerting on the health of our data. We are also focused on data testing and documentation, enabling us to better communicate context and expectations across team members.”

Ensuring that the data pipeline continues to scale with business means starting with the right tools, whether that means turning to trusted programming languages like Python, or harnessing new technologies like Snowflake. Read on to hear how Modi and Senior Data Scientist John Maiden at Cherre process data with cutting edge tools. 

 

Pooja Modi
Head of Analytics • Harry's Inc.

As Harry’s e-commerce business expands from men’s razors to encompass shaving products and skincare, Head of Analytics Pooja Modi said, “Scalability is definitely top of mind.” To ensure Harry’s data pipeline can scale to support higher volumes of data, her team is focused on measuring data quality at every step.

We are focused on data testing and documentation.’’

 

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies specifically?

This question is perfectly timed as we are in the middle of piloting several different tools while getting smarter on our decision criteria. Historically, we have relied on Redshift and Looker, enriched with a set of in-house capabilities to support data ingestion and orchestration (e.g. our open-source tool Arthur).  

We’re now in the midst of piloting several new technologies (e.g. Snowflake, Stitch and DBT), broadly optimizing for usability, reliability, cost and feature richness. We like to work with technologies that come with a high level of customer support from the vendors and user communities. It’s also appreciated when tools provide turnkey integrations, enabling us to refocus our bandwidth on solving specific complexities within the business.

 

As your company — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

Scalability is definitely top of mind for Harry’s as we are in the middle of rapidly spinning up new brands and scaling to several more retailers this year. We need to ensure that our data pipeline can scale to support a higher volume and variety of data. Secondly, the data pipeline needs to continue to be manageable for the current and future team. To address these needs, we’re spending a lot of energy on data validation and measuring data quality at every single step, with robust monitoring and alerting on the health of our data. We are also focused on data testing and documentation, enabling us to better communicate context and expectations across team members.

 

John Maiden
Senior Data Scientist • Cherre

Cherre’s provides investors, insurers, real estate advisors and other large enterprises with a platform to collect, resolve and augment real estate data from hundreds of thousands of public, private and internal sources. To manage ever-changing data, Senior Data Scientist John Maiden explains why he relies on the standard tools from the Google Cloud shop.

Python is a mature language with great library support for ML and AI applications.’’

 

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies specifically?

We're a Google Cloud shop, so we use many of the standard ETL tools like Airflow, BigQuery and PostgreSQL to get data ready for analysis. Once the data is ready, we use Python and the usual suspects such as Pandas, scikit-learn for small data sets and Spark (PySpark) when we need to scale. Python is a mature language with great library support for ML and AI applications, and PySpark allows us to extend that functionality to large data sets. Spark Graphframes is a critical technology for us when it comes to graph processing since we're handling hundreds of millions of rows.

 

As your company — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

There are two parts to making our work scale. We first build reusable pipelines and a growth-focused architecture that can develop alongside growing client demand. The second is cultivating a team with strong domain knowledge that understands the type and quality of data currently available, and what we need to add or improve to better support our products.

 

Responses have been edited for length and clarity. Images via listed companies.