All Companies Use Data. But Not All of Them Are Data-Driven.

On October 3, 2021, two years into the pandemic, the International Consortium of Investigative Journalists published a landmark report detailing a shadowy financial system rife with fraud and offshore dealings of the world’s elite. The largest collaboration of its kind, the Pandora Papers uncovered secrets distilled through millions of leaked documents and almost three terabytes of data.

At that same moment, Johns Hopkins’ Covid-19 map continued to tally numbers on its data-hungry dashboard, as it had when the virus first blanketed the globe. Just a year before, data-driven publication The Pudding found that scientists published, on average, 11 coronavirus-related articles per hour throughout the entirety of 2020.

But data doesn’t rest. And in 2022, The New York Times showed that neither did the virus, with their data-visualized breakdown of what it takes to understand the fast-changing Covid-19 variants.

No matter the era, no matter the events, big or small — data colors and connects the world.

For tech companies, the complex interplay of statistics fed through programs for processing, analyses and use drive many of their core functions. Products are fueled by the data with which engineering teams work, and their development cycles are similarly informed. Knowledge is power, but that doesn’t mean all data teams use the same technology in the same way.

For cannabis company LeafLink, the budding nature of the industry means real-time operational data necessary for logistics requires more data finesse and adaptations. “We are exploring the right architecture to meet near real-time service-level agreements for some of our operational and platform data needs,” said Nikhil Goyal, their VP of data and analytics. Meanwhile, in a starkly different field, edtech company Teachers Pay Teachers employs a cloud-based data stack. “We’re very mindful of our responsibility and obligations around student data,” explained Data Engineering Director Carly Stambaugh.

The world of data moves fast. New challenges, evolving technologies and expanding stakeholder demands all contribute to the dynamic data stack that engineering teams fine-tune on a regular basis. To get more insight on how data scientists and engineers are looking ahead, Built In NYC sat down with two tech companies motivated by foresight for scale, future needs and the inevitable obsolescence in their stack.

Nikhil Goyal

VP, Data & Analytics • LeafLink

LeafLink is a B2B cannabis company that connects industry brands, distributors and retailers on a wholesale platform. With a suite of digital tools and technology, it aims to make the cannabis supply chain more streamlined and accessible. As an e-commerce platform, LeafLink’s product is powered by data, backed by a robust tech stack to support its many functions. For Nikhil Goyal, VP of data and analytics, the company “has a very diverse use case for data,” necessary to tackle the unique challenges of an emerging industry.

Describe your data stack, and why you use that combination of tools.

Our data stack is comprised of Fivetran, Segment, Airflow, dbt for ELT and Orchestration, Redshift + s3 for Data Warehouse-ing and storage, Tableau for BI and external facing data products and a Data Service API for consumption of data by any application. Our DevOps engineering team maintains the deployments and guarantees more than 99.5 percent uptime for Airflow, Data Warehouse and Tableau servers. We also use Census for reverse ETL-ing data back into our business systems like Salesforce and Hubspot, and split.io for experimentation.

The nature and operations of each of [our core offerings] make their data needs unique.”

How does your organization use data, and what unique requirements does that place on your team and the technology you use?

Internally, we have the GTM teams and the product teams as our main stakeholders looking to inform OKRs, measure performance and generate insights and hypotheses. Externally, the data team has built and maintained four embedded paid data products that are part of LeafLink.com, and we also maintain the pipelines which feed data to the CTAs placed within the website for things like seller ranking, customer matching, repurchase recommendations and more.

Since LeafLink has three core offerings — the marketplace, payments and logistics -— the nature and operations of each of these make their data needs unique. At the same time, we need to also blend the data from each of them to form a 360 view of the customers. One of the most challenging aspects is logistics, since they need real-time operational data and have customized experiences and tool sets depending on the market. The nature of the cannabis industry, where there is no UPC for products and no public data on company performance, poses other challenges in providing insight on market growth and inventory.

How has your stack evolved over time, and how do you think it’ll continue to evolve in the next few years?

When we started building the platform, we deployed Airflow on a single large instance. Our DevOps quickly helped us scale up to Airflow on Kubernetes. We used to have Periscope where self-service capability was limited, so we moved to Tableau which allows users to build dashboards without typing SQL. To operationalize data, we added Census and also a Slack notification system. We started building the data warehouse without dbt at first but eventually settled on dbt to maintain all data models and business logic. On the team adoption side, we used to have two to three people who would write dbt models and build dashboards; today, everyone except data engineers does it while specializing in their core skill set.

While have batch prediction models and are able to meet within the 15-minute criteria for data from OLTP to OLAP, the biggest area to unlock as we move towards an event-driven architecture on our application side is going to be real-time data for logistics operations and applied machine learning to drive behavior on the platform. As LeafLink grows, we’ll need to get faster at both.

LeafLink is Hiring | View All Jobs

Table with plants on it in the Teachers Pay Teachers office — Teachers Pay Teachers

Carly Stambaugh

Engineering Director, Data Eng • TPT (formerly Teachers Pay Teachers)

Teachers Pay Teachers (TPT) is an edtech company offering a teacher-powered learning platform with a suite of tools that aims to empower educators. TPT provides teachers with an online marketplace where they can buy and sell educational materials within a supportive community. To make digital content and tools accessible for teachers, the data team works with resources of their own, with goals in mind to leverage data even more. “We are using data to power our products — and our product development cycle — through user research, deep dive analyses, predictive algorithms and experimentation,” said Data Engineering Director Carly Stambaugh.

Describe your data stack, and why you use that combination of tools.

Our data stack is heavily cloud-based. We use a hybrid cloud solution, GCP and AWS, with our production databases being RDS instances in AWS and BigQuery as our data warehouse. We ingest several external data sources, such as SalesForce, Heap and others, and import this data into our warehouse using Stitch as our EL tool. For transformations, we run Python and JavaScript ETL processes, which run in our production a Kubernetes cluster, as well as DBT 1.1.x. We tie everything together with Google Cloud Composer.

Our product and business teams use Looker for metrics and BI, and our data scientists use tools like Hex notebooks to perform analyses and communicate results with a broader audience. We’ve built an in-house system for training and deploying machine learning algorithms to production, which uses BigQuery ML.

How does your organization use data, and what unique requirements does that place on your team and the technology you use?

At Teachers Pay Teachers, we use data to track the health of our business, as well as to understand how our users are interacting with our products. This allows us to connect a team’s efforts to the impact it has on teachers and students everyday. Because we are a marketplace, we also provide data and insights to our teacher authors about the health of their business and how their work impacts teachers and students.

As an edtech company, student data also affects the technology and processes we use with our data at all steps of the pipeline, from data collection and storage to data insights.

We use data to track the health of our business, as well as to understand how our users are interacting with our products.”

How has your stack evolved over time, and how do you think it’ll continue to evolve in the next few years?

The first iteration of our data stack relied heavily on custom in-house code and on-premises infrastructure. For example, we were hosting our own Airflow and Jenkins instances. Much of the data engineering team’s work over the past year has focused on modernizing our stack, and removing legacy codebases and infrastructure.

Looking forward, we want to increase availability of data. For example, by implementing micro-batch and streaming data pipelines, in addition to our current batch pipelines and DBT runs. We also want to provide more real-time prediction and models from data science deployed into production. For instance, implementing a real-time recommendation engine to help teachers find the right resource at the right time.

Recent Articles