What’s It Like to Build a Data Foundation From Scratch?
Monika Heinig, who has a doctorate in mathematics, has gotten to do something that would immediately make most data scientists envious.
Nearly three years ago, she was the first official hire at Clyde, an end-to-end product protection platform for retailers. Tasked with organizing the company’s data foundation, Heinig found herself responsible for determining what data the team used, how it was stored and what their analytics database looked like. That’s “typically something a data scientist can only dream of,” Heinig said.
To fully understand what an exciting and challenging opportunity this was, look no further than Clyde’s mission. The NYC startup is building tech that allows brands to offer extended warranties and product protection plans. So, if a consumer buys something expensive and a Clyde plan — and then the product breaks — that consumer knows they’re covered by Clyde.
Keeping these consumer extended warranties in order and capitalizing on the chance to collect personalized data from buyers keeps Heinig and the tech team working with a staggering amount of data each day. But that doesn’t mean their approach is scattershot, or casts too-wide a net.
“Every data point we collect is done with a purpose,” Heinig said. “While we might not be able to utilize every single one at the moment, it was collected so that we could do something specific or answer a specific question down the road.”
Built In NYC connected with Heinig to learn more about how Heinig built out the data science function at Clyde. She filled us in on what Clyde does right when it comes to data collection and storage, why all of that information is vital for providing a superior product, and how she’s set the company up for success in working with data in the future.
First off, tell us a bit about your work at Clyde. What are your main responsibilities? How has your role changed during your time at Clyde?
My role has varied over my two-and-a-half years at Clyde, but I have primarily been focused on establishing Clyde’s data foundation. As the first hire and employee number five, my first responsibility was to look at the data and see where we were.
Thanks to our engineering team, the data we had on the contracts we sold was great, but I knew we needed more data. Shortly after I started, we spun up an entire analytics database where I got to hand-pick what data we collected and exactly how it was collected and stored, which is typically something a data scientist can only dream of. With data science, the better the data going in, the better the results coming out and vice versa.
After getting all of the data and understanding it for myself, I then needed to be able to share that information with not only our internal team but also our partner merchants, enabling them to see how they were doing, where they could improve and how to optimize offerings. I set up our current business intelligence (BI) tool to do just that. I put together dashboards and reports to monitor performance, trends and make recommendations.
Why is it important that Clyde be data-centered? How does that give the company an edge up on the competition?
For Clyde to succeed and grow, it must make decisions. The best decisions come from the best information — which comes from the best data, so Clyde must be data-centered.
As I previously described, it was my responsibility to set up the foundation of data at Clyde to set us up to not only do the best reporting, but also set us up to make the best recommendations, and have the best data for machine learning algorithms. Every data point we collect is done with a purpose. While we might not be able to utilize every single one at the moment, it was collected so that we could do something specific or answer a specific question down the road.
There are companies that collect data just to collect data, and love to boast about how much data they have. But at the end of the day, most of it is probably extremely messy and hard to work with — if it can even be salvaged at all. Since data at Clyde is collected with a vision of making the experience for the end customer flawless and providing the best feedback and most money to our partners, we’re very methodical. That inherently leads to accuracy and efficiency.
The best decisions come from the best information — which comes from the best data.”
What kinds of data are you collecting and how do you collect it?
Clyde not only collects information about the contracts that we sell (including date of purchase; price; term purchased; which product the contract was purchased for; the merchant and customer information), but we also capture all order data, and which Calls To Action (CTAs) were visible to the end-customer at time of purchase. That means for every order, we have information about every product that was purchased, if there was a contract matched to that product at the time of purchase, and if so, what options were available at the time of purchase. We also record if a contract was indeed purchased or not.
This allows us to distinguish between eligible products (products that were matched to contracts at time of sale and could have had a contract purchased with it) or ineligible products (products that were not matched to contracts at time of sale, or the end customer did not see any CTAs or did not have the opportunity to purchase a contract at time of sale). This then allows us to accurately calculate a merchant’s attach rate, which is the percentage of products that are purchased with a contract.
Along with all of this snapshot data, we also collect the customers’ state, ZIP code and country, which enables us to determine eligibility and gives us data for heat maps to inform merchants where most of their customers reside. Complete customer information that is required by the underwriter is stored separately.
Additionally, another dataset we collect is around how an end user interacts with our CTAs. We receive a row of data if a person has seen a CTA and which CTA it was, and a row for all the possible clicks made — like if they click the Clyde logo, if they click the FAQs, if they go back and forth between different term options (we get a row of data for each click), if they exit out of the modal and how they exit. By having hashed unique identifiers associated with each click, we can also determine the number of unique users who interact with our CTAs on a daily basis.
At Clyde, we are very conscious and protective about people’s personal information and security, and so we are SOC2 compliant.
How is the information the data provides embedded in the production process from start to finish?
The information the data provides is primarily relayed internally, either through verbal communication from myself to the team, in the form of dashboards and reports through our internal BI tool, or just through a simple Excel spreadsheet. The goal of my job is to provide the stakeholders with information in the most understandable and effective way for them. Not every question or request requires a complicated machine learning algorithm — sometimes an Excel file does the job.
One of the other important aspects of the job is documentation, as many of the data results and procedures need to be reproducible.
Not every question or request requires a complicated machine learning algorithm — sometimes an Excel file does the job.”
What’s a hurdle or challenge you didn’t originally anticipate when starting in this role? How did you overcome it?
One major challenge I did not anticipate was having data in two separate databases. We have one database that feeds our product, so it is very important that it has the essential information for efficiency purposes as to not slow down our product. Our other analytics database captures all of our snapshot data, which is stored in a data warehouse intended to hold extremely large amounts of data. However, we have several metrics that are calculated based on the data in one and several metrics that are calculated from the data in the other.
Since the two databases didn’t talk, it was hard to provide all of the data necessary. Luckily, with the technology now available, we have the capability to merge the data together into one large data set and can even incorporate data from a third source.
What’s an exciting project you are working on right now?
Currently, I am setting up Looker as a new internal BI tool. This will allow Clyde users to not only have direct access to more data, but also allow them to interact with it and be able to answer some of their own questions.
Looker is more user-friendly, which will be especially helpful for a non-technical audience, and with its many capabilities, will also be useful for all! At the moment, all data-related inquiries go through me, so this will not only allow others to play with the data themselves, in a secure fashion, but will also free up some of my time to work on other data science projects from my laundry list of ideas.
What is your overall vision for the future of the data science and analytics team at Clyde?
First, I would love to grow the team, as I am currently a lean team of one. Along with a larger team, I would love to see the data being fully utilized to do more predicting and machine learning to look forward, rather than primarily look backward at what historically has happened or is currently happening. This would enable our success teams to be more proactive rather than reactive, and hopefully improve optimizations for Clyde, our merchant partners and our underwriter partners.
I think the foundation of data that has been established will fuel this. To have several high-impact projects running simultaneously across a team of data scientists, data engineers and data analysts that provides information, insights, recommendations and improvements is where I hope Clyde will be in the not-so-distant future!