Refreshing a tech stack is anything but a run-of-the-mill endeavor. Yet making a change can be necessary to better sustain growth — a situation which Shray Kumar, an infrastructure engineer, is helping retail martech company Bluecore currently navigate.
“As we onboard more clients, we’re growing at a dramatic scale. We’re sending hundreds of millions of emails a day,” Kumar said. “Now we have to fine-tune more knobs around our performance in order to meet that scale in a cost-efficient way. We want to be able to provide our engineers more observability and have everything homogenized in the stack.”
Over the remainder of this year and into the next, Bluecore is implementing a series of changes that team members said will ultimately boost its email marketing capabilities. Key changes include transitioning from Google App Engine (GAE) to Kubernetes, breaking up a monolith into Golang-based microservices and more.
Glenn Nagel, who is a staff software engineer on an infrastructure-focused team, has played a part in many of the changes. The migration is all part of a broader effort to transition Bluecore systems to be stream-based, which Nagel said will be an asset to data ingestion, allowing for a simpler architecture for the team to navigate.
“We want to be able to process things and pipeline data much more effectively, as well as the ability to have a true streaming representation of the system that is based around an event stream versus a database record. Then we can have a change log of things that have occurred over time,” Nagel said.
What’s Happening in the Stack
- Transitioning from GAE to Kubernetes
- Decoupling a monolith into microservices written in Golang
- Leveraging Pub/Sub Messaging and Argo Rollouts
The team described the changes as mutually beneficial. For Bluecore, team members benefit from more consistency in the tech stack, which Software Engineer Sunny Shapir noted is useful for junior engineers who are still learning. Additionally, Kumar said tech like Pub/Sub Messaging will ease potential growth into other geographic markets. On the other side, clients enjoy what colleagues characterized as faster and more reliable email deploying capabilities.
“Reliability is very important to our customers. It does wonders for the relationship. It’s nice to answer their questions with confidence and not have so many unknowns,” said Shapir, who works on campaigns.
Additionally, Nagel said that it helps the team monitor performance to see if they’re meeting their customers’ needs: “The changes give us the ability to guarantee our service level objectives and agreements to make sure we can process the events and send the emails in the time frames that we expect to. The consistency itself has really made it much simpler to say in a very large distributed system: Where am I at? How is it going?” he said.
Chatting with Built In NYC, Kumar, Nagel and Shapir discussed the changes at play, the motivating factors behind the updates, and the internal and external advantages that happened as a result.
We’ve touched on the work you’re doing at Bluecore and the migration underway. Let’s talk more about changes to the tech stack. First off, is there a major throughline between all of those changes?
Nagel: Historically, Bluecore does a lot of stuff that’s event-driven: A customer viewed this; they added to their cart; they did all the various other things. But there were other areas in the system where that was a batch process, like bundling all the data up and turning it into large backend systems.
In an effort of modernize all of this, we’re moving much more toward a fully event stream-based system that allows us to take the old business logic as it existed for, say, something was delivered to an inbox, something was opened or something was added to your cart and making that same backend work both for the batch streaming systems and the real-time streaming systems. A lot of the effort there is to make it so that we have homogenous data ingestion. That in turn reduces the amount of code that we have to deal with over time and simplifies the overall architecture of the system.
You’re migrating from GAE to Kubernetes. Tell us more about that.
Nagel: Horizontal scaling is much easier to do in a Go, Pub/Sub-based streaming system. Without that, we simply can’t auto-scale, and we need to be able to going forward to meet those service level objectives and agreements. Other reasons include speed and cost efficiency. It’s much more cost-effective and faster to process these things in Kubernetes and Go than we were able to do in Python.
Kumar: We have more control and more observability around whether or not software is working as expected. Having access to those fine-grained dials allows us to understand how we need to dedicate our engineering time.
Shapir: Instead of just using whatever existed in the Bluecore ecosystem to get it to do whatever we wanted, we now have the luxury of choosing a tech stack for each service that we're trying to build out and making sure that it’s optimized for this specific use case.
You’re also decoupling your Python monolith into microservices written in Golang. Why?
Nagel: In terms of splitting it between Python and Go, one of the key ways we did that was through the use of Protobuf interfaces. This allowed us to create a very concrete, well-defined API contract between the systems. That in turn allowed us to effectively create system boundaries, where they didn't exist before. That’s been really critical in terms of moving things out of monolith into Kubernetes.
We now have the luxury of choosing a tech stack for each service that we're trying to build out and making sure that it’s optimized for this specific use case.”
Additionally, you’re using Pub/Sub Messaging. What are the benefits of that?
Nagel: One of the advantages is being able to handle the back pressure when there’s a very large influx of events. That allows us to scale gracefully over time and add more pods to a Kubernetes cluster, simply based on CPU usage or queue depth. That means that we can take our time processing the data, we can recover gracefully and everything works really well.
Kumar: You’re able to mirror messages and have a few different subscribers get both sets of messages. You can shadow things in a lower environment. For example, if you have to reproduce something that a client did or a set of events so you can reproduce a bug, Pub/Sub makes it much easier.
What’s the advantage of utilizing Argo Rollouts?
Kumar: The rollout component is for Canary releases. With Argo Rollouts, we can gradually shift the deployment from an old version to a new version, monitor the changes, and then quickly revert our monitoring or observability tools and notice any errors. This basically allows us to quickly determine whether or not things are happening as expected.
Also, what’s nice about our infrastructure stack now is that Argo CD uses infrastructure as code. It’s very easy to evaluate when a change went out and what was modified in that change, as well.
Lastly, how do these updates set Bluecore as a company up for a successful future?
Kumar: With the way that we’ve architected the system, the prospect of breaking into new markets — for example, the UK has their own compliance needs around General Data Protection Regulation, or GDPR — and duplicating our system or deploying into different regions will be much easier with this new infrastructure stack, because Pub/Sub can be made to be a global service.
Nagel: With the ability to scale our clusters elsewhere, we can hire at a more rapid pace because with that consistency comes developer optimizations and improved workflows. Everything is consistent, works well and is as documented as it can be. It’s been a huge improvement in velocity for the team.
Shapir: That consistency helps junior engineers learn best practices, too. It’s not only an investment in the product that we’re building, but also an investment in the engineers we’re hiring, because they get to learn a lot more than just writing software. Writing software is the easy part; doing it right is the harder part.