What Technologies Are Best for Building Data Pipelines?

Is it too much to ask for fast, reliable and easy-to-use data pipelines that eliminate redundancies?

For data scientists, the answer can oftentimes be “yes,” especially as companies undergo growth and need to scale rapidly. (And, as the resource itself expands: the International Data Corporation predicts the world’s data will grow to 163 zettabytes in 2025 — 10 times the amount of data generated in 2016.)

Massive amounts of data being transferred from a source can cause significant wait times, and discrepancies in data can lead to lots of manual fixes down the line. Simply put, there’s a lot that can go wrong, and that’s not ideal when there’s a business, or a business’s business, at stake.

“As a real-time alerting business, we need all of our object detection analytics to run extremely fast,” said Pablo Oberhauser, a lead data scientist at Actuate. Part of this, Oberhauser said, is deciding what data needs to be quickly accessible as opposed to what can go into harder-to-reach storage. “Finding the right balance in this question is something we take seriously and are constantly evaluating as a team.”

MORE ON DATA PIPELINESThe Best Ways to Build Data Pipelines, According to The Experts

Of course, there are many ways to build and scale data pipelines. Vahid Saberi, a senior data scientist at EPAM Systems, combats these issues by leveraging Spark, a large-scale data processing analytics engine that enables the pipelines they construct for their clients to be fast and efficient.

Below, Oberhauser and Saberi go into depth about the tools and technologies they’re using to achieve maximum efficiency.

Pablo Oberhauser

Data Science Lead • Actuate

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies?

As a real-time alerting business, we need all of our object detection analytics to run extremely fast. As such, we leverage both Python and C++ components in our pipeline to move our data around and make predictions.

One piece that is really important in our data science group is machine learning Ops. We use tools like Neptune to track experiments and results on different subsets of our data, Great Expectations to track data drift, and AWS to build and maintain our growing data lake. Neptune also helps us with data versioning since we can store queries that pulled training data as metadata in Neptune, for example.

We chose AWS to store our data because of its ease of use and helpful Python libraries like Boto3, which help us streamline all of our data pipelining tasks. For data management, we use a combination of a data lake with Delta Lake to manage the metadata and querying for our team.

MORE ON ACTUATE8 NYC Companies on Tackling a Year of Rapid Growth

As your company — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

As the volume of our data grows, one of the constant challenges is deciding what data to keep “hot” and what data can go into less accessible storage. Finding the right balance in this question is something we take seriously and are constantly evaluating as a team.

Specifically for the data science team, the absolute most important part of having that much data is keeping our experimentation and feedback loops as short as possible. To test if an idea is going to work out, we need to be really good at sampling and getting representative tests so we don’t spend days waiting for full model training to see if an experiment was successful.

Vahid Saberi

Senior Data Scientist • EPAM Systems

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies?

Since EPAM is a consulting, digital platform engineering and development services company, the technology selection depends on our client and the provided environment. If the data size is relatively small and can be handled by a single computer, we will use Python data processing and pipelining libraries, such as Pandas.

However, in most scenarios, we are dealing with big data sets that require multiple computer nodes for processing and we need to rely on big data technologies, such as Spark. Spark is an open-source big data analytics engine that uses in-memory computation — this makes it very fast. Spark supports several programming languages, including Python, Java and Scala, and it has high-level classes very similar to Pandas DataFrame as well as SQL interface that makes data pipeline implementation convenient for our data engineers, scientists and developers. Lazy evaluation is another advantage of Spark. It makes our pipelines efficient by keeping the intermediate transformations as RDD or DataFrames and evaluating the results only when it is required. Also, we can use Spark to develop pipelines for both batch data and data streams.

We use our expertise and extensive experience to design and develop solutions to be robust and scalable as our clients grow.”

As your company — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

The pipelines that we develop in Spark are inherently scalable because they add more worker nodes as the data volume increases. Since we usually develop our solutions on the cloud, it is very convenient to scale the computation resources and allocate more worker nodes as the demand increases. Because EPAM is a Google Cloud partner and a Microsoft Gold-Certified partner, our engineers are well-versed in developing big data solutions on the cloud. We use our expertise and extensive experience to design and develop solutions to be robust and scalable as our clients grow.

Recent Articles