Senior Data Engineer
As a member of the Yieldmo data team you are expected to build innovative data pipelines for processing and analyzing our large user datasets (250 billion + events per month). A unique challenge with the role is being comfortable in developing in varied technologies including: Develop custom transformation/integration apps in Python and Java, and build pipelines in Spark, Kafka, Kinesis, transforming and analyzing in SQL.
Responsibilities:
- Develop ETL (Extract, Transform and Load) Data pipelines in Spark, Kinesis, Kafka, custom Python apps to transfer massive amounts of data (over 20TB/ month) most efficiently between systems
- Engineer complex and efficient and distributed data transformation solutions using Python, Java, Scala, SQL
- Productionalize Machine Learning models efficiently utilizing resources in clustered environment
- Research, plan, design, develop, document, test, implement and support Yieldmo proprietary software applications
- Analytical data validation for accuracy and completeness of reported business metrics
- Open to taking on, learn and implement engineering projects outside of core competency
- Understand the business problem and engineer/architect/build an efficient, cost-effective and scalable technology infrastructure solution
- Monitor system performance after implementation and iteratively devise solutions to improve performance and user experience
- Research and innovate new data product ideas to grow Yieldmo’s revenue opportunities and contribute to company’s intellectual property
Requirements:
- BS or higher degree in computer science, engineering or other related field
- 5+ years of Object Oriented Programming experience in languages such as Java, Scala, C++
- 3+ years of experience of developing in Python to transform large datasets on distributed and cluster infrastructure
- 5+ years of experience in engineering ETL data pipelines for Big Data Systems
- Prior experience of designing and building ETL infrastructure involving streaming systems such as Kafka, Spark, AWS Kinesis
- Experience of implementing clustered/ distributed/ multi-threaded infrastructure to support Machine Learning processing on Spark or Sagemaker
- Proficient in SQL. Have some experience performing data transformations and data analysis using SQL
- Comfortable in juggling multiple technologies and high priority tasks
- Nice to have: experience with Distributed columnar databases like Veritca, Greenplum, Redshift, or Snowflake
Success in this role:
- Demonstrate a passion for Data
- Eagerness in research and learning new technologies to develop creative and efficient ways to solve business problems
- Take full responsibility of the initiative
- Stay focused on the successful implementation of the task at hand before moving on to next engineering challenge
- Going above and beyond:
- While engineering for current ask, think of big picture, adjustment code bases, processes. Try ways to make systems more robust, fault tolerant, monitor for failures, and program for automated recovery