H1 Logo

H1

Staff Data Engineer - Emerald

Posted Yesterday
Be an Early Applicant
Hybrid
New York, NY, USA
170K-190K Annually
Senior level
Hybrid
New York, NY, USA
170K-190K Annually
Senior level
The Staff Data Engineer will lead the architecture and scalability of H1's healthcare data processing systems, ensuring efficient entity resolution and big data operations. Responsibilities include design and optimization of distributed Spark pipelines, team leadership, and collaboration with AI/ML teams to enhance data processing capabilities.
The summary above was generated by AI
At H1, we believe access to the best healthcare information is a basic human right. Our mission is to provide a platform that can optimally inform every doctor interaction globally. This promotes health equity and builds needed trust in healthcare systems. To accomplish this our teams harness the power of data and AI-technology to unlock groundbreaking medical insights and convert those insights into action that result in optimal patient outcomes and accelerates an equitable and inclusive drug development lifecycle.  Visit h1.co to learn more about us.
 
Data Engineering is responsible for the development and delivery of our most important asset - our data. Looking across thousands of data sources from across the globe, the data engineering team is responsible for making sense out of that data to create the world's most extensive and comprehensive knowledge base of healthcare stakeholders and the ecosystem they influence. It is our job to ensure that only accurate, normalized data flows through to our customers, and at a velocity that keeps up with the changes in the real world. As we rapidly expand the markets we serve and the breadth and depth of data we want to collect for our customers, the team must grow and scale to meet that demand.
 
WHAT YOU'LL DO AT H1
As a Staff Data Engineer on the Emerald team, you will play a critical role in shaping the architecture, scalability, and technical direction of H1’s healthcare entity resolution platform. EMERALD is responsible for linking large-scale external healthcare datasets,  including PubMed, clinical trials, conferences, ct.gov, and web-collected data to H1’s canonical physician and organization profiles.

This role sits at the intersection of distributed data engineering, entity matching, identity resolution, and large-scale healthcare data processing. You will lead a small team of engineers while remaining deeply hands-on technically, owning the systems and pipelines powering automatching, grouping logic, identity mapping, deduplication, and enrichment workflows processing tens of millions of records.

You will partner closely with Product, AI/ML, Analytics, and Engineering teams to improve platform accuracy, scalability, reliability, and operational efficiency across one of H1’s most critical data platforms.

You will:
- Lead the design, optimization, and scalability of distributed Spark/PySpark pipelines powering entity resolution and large-scale healthcare data processing.
- Own systems supporting automatching, identity mapping, grouping logic, deduplication, enrichment, and auto-approval workflows across healthcare provider and organization datasets.
- Build and maintain scalable processing frameworks for PubMed, clinical trial, ct.gov, conference, and other healthcare data sources.
- Drive infrastructure optimization initiatives focused on improving throughput, runtime, observability, and cloud compute cost efficiency.
- Partner closely with AI/ML teams to integrate matching and resolution models into EMERALD and improve matching precision and recall.
- Lead complex technical initiatives from architecture and design through deployment, monitoring, and long-term production support.
- Serve as a technical leader and mentor across the team through code reviews, technical guidance, and engineering best practices.
- Collaborate directly with Product and business stakeholders to align technical solutions with operational and customer needs.
- Support production operations, incident response, troubleshooting, and ongoing platform reliability.

ABOUT YOU
You are an experienced data engineer with deep expertise building and optimizing distributed data systems in cloud-native environments. You thrive solving complex scalability and performance challenges across high-volume data processing systems and enjoy operating in highly technical, fast-paced engineering environments.

You bring strong hands-on engineering expertise across distributed computing, large-scale data processing, and infrastructure optimization while also helping guide technical direction and mentor engineers across the organization.

- Deep expertise with distributed data processing frameworks such as Apache Spark and Hadoop, particularly within AWS environments.
- Strong proficiency in Python (PySpark), Scala, Java, or other modern programming languages used for large-scale distributed processing.
- Experience building scalable ETL/ELT frameworks across both batch and streaming architectures.
- Experience with entity resolution, identity mapping, automatching, deduplication, or large-scale matching systems is strongly preferred.
- Strong understanding of distributed file formats including Apache Parquet and Apache AVRO.
- Experience with streaming technologies such as Kafka, Spark Streaming, or KSQL.
- Strong grasp of software engineering fundamentals including distributed systems, data structures, concurrency, and system design.
- Experience performing root cause analysis across large-scale distributed systems and complex data pipelines.
- Ability to write clean, maintainable, modular, and production-grade code.
- Experience improving performance, scalability, observability, and infrastructure efficiency within distributed systems.
- Strong communication and collaboration skills across both technical and non-technical stakeholders.
- Familiarity with modern development and infrastructure tooling including Git, CI/CD pipelines, Docker, Kubernetes, Terraform, Argo, Hudi, and JIRA.

REQUIREMENTS
- 8+ years of experience building and maintaining large-scale distributed data systems and pipelines.
- Demonstrated technical leadership experience mentoring engineers and driving complex technical initiatives.
- Extensive experience with Apache Spark and AWS-based big data technologies including EMR, S3, and distributed compute environments.
- Strong coding experience in Python (PySpark), Scala, Java, or equivalent languages used for distributed processing systems.
- Experience optimizing large-scale Spark workloads for performance, scalability, and infrastructure cost efficiency.
- Experience with streaming and event-driven architectures using technologies such as Kafka or Spark Streaming.
- Experience with orchestration and lakehouse technologies such as Argo and Hudi or comparable platforms.
- Experience with containerization and infrastructure technologies such as Docker, Kubernetes, and Terraform.
- Experience working with relational or distributed databases such as PostgreSQL or Redshift.
- Proven ability to operate effectively within highly scalable, production-grade distributed systems.
- Experience working with healthcare, life sciences, Real World Evidence (RWE), or large-scale healthcare datasets is strongly preferred.
 
COMPENSATION
This role pays $170,000 to $190,000 per year, based on experience, in addition to stock options.

Anticipated role close date: 8/1/2026

H1 OFFERS
- Full suite of health insurance options, in addition to generous paid time off
- Pre-planned company-wide wellness holidays
- Retirement options
- Health & charitable donation stipends
- Impactful Business Resource Groups
- Flexible work hours & the opportunity to work from anywhere
- The opportunity to work with leading biotech and life sciences companies in an innovative industry with a mission to improve healthcare around the globe
 
 
H1 is proud to be an equal opportunity employer that celebrates diversity and is committed to creating an inclusive workplace with equal opportunity for all applicants and teammates. Our goal is to recruit the most talented people from a diverse candidate pool regardless of race, color, ancestry, national origin, religion, disability, sex (including pregnancy), age, gender, gender identity, sexual orientation, marital status, veteran status, or any other characteristic protected by law.
 
H1 is committed to working with and providing access and reasonable accommodation to applicants with mental and/or physical disabilities. If you require an accommodation, please reach out to your recruiter once you've begun the interview process. All requests for accommodations are treated discreetly and confidentially, as practical and permitted by law.

HQ

H1 New York, New York, USA Office

386 Park Ave South , New York, NY, United States, 10016

Similar Jobs

28 Minutes Ago
Hybrid
New York, NY, USA
187K-257K Annually
Expert/Leader
187K-257K Annually
Expert/Leader
Fintech • Machine Learning • Payments • Software • Financial Services
The Director of Network Compliance oversees risk management for Capital One's network and merchant businesses, providing compliance guidance and expert advice while developing the network compliance program and leading a Compliance Advisory Team.
Top Skills: ComplianceCybersecurityRisk Management
29 Minutes Ago
Hybrid
New York, NY, USA
150K-205K Annually
Mid level
150K-205K Annually
Mid level
Fintech • Machine Learning • Payments • Software • Financial Services
The role involves managing the Card Acquisitions Product team, driving innovation in lending ecosystems, and enhancing customer experiences through strategic product management and partnerships across various platforms.
Top Skills: Customer Experience TechnologyDecisioning As A ServiceDigital Products
29 Minutes Ago
Hybrid
New York, NY, USA
111K-138K Annually
Junior
111K-138K Annually
Junior
Fintech • Machine Learning • Payments • Software • Financial Services
As a Senior Associate, Product Manager, you'll define product requirements for AI-Ready Tools and work with cross-functional teams to facilitate customer interactions through technology.
Top Skills: Agile DeliveryAi-Ready ToolsProduct Management

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account