MLabs Logo

MLabs

Research Crawling Engineer

Posted Yesterday
Be an Early Applicant
In-Office or Remote
3 Locations
100K-130K Annually
Senior level
In-Office or Remote
3 Locations
100K-130K Annually
Senior level
Design, build, and operate large-scale, fault-tolerant web crawlers and data pipelines that collect, clean, deduplicate, and normalize web-scale datasets (terabyte to petabyte). Navigate anti-bot defenses and dynamic JS sites, optimize crawl performance and cost, and collaborate with research teams to produce ML-ready datasets.
The summary above was generated by AI

Location: Remote - Must have a 6 hour overlap with EST

Remote | Full-time

Compensation: $100K - $130K

We are hiring on behalf of our client who is a technical infrastructure firm specializing in the delivery of massive-scale web data to organizations developing advanced artificial intelligence models. The organization supports high-capacity bandwidth-sharing networks and operates a distributed crawler capable of accessing high-quality public web data at a global scale. Additionally, the team has engineered sophisticated pipelines for the ingestion, segmentation, and annotation of billions of multimedia files, facilitating dataset creation for frontier research labs.

The organization operates as a lean, technical team that prioritizes speed and direct execution. As a Research Crawling Engineer, the successful candidate will design and operate large-scale web data acquisition systems. This role encompasses distributed systems, scraping infrastructure, and data pipelines, focusing on providing high-quality inputs for research and model development.

Key Responsibilities

  • Construct and maintain large-scale web crawlers across diverse domains.
  • Design high-throughput, fault-tolerant systems for data collection, managing volumes ranging from millions to billions of URLs per day.
  • Navigate anti-bot systems, rate limits, and dynamic, JavaScript-heavy websites.
  • Develop robust pipelines for data cleaning, deduplication, filtering, and normalization.
  • Build and maintain datasets specifically structured for research and machine learning model training.
  • Monitor and optimize crawl performance, coverage, and data quality through rapid iteration.
  • Collaborate with research teams to ensure data collection efforts align with modeling requirements.
  • Optimize infrastructure to ensure cost-efficiency, low latency, and reliability.

Requirements
  • Extensive programming experience in one or more of the following: Go, Rust, Python, Java, or C++.
  • Proven experience in building web crawlers or large-scale data pipelines.
  • Solid understanding of HTTP, networking protocols, and browser behavior.
  • Familiarity with distributed systems and parallel processing techniques.
  • Experience handling large datasets, ideally at the terabyte to petabyte scale.
  • Demonstrated ability to debug and maintain systems within unstable or adversarial environments.

Preferred Qualifications:

  • Experience with NLP pipelines or dataset curation for machine learning.
  • Familiarity with LLM pre-training data or retrieval systems.
  • Practical experience with headless browsers (e.g., Playwright, Puppeteer, or Chrome DevTools Protocol).
  • Knowledge of proxy systems, IP rotation, and large-scale request orchestration.
  • Background in data quality evaluation or benchmarking.
  • Experience running workloads on cloud or bare-metal infrastructure.

Benefits
  • Impactful Opportunity: Contribute to the development of a web-scale crawler and knowledge graph at the forefront of AI data accessibility.
  • High-Performance Culture: Join a lean, low-ego team that prioritizes high output and professional growth.
  • Remote Work: This position is part of a fully remote team, offering flexibility and autonomy.
  • Competitive Compensation: A package including a competitive salary, comprehensive benefits, and equity, commensurate with experience and the ability to operate at scale.



Interview Process

  1. Recruiter Coordination Call
  2. Hiring Manager Interview
  3. Founder / CEO Interview
  4. Secondary Executive Interview
  5. Final Interview

Due to the high volume of applications we anticipate, we regret that we are unable to provide individual feedback to all candidates. If you do not hear back from us within 4 weeks of your application, please assume that you have not been successful on this occasion. We genuinely appreciate your interest and wish you the best in your job search.

Commitment to Equality and Accessibility:

At MLabs, we are committed to offer equal opportunities to all candidates. We ensure no discrimination, accessible job adverts, and providing information in accessible formats. Our goal is to foster a diverse, inclusive workplace with equal opportunities for all. If you need any reasonable adjustments during any part of the hiring process or you would like to see the job-advert in an accessible format please let us know at the earliest opportunity by emailing [email protected].

MLabs Ltd collects and processes the personal information you provide such as your contact details, work history, resume, and other relevant data for recruitment purposes only. This information is managed securely in accordance with MLabs Ltd’s Privacy Policy and Information Security Policy, and in compliance with applicable data protection laws. Your data may be shared only with clients and trusted partners where necessary for recruitment purposes. You may request the deletion of your data or withdraw your consent at any time by contacting [email protected].

Similar Jobs

21 Minutes Ago
Remote or Hybrid
Senior level
Senior level
Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Lead a local procurement team and act as single procurement contact for business units. Manage sourcing activities with regional teams, resolve supply/service issues, ensure regulatory compliance, and implement regional/global projects to deliver service, productivity and cash mobilization.
6 Hours Ago
Remote or Hybrid
Senior level
Senior level
Cloud • Software
Design and build LLM-based autonomous security agents and self-healing remediation pipelines. Integrate AI-driven threat detection into CI/CD and GitOps workflows, automate vulnerability research, patching, and security orchestration across cloud-native infrastructure.
Top Skills: Agentic WorkflowsArgocdCi/CdGitopsGoIstioKargoKubernetesLinkerdLlmsMilvusOpa/RegoPineconePrompt EngineeringPulumiPythonService MeshSnykTerraformWafWiz
10 Hours Ago
Remote or Hybrid
East Hanover, NJ, USA
143K-235K Annually
Senior level
143K-235K Annually
Senior level
Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
The role entails leading digital transformation in supply chain engineering, overseeing automation, project management, and fostering cross-functional collaboration while driving digital solutions and strategy alignment across the organization.
Top Skills: C#C++JavaMqttPythonRest

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account