Firecrawl Logo

Firecrawl

Research Engineer – Evals

Reposted 12 Days Ago
Remote or Hybrid
2 Locations
160K-240K Annually
Mid level
Remote or Hybrid
2 Locations
160K-240K Annually
Mid level
The Research Engineer will design and build evaluation systems to measure data quality for Firecrawl's web data extraction, creating metrics and benchmarks, and collaborating on model improvements.
The summary above was generated by AI
Research Engineer — Evals

You'll build the evaluation systems that tell us whether Firecrawl actually works. That sounds simple. It isn't. Our core promise — convert any URL into clean, structured, LLM-ready data reliably — is hard to measure rigorously across millions of different websites, formats, and edge cases. As we layer in models and agent workflows, the question "did that work?" gets harder, not easier.

This isn't an eval role where you inherit a framework and run benchmarks. You'll design the metrics, build the pipelines, generate the datasets, and own the feedback loop from output quality back to model and product decisions. If you care about what "good" actually means and have the engineering depth to measure it, this is the role.

Salary Range: $160,000 to $240,000/year (Range shown is for U.S.-based employees in San Francisco, CA. Compensation outside the U.S. is adjusted fairly based on your country's cost of living.)

Equity Range: Up to 0.10%

Location: San Francisco, CA or Remote (Americas, UTC-3 to UTC-10)

Job Type: Full-Time

Experience: 3+ years in ML engineering, applied AI, or data quality — with production systems

Visa: US Citizenship/Visa required for SF; N/A for Remote

About Firecrawl

Firecrawl is the easiest way to extract data from the web. Developers use us to reliably convert URLs into LLM-ready markdown or structured data with a single API call. In just a year, we've hit 8 figures in ARR and 120k+ GitHub stars by building the fastest way for developers to get LLM-ready data.

We're a small, fast-moving, technical team building essential infrastructure superintelligence will use to gather data on the web. We ship fast and deep.

What You’ll Do

Build the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape, crawl, extract, and map. That means defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD so regressions get caught before they ship. You build the infra yourself because you're the one who needs it to work.

Design benchmarks that reflect reality. Our outputs need to hold up across millions of websites — SPAs, paywalled content, dynamic rendering, structured and unstructured formats. You'll build benchmark datasets that cover the real distribution of what our customers send us, including the edge cases that break naive approaches. Ground truth doesn't come for free — you'll design the collection and labeling systems too.

Own LLM-as-judge pipelines. You'll design and validate automated judges that score extraction quality at scale, know the failure modes of LLM-based evaluation, and build the human review tooling needed when automation isn't enough. You understand the difference between an eval that measures something real and one that just flatters the system.

Close the loop with models and RL. Evals here aren't a reporting layer — they're a training signal. You'll work closely with the RL and Search/IR research engineers to turn quality measurements into reward signals and feedback loops that make models meaningfully better. Your benchmarks directly influence what gets trained next.

Run fast experiments and communicate clearly. You design experiments that test meaningful hypotheses, run them quickly, and make decisions based on results. When you have findings, anyone on the team can understand what they mean — no decoder ring required.

What We're Looking For

Builds their own eval infrastructure. You don't wait for tooling to appear. You write the pipelines, curate the datasets, design the rubrics, and validate the judges yourself — because you understand that infra choices directly affect what you're actually measuring. You've run evals at scale and debugged the places where they lie.

Knows what "good" means for unstructured web data. You've worked with messy, real-world data before. You understand why markdown quality is hard to define, why structured extraction fidelity varies by schema, and why naive string-match metrics miss the point. You have strong opinions about what a useful benchmark actually looks like — and the rigor to validate them.

Fluent in LLM evaluation methodology. You understand LLM-as-judge systems, their correlation with human judgment, and where they break down. You've designed rubrics that hold up under adversarial inputs, built human review pipelines that scale, and know how to measure inter-rater agreement. You're not fooled by evals that only look good in aggregate.

Production-minded. You care about whether your evals reflect real production behavior, not just offline benchmarks. You've worked on systems serving real traffic and made hard tradeoffs between evaluation depth, coverage, and cost. A benchmark that doesn't represent what customers actually send isn't a benchmark worth maintaining.

Fast and clear. You'd rather run three rough experiments this week than one polished one next month. When you have results, anyone on the team can understand what they mean — and what to do next.

Backgrounds that tend to do well: ML engineers who've built eval or data quality systems at AI labs or applied teams. Engineers who've worked on LLM fine-tuning or RLHF pipelines and understand how feedback quality drives model improvement. People who've worked at the intersection of data infrastructure and model development. Anyone who's been the person on the team asking "but how do we know this actually works?"

What We're NOT Looking For

Benchmark runners. If your eval experience is running existing frameworks on existing benchmarks and reporting numbers, this isn't the right fit. We need someone who builds the frameworks and defines the benchmarks.

People who treat evals as an afterthought. If your default workflow is to build first and evaluate later — or to treat pass rates as a proxy for actual quality — you'll struggle here. Evals are a first-class product, not a QA gate.

Researchers who need a platform team. If you expect pipelines, datasets, and labeling infrastructure to exist before you can be productive, you'll be frustrated. You build the tools you need.

Slow iterators. If your standard experiment cycle is measured in weeks, not days, you'll struggle with the pace. We need someone who can design, run, and interpret a meaningful experiment within a day or two.

Bonus Points
  • Any other niche expertise and skills

  • Previous experience at a scraping, automation, or security-focused startup

  • Ex-founder

Benefits & Perks

Available to all employees
  • Salary that makes sense — $160,000-240,000/year (U.S.-based), based on impact, not tenure

  • Own a piece — Up to 0.1% equity in what you're helping build

  • Unlimited PTO — Minimum 3 weeks off encouraged; take the time you need to recharge

  • Parental leave — 12 weeks fully paid, for moms and dads

  • Wellness stipend — $100/month for the gym, therapy, massages, or whatever keeps you human

  • Learning & Development - Expense up to $150/year toward anything that helps you grow professionally

  • Team offsites — A change of scenery, minus the trust falls

  • Sabbatical — 3 paid months off after 4 years, do something fun and new

Available to US-based full-time employees
  • Full coverage, no red tape — Medical, dental, and vision (100% for employees, 50% for spouse/kids) — no weird loopholes, just care that works

  • Life & Disability insurance — Employer-paid short-term disability, long-term disability, and life insurance — coverage for life's curveballs

  • Supplemental options — Optional accident, critical illness, hospital indemnity, and voluntary life insurance for extra peace of mind

  • Doctegrity telehealth — Talk to a doctor from your couch

  • 401(k) plan — Retirement might be a ways off, but future-you will thank you

  • Pre-tax benefits — Access to FSAs and commuter benefits to help your wallet out a bit

  • Pet insurance — Because fur babies are family too

Available to SF-based employees
  • SF HQ perks — Snacks, drinks, team lunches, and the occasional burst of chaotic startup energy

Interview Process
  1. Application Review – Send us your stuff, and a quick note on why you're excited

  2. Intro Chat (~25 min) – Quick alignment call with a member of our team

  3. Technical Interview (~1 hr) – Tackle a small challenge

  4. Interview with Founders (~30 min) – Culture, vision, and long-term fit

  5. Paid Work Trial (1–2 weeks) – Work on something real with us

  6. Decision – We move fast

If you’ve ever wanted to own a product-critical system and build alongside founders, this is your moment. Apply now and let’s talk.

Similar Jobs

57 Minutes Ago
Easy Apply
Remote or Hybrid
Easy Apply
Senior level
Senior level
Artificial Intelligence • Cloud • Security • Software • Cybersecurity
Lead and expand global strategic partnerships with key Cloud Providers, driving joint go-to-market initiatives and maximizing revenue opportunities through sales collaboration and technical alignment.
Top Skills: Cloud TechnologiesSaaS
Yesterday
In-Office or Remote
2 Locations
150K-250K Annually
Mid level
150K-250K Annually
Mid level
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
The role involves researching and developing large language models (LLMs) with a focus on transformer architecture, data curation, distributed training, and optimization. Responsibilities include conducting experiments, collaborating with teams, and staying updated on deep learning advancements.
Top Skills: Distributed ComputingLarge Language ModelsPythonPyTorchTransformer Architectures
2 Days Ago
Easy Apply
Remote
United States
Easy Apply
140K-170K Annually
Senior level
140K-170K Annually
Senior level
Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
As the Lead Product Designer for Discover, you'll design AI-driven experiences for a two-sided marketplace while mentoring the design team and owning the product design direction.
Top Skills: Ai-Assisted Prototyping Tools (CursorClaude CodeFigmaLovable)V0

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account