Braintrust

Eval Engineer

Posted 3 Days Ago

Be an Early Applicant

In-Office

New York City, NY, USA

Mid level

In-Office

New York City, NY, USA

Mid level

The Eval Engineer will design and run evaluations of AI capabilities, create datasets, analyze system performances, and publish results to improve AI understanding in the developer ecosystem.

The summary above was generated by AI

About the company

Braintrust is the AI observability platform. By connecting evals and observability in one workflow, Braintrust gives builders the visibility to understand how AI behaves in production and the tools to improve it.

Teams at Notion, Stripe, Zapier, Vercel, and Ramp use Braintrust to compare models, test prompts, and catch regressions — turning production data into better AI with every release.

About the role

We’re hiring an Eval Engineer to design and run creative evaluations of new AI capabilities. Your job is to turn emerging AI ideas into measurable experiments and publish the results for the developer ecosystem.

When new models, agents, or frameworks appear, everyone has opinions about what works but few people actually test them. This role exists to change that.

You’ll design experiments that compare models, prompts, and agent architectures against real tasks. You’ll build the datasets, scoring logic, and evaluation harnesses. Then you’ll publish the results so builders understand what actually works.

This role sits at the intersection of engineering, experimentation, and technical storytelling.

What you’ll ownIndustry evals

Design and run evaluations of new AI capabilities
Compare frontier models, agent systems, and tool workflows
Turn emerging ideas into measurable benchmarks

Eval design

Define datasets, tasks, and scoring logic for experiments
Design realistic workloads that reflect production environments
Create tests that expose failure modes and edge cases

Experiment implementation

Build evaluation harnesses using Braintrust
Run comparisons across models, prompts, and agent approaches
Analyze traces, outputs, and failure patterns

Creative test construction

Invent novel ways to stress test AI systems
Design scenarios that break agents, prompts, and model reasoning
Build adversarial or complex datasets that reveal weaknesses

Technical content

Write technical posts explaining evaluation methodology and results
Share datasets and scoring logic so experiments are reproducible
Help establish better evaluation patterns for the industry via courses

Evaluation playbooks

Develop reusable eval patterns for agents, RAG systems, and LLM apps
Create open source reference implementations developers can adopt
Contribute examples and guides that help teams build better evals

What great looks like

You’re an engineer who likes testing systems more than building features
You enjoy breaking things and understanding why they fail
You can design experiments that isolate meaningful differences between approaches
You understand how LLMs, agents, and RAG systems actually work
You write clearly for technical audiences
You ship experiments quickly and iterate often
You care about methodology and reproducibility
You’re curious, creative, and opinionated about how AI should be evaluated

What you’ve done

Built or contributed to evaluation systems for LLM or agent applications
Designed experiments comparing models, prompts, or AI architectures
Written Python code to run tests across models or APIs
Built datasets or scoring logic for AI quality measurement
Investigated model failures or unexpected behaviors
Published technical blog posts, research notes, or engineering write-ups
Built prototypes quickly to test ideas

If you want to help the industry understand how to measure AI systems and design the evaluations everyone else learns from, this is the role.

Benefits include

Medical, dental, and vision insurance
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity
AI Stipend

Equal opportunity

Braintrust is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Top Skills

Agent Systems

Evaluation Harnesses

Llm

Python

Similar Jobs

Waymo

Senior Software Engineer

5 Days Ago

In-Office

New York, NY, USA

213K-263K Annually

Senior level

213K-263K Annually

Senior level

Automotive

The Senior Software Engineer will drive the architecture of core services for evaluation workflows, mentor junior engineers, and ensure APIs meet evaluation complexities.

Top Skills: C++Python

Braintrust

Eval Engineer

16 Days Ago

In-Office

New York City, NY, USA

Entry level

Blockchain • Web3

The Eval Engineer designs and runs evaluations of emerging AI technologies, builds evaluation frameworks, analyzes results, and publishes findings for the developer community.

Top Skills: Python

Mastercard

Principal Software Engineer

An Hour Ago

Hybrid

170K-337K Annually

Senior level

170K-337K Annually

Senior level

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing

Lead the technical vision for Mastercard's corporate platforms, modernizing systems and data architecture, ensuring compliance and driving innovation.

Top Skills: AWSAzureClouderaDatabricksDockerJavaKubernetesMiddleware And IntegrationRest ApisSpark

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
Key Industries: Artificial intelligence, Fintech
Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory