We turn raw on-chain activity into trustworthy intelligence — clustering addresses into real-world entities, attributing them to services and actors, and surfacing risk for compliance and investigations teams. We're looking for a data scientist who is as comfortable shipping a heuristic to production as they are designing it: someone who can move from a messy hypothesis to a working pipeline without waiting on someone else to wire up the data.
You'll work closely with our attribution and clustering leads on models and heuristics that run across billions of transactions and multiple chains (Bitcoin, Ethereum, Tron, Solana, and more).
What you'll doDesign, test, and ship clustering and attribution heuristics, and measure them with real precision/coverage metrics rather than vibes.
Own your data end to end — pull, clean, join, and model large on-chain datasets without depending on a separate team for every query.
Build and maintain the pipelines that take a heuristic from notebook to production, including backfills, incremental runs, and validation.
Investigate edge cases (mixers, bridges, exchange hot wallets, consolidation patterns) and translate findings into repeatable logic.
Partner with investigations and product to define what "correct" looks like and benchmark against ground truth.
Prototype quickly, then harden what works.
4+ years building data science or data engineering systems that actually shipped (not just notebooks).
Strong Python and SQL; comfortable with large datasets and the gotchas of joins, dedup, and skew at scale.
Solid grasp of clustering, graph/network analysis, or entity resolution — and a habit of validating results, not just producing them.
Ability to reason about precision vs. coverage trade-offs and defend your metrics.
Self-directed: you can scope an ambiguous problem, get the data yourself, and drive it to a result.
You don't need to have used all of these, but here's what you'd be working with day to day:
Databricks — our lakehouse and processing backbone. Large-scale on-chain datasets are transformed and modeled here via Spark and SQL; most heuristics run as Databricks jobs against billions of transactions.
Kafka — real-time ingestion of on-chain and transaction data. New blocks and events stream in continuously, so a lot of our work is designed to run incrementally rather than as one-off batch jobs.
Python — the primary language for everything from exploratory analysis to production heuristics and pipeline code.
TigerGraph — our graph database, where addresses, transactions, and entities live as a network. Clustering, traversals, and relationship queries (who funds whom, consolidation paths, entity linkage) happen here.
Supporting cast you'll likely touch:
SQL everywhere — for ad-hoc analysis, validation, and defining ground-truth datasets.
Columnar / analytical stores (e.g., ClickHouse) for fast aggregate queries over large tables.
Orchestration & scheduling for backfills and recurring pipeline runs.
Git / GitHub for version control and code review — we expect pipelines and heuristics to be reviewed like any other code.
GCP as our cloud environment.
Small, high-trust team. You'll have a lot of ownership and very little bureaucracy. We prototype fast, measure honestly, and ship.
Merkle Science New York, New York, USA Office
43 W 23rd St, New York, NY, United States, 10010
Similar Jobs
What you need to know about the NYC Tech Scene
Key Facts About NYC Tech
- Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
- Key Industries: Artificial intelligence, Fintech
- Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
- Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory


