00 — research
From an interesting paper to a job that runs every weekday at 06:00.
For research-led labs and applied-research teams inside larger orgs who measure the work in reproducible runs.
eval
before training
and before shipping
typical scope
4–10 wks
paper to a daily job
first-week result
1 rerun
new researchers ship clean
01 — what we do
Concrete deliverables. Named tools. Honest constraints.
01
Eval harnesses.
Task suites, dataset versioning, and a CI hook that fails the build when the metric drops. Eval before training, eval before shipping.
02
RAG that scales past a demo.
Chunking, retrieval, and reranking tuned on your corpus, not a leaderboard. We measure recall at the document level, not vibes at the chat level.
03
Model selection.
Open-weights vs. hosted, small vs. large, fine-tune vs. prompt — answered with a controlled eval on your data. The cheapest model that clears the bar wins.
04
Paper-to-prod pipelines.
A repeatable path from a notebook reproduction to a daily job. Config, seeds, data lineage, and a kill-switch on regression.
05
Experiment tooling.
Run tracking, artefact storage, and a query layer your researchers will actually open. We adapt to whatever you already use before we recommend new tools.
06
Reproducible scaffolding.
Project templates with locked deps, deterministic seeds, and a one-command rerun. New researchers ship a result in their first week.
02 — how we engage
We will not build a new MLOps platform. We will fix the three places your researchers lose a day a week.
Typical scope is 4 to 10 weeks. We start by reproducing one of your most-cited internal results and writing down what hurt.
Output is your repo, your runners, your conventions — with the gaps closed.
03 — example shape
an engagement shape
Anonymised. Plausible. The shape of a real engagement.
A research-led lab had eight researchers, six different run trackers, and no shared eval suite.
Promising results were not surviving review because nobody could rerun them clean.
We chose one tracker, wrote the eval harness around the two tasks the lab actually shipped against, and wired it into CI over 5 weeks.
A regression on either task now fails the PR, and the dashboard names the responsible run.
Two papers in flight were caught and fixed before submission.
The harness is the lab’s now. We do not maintain it.
# eval-in-CI — fails the PR on any task regression.
# runs on every push, full suite nightly.
eval:
on: [pull_request, push]
tasks:
- name: doc_qa_recall@5
floor: 0.82
data: corpus/v3
- name: rerank_ndcg@10
floor: 0.71
data: corpus/v3
on_regress:
fail: true
annotate_pr: true
name_owner: true # last touch on the failing module04 — contact
Tell us where the work is.
One sentence is enough. We reply within two business days. If we are a bad fit we will say so.