00 — research

From an interesting paper to a job that runs every weekday at 06:00.

For research-led labs and applied-research teams inside larger orgs who measure the work in reproducible runs.

eval

before training

and before shipping

typical scope

4–10 wks

paper to a daily job

first-week result

1 rerun

new researchers ship clean

01 — what we do

Concrete deliverables. Named tools. Honest constraints.

Eval harnesses.

Task suites, dataset versioning, and a CI hook that fails the build when the metric drops. Eval before training, eval before shipping.

RAG that scales past a demo.

Chunking, retrieval, and reranking tuned on your corpus, not a leaderboard. We measure recall at the document level, not vibes at the chat level.

Model selection.

Open-weights vs. hosted, small vs. large, fine-tune vs. prompt — answered with a controlled eval on your data. The cheapest model that clears the bar wins.

Paper-to-prod pipelines.

A repeatable path from a notebook reproduction to a daily job. Config, seeds, data lineage, and a kill-switch on regression.

Experiment tooling.

Run tracking, artefact storage, and a query layer your researchers will actually open. We adapt to whatever you already use before we recommend new tools.

Reproducible scaffolding.

Project templates with locked deps, deterministic seeds, and a one-command rerun. New researchers ship a result in their first week.

02 — how we engage

We will not build a new MLOps platform. We will fix the three places your researchers lose a day a week.
house rule

Typical scope is 4 to 10 weeks. We start by reproducing one of your most-cited internal results and writing down what hurt.

Output is your repo, your runners, your conventions — with the gaps closed.

03 — example shape

an engagement shape

Anonymised. Plausible. The shape of a real engagement.

A research-led lab had eight researchers, six different run trackers, and no shared eval suite.

Promising results were not surviving review because nobody could rerun them clean.

We chose one tracker, wrote the eval harness around the two tasks the lab actually shipped against, and wired it into CI over 5 weeks.

A regression on either task now fails the PR, and the dashboard names the responsible run.

Two papers in flight were caught and fixed before submission.

The harness is the lab’s now. We do not maintain it.

ci/eval_gate.ymlyaml

# eval-in-CI — fails the PR on any task regression.
# runs on every push, full suite nightly.

eval:
  on: [pull_request, push]
  tasks:
    - name: doc_qa_recall@5
      floor: 0.82
      data: corpus/v3
    - name: rerank_ndcg@10
      floor: 0.71
      data: corpus/v3
  on_regress:
    fail: true
    annotate_pr: true
    name_owner: true   # last touch on the failing module

04 — contact

Tell us where the work is.

One sentence is enough. We reply within two business days. If we are a bad fit we will say so.

Start a conversation All practices