00 — engineering
AI inside the engineering org, applied where the eval beats the script.
For engineering orgs, platform teams, and CTO offices measuring the work in shipped diffs and incident hours saved.
merge gate
eval-in-CI
no silent regressions
typical scope
3–10 wks
reading PRs to shipping diffs
handover
your repo
no SaaS dependency
01 — what we do
Concrete deliverables. Named tools. Honest constraints.
01
AI for internal tooling.
Codebase Q&A, log triage, on-call summarisers — built against your repos and your runbooks. We use a model only where a script falls short.
02
Code-gen workflows.
Prompts, harnesses, and review gates that turn a model into a repeatable contributor on a narrow surface. Boilerplate, migrations, test scaffolds — the boring parts.
03
Dev infra.
Build-time and review-time integrations: PR summaries that are accurate, flake detectors that earn their place, and a kill-switch on the model that wrote them.
04
Eval-in-CI.
Every model-touching change gates on an eval suite that fails loudly. No silent regressions, no eyeballed diffs.
05
Doc and search systems.
RAG over your docs, code, and incident history. Tuned for precision on the questions your engineers actually ask, not a public benchmark.
06
Model-aware migration tooling.
Codemods scaffolded by a model and verified by tests. The model writes the diff, the test suite earns the merge.
02 — how we engage
We will not bolt a chatbot onto a product that does not need one. We will not replace a working linter with a model.
Typical scope is 3 to 10 weeks. We start by reading one quarter of incidents and one week of PRs before suggesting where AI helps.
Deliverables live in your repo and your CI. We leave with a handover, not a SaaS dependency.
03 — example shape
an engagement shape
Anonymised. Plausible. The shape of a real engagement.
A platform team at a 200-engineer company spent two days a week triaging flaky tests.
Most of the flakes were three known patterns and a long tail of real bugs.
We built a classifier against eight months of CI history, wired it into the PR gate, and added an eval suite so the classifier could not silently regress.
Flake-triage time fell to about half a day a week within a month.
The real bugs in the long tail surfaced cleanly, because they no longer hid behind the noise.
We were on the work for 6 weeks. The classifier is owned by the platform team now.
# eval gate on the code-gen surface. # runs on every PR that touches prompts/ or harness/. set -euo pipefail pnpm eval:codegen \ --suite migrations \ --suite test-scaffolds \ --floor pass_at_1=0.78 \ --floor compile_rate=0.97 \ --report ci/last_eval.json # fail the PR if any floor regressed vs main pnpm eval:diff ci/last_eval.json origin/main
04 — contact
Tell us where the work is.
One sentence is enough. We reply within two business days. If we are a bad fit we will say so.