00 — engineering

AI inside the engineering org, applied where the eval beats the script.

For engineering orgs, platform teams, and CTO offices measuring the work in shipped diffs and incident hours saved.

merge gate

eval-in-CI

no silent regressions

typical scope

3–10 wks

reading PRs to shipping diffs

handover

your repo

no SaaS dependency

01 — what we do

Concrete deliverables. Named tools. Honest constraints.

AI for internal tooling.

Codebase Q&A, log triage, on-call summarisers — built against your repos and your runbooks. We use a model only where a script falls short.

Code-gen workflows.

Prompts, harnesses, and review gates that turn a model into a repeatable contributor on a narrow surface. Boilerplate, migrations, test scaffolds — the boring parts.

Dev infra.

Build-time and review-time integrations: PR summaries that are accurate, flake detectors that earn their place, and a kill-switch on the model that wrote them.

Eval-in-CI.

Every model-touching change gates on an eval suite that fails loudly. No silent regressions, no eyeballed diffs.

Doc and search systems.

RAG over your docs, code, and incident history. Tuned for precision on the questions your engineers actually ask, not a public benchmark.

Model-aware migration tooling.

Codemods scaffolded by a model and verified by tests. The model writes the diff, the test suite earns the merge.

02 — how we engage

We will not bolt a chatbot onto a product that does not need one. We will not replace a working linter with a model.
house rule

Typical scope is 3 to 10 weeks. We start by reading one quarter of incidents and one week of PRs before suggesting where AI helps.

Deliverables live in your repo and your CI. We leave with a handover, not a SaaS dependency.

03 — example shape

an engagement shape

Anonymised. Plausible. The shape of a real engagement.

A platform team at a 200-engineer company spent two days a week triaging flaky tests.

Most of the flakes were three known patterns and a long tail of real bugs.

We built a classifier against eight months of CI history, wired it into the PR gate, and added an eval suite so the classifier could not silently regress.

Flake-triage time fell to about half a day a week within a month.

The real bugs in the long tail surfaced cleanly, because they no longer hid behind the noise.

We were on the work for 6 weeks. The classifier is owned by the platform team now.

.ci/codegen-eval.shbash

# eval gate on the code-gen surface.
# runs on every PR that touches prompts/ or harness/.

set -euo pipefail

pnpm eval:codegen \
  --suite migrations \
  --suite test-scaffolds \
  --floor pass_at_1=0.78 \
  --floor compile_rate=0.97 \
  --report ci/last_eval.json

# fail the PR if any floor regressed vs main
pnpm eval:diff ci/last_eval.json origin/main

04 — contact

Tell us where the work is.

One sentence is enough. We reply within two business days. If we are a bad fit we will say so.

Start a conversation All practices