Let an agent rewrite the interview prompt overnight and keep only what scores higher. Start with one facet, expand to the whole experience.
From a real conversation to a better prompt. Example: the Thumpn growth-lead interview (agent = "Meghna").
| What we'll do | How | Why |
|---|---|---|
0Build the eval harnessWeek 1 | ||
| Collect transcripts | Pull ~30 real interviews, scrub PII, tag by candidate type | Realistic, varied cases to score against |
| Define the score | Lock metrics with Kinnari, write a 0–5 rubric | One number the loop can optimize |
| Build rig + judge | Replay answers to the agent; a strong model grades output | Automates scoring so we can run it hundreds of times |
1Optimize one facetWeek 2–3 | ||
| Scope to persona | Freeze the whole prompt except the persona section | Isolate one variable so gains are clear |
| Run the loop | Edit → replay → judge → keep if better, ~100× overnight | Finds improvements we wouldn't hand-write |
| Human gate | Spot-check the top 5 each morning before promoting | Catch metric-gaming before it ships |
2Add facets one by oneWeek 4–6 | ||
| Expand scope | Unfreeze memory, depth, signal, "don't sound AI" in turn | Compound gains while keeping attribution |
| Upgrade optimizer | Move to DSPy/MIPRO once many sections tune together | Smarter search when many parts move |
| Guard regressions | Re-score on a frozen eval set, block any drop | Protect wins already banked |
3Optimize the full interviewWeek 7+ | ||
| Go end-to-end | Optimize the whole-interview score, keep prompt variants | Tune the experience, not just parts |
| Productionize | Nightly report; flag-gated A/B on live interviews | Continuous improvement with a safety net |