Shipped this month:
- Sandbox v1. The 4-LLM playground with rubric scoring. Took twice as long as estimated, mostly because the JSON streaming differences across providers were worse than we expected. Gemini's chunk boundaries especially.
- Capability heatmap. 16 metaskills, scored from real outputs. The rubric draft was open in a Notion doc for two weeks before any code was written, which turned out to be the right order.
- Slack bot v0.3. Threaded reply support. Posts the daily digest to a per-engineer thread, not a flood channel. People liked it.
Slipped:
- On-call rotation integration. Decided not to ship in March because we couldn't agree on whether oncall should affect capability scores. Punted to April for a longer design discussion.
- Mobile app. Started, then stopped. The web app is responsive enough; the native shell wasn't earning its complexity.
Things that surprised us:
- The capability heatmap caused more emotional reactions than we expected. We thought "score from real outputs" would feel less invasive than self-rating. It feels more invasive — because you can't argue with the data the way you can argue with a self-assessment. We added a "context note" feature mid-month so engineers can explain what the score doesn't see.
- Sandbox usage skewed heavily toward debugging existing prompts, not exploring new ones. We had assumed the opposite. Will reflect that in v2's UX.
- Two design partners told us the same thing in the same week: "RUQA makes me want to do better work." We're not sure how to square that with "doesn't change incentives during calibration." Possibly the act of seeing your own work synthesized in plain language creates accountability without scoring.
Unresolved bug:
In ~3% of synthesis runs, the AI's "decisions" extraction includes a phantom decision that wasn't in the input signals. The prompt is deterministic at temperature=0. The signals don't contain the phantom. The output does. We've reproduced it twice manually and lost it both times. Logs at https://gist.github.com/ruqa/march-phantom — if you've seen this pattern with structured-output prompts on Sonnet 4.6, please email
Build logRead other essays →