Daily Claude eval refresh, on autopilot

If your team runs Claude (or any model) on a recurring task, you have an eval harness. The hard part isn't the eval — it's making sure someone actually runs it every day, sees the result, and notices when it regresses.

This is what that loop looks like as a routine.

The setup

Skill. Your eval harness, packaged as a Claude skill. It pulls today's test set, runs the prompts through the model, scores them, and emits an HTML report with the scoreboard, the regressions, and the diffs.
Report. A single Comma report named something like Claude eval — production prompts. Every routine run posts the new HTML as the next revision.
Cadence. Daily, hourly, or whatever matches your release cycle — any cadence, on any plan.
Cost ceiling. Per-run cap tuned to the size of your eval set.

The shape, day by day

Monday 09:00 UTC. Routine fires. The eval runs against today's prompts. HTML revision lands on the eval report. Pings the eng-quality channel.

Monday 09:15. Lead opens the link. Notices that a regression appeared on the "summarize support ticket" task. Selects the offending row in the results table and pins a comment: "This one started after the prompt change Friday — investigate."

Tuesday 09:00 UTC. Routine fires again. New revision lands. The comment from Monday is still there, anchored to the same row. The diff against yesterday is one click. The investigation continues without anyone re-finding context.

Friday. Two regressions are fixed, one is a false positive. Comments get resolved. The historical revisions stay; the audit trail is intact.

This is the loop that's hard to keep running on a laptop. It's not hard on a routine.

Why a Comma report and not a dashboard

The eval result is HTML the agent already produces. Sending it to a dashboard tool means rebuilding it in that tool's schema. Sending it to Slack means losing the per-row anchoring. Sending it to email means losing the conversation. The Comma report keeps the original HTML and adds the comment layer outside the iframe — the eval keeps its shape and the discussion attaches in the right places.

What it costs

You pay only for the AI compute each run uses, billed one of two ways:

Prepaid credit. Use Comma's keys and draw down a prepaid balance (at cost plus 20%). A daily eval that runs in a few hundred tokens costs cents; hosted runs pause at a $0 balance rather than overflowing.
Bring your own Bedrock keys. Bill the spend straight to your own AWS account — free on Comma, on any plan.

If the eval set grows, the per-run cap protects the budget. Runaway prompts pause rather than overflow.

Setup in three steps

Wrap the harness as a skill. A short Markdown skill that says "run the eval against the production prompt set, emit HTML."
Create the eval report. Either upload your current eval HTML or start with a placeholder.
Wire the routine. From the report, Add routine → pick the skill, pick daily cadence, set the cap.

That's it. The next morning, the routine has already run.

Variations

Daily regression sweep. Tighter eval set, runs at 06:00 UTC. Pings eng-quality only when scores drop below a threshold.
PR-triggered eval. Combine a routine (daily baseline) with an MCP-driven one-off run an agent fires when a PR lands. Same report, different revisions.
Per-model eval. One routine per model, all posting to the same report family — easier to compare drift across versions.

Try it

Create your first routine →