Daily Claude eval refresh, on autopilot

If your team runs Claude (or any model) on a recurring task, you have an eval harness. The hard part isn't the eval — it's making sure someone actually runs it every day, sees the result, and notices when it regresses.

This is what that loop looks like as a routine.

The setup

  • Skill. Your eval harness, packaged as a Claude skill. It pulls today's test set, runs the prompts through the model, scores them, and emits an HTML report with the scoreboard, the regressions, and the diffs.
  • Report. A single Comma report named something like Claude eval — production prompts. Every routine run posts the new HTML as the next revision.
  • Cadence. Daily on Pro ($15/mo), hourly on Team ($75/seat).
  • Cost ceiling. Per-run cap tuned to the size of your eval set.

The shape, day by day

Monday 09:00 UTC. Routine fires. The eval runs against today's prompts. HTML revision lands on the eval report. Pings the eng-quality channel.

Monday 09:15. Lead opens the link. Notices that a regression appeared on the "summarize support ticket" task. Selects the offending row in the results table and pins a comment: "This one started after the prompt change Friday — investigate."

Tuesday 09:00 UTC. Routine fires again. New revision lands. The comment from Monday is still there, anchored to the same row. The diff against yesterday is one click. The investigation continues without anyone re-finding context.

Friday. Two regressions are fixed, one is a false positive. Comments get resolved. The historical revisions stay; the audit trail is intact.

This is the loop that's hard to keep running on a laptop. It's not hard on a routine.

Why a Comma report and not a dashboard

The eval result is HTML the agent already produces. Sending it to a dashboard tool means rebuilding it in that tool's schema. Sending it to Slack means losing the per-row anchoring. Sending it to email means losing the conversation. The Comma report keeps the original HTML and adds the comment layer outside the iframe — the eval keeps its shape and the discussion attaches in the right places.

What it costs

Routines respect each plan's monthly Bedrock cap:

  • Pro — $50 monthly cap with $5 included credit. Plenty of headroom for a daily eval that runs in a few hundred tokens.
  • Team — $300 monthly cap with $30 included credit, or BYO Bedrock keys so the spend lands in your own AWS account.

If the eval set grows, the per-run cap protects the budget. Runaway prompts refuse rather than overflow.

Setup in three steps

  1. Wrap the harness as a skill. A short Markdown skill that says "run the eval against the production prompt set, emit HTML."
  2. Create the eval report. Either upload your current eval HTML or start with a placeholder.
  3. Wire the routine. From the report, Add routine → pick the skill, pick daily cadence, set the cap.

That's it. The next morning, the routine has already run.

Variations

  • Daily regression sweep. Tighter eval set, runs at 06:00 UTC. Pings eng-quality only when scores drop below a threshold.
  • PR-triggered eval. Combine a routine (daily baseline) with an MCP-driven one-off run an agent fires when a PR lands. Same report, different revisions.
  • Per-model eval. One routine per model, all posting to the same report family — easier to compare drift across versions.

Try it

Create your first routine →

Related