The review layer for AI reports

The shape of AI work changed this year, and the artifact moved with it. Teams are no longer staring at a dashboard or a notebook — they are staring at an HTML report that an agent regenerated overnight. That shift is the entire story. Stanford's 2026 forecast names the move from generation to evaluation as the defining transition of the year (Stanford News); Forrester frames the same year around three questions, the first of which is whether enterprises can actually trust the outputs they are now drowning in (Forrester). The missing primitive in that picture is not authoring. It is review.

Competitors aren't going to own this

Look at where the incumbents are spending. Notion's 2025 push has been agents that draft, summarize, and chain — generation, not review. Google is shipping Gemini co-authoring across Docs and Workspace, also generation. Atlassian's Rovo agents summarize and rewrite Confluence pages, again generation. Every one of those products treats the document as something to be written, with the LLM as a faster typist. None of them treats the document as something to be inspected, with the LLM as a producer whose output must be sampled, compared, and signed off on.

The eval community has been on a different page for over a year. Hamel Husain, who has taught the ritual of looking-at-data to roughly 800 engineers and PMs, calls the manual review pass the highest-leverage thing an AI team does (Hamel's evals FAQ). Eugene Yan at Anthropic has spent the last year publishing on how the review loop is the actual unit of progress (eugeneyan.com). Andrew Ng has been repeating the same point for months: the bottleneck is no longer model quality, it is the review ritual that turns model output into decisions (@AndrewYNg). The practitioners agree. The platforms haven't caught up. That gap is the window.

What "the review layer" actually means

A review layer is not a comment widget. It is four primitives that compose:

  1. Snapshots over time. An AI report is not one document; it is the sequence of versions an agent produced on a schedule. Every snapshot is diffable against the last one. Without that history, review is just triage.

  2. Anchored comments. Reviewers comment on the paragraph, the table cell, the specific number that moved — not on the document as a whole. Anchored comments are how disagreement gets resolved instead of re-litigated next week.

  3. Scheduled regeneration. The agent that produced the report needs to be able to produce the next one on a cadence. Routines — Comma's hosted-cron primitive — keep the artifact alive without turning it into a fire drill.

  4. MCP for AI reviewers. Humans aren't the only reviewers anymore. An AI agent that triages eval regressions, or auto-labels the rows where the new model lost to the old one, is a first-class commenter. MCP is how that agent reads the report, posts comments, and ships the next revision. The Model Context Protocol crossed 97 million monthly SDK downloads in 2025 and was donated to the Linux Foundation in December (Pento — A year of MCP); it is now the integration rail, and Comma's MCP server ships with a scoped token model that treats agents as ordinary collaborators.

These four primitives compose into one thing: accountability for AI output. Snapshots tell you what changed. Anchored comments tell you who saw it. Schedules tell you when it ran. MCP tells you which agent posted it. None of those is interesting in isolation; together, they are the substrate for review.

The proof

Three teams, same primitives.

An AI eval team ships a Claude skill that rebuilds a regression report every Monday morning. Each snapshot is diffed against the prior week. The lead reviewer leaves anchored comments on the rows that regressed; the on-call agent files them as Linear issues by lunch. The report is the artifact the entire eval cadence revolves around.

A data team runs a Monday business review built by a notebook pipeline. The output is an HTML page with revenue, retention, and funnel cells. The CFO comments on the same paragraph every Monday; the analyst answers in-thread instead of in a four-message Slack tangent. The history of those threads is the history of the metric.

A compliance team receives a quarterly board report that an agent assembles from policy logs. Every paragraph that changed since the last quarter is highlighted; the compliance officer signs off cell by cell. The signed-off snapshot is the legal artifact, not a screenshot in a deck.

Different reviewers, different cadences, the same four primitives.

What we're shipping next

The next release lands automatic changelogs on every snapshot. Today, when a routine reruns and posts a new revision, reviewers have to skim to find what moved. After R1, every snapshot opens with a plain-English summary of the diff — which numbers changed, which paragraphs were rewritten, which sections were added. The reviewer's first job stops being "where do I look" and starts being "do I agree with this." That is the right starting point for review.

Close

The market has spent two years asking how to generate more. It is about to spend the next two asking whether what was generated is correct. The artifact at the center of that question is the AI-generated HTML report, and the primitive it has been missing is review. Comma is the review layer for AI reports. We believe that is the category, and we are going to act like it.