Context budgeting for AI reports

Most teams start optimizing LLM cost too late.

They look at the final answer, choose a cheaper model, shorten the prompt, and hope quality stays the same. That misses where the money actually goes in agentic workflows. For AI report agents, the expensive part is often not the last paragraph. It is the repeated context before the answer: prior runs, raw tables, dashboard exports, tool logs, reviewer comments, and old summaries that get sent back to the model every time.

That is the hidden cost of scheduled AI reporting. The report runs again. The agent fetches the same sources again. It reads the same context again. It summarizes the same stable facts again. Then humans leave feedback in Slack, email, or a doc, and the next run needs that feedback too.

This post is a practical framework for fixing that. Not by blindly compressing everything. Not by downgrading every call to the cheapest model. By adding a context budget to the report workflow.

The new cost problem: recurring context

The current AI cost conversation has shifted. Enterprises are no longer treating model usage as unlimited. Recent coverage has described companies adding internal AI caps, pushing teams toward smaller models, and using context engineering to control token consumption. Research is moving in the same direction: prompt caching, context compaction, small-model delegation, and cache-aware memory management are all attempts to solve the same problem.

For report agents, the problem is sharper because reports are repetitive by design.

A weekly churn report, sales readout, eval summary, support digest, or board prep memo usually has four types of context:

  1. Stable instructions: the format, audience, definitions, and source rules.
  2. Fresh data: this week's numbers, incidents, tickets, deals, or eval rows.
  3. Human feedback: comments from the last review cycle.
  4. Historical memory: previous conclusions, open questions, and unresolved decisions.

Only one of those changes fully every run. But many agents pay to resend all four.

A simple token budget model

Here is an illustrative model for a recurring AI report. The exact numbers will differ by team, but the shape is common.

Workflow stage Naive tokens per run Budgeted tokens per run What changed
Stable instructions 8,000 8,000 cached Keep the prefix stable
Fresh data 16,000 12,000 Filter to changed rows and relevant slices
Old comments 12,000 2,000 Pass unresolved actions only
Prior report body 20,000 3,000 Store the report as a revision, pass deltas
Tool logs 6,000 1,000 Summarize after validation
Final report 5,000 5,000 Keep the output quality bar
Total 67,000 31,000 54% less input context in this example

This is not a benchmark. It is a planning model. The lesson is simple: more than half the tokens are often not fresh data. They are instructions, old comments, prior report content, and tool residue. That is where context budgeting starts.

Do not compress blindly

The lazy answer is: compress the prompt.

That is dangerous. Prompt compression can remove the exact thing the model needs to preserve quality: a definition, an exception, a reviewer objection, a known false positive, a row-level caveat. In reporting workflows, small omissions become leadership mistakes.

A better rule:

Compress context only after you know what job that context is doing.

For AI reports, split context into five buckets:

Bucket Keep as-is? Better treatment
Report format rules Usually yes Put in a stable cached prefix
Fresh source data Sometimes Filter to changed rows and relevant slices
Prior full report Rarely Store as a revision, pass only deltas
Human comments No Pass unresolved comments and decisions only
Tool logs Almost never Summarize or drop after validation

That table is the operating model. You are not trying to make every prompt short. You are trying to make every token earn its place.

The cache-friendly report loop

Recent research on prompt caching for long-horizon agents points to an unintuitive finding: caching helps only when you avoid breaking the cache. If each run rewrites the top of the prompt, injects dynamic tool output into the wrong place, or mutates long stable sections, you pay for more tokens and lose latency benefits.

For report agents, the pattern should look like this:

Stable prefix
- role
- audience
- report structure
- definitions
- formatting rules
- source rules

Dynamic block
- fresh data only
- unresolved comments only
- changed assumptions only
- links to report revisions

Output
- HTML report
- short changelog
- open questions

The stable prefix should move as little as possible. The dynamic block should be small and explicit. The output should become an artifact, not another chunk of prompt history.

That last part is where most teams are sloppy. They treat the model output as disposable text. Then the next run needs to paste it back in because nobody knows where the real artifact lives.

A better pattern is artifact-first reporting:

  1. The agent generates the report as HTML.
  2. The HTML is published to a stable URL.
  3. Each refresh becomes a new revision on the same report.
  4. Humans comment on exact paragraphs, rows, and table cells.
  5. The next run reads only unresolved comments and material deltas.

Comma is built around this pattern. It renders the HTML report faithfully, keeps the review conversation anchored to the artifact, and gives agents an MCP/API surface to publish revisions. The sales pitch is not "save tokens by using Comma." The more accurate version is: stop making the prompt carry work that belongs in the report artifact.

A practical implementation guide

1. Add a token ledger

Log token usage by stage, not just by request.

At minimum, track:

  • instruction tokens
  • retrieved data tokens
  • prior-report tokens
  • comment and feedback tokens
  • tool-output tokens
  • output tokens

If you only track total cost, you will optimize the wrong thing.

2. Budget each stage

Set a target budget per run. For example:

Report instructions: 8k cached tokens
Fresh data: 12k to 20k tokens
Human feedback: 2k tokens
Historical memory: 1k to 4k tokens
Output: 3k to 6k tokens

The numbers are placeholders. The point is to make budget overruns visible. A report that suddenly needs 40k tokens of historical memory is usually not smarter. It is usually leaking old context.

3. Separate "changed" from "available"

Dashboards, warehouses, and ticket systems can return everything. The model should not read everything.

Before the report call, add a cheap filtering step:

  • what changed since the last run?
  • what crossed a threshold?
  • what did a reviewer ask about?
  • what source is newly relevant?
  • what can be linked instead of pasted?

This is where small models, deterministic code, and simple SQL often beat frontier models.

4. Keep comments structured

Do not feed the model a Slack thread dump.

A useful comment object looks like this:

{
  "anchor": "conversion_table.row.checkout_to_payment",
  "comment": "Is this drop caused by the mobile bug from June 18?",
  "status": "unresolved",
  "requested_action": "check bug impact before final summary"
}

That gives the next run a decision, not noise.

5. Publish the output as the memory layer

The full report should not be pasted into the next prompt unless the agent truly needs to quote or compare it. Store it as a revision. Pass a short changelog and unresolved review items instead.

This is the product lesson behind Comma: HTML reports are already structured artifacts. Treat them like documents. Let people comment on them. Let agents revise them. Do not turn every report back into a prompt blob.

Where routers and compression still fit

Context budgeting does not replace model routing or prompt compression. It makes them safer.

A router can decide:

  • use a small model to classify whether a report changed materially
  • use a cheaper model to summarize comments
  • use a frontier model only for synthesis, judgment, or executive narrative
  • retry with a stronger model when confidence is low

Compression can decide:

  • shorten old comments after they are resolved
  • distill prior reports into deltas
  • trim tool outputs after validation
  • preserve exact values, definitions, and exceptions

The mistake is applying those techniques globally. The right move is to apply them at the stage where they are least likely to damage quality.

The takeaway

LLM cost optimization is not only a model-choice problem. For report agents, it is a workflow-design problem.

If your agent re-reads the same dashboard, same old report, same Slack thread, and same instructions every run, a cheaper model will only hide the waste for a while. The durable fix is to budget context like infrastructure:

  • keep stable instructions cache-friendly
  • pass fresh data, not all available data
  • convert human feedback into structured unresolved actions
  • store reports as artifacts and revisions
  • route and compress by workflow stage

The future of AI reporting is not a longer prompt. It is a smaller prompt connected to a better artifact.

Create your first Comma report →

Further reading