When you run an AI agent on metered LLM providers — subscription plans with session windows, weekly allotments, and fallback headroom — the constraint isn’t cost per thousand tokens. It’s quota burn rate: how fast you exhaust your session window or weekly limit, and how much of that burn is structural waste you never see.

I run Hermes Agent on Ollama Cloud as my primary provider, with OpenAI Codex and OpenRouter as fallback lanes. Ollama Cloud has a ~5-hour session limit and a weekly allotment that resets Sunday evening. There’s no live remaining-quota API. When you hit the wall, the agent stops working until the window resets.

This post documents the audit methodology I used to find and fix structural waste — and includes a reusable prompt so you can run the same audit on your own setup.

Why Not Per-Token Cost

The instinct is to optimize cents-per-1K-tokens. That’s the wrong frame when you’re on a subscription plan.

If your provider charges $0.001 per 1K tokens but caps you at 5 million tokens per session window, the real question is: how much of that 5M is wasted on structural overhead before the agent even starts reasoning?

Every tool schema injected into the prompt, every imported skill catalog, every silent auxiliary model call — those consume quota on every turn, including trivial chats where you just ask “what time is it?”

The goal is quota preservation: finding and eliminating structural waste so your limited metered budget goes further toward actual reasoning work.

The Six Audit Areas

Ranked by leverage — structural fixes that save quota on every turn beat micro-optimizations that save tokens on specific calls.

# Area Why It Drains Quota Fix Pattern
1 Default toolset breadth Every tool schema is injected into every prompt, even for trivial chats. 17 tools = 17 schema blocks paid per turn. Cut always-on toolsets to ~9 daily-driver tools; load the rest on demand.
2 Cross-profile skill imports Imported skill catalogs (e.g. 137 skills from another profile) inflate the skill list on the paid path, even if never used. Remove blanket external_dirs imports; import only specific skills if needed.
3 Auxiliary model routing Helper tasks (vision, compression, skills_hub, session_search, approval, mcp) silently consume paid quota in the background. Route each auxiliary to the cheapest capable lane: local for lightweight helpers, cheap-cloud for quality-sensitive ones, stronger cloud only for vision.
4 Recurring agentic cron jobs Background LLM-driven jobs burn quota without an active chat session — easy to forget. Convert deterministic jobs to script-only (no_agent) execution; keep reasoning jobs only where reasoning is genuinely needed; move remaining reasoning jobs to cheaper models.
5 Fallback exposure Each fallback event uses a different paid provider. Frequent fallback = unexpected spend. Evaluate actual fallback frequency; use cheapest acceptable fallback route.
6 Compression tuning Lower priority — compression is already doing useful work. Tune only after structural fixes above.

Key Principles

These principles guide every decision in the audit:

  • Structural wins beat micro-tweaks. Reducing always-on tool breadth saves more than prompt phrasing.
  • Cheapest acceptable, not cheapest. Quality matters; downshift only where the drop is unnoticeable.
  • Measure before changing. Pull fresh-cycle usage data; don’t mix old and new billing windows.
  • Preserve justified reasoning. Don’t strip reasoning from jobs that actually need it.
  • Verify through the real execution path. A config edit is not proof — trigger an actual run.
  • Keep resilience explicit. Fallbacks exist for a reason; evaluate failure modes before removing them.

What I Found and Fixed

In my Hermes Agent setup, three structural changes had the biggest impact:

1. Toolset Narrowing (17 → 9)

The default profile had 17 always-on toolsets. I cut to 9 daily-driver tools (file, terminal, web, memory, session_search, skills, clarify, todo, messaging). The other 8 (code_execution, computer_use, cronjob, delegation, image_gen, kanban, tts, vision) are loaded on demand when the task actually needs them.

Result: Lower per-turn paid prompt overhead on every trivial chat, with no friction on normal daily admin work.

2. Skill Import Cleanup (137 → 0)

I had another profile’s entire skill tree (137 skills, 130 duplicate names) imported into the default paid profile via skills.external_dirs. Those skills inflated the skill catalog on every paid turn even though most were never used. Removing the blanket import dropped 137 skill descriptions from the prompt baseline.

3. Auxiliary Rerouting (6 paths)

Six auxiliary helper tasks were silently consuming paid cloud quota. I rerouted them to a hybrid strategy:

Helper Before After
skills_hub Ollama Cloud Local llama.cpp
approval Ollama Cloud Local llama.cpp
mcp Ollama Cloud Local llama.cpp
flush_memories Ollama Cloud Local llama.cpp
session_search Ollama Cloud (expensive model) Ollama Cloud (cheap model)
compression Ollama Cloud (expensive model) Ollama Cloud (cheap model)
vision Ollama Cloud (expensive model) Ollama Cloud (capable vision model)

The lightweight helpers moved to a local llama.cpp instance (free, no quota). Quality-sensitive helpers stayed on cloud but on cheaper model tiers. Vision stayed on cloud because modality demands it.

Observability

You can’t fix what you can’t see. I built a Grafana/Loki dashboard that ships Hermes logs to Loki via Promtail and visualizes:

  • Session burn (last 5h) — am I exhausting my session window?
  • Weekly burn since Sunday reset — how much of my weekly allotment is gone?
  • Burn by agent — is Ghost (orchestrator) doing specialist work it shouldn’t?
  • Fallback leakage — how often is OpenAI fallback firing?
  • Cron/background drain — which background jobs are quietly burning quota?
  • Cumulative tokens by provider — where is spend concentrating?

Token counts are an operational proxy, not billing truth — there’s no authoritative remaining-quota API for Ollama Cloud. But proxy telemetry is sufficient for decision support as long as the dashboard is labeled honestly.

Reusable Audit Prompt

Here’s the prompt I wrote so anyone can run the same audit on their own setup. Paste it into your AI agent (Hermes, Claude, ChatGPT, etc.) and fill in the environment details at the bottom.

You are auditing my AI agent setup for **metered-quota preservation**.
The goal is not per-token cost minimization — it is preserving limited
quota (session windows, weekly allotments, fallback headroom) by finding
and fixing structural waste.

### Step 1 — Inventory the metered paths

Read my agent config and identify every path that consumes paid LLM
quota. For each, record:

- **Route name** (e.g. "primary interactive model", "vision helper",
  "compression helper", "fallback", "cron job: morning summary")
- **Provider + model** (e.g. `ollama-cloud / glm-5.2`,
  `openai / gpt-4o-mini`)
- **Paid or free** — is this a metered provider or a local/free lane?
- **Frequency** — per-turn (always-on), per-job (recurring cron), or
  on-demand (fallback / conditional)
- **Tool/schema overhead** — how many tool schemas are injected into the
  prompt on every turn? List them.

### Step 2 — Rank by quota leverage

Rank all metered paths from highest structural drain to lowest. Use
these criteria:

1. **Always-on cost** (paid on every turn, even trivial chats) beats
   on-demand cost
2. **Background cron cost** (paid without an active session) beats
   interactive cost
3. **Silent auxiliary cost** (helper calls the user doesn't see) beats
   visible interactive cost
4. **Fallback frequency** — how often does fallback actually fire?

### Step 3 — Find structural optimizations

For each high-leverage path, propose a fix. Categorize each as:

- **Toolset narrowing** — remove always-on tools that aren't needed on
  most turns; load them on demand instead
- **Skill import cleanup** — remove cross-profile or external skill
  directories that inflate the paid prompt with unused skill catalogs
- **Auxiliary rerouting** — move helper tasks (vision, compression,
  skills_hub, session_search, approval, mcp) to the cheapest capable
  lane:
  1. local model (llama.cpp, Ollama local) for lightweight helpers
  2. cheap cloud model for quality-sensitive helpers
  3. stronger cloud only where modality or quality justifies it
- **Cron conversion** — convert deterministic background jobs to
  script-only (`no_agent`) execution; keep reasoning jobs only where
  reasoning is genuinely needed; move remaining reasoning jobs to
  cheaper models
- **Fallback right-sizing** — evaluate actual fallback frequency and
  use the cheapest acceptable fallback route
- **Compression tuning** — only after all structural fixes above

### Step 4 — Validate before recommending

For each proposed change, answer:

- Is this change **reversible**?
- What is the **risk of regression**?
- Is there a **cheaper acceptable substitute**, or am I just picking the
  cheapest regardless of quality?
- Does **fallback/resilience** remain adequate after the change?
- Is there **evidence of actual spend concentration** on this path, or am
  I guessing?

### Step 5 — Produce the audit document

Write a markdown document with these sections:

1. **Scope** — what is included and excluded in this audit
2. **Current metered paths** — the full inventory from Step 1
3. **Findings** — ranked structural drains with evidence
4. **Recommended priority order** — numbered list, highest leverage first
5. **Concrete proposals** — for each optimization: what to change, why,
   expected effect, risk, and implementation approach
6. **Implementation status** — mark each as proposed / implemented / needs
   verification
7. **Remaining optimizations to revisit** — things to check after the
   next billing cycle with fresh data
8. **Open questions** — unknowns that need measurement or provider
   research
9. **Changelog** — timestamped log of what was changed and why

### Rules

- **Measure before changing.** Pull fresh-cycle usage data; don't mix
  old and new billing windows.
- **Structural wins beat micro-tweaks.** Reducing always-on tool
  breadth saves more than prompt phrasing.
- **Cheapest acceptable, not cheapest.** Quality matters; downshift
  only where the drop is unnoticeable.
- **Preserve justified reasoning.** Don't strip reasoning from jobs that
  genuinely need it.
- **Verify through the real execution path.** A config edit is not
  proof — trigger an actual run.
- **Keep resilience explicit.** Fallbacks exist for a reason; evaluate
  failure modes before removing them.
- **Don't leak secrets.** Never echo API keys or token fragments.

### Environment details to provide

Fill these in before running the audit:

Agent framework:       (e.g. Hermes Agent, Claude Code, custom)
Config file path:      (e.g. ~/.hermes/config.yaml)
Primary provider:      (e.g. Ollama Cloud, OpenAI, OpenRouter)
Primary model:         (e.g. glm-5.2, gpt-4o)
Fallback provider(s):  (e.g. OpenAI Codex, OpenRouter)
Fallback model(s):     (e.g. gpt-5.4-mini, deepseek-chat-v3.1)
Local model server:    (e.g. llama.cpp on port 8082, Ollama local)
Auxiliary routes:      (list any — vision, compression, session_search,
                        skills_hub, approval, mcp, flush_memories)
Cron jobs:             (list recurring jobs and whether they use an LLM)
Tool count:            (how many tool schemas are always-on?)
Skill imports:         (any cross-profile or external skill dirs?)

Lessons

  1. The biggest drains are invisible. Tool schemas, imported skill catalogs, and silent auxiliary calls consume quota on every turn without any visible signal.
  2. Structure beats phrasing. Cutting 8 tools from the always-on set saves more quota than any prompt-level optimization.
  3. Local is free. If you have a local model server (llama.cpp, Ollama local), route lightweight helper tasks there. The quality difference for tasks like approval checks or skill lookups is negligible, and the quota savings are permanent.
  4. Observability is non-negotiable. Without a dashboard showing burn by agent, provider, and time window, you’re guessing. Proxy telemetry from your own logs is sufficient — you don’t need a provider-side quota API to make good decisions.
  5. Revisit after each billing cycle. Run the audit once to find structural waste, then re-measure after the next quota reset to verify the fixes actually moved the needle.