The Symptom

I spent some time chasing a frustrating failure mode in a self-hosted agent stack: the model was clearly alive, but some requests came back empty, or with enough hidden reasoning overhead that the whole system felt sluggish.

The confusing part was that the usual “is the service up?” checks all looked fine.

  • The API responded.
  • The model was loaded on the GPU.
  • Short prompts worked.
  • Health checks passed.

But once the prompts got larger, the system started to misbehave in ways that were hard to separate:

  • some requests timed out,
  • some returned zero useful output,
  • some looked like parser or repair failures,
  • and some just felt slow enough to be broken.

What Was Actually Going Wrong

The root problem was not one thing. It was a combination of prompt size, model behavior, and the wrong API surface for the job.

Two lessons stood out:

  1. A model can be “healthy” and still be a bad fit for a particular endpoint or request pattern.
  2. An OpenAI-compatible endpoint is not always a faithful substitute for the model vendor’s native chat API.

In my case, the local model was doing best when called through its native chat endpoint with hidden reasoning disabled. The OpenAI-compatible path sometimes consumed the entire budget in ways that looked like empty output or stalled replies.

The Fix

I made three practical changes:

  • tightened the prompt and output contract so the model had less room to wander,
  • lowered the local budgets so the request size fit the actual runner context,
  • switched the local path to the native chat endpoint with reasoning disabled for the cases that needed deterministic output.

That was the difference between “works in theory” and “works in practice.”

What I Measured

The most useful debugging signal was not a vague “the bot is slow.” It was the combination of:

  • response latency,
  • whether the response was empty,
  • whether the system tried to repair malformed output,
  • and whether the prompt was being clipped before generation even began.

Once I started tracking those separately, the failure mode became much easier to reason about. Some symptoms were model latency. Some were parser issues. Some were prompt pressure. They were not all the same bug.

Why This Matters

If you build on top of local models, you eventually run into a choice:

  • keep the local model lane simple and deterministic, or
  • let everything flow through the same generic interface and hope it behaves.

The second option is convenient until it isn’t.

For operational work, I now prefer a much stricter approach:

  • keep the local model path narrow,
  • make output constraints explicit,
  • log the difference between “slow,” “empty,” and “failed to repair,”
  • and only use the generic compatibility layer where it has proven reliable.

Takeaway

The lesson here was not “local models are bad.” The lesson was that local models need to be treated like systems, not magic.

If a request is failing, ask:

  • is the prompt too large?
  • is the endpoint the wrong one for this model?
  • is the output contract too loose?
  • is the repair layer hiding the real failure?

Once those are separated, the fix usually becomes obvious.

And when it does not, observability is the next best tool.