The Bouncer at the Door

Or: How a Model That Can Barely Think Became the Most Cost-Effective Employee We Have

Every message Bubba receives goes to Claude Opus. Every single one.

“Good morning.” Opus. “Thanks!” Opus. ”👍” Opus. “What time is it?” Opus.

That’s like hiring a senior architect to answer the phone. It works — Opus gives you a lovely “Good morning! How can I help you today?” — but you just spent premium tokens on something a fortune cookie could handle.

We’re fixing this. Phase 4 of the roadmap introduces a local LLM triage layer, and the model we’re using for it is Qwen 2.5 0.5B. Half a billion parameters. Running on the Mac Mini’s M4 chip via Ollama. Zero API cost. Zero cloud calls. Zero tokens burned.

The idea is simple: before any message reaches Claude, it passes through the smallest, cheapest, dumbest model we can find. And that model’s only job is to answer one question: does this actually need a brain?

The Problem in Numbers

I went through a week of Bubba’s conversation logs. Here’s the breakdown:

Message Type% of TotalRequires Claude?
Greetings (“hey”, “morning”, “thanks”)~18%No
Status checks (“you up?”, “what’s running?”)~12%No
Simple confirmations (“yes”, “ok”, “do it”)~15%Maybe
Actual work (“debug this”, “write a blog post”)~55%Yes

Roughly 30% of all messages are so trivial that sending them to Opus is actively wasteful. Another 15% are confirmations that could be handled by pattern matching or a small model with conversation state.

That’s almost half the traffic. Half the tokens. Half the cost. For messages that produce responses a Markov chain could fake.

Why Qwen 2.5 0.5B?

Because it’s the smallest model that isn’t useless.

Let me be specific about what “triage” means here. We’re not asking Qwen to respond to the user. We’re not asking it to reason, plan, or generate anything creative. We’re asking it to classify. One job: look at an incoming message and output one of three labels.

TRIVIAL   → Handle locally, no Claude needed
SIMPLE    → Route to Haiku or Sonnet
COMPLEX   → Send to Opus

That’s it. A classification task. And for classification, you don’t need 200 billion parameters. You need something that can parse a sentence and match a pattern. Qwen 2.5 0.5B does this at ~30 tokens/second on the M4 with negligible memory footprint. The model loads in under a second. Inference on a short message takes milliseconds.

The alternatives we considered:

  • Regex / keyword matching: Fragile. “Morning” is a greeting but “this morning the build failed” is a bug report. Context matters, even for triage.
  • Qwen 2.5 1.5B: Better at edge cases, but 3x the memory for marginal gains on a binary classification task.
  • Phi-3 Mini: Comparable performance, but Qwen’s multilingual support matters — JJ writes in English and Spanish, sometimes in the same sentence.
  • No triage, just use Haiku for everything first: Haiku still costs tokens. Local costs nothing.

The 0.5B model is the sweet spot: smart enough to distinguish “thanks bro” from “thanks, now deploy it to production,” dumb enough to run on hardware we already own with zero marginal cost.

The Architecture

Here’s how the triage layer fits into the existing message flow:

User Message (Telegram)


┌──────────────┐
│  Qwen 2.5    │  ← Local, ~5ms inference
│  Classifier  │
└──────┬───────┘

       ├── TRIVIAL ──→ Local Response (canned/template)
       │                 "Good morning! ☀️"

       ├── SIMPLE ───→ Claude Haiku ($0.00025/msg)
       │                 Status checks, quick lookups

       └── COMPLEX ──→ Claude Opus (current path)
                        Actual work, reasoning, code

The classifier prompt is minimal:

Classify this user message into exactly one category.
TRIVIAL: greetings, thanks, emoji-only, time checks, pleasantries
SIMPLE: status questions, yes/no confirmations, simple factual lookups
COMPLEX: anything requiring reasoning, code, planning, or multi-step work

Message: {message}
Category:

No chain-of-thought. No examples. No system prompt engineering. For a classification task this narrow, the 0.5B model gets it right ~94% of the time in our testing. The 6% it gets wrong almost always err toward COMPLEX — meaning it over-escalates rather than under-escalates. Which is exactly the failure mode you want. Nobody’s going to complain that their message got too much intelligence thrown at it.

The Escape Hatch

Here’s the thing about triage systems: they need to fail gracefully, and they need to fail up.

If the local model is down, everything routes to Opus. If classification confidence is low, route to Opus. If a message was classified TRIVIAL but the user follows up with “no, I meant…”, the conversation escalates automatically.

async def triage_message(message: str) -> MessageTier:
    """Classify message complexity via local LLM."""
    try:
        result = await ollama_classify(
            model='qwen2.5:0.5b',
            message=message,
            timeout=2.0  # 2 second hard ceiling
        )
        if result.confidence < 0.8:
            return MessageTier.COMPLEX  # when in doubt, escalate
        return result.tier
    except Exception:
        return MessageTier.COMPLEX  # local down? use the big model

The timeout is 2 seconds. If the local model can’t classify a 10-word message in 2 seconds, something is wrong, and we skip the whole layer. The user never waits longer because we tried to save a few cents.

This is the principle: the triage layer is an optimization, not a gate. Removing it should change the cost, not the behavior. If you yanked out the entire Qwen integration tomorrow, every message would just go to Opus like it does today. The system degrades to “expensive but correct,” which is exactly where it started.

The Cost Impact

Let’s do the math on current Opus-for-everything vs. the tiered approach.

Bubba processes roughly 150 messages per day. At current rates:

TierMessages/DayCost/MessageDaily Cost
Current (all Opus)150~$0.03~$4.50
With triage
→ Trivial (local)45$0.00$0.00
→ Simple (Haiku)23$0.0003$0.01
→ Complex (Opus)82$0.03$2.46
Tiered total150~$2.47

That’s a 45% cost reduction. On an annual basis, roughly $740 saved — for one agent. Scale this across all eleven agents in the squad (once they’re all running through the triage layer), and the savings compound fast.

And the local model’s cost? Electricity. Maybe 3 watts of incremental power draw on the M4. Call it $2/year if you’re being generous with the math. The ROI on Qwen 2.5 0.5B is so absurd it feels like a rounding error someone forgot to fix.

What This Means for the Squad

The triage layer isn’t just about Bubba. Once it works, every agent gets it.

Rose, the content scout, currently uses Haiku for scanning — but even Haiku is overkill for “is this URL a 404?” Local model handles that. Clueless Joe’s QA checks? Half of them are “does this page load?” — local model. Uptime Eddie’s health pings? Definitely local model. The guy’s job is literally asking “are you alive?” on repeat. That doesn’t need cloud inference.

The eventual architecture is a funnel:

All agent messages


Local Qwen (free) ──→ handles 40-60%


Claude Haiku ($) ────→ handles 15-25%


Claude Sonnet ($$) ──→ handles 10-20%


Claude Opus ($$$) ───→ handles 10-20%

Every layer catches what it can. Only the messages that genuinely need heavy reasoning reach the expensive models. It’s trickle-down economics except it actually works — because we control both the policy and the labor force.

The Philosophical Bit

There’s something beautifully absurd about a 0.5B parameter model deciding whether a 200B+ parameter model needs to get involved. It’s like having the intern screen calls for the CEO. The intern can’t do the CEO’s job — but the intern can absolutely tell the difference between a sales call and a board member.

That’s all triage is. Pattern recognition at the gate. You don’t need intelligence for it. You need reliability, speed, and the humility to escalate when you’re not sure.

Qwen 2.5 0.5B has two of those three qualities. We hardcoded the third.

What’s Next

Implementation is Phase 4 on the roadmap. The classifier prompt needs tuning against real conversation logs (not synthetic benchmarks — our actual messages, with JJ’s actual mix of Spanish, English, emoji, and 3 AM debugging rage). We need to build the Ollama integration into the session manager, add the routing logic to polling.py, and set up metrics so we can track classification accuracy over time.

The hardest part won’t be the code. It’ll be tuning the confidence threshold — the line between “I’m sure this is trivial” and “better send this upstairs.” Too aggressive and we’ll fumble real requests. Too conservative and we save nothing.

But the bones are there. The model runs. The math works. And roughly 30% of our daily messages are about to cost exactly nothing to process.

The dumbest model in the stack. The best ROI in the company. Sometimes the smartest hire is the one who can barely think — as long as you only ask them to do one thing.