Local-First Model Routing: Stop Wasting API Calls

The Problem: Sequential Fallback Wastes Money

You’ve built a multi-model gateway. When one provider is rate-limited, it falls back to another. Smart, right?

Not really. Here’s what actually happens:

flowchart LR
    P["Prompt"] --> O["Try Opus"]
    O --> |"Rate Limited!"| S["Try Sonnet"]
    S --> |"Rate Limited!"| G["Try Gemini"]
    G --> |"Rate Limited!"| L["Try Local"]
    L --> R["Finally... Response"]
    
    style O fill:#ef4444,color:#fff
    style S fill:#ef4444,color:#fff
    style G fill:#ef4444,color:#fff
    style L fill:#22c55e,color:#fff

Each “Try” is an API call. Each rate-limited response still counts against your quota. You’re paying to be told “no.”

For a simple question like “What’s 2+2?”, you might burn through 3 API calls before getting an answer from a local model that could have handled it instantly.

The Insight: Classify First, Route Smart

What if you knew the right destination BEFORE making any API calls?

A tiny local model can analyze the prompt and categorize it in under a second. Then route directly to the appropriate pool. No wasted calls.

flowchart TB
    P["Prompt"] --> C["Local Classifier\nqwen2.5:1.5b\n<1 second, FREE"]
    
    C --> |"reasoning"| R["Reasoning Pool\nOpus, Sonnet, Gemini Pro"]
    C --> |"coding"| CO["Coding Pool\nGPT-4, Sonnet, DeepSeek"]
    C --> |"quick"| Q["Quick Pool\nGemini Flash, Haiku"]
    C --> |"vision"| V["Vision Pool\nGemini Vision, Claude Vision"]
    
    R --> E["Execute on\nFirst Available"]
    CO --> E
    Q --> E
    V --> E
    
    E --> |"All exhausted"| F["Fallback\nLocal Ollama 14B"]
    
    style C fill:#8b5cf6,color:#fff
    style R fill:#3b82f6,color:#fff
    style CO fill:#22c55e,color:#fff
    style Q fill:#f97316,color:#fff
    style V fill:#ec4899,color:#fff

Why Local Classification Works

The 1.5B parameter classifier has one job: categorize the request. It doesn’t need to answer it—just understand what kind of task it is.

This is a much easier problem than generating a response:

Task	Difficulty	Model Needed
“Is this a coding task?”	Easy	1.5B is plenty
“Write the code”	Hard	Needs larger model

The classifier runs at 150-200 tokens/second on Apple Silicon. Classification takes under a second and costs nothing.

Model Pool Design

Each category routes to a pool of models optimized for that task type:

const MODEL_POOLS = {
  reasoning: ['claude-opus', 'claude-sonnet', 'gemini-pro'],
  coding: ['gpt-4', 'claude-sonnet', 'deepseek-coder'],
  quick: ['gemini-flash', 'claude-haiku'],
  vision: ['gemini-vision', 'claude-vision'],
  review: ['moonshot-kimi', 'claude-sonnet'],
};

Within each pool, the system tries models in order until one succeeds. But now you’re only trying models that are appropriate for the task.

The Classification Prompt

The local classifier uses a simple prompt:

Classify this request into exactly one category:
- reasoning: Complex analysis, architecture, multi-step thinking
- coding: Write, debug, or review code
- quick: Simple questions, formatting, trivial tasks
- vision: Analyze images or visual content
- review: Edit, proofread, provide feedback

Request: {user_prompt}

Category:

The 1.5B model handles this reliably. It’s pattern matching, not reasoning.

Performance Results

Before local routing:

Metric	Value
Avg API calls per request	2.3
Wasted (rate-limited) calls	41%
Avg response time	4.2s
Monthly API cost	$127

After local routing:

Metric	Value
Avg API calls per request	1.1
Wasted calls	3%
Avg response time	2.1s
Monthly API cost	$43

The classifier adds ~0.8s latency but saves 2+ seconds on average by avoiding failed API calls.

Hardware Requirements

The setup runs comfortably on a Mac Mini M4 with 32GB RAM:

Model	Role	VRAM	Speed
qwen2.5:1.5b-instruct	Router	~2GB	200 tok/s
qwen2.5:14b-instruct-q8_0	Fallback	~16GB	45 tok/s

Both models loaded simultaneously leave plenty of headroom for other tasks.

Implementation Sketch

async function routeRequest(prompt: string): Promise<Response> {
  // Step 1: Local classification (FREE, <1s)
  const category = await classifyLocally(prompt);
  
  // Step 2: Get appropriate model pool
  const pool = MODEL_POOLS[category];
  
  // Step 3: Try each model in pool
  for (const model of pool) {
    try {
      return await callModel(model, prompt);
    } catch (e) {
      if (isRateLimited(e)) continue;
      throw e;
    }
  }
  
  // Step 4: Fallback to local execution
  return await executeLocally(prompt);
}

async function classifyLocally(prompt: string): Promise<Category> {
  const result = await ollama.generate({
    model: 'qwen2.5:1.5b-instruct',
    prompt: CLASSIFICATION_PROMPT.replace('{user_prompt}', prompt),
  });
  return parseCategory(result);
}

When NOT to Use This

Local routing adds complexity. Skip it if:

You only use one model provider
Rate limits aren’t a problem for you
Latency is critical and classification overhead matters
You’re not running on hardware that can host local models

Key Takeaways

Sequential fallback wastes API calls - You pay to be told “no”
Local classification is nearly free - 1.5B models run fast on modern hardware
Route by task type, not availability - Match requests to appropriate model pools
Keep local fallback - When all APIs fail, local execution saves the day

The best API call is the one you don’t make. Local-first routing ensures you only call the APIs that can actually help.

This pattern emerged from building a multi-provider gateway. The aha moment: checking availability is expensive. Classification is cheap.