The Vision: An AI That Just Handles Things
The best AI assistant is one you forget is AI. It doesn’t wait for commands. It anticipates needs. It runs in the background, handling routine tasks while you focus on what matters.
But there’s a catch: AI APIs cost money. An always-on assistant that naively calls GPT-4 for everything will rack up hundreds of dollars monthly. The challenge is building something autonomous AND economical.
This is the architecture I developed to solve that problem.
The Five-Layer Architecture
flowchart TB
subgraph L0["Layer 0: CONTEXT BUDGET"]
L0A["Token tracking per session"]
L0B["Semantic cache"]
L0C["Memory compression"]
L0D["Budget warnings"]
end
subgraph L1["Layer 1: ROUTING INTELLIGENCE"]
L1A["Local Ollama classifier (7B)"]
L1B["Category: trivial|standard|complex"]
L1C["Rate limit aware fallback"]
L1D["~90% queries free"]
end
subgraph L2["Layer 2: EXECUTION MODES"]
L2A["Reflexive: direct response"]
L2B["Deliberate: plan then execute"]
L2C["Autonomous: background daemon"]
L2D["Ultrawork: parallel waves"]
end
subgraph L3["Layer 3: SKILLS + SANDBOX"]
L3A["Skill registry"]
L3B["Docker sandbox"]
L3C["Audit logging"]
end
subgraph L4["Layer 4: COMMUNICATION"]
L4A["Signal / Email / Telegram"]
L4B["Obsidian vault"]
L4C["Content harvester"]
end
L0 --> L1
L1 --> L2
L2 --> L3
L3 --> L4
style L0 fill:#3b82f6,color:#fff
style L1 fill:#8b5cf6,color:#fff
style L2 fill:#ec4899,color:#fff
style L3 fill:#f97316,color:#fff
style L4 fill:#22c55e,color:#fff
Layer 0: Context Budget Management
Every LLM interaction has a cost. Layer 0 tracks and optimizes token usage:
Token Tracking: Each session maintains a running count. When approaching limits, the system warns and suggests summarization.
Semantic Cache: Before calling any API, check if a semantically similar query was answered recently. Embedding similarity > 0.92? Return the cached answer. 100% savings on cache hits.
Memory Compression: When context grows too large, compress older messages while preserving key information. A 10,000 token conversation becomes 3,000 tokens without losing critical context.
Budget Warnings: Set daily/weekly spending limits. The system alerts before you hit them, not after.
Layer 1: Routing Intelligence
This is where the magic happens. A tiny local model decides where each request goes.
flowchart LR
Q["Query"] --> LC["Local Classifier\n(qwen2.5:1.5b)"]
LC --> |"trivial"| LOCAL["Local Ollama\n(FREE)"]
LC --> |"standard"| FAST["Gemini Flash\nSonnet 3.5"]
LC --> |"complex"| PREMIUM["Opus\nGPT-4"]
LOCAL --> R["Response"]
FAST --> R
PREMIUM --> R
style LC fill:#8b5cf6,color:#fff
style LOCAL fill:#22c55e,color:#fff
style FAST fill:#f97316,color:#fff
style PREMIUM fill:#ef4444,color:#fff
The classifier runs locally on a 1.5B parameter model. It takes <1 second and costs nothing. It categorizes requests:
- Trivial (60%): “What time is it?”, “Summarize this text”, formatting tasks → Local Ollama
- Standard (30%): Code completion, analysis, explanations → Gemini Flash or Sonnet
- Complex (10%): Architecture decisions, novel problems, multi-step reasoning → Opus or GPT-4
Result: ~90% of queries never hit expensive APIs.
Layer 2: Execution Modes
Not all tasks need the same approach:
Reflexive Mode: Simple query → immediate response. No planning, no verification. For quick answers.
Deliberate Mode: Complex request → interview for clarification → create plan → execute with verification. For important tasks where mistakes are costly.
Autonomous Mode: Background daemon that handles scheduled tasks. Morning briefings, email summaries, routine checks. Runs without prompting.
Ultrawork Mode: Parallel execution of independent subtasks. When you need 10 things done, why do them sequentially?
Layer 3: Skills + Sandbox
Skills are pre-baked instruction sets that dramatically reduce token usage:
Without skill: 2000 tokens explaining how to do X
With skill: 50 tokens referencing skill + parameters
A skill registry maps capabilities to compressed instructions. The AI doesn’t need to figure out how to do common tasks—it just loads the skill.
All execution happens in a Docker sandbox with:
- Network allowlist (only approved APIs)
- File system restrictions
- Resource limits
- Complete audit logging
Layer 4: Communication
The companion needs to reach you where you are:
- Signal/Telegram: Quick notifications and commands
- Email: Longer reports, morning briefings
- Obsidian Vault: Knowledge accumulation, notes, seeds for future content
Information flows both ways. You can command via any channel; it responds through the appropriate one.
Token Optimization Results
| Strategy | Savings |
|---|---|
| Local classifier first | ~90% trivial queries free |
| Skill distillery | ~2000 tokens per skill invocation |
| Semantic cache | 100% on cache hit |
| Memory compression | ~70% context reduction |
Combined effect: What would cost $150/month with naive API usage costs <$5/month with this architecture.
Success Metrics
After running this architecture for a month:
| Metric | Target | Actual |
|---|---|---|
| Daily cost | <$5 | $2.30 avg |
| Cache hit rate | >30% | 42% |
| Local routing | >60% | 73% |
| Response latency p95 | <3s | 1.8s |
The system handles hundreds of interactions daily while keeping costs minimal.
Why This Matters
AI assistants are becoming essential tools, but cost and control remain barriers. This architecture proves you can have:
- Autonomy: It acts without constant prompting
- Economy: Costs under a coffee per day
- Security: Sandboxed execution with audit trails
- Extensibility: Skills system allows easy capability addition
The best AI assistant is one that works so well you forget it’s there. This architecture makes that possible without breaking the bank.
Implementation Notes
The full implementation combines:
- Ollama for local model inference
- Docker for sandboxed execution
- Redis for semantic caching
- SQLite for conversation history
- Various APIs behind a unified gateway
Each layer is independently testable and replaceable. Start with layers 0-2, add sandbox when you’re ready for autonomous execution.
This architecture emerged from building Moltbot, a personal AI companion. The goal: an assistant that handles things so well you forget it’s AI.
