Zero-cost inference backends — Ollama + Cloudflare Workers AI + Puter.js
Zero-cost inference backends (2026-06-30 grill)
Decision
Three zero-cost model backends are approved for oriz workflows, each marking a distinct deployment surface:
| Backend | Surface | Free tier | Card? | Role |
|---|---|---|---|---|
| Ollama | Local, dev machine | Unlimited (your GPU) | n/a | Primary dev runtime; offline; CI on workstation |
| Cloudflare Workers AI | Serverless, edge Worker | 10,000 neurons/day | NO | Primary serverless runtime; prod-side inference |
| Puter.js | Browser, end-user pays | Unlimited from our side | NO (end-user may optionally add one to their Puter account) | User-facing chat and on-page AI features |
All three pass the no-card-on-file hard rule. All three already have service entries in knowledge/services/business/ai/. This decision codifies them as a single fallback ladder end-to-end.
Why a ladder (not pick-one)
- Different points of failure. Local GPU dies → serverless. Serverless quota hits → browser. Browser blocked (user has no Puter account) → local. Each rung covers the previous rung's gap, satisfying
parallel-by-defaultvia fallback chain. never-hit-quotasrequires ≥10× headroom. Local Ollama gives essentially infinite headroom. Serverless Workers AI caps at 10K neurons/day — sufficient for edge AI tasks, not for bulk scraping. Browser Puter is end-user pays, so quota is per-user and doesn't reduce our headroom.no-card-on-filerejects Anthropic / OpenAI direct. The free alternatives here are exactly what that rule forces us into.
Routing rules
| Workload | Backend (first pick) | Fallback |
|---|---|---|
| Dev on laptop, no network / offline | Ollama (localhost:11434/v1/chat/completions) |
Cloudflare Workers AI |
| Prod inference inside a Cloudflare Worker | Cloudflare Workers AI (env.AI.bind() native binding) |
Puter.js dispatched to client |
| User-facing chat in browser | Puter.js | Fall back to Workers AI for hard server-side steps |
| Open-source CLI agent failover (any of Aider, Cline, Kilo Code, OpenCode, gocode, Coddy) | Ollama at localhost:11434 (all confirmed OpenAI-compat) |
Cloudflare Workers AI over HTTPS |
| Free-tier hosted Google model for general chat | Gemini CLI — see gemini-cli-agent-addition-2026-06-30 |
n/a (no public REST) |
Per-backend details
- Ollama. Install via
ollama.com. Default endpointhttp://localhost:11434. OpenAI-compat at/v1/chat/completions(streaming, vision, tools, reasoning all supported). Desktop GUI + CLI expose the same underlying server. Optional paid "Ollama Cloud" exists — explicitly out of scope; remain on local. - Cloudflare Workers AI. See
cloudflare-workers-ai. Native worker binding viaenv.AI. 10K neurons/day free. All listed catalog models (Llama 3.x, Mistral, Mixtral, BGE, Nomic, Whisper, Stable Diffusion XL Lightning) included in the free quota. - Puter.js. See
puter-js. Browser-side via<script src="https://js.puter.com/v2/">. End-user pays Puter directly; we hold no keys, no accounts. Model IDs mirror OpenRouter's catalog for parity, but billing routes through Puter. - Gemini CLI. See
gemini-cli-agent-addition-2026-06-30. Linked here because it shares the "free-of-cost" framing even though it is not an API endpoint itself.
Quota invariants (per never-hit-quotas)
| Backend | Soft alarm trip | Cap |
|---|---|---|
| Cloudflare Workers AI | 5,000 neurons/day (50%) | 10,000 neurons/day hard cap |
| Local Ollama | Disk space | No API-side cap |
| Puter.js | n/a (end-user pays) | Per-user at Puter's discretion |
| Gemini CLI | 600 req/day (60%); 36 req/min (60%) | 1,000 req/day, 60 req/min |
What this decision does NOT do
- Does not replace Claude Code as primary oriz driver. For the monorepo's full session-cost surface (reasoning-heavy refactors, multi-file architectural changes), paid Claude Code remains primary. The three backends above are for free-routed workloads and failovers.
- Does not bypass
auto-grill-on-architectural-decisions. Any new service joining this ladder (e.g., LM Studio as a 4th option) requires its own grill-locked OKF decision. - Does not extend to hosting suggestions. The Cf Workers-AI free tier is generous but quota-bound, not infinite. Long-context summarisation, bulk scraping, or anything else that could exhaust daily neurons on its own MUST be pre-grilled for quota impact.
Cross-refs
no-card-on-file— gate this decision respectsnever-hit-quotas— quota alarm ruleauto-grill-on-architectural-decisions— process producing this filecloudflare-workers-ai— serverless service entryputer-js— browser service entryai-puter-plus-cf-workers-ai— prior split decision reinforcing the surface splitgemini-cli-agent-addition-2026-06-30— sibling agent-class decision