Cloudflare Workers AI

service Sat Jun 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time) aicloudflareworkersinferenceserver-sideprimary

Cloudflare Workers AI

Role

Server-side AI inference inside the umbrella Hono Worker at api.oriz.in — the place to put any AI call where:

The prompt or response includes a secret the browser must not see.
The output feeds another server-side step (queue dispatch, DB write, image-CDN warm-up).
The model needed is one Workers AI hosts (Llama 3, Mistral, BGE embeddings, Stable Diffusion XL, etc.).
Latency from api.oriz.in matters — Workers AI binds natively, so the inference happens on the same edge node as the Worker, with zero egress.

Browser-side AI stays on Puter.js — see the split decision at decisions/architecture/ai-puter-plus-cf-workers-ai.md.

Free tier

10,000 neurons / day (rolling 24 h)
Native binding — no separate credentials surface, declared in the Hono Worker's wrangler.toml as [ai]
Same-account billing posture as Pages / Workers / Queues / KV
Zero egress when the call originates inside Cloudflare
All hosted models on the catalog included in the free quota:
- Llama 3.1 / Llama 3.3 family
- Mistral 7B / Mixtral
- BGE / Nomic embeddings
- Stable Diffusion XL Lightning (image gen)
- Whisper (ASR)
- speech-to-text + text-to-speech variants

Card / subscription required?

NO. Same account as the rest of the Cloudflare stack — the account stays no-card per rules/no-card-on-file.md.

Quota-headroom plan

Per rules/interaction/never-hit-quotas.md:

The Hono Worker tracks neurons consumed per day in KV; trips a soft cap at 50% (5,000 neurons) to flag approach.
Browser-side AI features go through Puter.js, not Workers AI, so the 10K / day budget is reserved for true server-side use.
Long-context summarization batched into low-traffic windows; embeddings cached by content hash so re-runs cost zero neurons.

Why two AI services?

Different surfaces:

Use case	Service
Browser AI (chat in `oriz-me`, on-page assistants)	Puter.js
Server AI (inside Hono Worker, chained with DB / Queue / R2)	Cloudflare Workers AI (this file)
Hosted Gemini if a feature truly needs Google's specific model	Firebase AI Logic (firebase-ai-logic-basics skill)

Picked together so each surface has a no-card free tier already sitting on infra the family uses. See decisions/architecture/ai-puter-plus-cf-workers-ai.md.

Alternatives

Puter.js — sibling, browser-side
OpenRouter — REJECTED, requires server-paid account
Replicate — no card-free tier
Hugging Face Inference API — small free quota, slower
Together AI — has a free trial but card-required for production

Swap cost

Medium — Workers AI's binding API is Cloudflare-specific. A swap means rewriting the AI helper module to call OpenAI / Anthropic / Hugging Face over HTTP and adding a credentials surface. Encapsulate in apps/api/src/ai/ so the swap is one file.

Why this is our pick

No card, native binding — zero new accounts.
Zero egress + same edge as the Worker — better p50 than any external HTTP gateway.
Catalog matches family needs — Llama for general chat / summary, BGE for embeddings (RAG), Stable Diffusion for occasional og:image fallbacks, Whisper for podcast / video transcription.
Stack cohesion — same posture as decisions/architecture/queue-cloudflare-native.md.

Cloudflare Workers AI

Cloudflare Workers AI

Role

Free tier

Card / subscription required?

Quota-headroom plan

Why two AI services?

Alternatives

Swap cost

Why this is our pick

Cross-refs