Live pricing — updated regularly

Compare LLM
API Prices

Make data-driven decisions about which AI model fits your budget.

🧠0+Models

🏢0Providers

📡Live Pricing

✨100% Free

Last updated: Never synced

No models selected yet

Select models from the grid above to compare pricing, capabilities, and costs.

❓ FAQ

Frequently Asked Questions

Everything you need to know to choose the right LLM for your project.

How to Choose the Right Model

5 questions

Six dimensions determine which LLM fits your use case best. Evaluating them in order quickly eliminates poor matches:

💰 Cost — Input vs. output price per 1M tokens. Output tokens are typically 3–10× more expensive.
📏 Context window — Maximum tokens processed per request (input + output combined).
⚡ Latency — TTFT for interactive chat; TPS for batch throughput.
🧠 Capabilities — Vision, function calling, reasoning, code generation, fine-tuning support.
🔒 Privacy — Cloud APIs send data to third parties; local models stay on your hardware.
🔗 Reliability — Uptime SLAs, rate limits, SDK maturity, and long-term provider commitment.

💡

Tip: Start with cost × context window. These two alone eliminate 80% of mismatches before you need to compare capabilities.

Every API call generates two separate costs, both measured per million tokens:

Input tokens: everything you send — system prompt + conversation history + user message.
Output tokens: what the model generates — the completion, answer, or generated content.
Output is usually priced 3–10× higher (e.g., GPT-4o: $2.50 input / $10.00 output per 1M tokens).
Long chat histories make input cost dominant; generation tasks make output cost dominant.
Setting max_tokens controls the maximum output spend per request.

💡

Tip: Set max_tokens conservatively. Capping at 500 tokens can reduce your bill by 60–80% in chat applications.

The context window defines the maximum amount of text the model can see in a single request — input and output combined. Larger windows cost proportionally more:

8K tokens — Simple chatbots, short Q&A, code completions.
32K tokens — Document Q&A, multi-turn conversations, code review.
128K tokens — Legal documents, large codebases, report summarization.
200K+ tokens — Entire books, multi-file code analysis, complex agent tasks.

💡

Tip: You are billed for the full context on every call. Periodically summarize long conversations to keep context lean and costs low.

Latency has two independent metrics that matter for very different scenarios:

TTFT (Time to First Token) — ms until the first character arrives. Dominant UX metric for streaming chat (target < 500ms).
TPS (Tokens per Second) — generation speed after first token. Determines total response time for long completions.
Real-time chat apps: optimize for low TTFT + enable streaming.
Batch pipelines (summarization, analysis): optimize for high TPS and low cost; TTFT is irrelevant.
Autonomous agents: both matter — each sequential tool call compounds total latency.

💡

Tip: Always benchmark in your production deployment region, not the provider's marketing benchmark region.

Both are production-ready options. The right choice depends on volume, privacy requirements, and your team's operational capacity:

☁️ Cloud APIs — State-of-the-art models, zero infrastructure, instant scaling, pay-per-token. Data leaves your network.
🖥️ Self-hosted — Full data privacy, no per-call cost after hardware, unlimited rate. Requires GPU + maintenance + expertise.
Choose cloud for: prototyping, variable load, regulated industries needing latest model capabilities.
Choose self-hosted for: >10M tokens/day at predictable load, strict data sovereignty, air-gapped environments.

Technical Glossary

7 questions

A token is the smallest text unit that an LLM processes. All LLM pricing is denominated in tokens — not words, sentences, or characters:

1 token ≈ ¾ word in English, ≈ 4 characters.
"Hello world" = 2 tokens. "Supercalifragilistic" = 6 tokens.
Code and non-Latin scripts (Chinese, Arabic) consume substantially more tokens per character.
Rule of thumb: 1,000 tokens ≈ 750 English words ≈ 1.5 double-spaced pages.
A 10-page PDF is roughly 3,000–5,000 tokens.

💡

Tip: Use the provider's tokenizer tool before estimating costs. Token count varies significantly by language and content type.

TTFT is the elapsed clock time from sending an API request to receiving the very first output token:

Measured in milliseconds (ms). Common range: 200–2,000ms depending on model and server load.
Components: network round-trip + server queuing + prompt processing + first token generation.
Under 300ms feels instant; above 1,000ms feels sluggish in interactive UIs.
Longer prompts typically increase TTFT — the model must process all input before generating.
Key metric for: chatbots, voice assistants, real-time code completion.

💡

Tip: Deploy your backend in the same cloud region as the LLM endpoint to reduce network RTT contribution.

TPS measures how fast the model generates output tokens after the first token is emitted:

Common range: 30–150 TPS depending on model size and server hardware.
Higher TPS = shorter total response time, especially visible in 500+ token responses.
TPS degrades under high server load on shared inference endpoints.
Self-hosted instances on dedicated GPUs offer more stable, predictable throughput.
TPS and TTFT are independent: fast start (low TTFT) + slow generation (low TPS) is common for large models.

Function calling lets the model invoke external tools or APIs during its reasoning process, making it the foundation of LLM agents:

You declare available tools via JSON schema (name, description, parameter types, required fields).
The model signals which tool to call and with which arguments — your code actually executes it.
You return the tool result to the model, which incorporates it into its response.
Common tools: web search, database queries, code interpreters, REST APIs, calculators.
Supported by: GPT-4o, Claude 3.x, Gemini 1.5+, Mistral Large, Llama 3.1+.

💡

Tip: Clear, specific tool descriptions are critical. The model's ability to call tools correctly depends entirely on how well you describe them.

Fine-tuning updates the base model's weights on your own training data, permanently adapting its behavior and style:

✅ Best for: consistent brand tone/voice, domain-specific formatting, output structure standardization.
❌ Avoid for: adding new factual knowledge — use RAG instead.
Requires hundreds to thousands of labeled (prompt → ideal response) example pairs.
Fine-tuned models cost 1.5–3× more per token than base models.
Any change in requirements needs a full new training run — RAG is far easier to iterate.

💡

Tip: Exhaust prompt engineering and RAG first. Fine-tuning is the last resort, not the first instinct.

RAG injects relevant external knowledge into the model's prompt at request time using semantic vector search, without retraining the model:

1. Documents → split into chunks → converted to embeddings (dense numerical vectors).
2. Embeddings stored in a vector database (pgvector, Pinecone, Chroma, Weaviate).
3. At query time: user question → embed → find nearest document chunks by cosine similarity.
4. Retrieved chunks are added to the prompt as context before the model generates a response.
Much cheaper and faster to update than fine-tuning — just re-embed changed documents.

💡

Tip: Use a small embeddings model (e.g., text-embedding-3-small at $0.02/1M tokens) — not a large chat model — for the embedding step.

Streaming delivers the model's response token-by-token via Server-Sent Events (SSE) as each token is generated, instead of waiting for the complete response:

Reduces perceived latency from "full completion time" to "first token time" — often a 10–30× UX improvement.
Supported by virtually all major providers: OpenAI, Anthropic, Google, Mistral, Cohere.
Same cost as non-streaming — billed by tokens used, not by request duration.
Enables typewriter animations, real-time UI updates, and early error recovery in the client.
Non-streaming is simpler for background batch pipelines where no user is waiting.

💡

Tip: Always enable streaming for user-facing interfaces. Only use non-streaming for background pipelines.

How LLM APIs Work

1 question

Every chat completion request follows the same lifecycle over HTTPS. Understanding this flow helps you estimate costs, debug errors, and design efficient applications:

💻Your App

openai.chat.completions
.create({ model, messages,
stream: true })

Sends a JSON payload over HTTPS and listens for a Server-Sent Events (SSE) stream.

HTTP Request

→

←

Streaming Response

🤖LLM API Provider

1Receive & validate request
2Tokenize input
3Run model inference
4Detokenize & stream output
5Track usage & bill

HTTP Request

POST /v1/chat/completions HTTP/1.1
Content-Type: application/json

{
  "model": "gpt-4o",
  "messages": [
    { "role": "system",
      "content": "You are a helpful assistant." },
    { "role": "user",
      "content": "Summarise this article..." }
  ],
  "max_tokens": 500,
  "temperature": 0.7,
  "stream": true
}

Streaming Response

data: {"choices":[{"delta":
  {"content":"Sure!"}}]}

data: {"choices":[{"delta":
  {"content":" Here is the"}}]}

// ... more chunks ...

data: {"choices":[{"finish_reason":"stop"}],
  "usage":{
    "prompt_tokens": 70,
    "completion_tokens": 150,
    "total_tokens": 220
  }}

data: [DONE]

💰Cost Breakdown

(input tokens × input price) + (output tokens × output price) = total cost

Example: (70 × $2.50/1M) + (150 × $10.00/1M) = $0.000175 + $0.0015 = $0.001675 per call

Compare LLM API Prices

Frequently Asked Questions

How to Choose the Right Model

Technical Glossary

How LLM APIs Work

Compare LLM
API Prices