LLMComparator
Live pricing — updated regularly

Compare LLMAPI Prices

Make data-driven decisions about which AI model fits your budget.

🧠0+Models
🏢0Providers
📡Live Pricing
100% Free
Last updated: Never synced

25 models found

OpenAI

GPT-4o

Most Capable

OpenAI's flagship multimodal model combining GPT-4 intelligence with vision, audio, and image capabilities in a fast and affordable package.

Input$2.50
Output$10.00

/ 1M tokens

128K tokens

520ms TTFT

82 tok/s

VisionFunctionsStreamingCode
Official pricing
OpenAI

GPT-4o mini

Affordable and efficient small model that outperforms GPT-3.5 Turbo, designed for lightweight tasks with fast response times.

Input$0.150
Output$0.600

/ 1M tokens

128K tokens

290ms TTFT

125 tok/s

FunctionsStreamingCode
Official pricing
OpenAI

GPT-4 Turbo

High-capability GPT-4 model with a 128k context window, optimized for complex tasks requiring deep reasoning and comprehensive understanding.

Input$10.00
Output$30.00

/ 1M tokens

128K tokens

780ms TTFT

42 tok/s

VisionFunctionsStreamingCode
Official pricing
OpenAI

o1 Preview

OpenAI's advanced reasoning model that thinks before answering, excelling at complex math, science, and coding problems through extended internal deliberation.

Input$15.00
Output$60.00

/ 1M tokens

128K tokens

5.2s TTFT

24 tok/s

ReasoningCode
Official pricing
OpenAI

o1 mini

Faster and more cost-efficient reasoning model optimized for STEM tasks, offering strong performance on coding and math at a fraction of o1 Preview's cost.

Input$3.00
Output$12.00

/ 1M tokens

128K tokens

2.9s TTFT

38 tok/s

ReasoningCode
Official pricing
OpenAI

text-embedding-3-small

Highly efficient embedding model with strong multilingual performance, significantly outperforming the second generation ada embedding model.

Input$0.020
Output$0

/ 1M tokens

8K tokens

45ms TTFT

Official pricing
OpenAI

text-embedding-3-large

Most capable embedding model for English and multi-lingual tasks, producing larger vector representations for higher retrieval accuracy.

Input$0.130
Output$0

/ 1M tokens

8K tokens

75ms TTFT

Official pricing
Anthropic

Claude 3.5 Sonnet

Anthropic's most intelligent model yet, setting new industry benchmarks for graduate-level reasoning, code, and vision tasks with an industry-leading 200k context window.

Input$3.00
Output$15.00

/ 1M tokens

200K tokens

610ms TTFT

72 tok/s

VisionStreamingCodeLong Context
Official pricing
Anthropic

Claude 3.5 Haiku

Anthropic's fastest and most compact model, offering best-in-class speed with exceptional coding skills and instruction following at a low cost.

Input$0.800
Output$4.00

/ 1M tokens

200K tokens

310ms TTFT

105 tok/s

StreamingCode
Official pricing
Anthropic

Claude 3 Opus

Anthropic's most powerful model for highly complex tasks, offering top-level performance on a wide range of evaluations including nuanced content creation at the highest intelligence.

Input$15.00
Output$75.00

/ 1M tokens

200K tokens

1.2s TTFT

28 tok/s

VisionReasoningLong Context
Official pricing
Anthropic

Claude 3 Haiku

The fastest and most compact model in the Claude 3 family, near-instant responsiveness and strong performance at an accessible price point for high-throughput applications.

Input$0.250
Output$1.25

/ 1M tokens

200K tokens

390ms TTFT

95 tok/s

VisionStreaming
Official pricing
Google

Gemini 1.5 Pro

Google's best-performing multimodal model with a breakthrough 1M token context window for processing long documents, videos, and code repositories in a single prompt.

Input$3.50
Output$10.50

/ 1M tokens

1M tokens

720ms TTFT

63 tok/s

VisionLong ContextCodeFunctions
Official pricing
Google

Gemini 1.5 Flash

Google's fastest multimodal model optimized for high-volume tasks. Inherits Gemini 1.5 Pro's 1M context window at a fraction of the cost through model distillation.

Input$0.075
Output$0.300

/ 1M tokens

1M tokens

340ms TTFT

115 tok/s

VisionLong ContextStreaming
Official pricing
Google

Gemini 1.0 Pro

Google's first-generation production model optimized for natural language tasks, multi-turn conversations, and code generation with reliable performance.

Input$0.500
Output$1.50

/ 1M tokens

33K tokens

580ms TTFT

68 tok/s

FunctionsStreaming
Official pricing
Google

Gemini Pro Vision

Google's multimodal model capable of understanding images and text simultaneously, enabling visual question answering, image captioning, and document analysis.

Input$0.500
Output$1.50

/ 1M tokens

16K tokens

820ms TTFT

52 tok/s

VisionStreaming
Official pricing
Meta

Llama 3 70B (API)

Meta's flagship open-source large language model available via hosted API (Together AI). Strong reasoning and instruction-following with competitive performance against proprietary models.

Input$0.900
Output$0.900

/ 1M tokens

8K tokens

420ms TTFT

78 tok/s

StreamingCode
Official pricing
Meta

Llama 3 8B (API)

Best Value

Compact and efficient open-source model from Meta, available via hosted API. Ideal for lightweight tasks requiring fast inference at minimal cost.

Input$0.200
Output$0.200

/ 1M tokens

8K tokens

190ms TTFT

160 tok/s

StreamingCode
Official pricing
Meta

Llama 3 (Self-hosted)

Self-hosted via Ollama or similar runtime. Zero API costs — you only pay for your own hardware/cloud infrastructure.

FREE
Self-hostedOpen Source

8K tokens

Hardware dependent

StreamingCode
Official pricing
Mistral

Mistral Large

Mistral's flagship model with top-tier reasoning capabilities, native function calling, and deep code generation across 80+ programming languages.

Input$8.00
Output$24.00

/ 1M tokens

33K tokens

640ms TTFT

58 tok/s

FunctionsCodeStreaming
Official pricing
Mistral

Mistral Small

Cost-efficient Mistral model for bulk operations with low latency. Ideal for classification, customer support, and text generation tasks at scale.

Input$1.00
Output$3.00

/ 1M tokens

33K tokens

350ms TTFT

98 tok/s

FunctionsStreaming
Official pricing
Mistral

Mistral 7B Instruct

Fine-tuned instruction-following version of Mistral 7B, offering outstanding performance for its size on coding, reasoning, and multi-turn dialogue tasks.

Input$0.250
Output$0.250

/ 1M tokens

33K tokens

270ms TTFT

128 tok/s

StreamingCode
Official pricing
Mistral

Mistral 7B (Self-hosted)

Self-hosted Mistral 7B via Ollama or llama.cpp. Efficient enough to run on consumer GPUs — zero API cost, complete data privacy.

FREE
Self-hostedOpen Source

33K tokens

Hardware dependent

StreamingCode
Official pricing
Cohere

Command R+

Cohere's most powerful model, optimized for enterprise RAG and tool use. State-of-the-art retrieval-augmented generation with multi-hop reasoning and grounded responses.

Input$3.00
Output$15.00

/ 1M tokens

128K tokens

690ms TTFT

55 tok/s

FunctionsLong ContextStreaming
Official pricing
Cohere

Command R

Cohere's balanced model for retrieval-augmented generation and complex reasoning tasks. Optimized for long context search and tool use at a lower price point.

Input$0.500
Output$1.50

/ 1M tokens

128K tokens

490ms TTFT

74 tok/s

FunctionsLong ContextStreaming
Official pricing
Cohere

Command Light

Cohere's lightest chat model for high-throughput, low-latency applications. Fast response times make it ideal for chatbots and summarization pipelines.

Input$0.300
Output$0.600

/ 1M tokens

4K tokens

240ms TTFT

135 tok/s

Streaming
Official pricing

No models selected yet

Select models from the grid above to compare pricing, capabilities, and costs.

FAQ

Frequently Asked Questions

Everything you need to know to choose the right LLM for your project.

How to Choose the Right Model

5 questions

Six dimensions determine which LLM fits your use case best. Evaluating them in order quickly eliminates poor matches:

  • 💰 Cost — Input vs. output price per 1M tokens. Output tokens are typically 3–10× more expensive.
  • 📏 Context window — Maximum tokens processed per request (input + output combined).
  • ⚡ Latency — TTFT for interactive chat; TPS for batch throughput.
  • 🧠 Capabilities — Vision, function calling, reasoning, code generation, fine-tuning support.
  • 🔒 Privacy — Cloud APIs send data to third parties; local models stay on your hardware.
  • 🔗 Reliability — Uptime SLAs, rate limits, SDK maturity, and long-term provider commitment.
💡

Tip: Start with cost × context window. These two alone eliminate 80% of mismatches before you need to compare capabilities.

Every API call generates two separate costs, both measured per million tokens:

  • Input tokens: everything you send — system prompt + conversation history + user message.
  • Output tokens: what the model generates — the completion, answer, or generated content.
  • Output is usually priced 3–10× higher (e.g., GPT-4o: $2.50 input / $10.00 output per 1M tokens).
  • Long chat histories make input cost dominant; generation tasks make output cost dominant.
  • Setting max_tokens controls the maximum output spend per request.
💡

Tip: Set max_tokens conservatively. Capping at 500 tokens can reduce your bill by 60–80% in chat applications.

The context window defines the maximum amount of text the model can see in a single request — input and output combined. Larger windows cost proportionally more:

  • 8K tokens — Simple chatbots, short Q&A, code completions.
  • 32K tokens — Document Q&A, multi-turn conversations, code review.
  • 128K tokens — Legal documents, large codebases, report summarization.
  • 200K+ tokens — Entire books, multi-file code analysis, complex agent tasks.
💡

Tip: You are billed for the full context on every call. Periodically summarize long conversations to keep context lean and costs low.

Latency has two independent metrics that matter for very different scenarios:

  • TTFT (Time to First Token) — ms until the first character arrives. Dominant UX metric for streaming chat (target < 500ms).
  • TPS (Tokens per Second) — generation speed after first token. Determines total response time for long completions.
  • Real-time chat apps: optimize for low TTFT + enable streaming.
  • Batch pipelines (summarization, analysis): optimize for high TPS and low cost; TTFT is irrelevant.
  • Autonomous agents: both matter — each sequential tool call compounds total latency.
💡

Tip: Always benchmark in your production deployment region, not the provider's marketing benchmark region.

Both are production-ready options. The right choice depends on volume, privacy requirements, and your team's operational capacity:

  • ☁️ Cloud APIs — State-of-the-art models, zero infrastructure, instant scaling, pay-per-token. Data leaves your network.
  • 🖥️ Self-hosted — Full data privacy, no per-call cost after hardware, unlimited rate. Requires GPU + maintenance + expertise.
  • Choose cloud for: prototyping, variable load, regulated industries needing latest model capabilities.
  • Choose self-hosted for: >10M tokens/day at predictable load, strict data sovereignty, air-gapped environments.

Technical Glossary

7 questions

A token is the smallest text unit that an LLM processes. All LLM pricing is denominated in tokens — not words, sentences, or characters:

  • 1 token ≈ ¾ word in English, ≈ 4 characters.
  • "Hello world" = 2 tokens. "Supercalifragilistic" = 6 tokens.
  • Code and non-Latin scripts (Chinese, Arabic) consume substantially more tokens per character.
  • Rule of thumb: 1,000 tokens ≈ 750 English words ≈ 1.5 double-spaced pages.
  • A 10-page PDF is roughly 3,000–5,000 tokens.
💡

Tip: Use the provider's tokenizer tool before estimating costs. Token count varies significantly by language and content type.

TTFT is the elapsed clock time from sending an API request to receiving the very first output token:

  • Measured in milliseconds (ms). Common range: 200–2,000ms depending on model and server load.
  • Components: network round-trip + server queuing + prompt processing + first token generation.
  • Under 300ms feels instant; above 1,000ms feels sluggish in interactive UIs.
  • Longer prompts typically increase TTFT — the model must process all input before generating.
  • Key metric for: chatbots, voice assistants, real-time code completion.
💡

Tip: Deploy your backend in the same cloud region as the LLM endpoint to reduce network RTT contribution.

TPS measures how fast the model generates output tokens after the first token is emitted:

  • Common range: 30–150 TPS depending on model size and server hardware.
  • Higher TPS = shorter total response time, especially visible in 500+ token responses.
  • TPS degrades under high server load on shared inference endpoints.
  • Self-hosted instances on dedicated GPUs offer more stable, predictable throughput.
  • TPS and TTFT are independent: fast start (low TTFT) + slow generation (low TPS) is common for large models.

Function calling lets the model invoke external tools or APIs during its reasoning process, making it the foundation of LLM agents:

  • You declare available tools via JSON schema (name, description, parameter types, required fields).
  • The model signals which tool to call and with which arguments — your code actually executes it.
  • You return the tool result to the model, which incorporates it into its response.
  • Common tools: web search, database queries, code interpreters, REST APIs, calculators.
  • Supported by: GPT-4o, Claude 3.x, Gemini 1.5+, Mistral Large, Llama 3.1+.
💡

Tip: Clear, specific tool descriptions are critical. The model's ability to call tools correctly depends entirely on how well you describe them.

Fine-tuning updates the base model's weights on your own training data, permanently adapting its behavior and style:

  • ✅ Best for: consistent brand tone/voice, domain-specific formatting, output structure standardization.
  • ❌ Avoid for: adding new factual knowledge — use RAG instead.
  • Requires hundreds to thousands of labeled (prompt → ideal response) example pairs.
  • Fine-tuned models cost 1.5–3× more per token than base models.
  • Any change in requirements needs a full new training run — RAG is far easier to iterate.
💡

Tip: Exhaust prompt engineering and RAG first. Fine-tuning is the last resort, not the first instinct.

RAG injects relevant external knowledge into the model's prompt at request time using semantic vector search, without retraining the model:

  • 1. Documents → split into chunks → converted to embeddings (dense numerical vectors).
  • 2. Embeddings stored in a vector database (pgvector, Pinecone, Chroma, Weaviate).
  • 3. At query time: user question → embed → find nearest document chunks by cosine similarity.
  • 4. Retrieved chunks are added to the prompt as context before the model generates a response.
  • Much cheaper and faster to update than fine-tuning — just re-embed changed documents.
💡

Tip: Use a small embeddings model (e.g., text-embedding-3-small at $0.02/1M tokens) — not a large chat model — for the embedding step.

Streaming delivers the model's response token-by-token via Server-Sent Events (SSE) as each token is generated, instead of waiting for the complete response:

  • Reduces perceived latency from "full completion time" to "first token time" — often a 10–30× UX improvement.
  • Supported by virtually all major providers: OpenAI, Anthropic, Google, Mistral, Cohere.
  • Same cost as non-streaming — billed by tokens used, not by request duration.
  • Enables typewriter animations, real-time UI updates, and early error recovery in the client.
  • Non-streaming is simpler for background batch pipelines where no user is waiting.
💡

Tip: Always enable streaming for user-facing interfaces. Only use non-streaming for background pipelines.

How LLM APIs Work

1 question

Every chat completion request follows the same lifecycle over HTTPS. Understanding this flow helps you estimate costs, debug errors, and design efficient applications:

💻Your App
openai.chat.completions
  .create({ model, messages,
    stream: true })

Sends a JSON payload over HTTPS and listens for a Server-Sent Events (SSE) stream.

HTTP Request
Streaming Response
🤖LLM API Provider
  1. 1Receive & validate request
  2. 2Tokenize input
  3. 3Run model inference
  4. 4Detokenize & stream output
  5. 5Track usage & bill
HTTP Request
POST /v1/chat/completions HTTP/1.1
Content-Type: application/json

{
  "model": "gpt-4o",
  "messages": [
    { "role": "system",
      "content": "You are a helpful assistant." },
    { "role": "user",
      "content": "Summarise this article..." }
  ],
  "max_tokens": 500,
  "temperature": 0.7,
  "stream": true
}
Streaming Response
data: {"choices":[{"delta":
  {"content":"Sure!"}}]}

data: {"choices":[{"delta":
  {"content":" Here is the"}}]}

// ... more chunks ...

data: {"choices":[{"finish_reason":"stop"}],
  "usage":{
    "prompt_tokens": 70,
    "completion_tokens": 150,
    "total_tokens": 220
  }}

data: [DONE]
💰Cost Breakdown
(input tokens × input price) + (output tokens × output price) = total cost

Example: (70 × $2.50/1M) + (150 × $10.00/1M) = $0.000175 + $0.0015 = $0.001675 per call