GPT-4o
OpenAI's flagship multimodal model combining GPT-4 intelligence with vision, audio, and image capabilities in a fast and affordable package.
/ 1M tokens
128K tokens
520ms TTFT
82 tok/s
Make data-driven decisions about which AI model fits your budget.
25 models found
OpenAI's flagship multimodal model combining GPT-4 intelligence with vision, audio, and image capabilities in a fast and affordable package.
/ 1M tokens
128K tokens
520ms TTFT
82 tok/s
Affordable and efficient small model that outperforms GPT-3.5 Turbo, designed for lightweight tasks with fast response times.
/ 1M tokens
128K tokens
290ms TTFT
125 tok/s
High-capability GPT-4 model with a 128k context window, optimized for complex tasks requiring deep reasoning and comprehensive understanding.
/ 1M tokens
128K tokens
780ms TTFT
42 tok/s
OpenAI's advanced reasoning model that thinks before answering, excelling at complex math, science, and coding problems through extended internal deliberation.
/ 1M tokens
128K tokens
5.2s TTFT
24 tok/s
Faster and more cost-efficient reasoning model optimized for STEM tasks, offering strong performance on coding and math at a fraction of o1 Preview's cost.
/ 1M tokens
128K tokens
2.9s TTFT
38 tok/s
Highly efficient embedding model with strong multilingual performance, significantly outperforming the second generation ada embedding model.
/ 1M tokens
8K tokens
45ms TTFT
Most capable embedding model for English and multi-lingual tasks, producing larger vector representations for higher retrieval accuracy.
/ 1M tokens
8K tokens
75ms TTFT
Anthropic's most intelligent model yet, setting new industry benchmarks for graduate-level reasoning, code, and vision tasks with an industry-leading 200k context window.
/ 1M tokens
200K tokens
610ms TTFT
72 tok/s
Anthropic's fastest and most compact model, offering best-in-class speed with exceptional coding skills and instruction following at a low cost.
/ 1M tokens
200K tokens
310ms TTFT
105 tok/s
Anthropic's most powerful model for highly complex tasks, offering top-level performance on a wide range of evaluations including nuanced content creation at the highest intelligence.
/ 1M tokens
200K tokens
1.2s TTFT
28 tok/s
The fastest and most compact model in the Claude 3 family, near-instant responsiveness and strong performance at an accessible price point for high-throughput applications.
/ 1M tokens
200K tokens
390ms TTFT
95 tok/s
Google's best-performing multimodal model with a breakthrough 1M token context window for processing long documents, videos, and code repositories in a single prompt.
/ 1M tokens
1M tokens
720ms TTFT
63 tok/s
Google's fastest multimodal model optimized for high-volume tasks. Inherits Gemini 1.5 Pro's 1M context window at a fraction of the cost through model distillation.
/ 1M tokens
1M tokens
340ms TTFT
115 tok/s
Google's first-generation production model optimized for natural language tasks, multi-turn conversations, and code generation with reliable performance.
/ 1M tokens
33K tokens
580ms TTFT
68 tok/s
Google's multimodal model capable of understanding images and text simultaneously, enabling visual question answering, image captioning, and document analysis.
/ 1M tokens
16K tokens
820ms TTFT
52 tok/s
Meta's flagship open-source large language model available via hosted API (Together AI). Strong reasoning and instruction-following with competitive performance against proprietary models.
/ 1M tokens
8K tokens
420ms TTFT
78 tok/s
Compact and efficient open-source model from Meta, available via hosted API. Ideal for lightweight tasks requiring fast inference at minimal cost.
/ 1M tokens
8K tokens
190ms TTFT
160 tok/s
Self-hosted via Ollama or similar runtime. Zero API costs — you only pay for your own hardware/cloud infrastructure.
8K tokens
Hardware dependent
Mistral's flagship model with top-tier reasoning capabilities, native function calling, and deep code generation across 80+ programming languages.
/ 1M tokens
33K tokens
640ms TTFT
58 tok/s
Cost-efficient Mistral model for bulk operations with low latency. Ideal for classification, customer support, and text generation tasks at scale.
/ 1M tokens
33K tokens
350ms TTFT
98 tok/s
Fine-tuned instruction-following version of Mistral 7B, offering outstanding performance for its size on coding, reasoning, and multi-turn dialogue tasks.
/ 1M tokens
33K tokens
270ms TTFT
128 tok/s
Self-hosted Mistral 7B via Ollama or llama.cpp. Efficient enough to run on consumer GPUs — zero API cost, complete data privacy.
33K tokens
Hardware dependent
Cohere's most powerful model, optimized for enterprise RAG and tool use. State-of-the-art retrieval-augmented generation with multi-hop reasoning and grounded responses.
/ 1M tokens
128K tokens
690ms TTFT
55 tok/s
Cohere's balanced model for retrieval-augmented generation and complex reasoning tasks. Optimized for long context search and tool use at a lower price point.
/ 1M tokens
128K tokens
490ms TTFT
74 tok/s
Cohere's lightest chat model for high-throughput, low-latency applications. Fast response times make it ideal for chatbots and summarization pipelines.
/ 1M tokens
4K tokens
240ms TTFT
135 tok/s
No models selected yet
Select models from the grid above to compare pricing, capabilities, and costs.
Everything you need to know to choose the right LLM for your project.
5 questions
Six dimensions determine which LLM fits your use case best. Evaluating them in order quickly eliminates poor matches:
Tip: Start with cost × context window. These two alone eliminate 80% of mismatches before you need to compare capabilities.
Every API call generates two separate costs, both measured per million tokens:
Tip: Set max_tokens conservatively. Capping at 500 tokens can reduce your bill by 60–80% in chat applications.
The context window defines the maximum amount of text the model can see in a single request — input and output combined. Larger windows cost proportionally more:
Tip: You are billed for the full context on every call. Periodically summarize long conversations to keep context lean and costs low.
Latency has two independent metrics that matter for very different scenarios:
Tip: Always benchmark in your production deployment region, not the provider's marketing benchmark region.
Both are production-ready options. The right choice depends on volume, privacy requirements, and your team's operational capacity:
7 questions
A token is the smallest text unit that an LLM processes. All LLM pricing is denominated in tokens — not words, sentences, or characters:
Tip: Use the provider's tokenizer tool before estimating costs. Token count varies significantly by language and content type.
TTFT is the elapsed clock time from sending an API request to receiving the very first output token:
Tip: Deploy your backend in the same cloud region as the LLM endpoint to reduce network RTT contribution.
TPS measures how fast the model generates output tokens after the first token is emitted:
Function calling lets the model invoke external tools or APIs during its reasoning process, making it the foundation of LLM agents:
Tip: Clear, specific tool descriptions are critical. The model's ability to call tools correctly depends entirely on how well you describe them.
Fine-tuning updates the base model's weights on your own training data, permanently adapting its behavior and style:
Tip: Exhaust prompt engineering and RAG first. Fine-tuning is the last resort, not the first instinct.
RAG injects relevant external knowledge into the model's prompt at request time using semantic vector search, without retraining the model:
Tip: Use a small embeddings model (e.g., text-embedding-3-small at $0.02/1M tokens) — not a large chat model — for the embedding step.
Streaming delivers the model's response token-by-token via Server-Sent Events (SSE) as each token is generated, instead of waiting for the complete response:
Tip: Always enable streaming for user-facing interfaces. Only use non-streaming for background pipelines.
1 question
Every chat completion request follows the same lifecycle over HTTPS. Understanding this flow helps you estimate costs, debug errors, and design efficient applications:
Sends a JSON payload over HTTPS and listens for a Server-Sent Events (SSE) stream.
POST /v1/chat/completions HTTP/1.1
Content-Type: application/json
{
"model": "gpt-4o",
"messages": [
{ "role": "system",
"content": "You are a helpful assistant." },
{ "role": "user",
"content": "Summarise this article..." }
],
"max_tokens": 500,
"temperature": 0.7,
"stream": true
}data: {"choices":[{"delta":
{"content":"Sure!"}}]}
data: {"choices":[{"delta":
{"content":" Here is the"}}]}
// ... more chunks ...
data: {"choices":[{"finish_reason":"stop"}],
"usage":{
"prompt_tokens": 70,
"completion_tokens": 150,
"total_tokens": 220
}}
data: [DONE]Example: (70 × $2.50/1M) + (150 × $10.00/1M) = $0.000175 + $0.0015 = $0.001675 per call