Vision Models

anthropic
$10.00/1M

Anthropic: Claude Fable 5

Claude Fable 5 is a Mythos-class model from Anthropic, built for autonomous knowledge work...

πŸ“ 1,000,000 ctx Compare →
nex-agi
Free/1M

Nex AGI: Nex-N2-Pro (free)

Nex-N2-Pro is an agentic mixture-of-experts model from Nex AGI, with 17B active parameters...

πŸ“ 262,144 ctx Compare →
qwen
$0.40/1M

Qwen: Qwen3.7 Plus

Qwen3.7-Plus is a cost-effective model in Alibaba's Qwen3.7 series. It supports text and i...

πŸ“ 1,000,000 ctx Compare →
minimax
$0.30/1M

MiniMax: MiniMax M3

MiniMax-M3 is a multimodal foundation model from MiniMax. It supports text, image, and vid...

πŸ“ 1,048,576 ctx Compare →
stepfun
$0.20/1M

StepFun: Step 3.7 Flash

Step 3.7 Flash is StepFun's latest high-efficiency multimodal Mixture-of-Experts model. It...

πŸ“ 256,000 ctx Compare →
anthropic
$5.00/1M

Anthropic: Claude Opus 4.8

Claude Opus 4.8 is Anthropic's most capable generally available model in the Opus family. ...

πŸ“ 1,000,000 ctx Compare →
x-ai
$1.00/1M

xAI: Grok Build 0.1

Grok Build 0.1 is xAI’s fast coding model trained specifically for agentic software engi...

πŸ“ 256,000 ctx Compare →
perceptron
$0.15/1M

Perceptron: Perceptron Mk1

Perceptron Mk1 (Mark One) is Perceptron's highest-quality vision-language model for video ...

πŸ“ 32,768 ctx Compare →
google
$0.25/1M

Google: Gemini 3.1 Flash Lite

Gemini 3.1 Flash Lite is Google’s GA high-efficiency multimodal model optimized for low-...

πŸ“ 1,048,576 ctx Compare →
x-ai
$1.25/1M

xAI: Grok 4.3

Grok 4.3 is a reasoning model from xAI. It accepts text and image inputs with text output,...

πŸ“ 1,000,000 ctx Compare →
mistralai
$1.50/1M

Mistral: Mistral Medium 3.5

Mistral Medium 3.5 is a dense 128B instruction-following model from Mistral AI. It support...

πŸ“ 262,144 ctx Compare →
nvidia
Free/1M

NVIDIA: Nemotron 3 Nano Omni (free)

NVIDIA Nemotronβ„’ 3 Nano Omni is a 30B-A3B open multimodal model designed to function as ...

πŸ“ 256,000 ctx Compare →
qwen
$0.30/1M

Qwen: Qwen3.5 Plus 2026-04-20

Qwen3.5 Plus (April 2026) is a large-scale multimodal language model from Alibaba. It acce...

πŸ“ 1,000,000 ctx Compare →
qwen
$0.19/1M

Qwen: Qwen3.6 Flash

Qwen3.6 Flash is a fast, efficient language model from Alibaba's Qwen 3.6 series. It suppo...

πŸ“ 1,000,000 ctx Compare →
qwen
$0.29/1M

Qwen: Qwen3.6 27B

Qwen3.6 27B is a dense 27-billion-parameter language model from the Qwen Team at Alibaba, ...

πŸ“ 262,144 ctx Compare →
xiaomi
$0.14/1M

Xiaomi: MiMo-V2.5

MiMo-V2.5 is a native omnimodal model by Xiaomi. It delivers Pro-level agentic performance...

πŸ“ 1,048,576 ctx Compare →
openai
$8.00/1M

OpenAI: GPT-5.4 Image 2

[GPT-5.4](https://openrouter.ai/openai/gpt-5.4) Image 2 combines OpenAI's GPT-5.4 model wi...

πŸ“ 272,000 ctx Compare →
google
Free/1M

Google: Gemma 4 31B (free)

Gemma 4 31B Instruct is Google DeepMind's 30.7B dense multimodal model supporting text and...

πŸ“ 262,144 ctx Compare →
google
$0.12/1M

Google: Gemma 4 31B

Gemma 4 31B Instruct is Google DeepMind's 30.7B dense multimodal model supporting text and...

πŸ“ 262,144 ctx Compare →
z-ai
$1.20/1M

Z.ai: GLM 5V Turbo

GLM-5V-Turbo is Z.ai’s first native multimodal agent foundation model, built for vision-...

πŸ“ 202,752 ctx Compare →
rekaai
$0.10/1M

Reka Edge

Reka Edge is an extremely efficient 7B multimodal vision-language model that accepts image...

πŸ“ 16,384 ctx Compare →
openai
$0.20/1M

OpenAI: GPT-5.4 Nano

GPT-5.4 nano is the most lightweight and cost-efficient variant of the GPT-5.4 family, opt...

πŸ“ 400,000 ctx Compare →
openai
$0.75/1M

OpenAI: GPT-5.4 Mini

GPT-5.4 mini brings the core capabilities of GPT-5.4 to a faster, more efficient model opt...

πŸ“ 400,000 ctx Compare →
qwen
$0.10/1M

Qwen: Qwen3.5-9B

Qwen3.5-9B is a multimodal foundation model from the Qwen3.5 family, designed to deliver s...

πŸ“ 262,144 ctx Compare →
google
$0.50/1M

Google: Nano Banana 2 (Gemini 3.1 Flash Image Preview)

Gemini 3.1 Flash Image Preview, a.k.a. "Nano Banana 2," is Google’s latest state of the ...

πŸ“ 131,072 ctx Compare →
qwen
$0.14/1M

Qwen: Qwen3.5-35B-A3B

The Qwen3.5 Series 35B-A3B is a native vision-language model designed with a hybrid archit...

πŸ“ 262,144 ctx Compare →
qwen
$0.20/1M

Qwen: Qwen3.5-27B

The Qwen3.5 27B native vision-language Dense model incorporates a linear attention mechani...

πŸ“ 262,144 ctx Compare →
qwen
$0.26/1M

Qwen: Qwen3.5-122B-A10B

The Qwen3.5 122B-A10B native vision-language model is built on a hybrid architecture that ...

πŸ“ 262,144 ctx Compare →
qwen
$0.07/1M

Qwen: Qwen3.5-Flash

The Qwen3.5 native vision-language Flash models are built on a hybrid architecture that in...

πŸ“ 1,000,000 ctx Compare →
qwen
$0.26/1M

Qwen: Qwen3.5 Plus 2026-02-15

The Qwen3.5 native vision-language series Plus models are built on a hybrid architecture t...

πŸ“ 1,000,000 ctx Compare →
qwen
$0.39/1M

Qwen: Qwen3.5 397B A17B

The Qwen3.5 series 397B-A17B native vision-language model is built on a hybrid architectur...

πŸ“ 262,144 ctx Compare →
z-ai
$0.30/1M

Z.ai: GLM 4.6V

GLM-4.6V is a large multimodal model designed for high-fidelity visual understanding and l...

πŸ“ 131,072 ctx Compare →
amazon
$0.30/1M

Amazon: Nova 2 Lite

Nova 2 Lite is a fast, cost-effective reasoning model for everyday workloads that can proc...

πŸ“ 1,000,000 ctx Compare →
mistralai
$0.15/1M

Mistral: Ministral 3 8B 2512

A balanced model in the Ministral 3 family, Ministral 3 8B is a powerful, efficient tiny l...

πŸ“ 262,144 ctx Compare →
mistralai
$0.10/1M

Mistral: Ministral 3 3B 2512

The smallest model in the Ministral 3 family, Ministral 3 3B is a powerful, efficient tiny...

πŸ“ 131,072 ctx Compare →
google
$2.00/1M

Google: Nano Banana Pro (Gemini 3 Pro Image Preview)

Nano Banana Pro is Google’s most advanced image-generation and editing model, built on G...

πŸ“ 65,536 ctx Compare →
qwen
$0.10/1M

Qwen: Qwen3 VL 32B Instruct

Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-...

πŸ“ 262,144 ctx Compare →
openai
$2.50/1M

OpenAI: GPT-5 Image Mini

GPT-5 Image Mini combines OpenAI's advanced language capabilities, powered by [GPT-5 Mini]...

πŸ“ 400,000 ctx Compare →
qwen
$0.08/1M

Qwen: Qwen3 VL 8B Instruct

Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built...

πŸ“ 256,000 ctx Compare →
openai
$10.00/1M

OpenAI: GPT-5 Image

[GPT-5](https://openrouter.ai/openai/gpt-5) Image combines OpenAI's GPT-5 model with state...

πŸ“ 400,000 ctx Compare →
google
$0.30/1M

Google: Nano Banana (Gemini 2.5 Flash Image)

Gemini 2.5 Flash Image, a.k.a. "Nano Banana," is now generally available. It is a state of...

πŸ“ 32,768 ctx Compare →
qwen
$0.13/1M

Qwen: Qwen3 VL 30B A3B Thinking

Qwen3-VL-30B-A3B-Thinking is a multimodal model that unifies strong text generation with v...

πŸ“ 131,072 ctx Compare →
qwen
$0.13/1M

Qwen: Qwen3 VL 30B A3B Instruct

Qwen3-VL-30B-A3B-Instruct is a multimodal model that unifies strong text generation with v...

πŸ“ 262,144 ctx Compare →
qwen
$0.26/1M

Qwen: Qwen3 VL 235B A22B Thinking

Qwen3-VL-235B-A22B Thinking is a multimodal model that unifies strong text generation with...

πŸ“ 131,072 ctx Compare →
qwen
$0.20/1M

Qwen: Qwen3 VL 235B A22B Instruct

Qwen3-VL-235B-A22B Instruct is an open-weight multimodal model that unifies strong text ge...

πŸ“ 262,144 ctx Compare →
z-ai
$0.60/1M

Z.ai: GLM 4.5V

GLM-4.5V is a vision-language foundation model for multimodal agent applications. Built on...

πŸ“ 65,536 ctx Compare →
bytedance
$0.10/1M

ByteDance: UI-TARS 7B

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, in...

πŸ“ 128,000 ctx Compare →
baidu
$0.42/1M

Baidu: ERNIE 4.5 VL 424B A47B

ERNIE-4.5-VL-424B-A47B is a multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE...

πŸ“ 131,072 ctx Compare →
google
$0.05/1M

Google: Gemma 3 4B

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It ha...

πŸ“ 131,072 ctx Compare →
google
$0.05/1M

Google: Gemma 3 12B

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It ha...

πŸ“ 131,072 ctx Compare →
google
$0.08/1M

Google: Gemma 3 27B

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It ha...

πŸ“ 131,072 ctx Compare →
qwen
$0.25/1M

Qwen: Qwen2.5 VL 72B Instruct

Qwen2.5-VL is proficient in recognizing common objects such as flowers, birds, fish, and i...

πŸ“ 131,072 ctx Compare →
minimax
$0.20/1M

MiniMax: MiniMax-01

MiniMax-01 is a combines MiniMax-Text-01 for text generation and MiniMax-VL-01 for image u...

πŸ“ 1,000,192 ctx Compare →
amazon
$0.06/1M

Amazon: Nova Lite 1.0

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast ...

πŸ“ 300,000 ctx Compare →
meta-llama
$0.35/1M

Meta: Llama 3.2 11B Vision Instruct

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle ...

πŸ“ 131,072 ctx Compare →
openai
$0.15/1M

OpenAI: GPT-4o-mini (2024-07-18)

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting...

πŸ“ 128,000 ctx Compare →
openai
$0.15/1M

OpenAI: GPT-4o-mini

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting...

πŸ“ 128,000 ctx Compare →
openai
$5.00/1M

OpenAI: GPT-4o (2024-05-13)

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs...

πŸ“ 128,000 ctx Compare →
openai
$2.50/1M

OpenAI: GPT-4o

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs...

πŸ“ 128,000 ctx Compare →
openai
$10.00/1M

OpenAI: GPT-4 Turbo

The latest GPT-4 Turbo model with vision capabilities. Vision requests can now use JSON mo...

πŸ“ 128,000 ctx Compare →