DeepSeek V4 Flash costs $0.28 per million output tokens. That's not a typo. GPT-5.5 costs $30. Claude Opus 4.8 costs $25. DeepSeek V4 Flash is 89× to 107× cheaper than the frontier models — and it scores 91.6% on LiveCodeBench and 79.0% on SWE-bench Verified. It runs at 126.7 tokens per second on Artificial Analysis, ~84 tok/s (337.3 char/s) on CodingFleet's DeepSeek API endpoint, and ~60+ tok/s in production on Lightning AI. 284B total parameters, 13B active per token, 1M context window, MIT license, weights on Hugging Face. This is the model that makes high-volume AI coding free. Here's the complete review. Try it on CodingFleet.
📊 Key Stats
- $0.28/1M output tokens. 89× cheaper than GPT-5.5 ($30). 107× cheaper than Claude Opus 4.8 ($25 with discount math). Input: $0.14/1M. Cached input: $0.028/1M — 98% discount.
- 126.7 tok/s on Artificial Analysis. Well above the open-weight large-model median of 58.2 t/s. ~84 tok/s (337.3 char/s) on CodingFleet's DeepSeek API. ~60+ tok/s on Lightning AI. 53 tok/s median on OpenRouter across providers.
- 91.6% LiveCodeBench. 79.0% SWE-bench Verified. 52.6% SWE-bench Pro. Within 1.6 points of V4 Pro on Verified. Within 2 points on LiveCodeBench. The gap to the frontier is measured in single digits.
- MIT license. 284B total / 13B active MoE. 1M context. Self-host on a single 80GB GPU when quantized. Fits comfortably on 2×48GB with Unsloth GGUF builds. 11.6T tokens served on OpenRouter in the last 30 days — 2.4× more than V4 Pro.
Try DeepSeek V4 Flash on CodingFleet — 20+ LLMs, including this one at ~84 tok/s.
Specifications
Speed: The Numbers From Every Source
DeepSeek V4 Flash is fast. Here's the speed data from every available source:
| Source | Speed | Notes |
|---|---|---|
| Artificial Analysis (Max Effort) | 126.7 tok/s | Independent benchmark. Well above median (58.2 t/s). |
| CodingFleet (DeepSeek API) | ~84 tok/s (337.3 char/s) | Real-world coding throughput. ~4 chars/tok. |
| Lightning AI | ~60+ tok/s | Production deployment estimate. |
| OpenRouter (median across providers) | 53.0 tok/s | Across 14 providers. P50 throughput. |
| DeepSeek V3.2 (for comparison) | ~35 tok/s | Flash is 1.7-3.6× faster than predecessor. |
| Claude Opus 4.8 (for comparison) | Moderate | Flash is dramatically faster. |
TTFT (Time To First Token): 1.30s on DeepSeek API. CodingFleet char/s → tok/s conversion uses ~4 chars/token (standard for English code). Sources: Artificial Analysis; Lightning AI; OpenRouter; CodingFleet internal measurements.
Why speed matters for coding: In execution-guided workflows — generate code, run tests, see failures, fix — faster iterations mean more bugs fixed per minute. At 126.7 tok/s, Flash completes 4× more repair cycles than a 30 tok/s model in the same time. For autonomous coding agents that loop dozens of times per task, speed is literally capability. See our unit test generator guide for why this compounds.
Coding Benchmarks: How Close to the Frontier?
| Benchmark | V4 Flash (Max) | V4 Pro (Max) | Claude Opus 4.8 | Flash vs Pro Gap |
|---|---|---|---|---|
| LiveCodeBench | 91.6% | 93.5% | 88.8% | -1.9 pts |
| SWE-bench Verified | 79.0% | 80.6% | 88.6% | -1.6 pts |
| SWE-bench Pro | 52.6% | 55.4% | 69.2% | -2.8 pts |
| Terminal-Bench 2.0 | 56.9% | 67.9% | 65.4% | -11.0 pts |
| GPQA Diamond | 88.1% | 90.1% | 93.6% | -2.0 pts |
| MMLU-Pro | 86.2% | 87.5% | — | -1.3 pts |
| Codeforces Rating | 3052 | 3206 | — | -154 pts |
| HumanEval | 69.5% | 76.8% | 95.0% | -7.3 pts |
| Output Price /1M tok | $0.28 | $0.87 | $25.00 | 3.1× cheaper |
Sources: DeepSeek V4 Flash Model Card; V4 Pro Model Card; Vellum — Opus 4.8; FriendliAI. All Flash and Pro scores are vendor-reported (Max reasoning mode).
The gap to V4 Pro is tiny — 1.6 points on Verified, 1.9 on LiveCodeBench, 2.8 on Pro. Flash delivers ~96% of V4 Pro's coding capability at 3.1× lower cost. The only meaningful gap is Terminal-Bench (56.9% vs 67.9%) — for CLI agent coding, Pro's extra reasoning depth matters. For everything else — code generation, bug fixes, algorithm implementation — Flash is functionally equivalent to Pro at a fraction of the price.
Why Flash Is So Fast and Cheap
Flash shares the same architectural innovations as V4 Pro, just scaled down. The hybrid CSA+HCA attention mechanism (Compressed Sparse Attention + Heavily Compressed Attention) is the key: at 1M tokens, V4 Flash uses an even smaller fraction of FLOPs and KV cache than Pro. CSA compresses KV caches 4× with a lightning indexer selecting the top 1,024 relevant entries per query. HCA provides a 128× compressed global view. A sliding window handles recency.
The Muon optimizer replaced AdamW for faster convergence. Manifold-Constrained Hyper-Connections (mHC) stabilize signal propagation across deep layers. MoE experts use FP4 precision with FP8 for other parameters — maximizing memory efficiency. The result: 284B total parameters, only 13B active per token. That's smaller than many laptops' RAM. At 126.7 tok/s with 1M context, Flash achieves a price-performance ratio that no other model approaches.
Self-hosting: Flash fits on a single 80GB GPU when quantized (FP4 + FP8). On 2×48GB with Unsloth GGUF builds, it runs comfortably. vLLM and SGLang both shipped Day-0 support. The model weights are ~158GB on disk — download from Hugging Face, deploy on RunPod or Clore.ai, and you're running your own instance for pennies per million tokens. No other model combines 91.6% LiveCodeBench with self-hosting this accessible.
Adoption: The Numbers Don't Lie
Developers are voting with their tokens. On OpenRouter in the last 30 days:
| Model | Total Tokens (30 days) | Prompt | Completion | Reasoning |
|---|---|---|---|---|
| DeepSeek V4 Flash | 11.6T | 401B | 9.1B | 8.61B |
| DeepSeek V4 Pro | 4.86T | 196B | 2.4B | 5.45B |
Source: OpenRouter. Flash serves 2.4× more tokens than V4 Pro — developers prefer speed + value over raw capability for most tasks.
Flash serves 11.6 trillion tokens per month — 2.4× more than V4 Pro. The market has spoken: for high-volume coding, speed and price beat benchmark margins. Flash is available across 14 providers on OpenRouter, from DeepSeek's first-party API to Microsoft Foundry, Vercel AI Gateway, Novita AI, and Baidu Qianfan.
What Flash Isn't Good At
No model is perfect. Flash has specific weaknesses you should know about:
- Terminal-Bench (56.9% vs Pro's 67.9%). For CLI agent coding — multi-step tool use, environment interaction, iterative debugging — Flash falls meaningfully behind Pro. The 13B active parameters lack the reasoning depth for complex agentic workflows. Use Pro for autonomous coding agents, Flash for code generation.
- High hallucination rate: 96%. Artificial Analysis reports that when Flash doesn't know the answer, it answers anyway 96% of the time — compared to 94% for V4 Pro. This is a known DeepSeek tradeoff: the models are optimized for helpfulness, not refusal. For factual coding (API documentation, library behavior), verify output. For creative coding (algorithm design, architecture), this matters less.
- Verbose: 240M output tokens on the AA Intelligence Index. Flash generates a lot of tokens — the median for similar models is 43M. On coding tasks, this means longer explanations, more commentary. The low per-token price offsets the verbosity, but it's worth capping
max_tokensfor latency control. - HumanEval gap (69.5% vs Pro's 76.8%). For isolated function generation from docstrings, Flash is notably weaker than Pro. This is the benchmark most aligned with "write a function that does X" — if your workflow depends heavily on this pattern, Pro is worth the premium.
- Text-only. No vision. No multimodal. Flash can't process images, video, or audio. For coding from screenshots or diagrams, use MiniMax M3 or Gemini 3.5 Flash instead.
Verdict: The Model That Makes AI Coding Free
DeepSeek V4 Flash is the most important model of 2026 — not because it's the best, but because it's good enough at a price that makes high-volume AI coding free.
At $0.28/1M output, you can generate 3.5 million tokens — roughly 2.6 million lines of code — for one dollar. A full-codebase analysis that costs $25 with Claude Opus 4.8 costs $0.28 with Flash. A CI/CD pipeline generating tests for every commit costs pennies per day. An automated code review system processing thousands of PRs costs less than a coffee.
The benchmark gaps to the frontier are real but shrinking: 9.6 points behind Opus 4.8 on SWE-bench Pro, 8.3 points on Terminal-Bench. For critical production debugging, pay for Opus. For everything else — code generation, test writing, boilerplate, documentation, code review, batch refactoring — Flash is the economically correct choice.
The tiered stack: Flash for volume (90% of coding tasks), V4 Pro or Opus 4.8 for depth (the 10% that matter most). At $0.28/1M, you can afford to be profligate with Flash and conservative with frontier models. That's the stack. See our budget models ranking and heavy user's guide for the complete strategy.
20+ LLMs. MIT license. $0.28/1M output. The fastest model we've measured.
Sources: Artificial Analysis — V4 Flash Intelligence & Speed | DeepSeek V4 Flash Model Card (Hugging Face) | MorphLLM — DeepSeek V4 Architecture Guide | Lightning AI — V4 Comparison | FriendliAI — V4 Pro & Flash Analysis | Fireworks AI — Best LLMs for Coding | OpenRouter — Activity & Pricing | Artificial Analysis — V4 Launch Analysis.