The $0.28 Developer: DeepSeek V4 Flash — The Fastest, Cheapest Coding Model

DeepSeek V4 Flash costs $0.28 per million output tokens. That's not a typo. GPT-5.5 costs $30. Claude Opus 4.8 costs $25. DeepSeek V4 Flash is 89× to 107× cheaper than the frontier models — and it scores 91.6% on LiveCodeBench and 79.0% on SWE-bench Verified. It runs at 126.7 tokens per second on Artificial Analysis, ~84 tok/s (337.3 char/s) on CodingFleet's DeepSeek API endpoint, and ~60+ tok/s in production on Lightning AI. 284B total parameters, 13B active per token, 1M context window, MIT license, weights on Hugging Face. This is the model that makes high-volume AI coding free. Here's the complete review. Try it on CodingFleet.

📊 Key Stats

$0.28/1M output tokens. 89× cheaper than GPT-5.5 ($30). 107× cheaper than Claude Opus 4.8 ($25 with discount math). Input: $0.14/1M. Cached input: $0.028/1M — 98% discount.
126.7 tok/s on Artificial Analysis. Well above the open-weight large-model median of 58.2 t/s. ~84 tok/s (337.3 char/s) on CodingFleet's DeepSeek API. ~60+ tok/s on Lightning AI. 53 tok/s median on OpenRouter across providers.
91.6% LiveCodeBench. 79.0% SWE-bench Verified. 52.6% SWE-bench Pro. Within 1.6 points of V4 Pro on Verified. Within 2 points on LiveCodeBench. The gap to the frontier is measured in single digits.
MIT license. 284B total / 13B active MoE. 1M context. Self-host on a single 80GB GPU when quantized. Fits comfortably on 2×48GB with Unsloth GGUF builds. 11.6T tokens served on OpenRouter in the last 30 days — 2.4× more than V4 Pro.

Try DeepSeek V4 Flash on CodingFleet — 20+ LLMs, including this one at ~84 tok/s.

Specifications

SpecDeepSeek V4 Flash ArchitectureMixture-of-Experts (MoE) Total Parameters284B Active Parameters13B per token Context Window1,000,000 tokens Max Output131K tokens AttentionHybrid CSA + HCA OptimizerMuon (AdamW for embeddings) Training Data33T tokens PrecisionFP4 experts / FP8 other parameters Reasoning Modes3 (Non-Think, Think High, Think Max) Input Price$0.14/1M (cache miss) / $0.028/1M (cache hit, 98% discount) Output Price$0.28/1M Blended Price (7:2:1)$0.06/1M LicenseMIT (open weights on Hugging Face) Release DateApril 24, 2026 API CompatibilityOpenAI + Anthropic specs

Speed: The Numbers From Every Source

DeepSeek V4 Flash is fast. Here's the speed data from every available source:

Source	Speed	Notes
Artificial Analysis (Max Effort)	126.7 tok/s	Independent benchmark. Well above median (58.2 t/s).
CodingFleet (DeepSeek API)	~84 tok/s (337.3 char/s)	Real-world coding throughput. ~4 chars/tok.
Lightning AI	~60+ tok/s	Production deployment estimate.
OpenRouter (median across providers)	53.0 tok/s	Across 14 providers. P50 throughput.
DeepSeek V3.2 (for comparison)	~35 tok/s	Flash is 1.7-3.6× faster than predecessor.
Claude Opus 4.8 (for comparison)	Moderate	Flash is dramatically faster.

TTFT (Time To First Token): 1.30s on DeepSeek API. CodingFleet char/s → tok/s conversion uses ~4 chars/token (standard for English code). Sources: Artificial Analysis; Lightning AI; OpenRouter; CodingFleet internal measurements.

Why speed matters for coding: In execution-guided workflows — generate code, run tests, see failures, fix — faster iterations mean more bugs fixed per minute. At 126.7 tok/s, Flash completes 4× more repair cycles than a 30 tok/s model in the same time. For autonomous coding agents that loop dozens of times per task, speed is literally capability. See our unit test generator guide for why this compounds.

Coding Benchmarks: How Close to the Frontier?

Benchmark	V4 Flash (Max)	V4 Pro (Max)	Claude Opus 4.8	Flash vs Pro Gap
LiveCodeBench	91.6%	93.5%	88.8%	-1.9 pts
SWE-bench Verified	79.0%	80.6%	88.6%	-1.6 pts
SWE-bench Pro	52.6%	55.4%	69.2%	-2.8 pts
Terminal-Bench 2.0	56.9%	67.9%	65.4%	-11.0 pts
GPQA Diamond	88.1%	90.1%	93.6%	-2.0 pts
MMLU-Pro	86.2%	87.5%	—	-1.3 pts
Codeforces Rating	3052	3206	—	-154 pts
HumanEval	69.5%	76.8%	95.0%	-7.3 pts
Output Price /1M tok	$0.28	$0.87	$25.00	3.1× cheaper

Sources: DeepSeek V4 Flash Model Card; V4 Pro Model Card; Vellum — Opus 4.8; FriendliAI. All Flash and Pro scores are vendor-reported (Max reasoning mode).

The gap to V4 Pro is tiny — 1.6 points on Verified, 1.9 on LiveCodeBench, 2.8 on Pro. Flash delivers ~96% of V4 Pro's coding capability at 3.1× lower cost. The only meaningful gap is Terminal-Bench (56.9% vs 67.9%) — for CLI agent coding, Pro's extra reasoning depth matters. For everything else — code generation, bug fixes, algorithm implementation — Flash is functionally equivalent to Pro at a fraction of the price.

Why Flash Is So Fast and Cheap

Flash shares the same architectural innovations as V4 Pro, just scaled down. The hybrid CSA+HCA attention mechanism (Compressed Sparse Attention + Heavily Compressed Attention) is the key: at 1M tokens, V4 Flash uses an even smaller fraction of FLOPs and KV cache than Pro. CSA compresses KV caches 4× with a lightning indexer selecting the top 1,024 relevant entries per query. HCA provides a 128× compressed global view. A sliding window handles recency.

The Muon optimizer replaced AdamW for faster convergence. Manifold-Constrained Hyper-Connections (mHC) stabilize signal propagation across deep layers. MoE experts use FP4 precision with FP8 for other parameters — maximizing memory efficiency. The result: 284B total parameters, only 13B active per token. That's smaller than many laptops' RAM. At 126.7 tok/s with 1M context, Flash achieves a price-performance ratio that no other model approaches.

Self-hosting: Flash fits on a single 80GB GPU when quantized (FP4 + FP8). On 2×48GB with Unsloth GGUF builds, it runs comfortably. vLLM and SGLang both shipped Day-0 support. The model weights are ~158GB on disk — download from Hugging Face, deploy on RunPod or Clore.ai, and you're running your own instance for pennies per million tokens. No other model combines 91.6% LiveCodeBench with self-hosting this accessible.

Adoption: The Numbers Don't Lie

Developers are voting with their tokens. On OpenRouter in the last 30 days:

Model	Total Tokens (30 days)	Prompt	Completion	Reasoning
DeepSeek V4 Flash	11.6T	401B	9.1B	8.61B
DeepSeek V4 Pro	4.86T	196B	2.4B	5.45B

Source: OpenRouter. Flash serves 2.4× more tokens than V4 Pro — developers prefer speed + value over raw capability for most tasks.

Flash serves 11.6 trillion tokens per month — 2.4× more than V4 Pro. The market has spoken: for high-volume coding, speed and price beat benchmark margins. Flash is available across 14 providers on OpenRouter, from DeepSeek's first-party API to Microsoft Foundry, Vercel AI Gateway, Novita AI, and Baidu Qianfan.

What Flash Isn't Good At

No model is perfect. Flash has specific weaknesses you should know about:

Terminal-Bench (56.9% vs Pro's 67.9%). For CLI agent coding — multi-step tool use, environment interaction, iterative debugging — Flash falls meaningfully behind Pro. The 13B active parameters lack the reasoning depth for complex agentic workflows. Use Pro for autonomous coding agents, Flash for code generation.
High hallucination rate: 96%. Artificial Analysis reports that when Flash doesn't know the answer, it answers anyway 96% of the time — compared to 94% for V4 Pro. This is a known DeepSeek tradeoff: the models are optimized for helpfulness, not refusal. For factual coding (API documentation, library behavior), verify output. For creative coding (algorithm design, architecture), this matters less.
Verbose: 240M output tokens on the AA Intelligence Index. Flash generates a lot of tokens — the median for similar models is 43M. On coding tasks, this means longer explanations, more commentary. The low per-token price offsets the verbosity, but it's worth capping max_tokens for latency control.
HumanEval gap (69.5% vs Pro's 76.8%). For isolated function generation from docstrings, Flash is notably weaker than Pro. This is the benchmark most aligned with "write a function that does X" — if your workflow depends heavily on this pattern, Pro is worth the premium.
Text-only. No vision. No multimodal. Flash can't process images, video, or audio. For coding from screenshots or diagrams, use MiniMax M3 or Gemini 3.5 Flash instead.

Verdict: The Model That Makes AI Coding Free

DeepSeek V4 Flash is the most important model of 2026 — not because it's the best, but because it's good enough at a price that makes high-volume AI coding free.

At $0.28/1M output, you can generate 3.5 million tokens — roughly 2.6 million lines of code — for one dollar. A full-codebase analysis that costs $25 with Claude Opus 4.8 costs $0.28 with Flash. A CI/CD pipeline generating tests for every commit costs pennies per day. An automated code review system processing thousands of PRs costs less than a coffee.

The benchmark gaps to the frontier are real but shrinking: 9.6 points behind Opus 4.8 on SWE-bench Pro, 8.3 points on Terminal-Bench. For critical production debugging, pay for Opus. For everything else — code generation, test writing, boilerplate, documentation, code review, batch refactoring — Flash is the economically correct choice.

The tiered stack: Flash for volume (90% of coding tasks), V4 Pro or Opus 4.8 for depth (the 10% that matter most). At $0.28/1M, you can afford to be profligate with Flash and conservative with frontier models. That's the stack. See our budget models ranking and heavy user's guide for the complete strategy.

⚡ Try DeepSeek V4 Flash at ~84 tok/s on CodingFleet →

20+ LLMs. MIT license. $0.28/1M output. The fastest model we've measured.

📊 Key Stats

Specifications

Speed: The Numbers From Every Source

Coding Benchmarks: How Close to the Frontier?

Why Flash Is So Fast and Cheap

Adoption: The Numbers Don't Lie

What Flash Isn't Good At

Verdict: The Model That Makes AI Coding Free

Continue reading

MiniMax M2.7 vs DeepSeek V4 Flash: Budget Open-Weight Coding Showdown

GPT-5.6 Terra vs Gemini 3.5 Flash: Which Mid-Tier Model Wins in 2026?

Best AI Diagram Generators from Code in 2026: UML, Flowcharts & Architecture

Best AI Code Explainers in 2026: Understand Any Code in Seconds