Google just shipped Gemini 3.1 Pro on February 19, 2026 - just three months after Gemini 3 Pro. This is the first time Google has used a “.1” increment instead of their usual “.5” mid-cycle updates, and the jump in capability makes the naming change feel justified.
What Changed
The headline number: 77.1% on ARC-AGI-2, more than double Gemini 3 Pro’s score. ARC-AGI-2 tests novel pattern recognition rather than memorized knowledge, so this isn’t just benchmark gaming - it’s a genuine reasoning upgrade.
The “upgraded core intelligence” that first appeared in Gemini 3 Deep Think is now baked into the standard Pro model. Same pricing, same API format, dramatically better output.
Key Specs
- Context window: 1 million tokens input, 64K tokens output
- Multimodal: Text, audio, images, video, and entire code repositories
- Thinking levels: Low, Medium, and High - controlling how much internal reasoning the model performs before responding
- Pricing: $2 per million input tokens, $12 per million output tokens (same as Gemini 3 Pro)
The thinking levels are interesting. Rather than a single mode, developers can tune the reasoning depth per request - balancing quality, latency, and cost. Low for quick lookups, High for complex multi-step problems.
Benchmark Comparison
Here’s how Gemini 3.1 Pro stacks up against the current top models:
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.2 |
|---|---|---|---|
| ARC-AGI-2 (reasoning) | 77.1% | - | - |
| GPQA Diamond (science) | 94.3% | 91.3% | 92.4% |
| SWE-Bench Verified (coding) | 80.6% | 80.8% | - |
| Humanity’s Last Exam (tools) | 51.4% | 53.1% | - |
| APEX-Agents (agentic tasks) | 33.5% | 29.8% | 23.0% |
| Terminal-Bench 2.0 | 68.5% | - | 77.3%* |
*GPT-5.3-Codex with specialized harness
Google claims Gemini 3.1 Pro leads on 12 of 18 tracked benchmarks. That’s a strong position, but the details matter.
Where Each Model Wins
Gemini 3.1 Pro dominates general reasoning, scientific knowledge, and agentic workflows. The ARC-AGI-2 score is particularly impressive - it suggests the model handles novel problems rather than just pattern-matching from training data.
Claude Opus 4.6 retains the edge in real-world software engineering (SWE-Bench Verified) and expert-level human evaluations. On the GDPval-AA Elo benchmark, Claude Sonnet 4.6 scores 1633 points versus Gemini 3.1 Pro’s 1317. That’s a significant gap - suggesting Claude’s outputs tend to be more polished and contextually appropriate when judged by human experts.
GPT-5.3-Codex is the coding specialist. On Terminal-Bench 2.0, it scores 77.3% versus Gemini’s 68.5%. For dedicated coding workflows, OpenAI’s specialized model still has an edge.
The pattern: Gemini 3.1 Pro is the best general-purpose model on paper. Specialized models still win in their niches.
Pricing - The Real Story
This is where Gemini 3.1 Pro gets genuinely disruptive:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Gemini 3.1 Pro | $2 | $12 |
| Claude Sonnet 4.6 | $3 | $15 |
| Claude Opus 4.6 | $15 | $75 |
Gemini 3.1 Pro is 7.5x cheaper on input and 6.25x cheaper on output compared to Claude Opus 4.6 - while matching or beating it on most benchmarks. Context caching can reduce costs by another 75%.
For teams running high-volume inference, this pricing difference compounds fast. A workload costing $10,000/month on Opus 4.6 would cost roughly $1,300 on Gemini 3.1 Pro.
Where to Access It
Gemini 3.1 Pro is available in preview across:
- Gemini app - higher limits for AI Pro and Ultra plan users
- Google AI Studio - free-tier access with rate limits (great for prototyping)
- Vertex AI - enterprise access
- NotebookLM - Pro and Ultra users
- GitHub Copilot - public preview
- Android Studio - for mobile developers
The model is a drop-in replacement for Gemini 3 Pro - same API format, same pricing. Just swap the model ID and the performance upgrade is immediate.
Early Impressions
The reasoning upgrade is real. In hands-on testing, Gemini 3.1 Pro handles constraint-heavy logic without collapsing into contradictions - something most models still struggle with. One test generated a full implementation plan and code across 6 files in a single pass with zero errors.
But it’s not all smooth. Launch day was rough - response times of 100+ seconds for simple queries, rate limit errors, and timeout issues. Typical for a high-demand launch, but worth noting if planning production use immediately.
Hallucination rates dropped from 88% to 50% on the AA-Omniscience benchmark. That’s a meaningful improvement, but 50% is still high enough that structured prompting and human oversight remain necessary for high-stakes tasks.
What This Means
Three observations about where things are heading:
The gap between models is shrinking. A year ago, there were clear tiers. Now, the top 3-4 models trade wins across benchmarks. The differences are becoming task-specific rather than across-the-board.
Price is becoming the differentiator. When benchmark scores are within a few percentage points of each other, cost per token starts mattering a lot more. Gemini 3.1 Pro delivering near-top performance at a fraction of the price puts real pressure on competitors.
Specialization is winning over generalization. No single model is best at everything. Claude leads in expert tasks and software engineering. GPT-5.3-Codex leads in dedicated coding. Gemini leads in reasoning and agentic workflows. The future likely involves routing different tasks to different models rather than picking one.
Should You Switch?
It depends on the workload:
- General reasoning and multimodal tasks - Gemini 3.1 Pro is the clear choice right now
- Production software engineering - Claude Opus 4.6 still has the edge
- Dedicated coding agents - GPT-5.3-Codex with its specialized harness wins
- Cost-sensitive high-volume inference - Gemini 3.1 Pro, no contest
- Expert-quality writing and analysis - Claude models still produce more polished output per human evaluations
The best strategy for most teams: test Gemini 3.1 Pro on your actual workload. The benchmark numbers are promising, but the real question is how it performs on your specific use case.
The AI landscape just got more competitive. And for users, that’s nothing but good news.
This article was written with AI assistance.
Comments