What if you didn't have to pick one model?
That's the core idea behind Sakana Fugu, a multi-agent AI system launched by that coordinates Claude, GPT, and Gemini through a single OpenAI-compatible API endpoint. One request in. The system routes it, delegates it, synthesizes it, and sends one answer back.
It doesn't just run multiple models. It learned how to make them work together.
The results are hard to ignore. Fugu Ultra leads on 10 of 11 major benchmarks, outperforming Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro on coding, reasoning, and scientific tasks. On SWE-Bench Pro, the coding benchmark that most developers trust, Fugu Ultra scores 73.7% compared to Claude Opus 4.8's 69.2% and GPT-5.5's 58.6%.
One big caveat: those are Sakana-reported numbers, not independently verified yet. That matters, and we'll get into it.
Who Built Sakana Fugu
Sakana AI was founded by Llion Jones and David Ha. Jones is one of the eight co-authors of "Attention Is All You Need," the 2017 paper that introduced the transformer architecture everything in AI now runs on. Ha was previously at Google Brain.
That pedigree explains the research-first approach. Fugu isn't built on prompt tricks or fixed routing rules. It's built on two papers published at ICLR 2026.
How Fugu Actually Works
The Two Research Systems Behind It
TRINITY assigns three distinct roles across the model pool:
Thinker - breaks down the problem and plans an approach
Worker - executes the actual task
Verifier - checks the output for errors and inconsistencies
The coordinator is lightweight and "evolved" rather than hand-coded, meaning the role assignments learned to be efficient rather than following a ruleset someone wrote.
Conductor takes a different approach. It uses reinforcement learning to discover coordination strategies between agents. Not strategies humans designed. Strategies the system found by experimenting.
Sakana describes it as: "instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns."
That last phrase matters. The coordination patterns aren't intuitive. They're just effective.
What Happens When You Send a Request
Your prompt hits the single API endpoint
Fugu's orchestrator decides whether to handle it directly or delegate to specialists
The relevant models process their assigned portions independently
A synthesizer evaluates all outputs, detects contradictions, weights reasoning chains
One consolidated answer comes back
The whole process is invisible from the outside. Your application sends a standard OpenAI-format request. It gets a response. What happened in between is Sakana's proprietary system.
Fugu vs Fugu Ultra: Which One
Two tiers. Different priorities.
Fugu | Fugu Ultra | |
|---|---|---|
Priority | Speed and low latency | Answer quality |
Best for | Coding, review, interactive work | Complex reasoning, research, hard problems |
Agent coordination | Standard | Coordinates more expert agents |
Response time | Faster | Slower |
Pay-as-you-go | Underlying model rate | $5/$30 per 1M input/output tokens |
For most daily coding and review tasks, standard Fugu is the right call. For anything where getting the best possible answer matters more than getting it fast, Fugu Ultra is the one to use.
Benchmark Results
[IMAGE: Clean data visualization comparing Fugu Ultra against competing AI models on coding and reasoning benchmarks, bar chart style, dark background, teal and purple bars, showing clear performance gaps]
Fugu Ultra leads on 10 of 11 benchmarks tested. Here's what the numbers look like on the ones that matter most:
Benchmark | Fugu Ultra | Claude Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
SWE-Bench Pro | 73.7% | 69.2% | 58.6% | 54.2% |
LiveCodeBench Pro | 90.8% | 84.8% | 88.4% | 82.9% |
TerminalBench 2.1 | 82.1% | 74.6% | 78.2% | 70.3% |
Humanity's Last Exam | 50.0% | 49.8% | 41.4% | 44.4% |
MRCRv2 | 93.6% | — | 94.8% | — |
The only loss is MRCRv2, where GPT-5.5 edges ahead at 94.8% versus 93.6%. Every other benchmark goes to Fugu Ultra.
The Caveat You Should Know
All of these numbers are self-reported by Sakana AI. Independent verification hasn't happened yet. That's not unusual for a new product launch, but it does mean the benchmark claims deserve healthy skepticism until third parties start testing. The code review demonstration, where beta testers saw Fugu Ultra surface 20+ issues versus roughly 3 from competitors, is more immediately usable as signal than the benchmark table.
Pricing
Subscription Plans (include both Fugu and Fugu Ultra)
Standard: $20/month - lightweight daily use
Pro: $100/month - 10x usage for focused work sessions
Max: $200/month - 30x usage for heavy workloads
Free second month promo if you subscribe before July 2026
Pay-as-You-Go (token-based)
Fugu: Standard rate of the underlying model being used
Fugu Ultra (standard context): $5 input / $30 output per 1M tokens
Fugu Ultra (context over 272K tokens): $10 input / $45 output per 1M tokens
Cached input: $0.50 to $1.00 per 1M tokens
The subscription tiers are straightforward. The pay-as-you-go pricing for Fugu Ultra is on the expensive end, particularly at the long-context rate. For high-volume production workloads, the cost math needs careful attention before committing.
One pricing detail worth noting: when multiple agents run in parallel, you pay only the rate of the highest-tier model involved. Not a stacked fee for each agent.
Where Fugu Excels
Best Use Cases
Code review at depth - finding edge cases, security issues, and logic bugs across large codebases
Complex multi-step reasoning - problems that require planning before executing
Research and paper reproduction - scientific analysis with multiple validation steps
Security assessments - penetration testing scenarios where thoroughness beats speed
Long-context analysis - patent analysis, document review, legal summarization
High-stakes accuracy - medical, financial, and legal domains where errors are costly
Where It Falls Short
Be honest about this before you adopt it:
Real-time applications - orchestration overhead adds latency. If you need sub-second responses, Fugu adds friction
Simple, well-defined tasks - routing a basic question through three models is overkill and costs more than it should
High-volume, cost-sensitive operations - the per-token rate for Fugu Ultra at scale is not cheap
Compliance-sensitive routing - the orchestration logic is proprietary and non-auditable. If your organization needs to know exactly which model processed which data, Fugu can't fully satisfy that
EU/EEA users - not available yet, GDPR compliance work is still in progress
How to Access Fugu
Direct Access
Via with a standard OpenAI-compatible API client
Available globally except EU/EEA
Third-Party Platforms
OpenRouter - multi-model routing platform
Vercel AI Gateway - for Next.js and Vercel deployments
Creao - AI workflow platform
Enterprise Options
Opt out of specific providers for privacy and compliance requirements
Custom configurations for regulated industries
Token usage and costs reported per request for monitoring
What Makes This Actually Different
Most "multi-model" approaches in AI right now are prompt-chaining tools. You write a chain, pick which model runs which step, and manage the outputs yourself. LangChain, AutoGen, and similar frameworks work this way. They're powerful but they require you to design the workflow.
Fugu removes that design step. The orchestration layer learned when to delegate, how agents should communicate, and how to combine outputs. You don't configure a pipeline. You send a request.
That's the practical difference. And it's why the code review result is compelling. Nobody designed a "find bugs with multiple models" workflow. The system figured out how to do it and did it better than any single model alone.
For teams already building on AI tools and wondering whether adding model orchestration is worth the complexity, Fugu makes the answer simpler. The complexity is already handled.
If you're evaluating where AI fits into your creative and content workflows more broadly, the covers how platforms like MagicShot are already running including GPT Image 2.0, Nano Banana 2, VEO 3.1, and Seedance 2.0 under one subscription for creative work.
Sakana Fugu is worth watching closely over the next few months as independent benchmark verification catches up to the self-reported numbers. If the performance holds under third-party testing, it's a meaningful shift in what a single API endpoint can deliver.
