What if you didn't have to pick one model?

That's the core idea behind Sakana Fugu, a multi-agent AI system launched by Sakana AI that coordinates Claude, GPT, and Gemini through a single OpenAI-compatible API endpoint. One request in. The system routes it, delegates it, synthesizes it, and sends one answer back.

It doesn't just run multiple models. It learned how to make them work together.

The results are hard to ignore. Fugu Ultra leads on 10 of 11 major benchmarks, outperforming Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro on coding, reasoning, and scientific tasks. On SWE-Bench Pro, the coding benchmark that most developers trust, Fugu Ultra scores 73.7% compared to Claude Opus 4.8's 69.2% and GPT-5.5's 58.6%.

One big caveat: those are Sakana-reported numbers, not independently verified yet. That matters, and we'll get into it.

Who Built Sakana Fugu

Sakana AI was founded by Llion Jones and David Ha. Jones is one of the eight co-authors of "Attention Is All You Need," the 2017 paper that introduced the transformer architecture everything in AI now runs on. Ha was previously at Google Brain.

That pedigree explains the research-first approach. Fugu isn't built on prompt tricks or fixed routing rules. It's built on two papers published at ICLR 2026.

How Fugu Actually Works

The Two Research Systems Behind It

TRINITY assigns three distinct roles across the model pool:

  • Thinker - breaks down the problem and plans an approach

  • Worker - executes the actual task

  • Verifier - checks the output for errors and inconsistencies

The coordinator is lightweight and "evolved" rather than hand-coded, meaning the role assignments learned to be efficient rather than following a ruleset someone wrote.

Conductor takes a different approach. It uses reinforcement learning to discover coordination strategies between agents. Not strategies humans designed. Strategies the system found by experimenting.

Sakana describes it as: "instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns."

That last phrase matters. The coordination patterns aren't intuitive. They're just effective.

What Happens When You Send a Request

  1. Your prompt hits the single API endpoint

  2. Fugu's orchestrator decides whether to handle it directly or delegate to specialists

  3. The relevant models process their assigned portions independently

  4. A synthesizer evaluates all outputs, detects contradictions, weights reasoning chains

  5. One consolidated answer comes back

The whole process is invisible from the outside. Your application sends a standard OpenAI-format request. It gets a response. What happened in between is Sakana's proprietary system.

Fugu vs Fugu Ultra: Which One

Two tiers. Different priorities.

Fugu

Fugu Ultra

Priority

Speed and low latency

Answer quality

Best for

Coding, review, interactive work

Complex reasoning, research, hard problems

Agent coordination

Standard

Coordinates more expert agents

Response time

Faster

Slower

Pay-as-you-go

Underlying model rate

$5/$30 per 1M input/output tokens

For most daily coding and review tasks, standard Fugu is the right call. For anything where getting the best possible answer matters more than getting it fast, Fugu Ultra is the one to use.

Benchmark Results

[IMAGE: Clean data visualization comparing Fugu Ultra against competing AI models on coding and reasoning benchmarks, bar chart style, dark background, teal and purple bars, showing clear performance gaps]

Fugu Ultra leads on 10 of 11 benchmarks tested. Here's what the numbers look like on the ones that matter most:

Benchmark

Fugu Ultra

Claude Opus 4.8

GPT-5.5

Gemini 3.1 Pro

SWE-Bench Pro

73.7%

69.2%

58.6%

54.2%

LiveCodeBench Pro

90.8%

84.8%

88.4%

82.9%

TerminalBench 2.1

82.1%

74.6%

78.2%

70.3%

Humanity's Last Exam

50.0%

49.8%

41.4%

44.4%

MRCRv2

93.6%

94.8%

The only loss is MRCRv2, where GPT-5.5 edges ahead at 94.8% versus 93.6%. Every other benchmark goes to Fugu Ultra.

The Caveat You Should Know

All of these numbers are self-reported by Sakana AI. Independent verification hasn't happened yet. That's not unusual for a new product launch, but it does mean the benchmark claims deserve healthy skepticism until third parties start testing. The code review demonstration, where beta testers saw Fugu Ultra surface 20+ issues versus roughly 3 from competitors, is more immediately usable as signal than the benchmark table.

Pricing

Subscription Plans (include both Fugu and Fugu Ultra)

  • Standard: $20/month - lightweight daily use

  • Pro: $100/month - 10x usage for focused work sessions

  • Max: $200/month - 30x usage for heavy workloads

  • Free second month promo if you subscribe before July 2026

Pay-as-You-Go (token-based)

  • Fugu: Standard rate of the underlying model being used

  • Fugu Ultra (standard context): $5 input / $30 output per 1M tokens

  • Fugu Ultra (context over 272K tokens): $10 input / $45 output per 1M tokens

  • Cached input: $0.50 to $1.00 per 1M tokens

The subscription tiers are straightforward. The pay-as-you-go pricing for Fugu Ultra is on the expensive end, particularly at the long-context rate. For high-volume production workloads, the cost math needs careful attention before committing.

One pricing detail worth noting: when multiple agents run in parallel, you pay only the rate of the highest-tier model involved. Not a stacked fee for each agent.

Where Fugu Excels

Best Use Cases

  • Code review at depth - finding edge cases, security issues, and logic bugs across large codebases

  • Complex multi-step reasoning - problems that require planning before executing

  • Research and paper reproduction - scientific analysis with multiple validation steps

  • Security assessments - penetration testing scenarios where thoroughness beats speed

  • Long-context analysis - patent analysis, document review, legal summarization

  • High-stakes accuracy - medical, financial, and legal domains where errors are costly

Where It Falls Short

Be honest about this before you adopt it:

  • Real-time applications - orchestration overhead adds latency. If you need sub-second responses, Fugu adds friction

  • Simple, well-defined tasks - routing a basic question through three models is overkill and costs more than it should

  • High-volume, cost-sensitive operations - the per-token rate for Fugu Ultra at scale is not cheap

  • Compliance-sensitive routing - the orchestration logic is proprietary and non-auditable. If your organization needs to know exactly which model processed which data, Fugu can't fully satisfy that

  • EU/EEA users - not available yet, GDPR compliance work is still in progress

How to Access Fugu

Direct Access

  • Via sakana.ai/fugu/ with a standard OpenAI-compatible API client

  • Available globally except EU/EEA

Third-Party Platforms

  • OpenRouter - multi-model routing platform

  • Vercel AI Gateway - for Next.js and Vercel deployments

  • Creao - AI workflow platform

Enterprise Options

  • Opt out of specific providers for privacy and compliance requirements

  • Custom configurations for regulated industries

  • Token usage and costs reported per request for monitoring

What Makes This Actually Different

Most "multi-model" approaches in AI right now are prompt-chaining tools. You write a chain, pick which model runs which step, and manage the outputs yourself. LangChain, AutoGen, and similar frameworks work this way. They're powerful but they require you to design the workflow.

Fugu removes that design step. The orchestration layer learned when to delegate, how agents should communicate, and how to combine outputs. You don't configure a pipeline. You send a request.

That's the practical difference. And it's why the code review result is compelling. Nobody designed a "find bugs with multiple models" workflow. The system figured out how to do it and did it better than any single model alone.

For teams already building on AI tools and wondering whether adding model orchestration is worth the complexity, Fugu makes the answer simpler. The complexity is already handled.

If you're evaluating where AI fits into your creative and content workflows more broadly, the AI tools guide for content creators covers how platforms like MagicShot are already running multiple frontier models including GPT Image 2.0, Nano Banana 2, VEO 3.1, and Seedance 2.0 under one subscription for creative work.

Sakana Fugu is worth watching closely over the next few months as independent benchmark verification catches up to the self-reported numbers. If the performance holds under third-party testing, it's a meaningful shift in what a single API endpoint can deliver.