GPT-5.5 and SubQ: Two AI Bets That Landed in May 2026
On the same day in May 2026, OpenAI made GPT-5.5 Instant the ChatGPT default while startup Subquadratic launched SubQ, a 12M-token model that ditches the transformer's quadratic attention. Two opposite bets on AI's future.
On 5 May 2026, two announcements landed within hours of each other and pointed in almost opposite directions. OpenAI quietly swapped the brain behind ChatGPT, making GPT-5.5 Instant the new default for hundreds of millions of users. The same day, a nine-person startup in Miami you had probably never heard of launched SubQ, a model built on a fundamentally different math than the transformers everyone else is scaling. One is a bet that the current architecture still has room to run. The other is a bet that it doesn't. Both can't be fully right.
This is the more interesting story in AI right now β not "which model tops the leaderboard," but a genuine fork in how these systems are built. Here's what actually shipped, what's a measured fact versus a vendor claim, and why the gap between the two bets matters for anyone deciding what to build on.
Bet one: refine the default (GPT-5.5 Instant)
OpenAI's move was the opposite of a launch event. As TechCrunch reported, GPT-5.5 Instant replaced GPT-5.3 Instant as ChatGPT's default for all tiers and shipped to the API as the chat-latest alias on the same day. No new product, no new modality β just a better engine dropped into the car most people already drive.
The headline metric OpenAI led with was reliability, not raw intelligence. In the company's internal evaluations, GPT-5.5 Instant produced 52.5% fewer hallucinated claims than GPT-5.3 Instant on "high-stakes" prompts spanning medicine, law, and finance. That framing is telling. When your model is the default for a billion-query-a-day product, a halving of confident-but-wrong answers on exactly the topics where wrongness is expensive is worth more than another point on a reasoning benchmark.
A few practical details for anyone building on the API:
- GPT-5.3 Instant doesn't vanish. Paid users keep access for three months through model settings before it retires, so production integrations have a migration window rather than a cliff.
- Personalization is part of the pitch. Enhanced use of past chats, files, and connected Gmail is rolling out first to Plus and Pro on web, then mobile and other tiers β a reminder that for OpenAI, the model and the memory layer around it are now one product.
- The default is the product. Most users never change the model picker. Shipping improvements into the default is how OpenAI moves the median experience without asking anyone to do anything.
The strategic read: OpenAI is betting that the transformer, plus better training and better grounding, still has meaningful headroom β and that the path to trust runs through fewer hallucinations, not exotic architecture.
Bet two: change the math (SubQ and subquadratic attention)
Subquadratic, founded by Justin Dangel and Alex Whedon, came out of stealth the same day with $29 million in seed funding and a model, SubQ, that claims a 12-million-token context window processed in a single pass. To put that in perspective, 12 million tokens is roughly a shelf of technical books held in working memory at once.
The reason that's hard is baked into the standard transformer. Classic self-attention is quadratic: double the input length and you roughly quadruple the compute, because every token attends to every other token. That O(nΒ²) wall is why long context has historically been expensive and why "1M token" models came with eye-watering price tags.
SubQ's answer is an architecture the company calls Subquadratic Sparse Attention (SSA). Rather than comparing every token to every other token, SSA selects only the most relevant positions for each query and computes relationships within that subset. The claimed result is attention cost that scales with the small number of selected tokens rather than the full sequence β roughly linear instead of quadratic.
What's measured versus what's marketing
This is exactly the kind of claim that deserves skepticism, so here is the line between the two:
| Claim | Status |
|---|---|
| 12M-token single-pass context | Vendor-stated capability |
| ~52Γ more efficient than FlashAttention at 1M tokens | Vendor benchmark |
| 95% on RULER 128K at ~$8 vs Claude Opus 94% at ~$2,600 | Vendor benchmark β a stated ~300Γ cost reduction |
| Coding "similar to Claude Opus" at ~1/20th the cost | Vendor comparison |
Every figure above comes from Subquadratic's own materials. None has yet been reproduced by independent third parties at the time of writing, and self-reported benchmarks in this industry have a long history of looking different under outside scrutiny. The RULER long-context benchmark is a reasonable choice β it's designed to stress retrieval across long inputs rather than reward lucky guessing β but a 300Γ cost claim is extraordinary, and extraordinary claims wait for replication.
What's not in dispute is the direction. Subquadratic and near-linear attention variants have been an active research line for years; what's new is a funded startup shipping a commercial model on that bet and pricing it aggressively. If even a fraction of the efficiency holds up independently, the economics of long context change.
The backdrop: open weights got very good in April
The May 5 fork didn't happen in a vacuum. A month earlier, on 2 April 2026, Google shipped Gemma 4, its most capable open-weights family yet, under the permissive Apache 2.0 license. The lineup spans an effective 2B and 4B for on-device work, a 26B mixture-of-experts, and a 31B dense model, with support for 140-plus languages, native function calling, and audio and video inputs.
The competitive detail that stung: Google positioned Gemma 4 explicitly against the wave of strong Chinese open-weights models, and the 31B dense variant landed near the top of open-model leaderboards on release. The takeaway for builders is that the floor has risen. You can now run a genuinely capable model on your own hardware, fine-tune it, and never send a token to anyone's API β a real option that didn't meaningfully exist two years ago.
The research lineage behind the hype
SubQ didn't invent the idea of escaping quadratic attention β it commercialized a research direction that's been building for years. The broad family of "efficient attention" and state-space approaches has been chasing sub-quadratic or near-linear scaling for a while, motivated by exactly the problem SubQ targets: the runaway cost of long context. What's genuinely new in 2026 isn't the ambition; it's a funded startup shipping a commercial model on that bet and pricing it to undercut frontier incumbents by orders of magnitude.
That context should shape how you read the launch. A first mover commercializing a known research direction is a meaningful event, but it's also the moment when claims tend to outrun verification, because the incentive to look dramatically better than the transformer is enormous. The healthy posture is to treat the direction as credible and the specific multipliers as provisional. If selective, sparse attention really does deliver near-linear scaling at production quality, the second-order effects matter more than any single benchmark: long-context use cases that are uneconomical at quadratic cost β querying an entire codebase, reasoning over a year of documents, holding a book-length conversation without forgetting the start β move from "demo" to "default." That's the prize the whole race is actually chasing.
Why the fork matters for what you build
Strip away the announcements and you're left with three different theories of where the value is, all live at once:
- OpenAI β value is in the default experience: a reliable, well-grounded frontier model that most people never have to configure.
- Subquadratic β value is in architecture and unit economics: if attention stops being quadratic, long-context use cases that are uneconomical today become routine.
- Google's Gemma β value is in openness and control: weights you own, run anywhere, under a license that lets you ship.
For a team choosing a foundation in mid-2026, the practical guidance falls out of that:
- If you need reliability on high-stakes content, a hosted frontier default like GPT-5.5 Instant is the conservative pick β and the hallucination-rate delta is a real, vendor-measured improvement, not vibes.
- If your problem is genuinely long-context β entire codebases, long legal discovery, multi-document synthesis β watch the subquadratic players closely, but pilot against independent benchmarks before you rewrite anything around a 12M-token window.
- If control, cost, or data residency dominate, open weights like Gemma 4 are now strong enough to be a serious default, not a compromise.
What to watch
- Independent replication of SubQ's benchmarks. A 300Γ cost claim is either the most important efficiency result of the year or it isn't reproducible. Third-party RULER and coding evals will settle it.
- Whether hallucination rate becomes the headline metric. OpenAI leading with a 52.5% reduction rather than a reasoning score is a signal that the industry's pitch is shifting from "smarter" to "more trustworthy."
- The subquadratic research line going mainstream. If the big labs ship their own near-linear attention in their next flagships, that's the tell that the quadratic transformer's reign is actually ending β not just being challenged by a startup.
- Open weights closing the gap. Each Gemma and open-model release narrows the distance to frontier. The question for 2026 is how small that gap gets, and how many teams decide it's small enough to stop paying per token.
Two bets, same week, opposite directions. The healthiest sign for the field is that it's no longer obvious which one wins.