The ASIC Gambit: Etched's Sohu-2 Tape-Out
On April 4, 2026, AI chip startup Etched announced the tape-out of its second-generation transformer-specific Application-Specific Integrated Circuit (ASIC), codenamed Sohu-2. The headline figure is staggering: 500,000 tokens per second when running a Llama 70B parameter model at FP8 precision. For context, that's enough throughput to generate the complete text of War and Peace in roughly 15 seconds. This isn't an incremental improvement over general-purpose GPUs like the H100 or B200; it's a declaration that the future of large model inference belongs to hardware designed for one task and one task only: executing the transformer architecture.
Why a Single-Purpose Chip Changes Everything
Technically, the Sohu-2's breakthrough stems from a ruthless focus. General-purpose GPUs, the workhorses of the current AI boom, are marvels of flexible computation. They can train models, render graphics, mine cryptocurrencies, and run scientific simulations. This flexibility comes at a cost: significant silicon area, power, and clock cycles are devoted to circuitry that a transformer's forward pass simply never uses.
Etched's ASIC strips all that away. By designing silicon that directly and efficiently implements the core operations of a transformer—attention mechanisms, feed-forward networks, layer norms—they eliminate the overhead of instruction decoding, scheduling, and managing a general-purpose memory hierarchy for irrelevant operations. The result is the raw performance number: 500k tokens/sec. The strategic implication is even more profound: inference cost per token could plummet by an order of magnitude compared to running on leased GPU cloud instances.
This creates a new economic axis in AI. For the last three years, the dominant narrative has been about scaling parameters and data. Sohu-2 introduces a parallel narrative about scaling inference efficiency. When generating 1 million tokens costs pennies instead of dollars, the calculus for every application—from real-time conversational agents and interactive tutors to large-scale synthetic data generation—fundamentally changes.
The Immediate Ripple Effects (Next 6 Months)
The announcement is a tape-out, not a product on shelves. Etched indicates samples will be available to select cloud partners in Q3 2026. The first and most obvious impact will be in the cloud inference price war, already ignited by announcements like Databricks' Mosaic AI Inference. Cloud providers who secure early access to Sohu-2 or similar ASICs will gain a potentially insurmountable cost advantage for serving large open-source models like Llama 70B or DBRX.
This will accelerate two existing trends:
1. The Commoditization of Foundation Model APIs: If the hardware to run Llama 70B cheaply is available, then the marginal cost of offering an API for it approaches zero. Differentiation will shift even more decisively to latency, reliability, and unique data or fine-tuning.
2. The Rise of the "Model-as-a-Feature": Products that previously could not justify the inference cost of a 70B-parameter model in real-time—think complex video game NPCs, individualized learning companions in educational software, or real-time design copilots—suddenly find it economically feasible. The InstaFlow 1-step image generation from Hugging Face's diffusers v0.30, for example, becomes far more deployable at scale if the text prompt can be processed by a giant, cheaply-running language model first.
The 12-Month Horizon: A New Hardware Stack
Looking out a year, the implications grow more structural. If transformer ASICs prove their reliability and software stack maturity, we will see the emergence of a dedicated AI inference hardware layer.
The Honest Counterpoint
The promise is immense, but the path is littered with challenges. ASICs are notoriously inflexible. What happens when the next seminal AI paper introduces a new architecture that isn't a pure transformer? Sohu-2's value is tied to the enduring dominance of the transformer paradigm. Furthermore, building a robust software ecosystem from scratch is a herculean task that has defeated many well-funded hardware startups. Finally, cost advantages on paper must translate to total cost of ownership in data centers, factoring in cooling, reliability, and utilization rates.
The Provocative Question
If the cost of inference for a 70B-parameter model falls to near-zero, does the primary bottleneck for AI advancement cease to be computation and become solely one of human creativity in problem formulation and evaluation?
Note: The technical orchestration required to manage fleets of specialized inference ASICs, route queries between them, and monitor performance at scale is a non-trivial automation challenge. For those interested in the systems engineering behind deploying efficient AI agents, AI4ALL University's Hermes Agent Automation course (EUR 19.99) covers relevant architectural patterns for managing heterogeneous AI workloads.