Computing on The Infinite Unknown

Tools That You Serve

Jared Watkins — Mon, 29 Jun 2026 00:00:00 +0000

There’s a recurring pattern in how humans build things: we design a system to serve us, it scales, and somewhere past a certain threshold it flips, and we’re serving it. This isn’t a flaw in individual designs. It’s something that happens consistently, across domains, to institutions and tools that get big enough. A lot of people have written about this from different angles (urban planners, economists, social critics) without quite connecting the dots. One thread that links them runs through, of all places, the Unix philosophy from the early 1970s.

Unix came out of Bell Labs as a reaction to failure. Multics was a comprehensive, design-it-all-at-once attempt to build the perfect OS. Bell Labs pulled out in 1969 because it had become too complex to ship. Ken Thompson and Dennis Ritchie built Unix deliberately small, on a discarded PDP-7, partly so Thompson could port a game he liked. Doug McIlroy added the pipe: feed the output of one program as input to the next. Suddenly a collection of narrow tools could chain into arbitrarily complex operations that no individual tool had anticipated. That toolbox from the 1970s still works. Most software from the same era is archaeology.

Richard Gabriel named why this matters in a 1989 essay he called “Worse is Better.” The Unix approach ships a slightly wrong, very simple design. The competing approach (he associated it with MIT and the Lisp community) prioritizes correctness: don’t ship until it’s right. The surprising result: worse-is-better wins almost every time. A simple design gets deployed, spreads, accumulates users and tooling, and gets iterated on while the correct design is still being finalized in committee. By the time the right thing ships, the worse-but-shipped version owns the ecosystem and has a decade of real-world refinement behind it. Gabriel called Unix and C “the ultimate computer viruses.” He didn’t mean it as a compliment, exactly, but he acknowledged it explained something real about how technology actually spreads. QWERTY keyboards persist because they arrived first and got embedded in muscle memory and manufacturing lines. VHS beat Betamax on recording time and first-mover network effects, not picture quality. Windows beat OS/2 despite IBM’s vastly superior technical execution, because Microsoft had it on every cheap PC clone while IBM was still arguing about positioning. A first-mover that’s good enough almost always beats a more correct late-mover.

Jane Jacobs made the same argument about city planning in 1961. Robert Moses was the master planner who reshaped New York and became the template for urban renewal projects across postwar America. His approach: find a dense mixed-use neighborhood, declare it blighted, demolish it, and replace it with towers on cleared superblocks. Internally consistent. Professionally specified. Theoretically correct. Pruitt-Igoe in St. Louis, 33 identical towers on a razed superblock, was demolished twenty years after it opened. The elevators stopped only on every third floor to encourage “vertical neighborhoods.” The long skip-stop corridors became ungovernable. There were no shops, no services, no reason for anyone without a lease to be there, which meant no informal eyes on the street, no social fabric, and no way to tell residents from strangers. Within a decade the buildings were largely abandoned by anyone who had another option, and the city was paying more to maintain them than it would have cost to house the residents elsewhere. Cabrini-Green in Chicago followed the same arc. The designs failed because they were optimized for what a planner could specify on paper: unit counts, square footage, setbacks, green space ratios. They couldn’t accommodate the uses the designers hadn’t anticipated, which turns out to be most of the uses that matter: the corner store that becomes a neighborhood anchor, the stoop culture that creates informal surveillance, the mixed-use density that makes walking somewhere worth doing. The neighborhoods they replaced looked like chaos to the professional eye but held an evolved order that had been tested by actual human behavior over decades, and couldn’t be recreated by writing a better spec.

James Scott taxonomized this failure mode in “Seeing Like a State” (1998). He called it high modernism: the belief that trained experts with the right theory can design an optimal system from scratch, and that the messiness of the existing system is a problem to be solved rather than information to be understood. His examples are almost comically consistent across domains: scientific forestry in 19th-century Prussia (monoculture plantations that looked orderly, maximized timber yield for one generation, and then collapsed because they’d eliminated the ecosystem complexity that made forests self-sustaining), Soviet collectivization, the construction of Brasília as a planned capital city that was efficient on a map and alienating to actually inhabit. The common thread is that the designed system is legible to the administrator and opaque to the people living in it, and it destroys the tacit local knowledge that made the messy original actually function. Scott called that knowledge metis, the Greek word for the practical intelligence you develop through experience rather than training. It’s the farmer who knows which field floods in a wet spring, the mechanic who knows this engine runs hot, the neighborhood where everyone knows which landlord fixes things. High modernism has no column for metis in its spreadsheet, so it eliminates it. Hayek made the same argument about economic markets a generation earlier: prices encode distributed knowledge that no central planner can extract or process. The Soviet planning apparatus didn’t fail because Soviet planners were stupid. It failed because the information required to make millions of allocation decisions correctly is dispersed across millions of people and embedded in their local practices and preferences, not available in any form a committee can use.

Ivan Illich pulled all of this into a sharp, disturbing point in “Tools for Conviviality” in 1973, the same year Thompson was adding pipes to Unix. He distinguished between convivial tools and industrial tools. A convivial tool is one you can use to accomplish your own ends without specialized intermediation. A bicycle multiplies your mobility and you control it entirely. An industrial tool has scaled past the point where it serves you and started shaping you to serve it. It requires experts. It creates dependency. And past a certain threshold, it becomes actively counterproductive, undermining the very purpose it was built to serve.

His examples are brutal. Schools were designed to produce education. At scale they produce credential-dependence: people learn that learning requires institutional validation, that curiosity outside approved channels doesn’t count for much. The school’s primary output becomes the belief that you need a school to learn anything, which is the opposite of education. Medicine was designed to produce health. At scale it produces what Illich called iatrogenesis, harm caused by the healer, not just through clinical errors but through the systematic medicalization of ordinary life. Birth, aging, grief, chronic discomfort, the ordinary difficult textures of being a person, get redefined as medical conditions requiring professional management, until people lose the capacity to navigate these experiences without supervision. The car promised mobility. Illich calculated that when you add up all the hours Americans spend earning money to cover car payments, insurance, fuel, repairs, and maintenance, then divide by distance traveled, the average American in 1973 was moving at about 6 km/h. Roughly a brisk walk. The tool that was supposed to save time was consuming it, and had simultaneously restructured American cities to make every other mode of getting around nearly impossible.

The GLP-1 wave is a live example of medicalization in progress, and it’s worth double clicking on because the stakes are unusually high. Semaglutide and its relatives work. They reduce obesity, which is a real health problem with real consequences. But GLP-1 receptors are distributed throughout the brain, not just the gut, and what these drugs actually suppress isn’t hunger specifically, it’s the reward-motivation signal more broadly. People on them report not just reduced appetite but reduced desire for alcohol, gambling, shopping, and a general flattening of the motivational drives that make life feel worth living. We’re medicating away the experience of wanting things. The clinical case for doing this in individuals with severe obesity is defensible. Nobody has seriously answered what happens when tens of millions of people are on these drugs indefinitely. We may be about to find out that human motivation was load-bearing in ways we didn’t appreciate until we started suppressing it at scale.

Back to the car for a moment, because it illustrates Illich’s sharpest concept: radical monopoly. Not market monopoly, where one company controls a product category. Radical monopoly is what happens when a tool restructures its environment so thoroughly that alternatives stop being viable. The infrastructure that would support them has been eliminated. Once American cities were built around the car, there was nowhere left to walk to. The monopoly wasn’t enforced by a corporation; it was baked into the landscape. You couldn’t opt out individually no matter how much you wanted to.

I’ve watched this happen to the internet in my lifetime. Email, the web, IRC, Usenet were open protocols. Anyone could implement them, direct them toward their own ends, leave without losing anything. Facebook, Instagram, TikTok, and Google Search present as tools for connection and discovery. At scale they’ve restructured those activities around their own requirements, optimizing for engagement time for advertising rather than things you care about. The social graph you built on one of these platforms isn’t portable. The communities on one exist nowhere else. You can’t leave without losing access to them. That’s radical monopoly, delivered by software in about twenty years.

And the iatrogenesis follows the same pattern. Clinical: documented increases in depression and anxiety causally linked to platform use. Social: algorithmic mediation becoming the default mode of discovering what’s happening, such that unmediated reality starts to feel incomplete. Cultural, the deepest layer: the possibility that platforms are gradually eroding people’s autonomous capacity to form opinions, manage attention, and navigate social life without algorithmic assistance. It’s exactly what Illich predicted would happen when an industrial tool reaches sufficient scale in a domain previously managed through human practice.

Which brings us to AI, where I think the stakes get a lot larger.

The current wave of AI tools is, right now, mostly convivial. I can use an LLM API to build whatever I want. I can run models locally. I can switch providers. The outputs are mine. AI tools slot into pipelines the way Unix tools do, taking input, producing output, no lock-in required. If you squint, it looks like the early internet, and that’s not an accident. A lot of the people building these tools were shaped by Unix culture and are deliberately trying to replicate its structural properties.

But the conditions for the flip are already forming. The models that matter most require infrastructure most people can’t run. The capabilities that make AI actually useful (as opposed to a novelty) live behind APIs controlled by a handful of companies. The consumer products built on top of those APIs, the AI assistants embedded in operating systems, productivity suites, search engines, and phones, are not Unix tools. They don’t expose clean interfaces. They don’t compose. They’re designed to be the environment, not a tool within it. Microsoft embedding Copilot into Office, Google weaving Gemini into Search and Android, Apple building intelligence into the OS, none of those are convivial designs. They’re platforms using AI to deepen the integration that already made them hard to leave.

The Illich framework predicts what comes next. Once enough work, thinking, and decision-making flows through these systems, the capacity to do those things without them starts to atrophy. It happened to navigation when GPS arrived, to memory when search engines did, to arithmetic when calculators became ubiquitous. Each of those is a relatively narrow domain. AI touches reasoning and judgment in ways that make those examples look small.

The same three layers of iatrogenesis show up here. Clinical: we’ll see documented cognitive effects, probably around attention, recall, and tolerance for ambiguity, as people offload more of their thinking to systems that do it faster and more fluently than they can. Social: the normalization of AI-mediated communication and decision-making, such that a job application, a medical question, or a legal problem handled without AI assistance starts to feel like showing up underprepared. Cultural, the hardest layer: the possibility that at sufficient scale, AI systems optimized for engagement or productivity metrics will gradually reshape what people think good thinking looks like, just as social media reshaped what people thought good communication looked like. Not through malicious intent. Just because that’s what industrial tools do past the threshold.

The Unix philosophy’s answer to all of this is structural, not political. Small tools. Open protocols. Composable pieces. No single component achieves radical monopoly because the design forces clean interfaces others can connect to. You can swap grep for something better without touching the rest of the pipeline. You can leave one email provider for another because email is a protocol, not a platform. The catch is that this kind of design is harder. Clean interfaces require upfront discipline; you have to think carefully about what each component does and doesn’t do, and resist the pressure to just add the feature directly rather than expose a composable primitive. Monoliths are faster to build and easier to ship. The incentives point toward integration and lock-in, which is why the default trajectory for any successful platform is to keep pulling more functionality in rather than keeping things separable. Building convivially has to be a deliberate choice, made against the grain. Whether AI development makes that choice or consolidates into a few deeply integrated platforms is probably the most consequential design question in technology right now, and it’s being decided mostly by which business models are winning.

Illich, Hayek, Scott, Jacobs, and McIlroy are making the same argument in different languages about different domains. Systems built from human-scale composable pieces, evolved through use, consistently outperform systems designed all at once by expert authority. And tools past a certain threshold of scale stop serving people and start capturing them. The thread runs from Prussian forestry in the 1800s to the AI assistant on your phone in 2026, and the dynamic is the same throughout.

Whether the people building these systems have ever heard of Illich is irrelevant. The pattern he described doesn’t require anyone to intend it.

There’s a lot more depth in the research section if you want to go further: Unix history, the MIT/New Jersey philosophical divide, the security critique, and Illich’s three books.

Future Value of GPU Compute

Jared Watkins — Mon, 01 Jun 2026 00:00:00 +0000

There’s a number that comes up constantly in AI coverage: the size of the capex commitment. Microsoft, Google, Meta, Amazon, collectively spending hundreds of billions on AI infrastructure. The implicit assumption in most of that coverage is that more spending equals more capability equals more future value. That’s not wrong, but it leaves out all the interesting parts.

The real question isn’t how much you’re spending. It’s what you’re actually buying, how fast it depreciates (in practice, not on paper), and how the economics shift over the hardware lifetime compared to what you could have deployed instead. Those are three separate variables that interact in non-obvious ways.

The Building Itself Is Expensive and Getting More Expensive Fast

Before a single GPU gets installed, you need a facility. Not a regular datacenter, an AI datacenter, which means high-density power delivery, liquid cooling, and structural requirements that didn’t exist five years ago. For AI infrastructure, the right unit to think in is cost per megawatt of capacity, because power density is the actual constraint. Sqft matters too, but a traditional datacenter might provision 5 to 10kW per rack while an AI facility runs 100kW per rack or more. You’re not buying floor space, you’re buying powered, cooled floor space.

Year	AI DC ($/sqft)	AI DC ($/MW)	YoY ($/MW)
2023	~$700	~$10M	baseline
2024	~$800	~$14M	+40%
2025	~$1,100	~$20M	+43%
2026	~$1,500+	~$28M+	+40%

Both metrics are climbing, but per-MW is rising faster because each generation of AI hardware packs more power into the same floor space. The cost per square foot broke $1,000 for the first time in 2025, roughly double 2023. Per-MW costs have nearly tripled over the same period. All-in costs including GPU hardware push the per-MW number to $30 to 40 million today.

Two things are driving this. First, AI chips are power-hungry and dense in ways that older datacenter designs didn’t anticipate. Second, construction labor and materials costs haven’t come down from their post-pandemic highs, and the competition for workers who know how to build these things has gotten fierce.

Both costs are going up every year, and next-gen hardware runs hotter and denser, so the next build will cost more still.

What It Actually Costs to Deploy a GPU Slot

The facility cost per MW is one number. What that translates to per GPU slot is the one that matters for the economics, and it’s been climbing faster than the per-MW headline suggests because each generation of hardware is denser.

An H100 at 700W GPU draw means roughly 1,200 H100 slots per megawatt of facility capacity once you account for PUE overhead (at PUE 1.2, roughly 833kW of IT load per MW of facility power, divided by ~700W per slot). A B200 at 1,000W drops that to around 830 slots per MW. Newer and more capable, but the facility cost per slot goes up even before the chip price. At $20M per MW (2025 pricing), an H100 facility costs roughly $17,000 per GPU slot in facility-side capital. A B200 facility at $28M per MW costs around $34,000 per slot. The chip itself is separate.

Generation	GPU TDP	Slots per MW (at PUE 1.2)	Facility cost/MW	Facility cost/slot
H100 (2023)	700W	~1,200	~$10M	~$8,000
H100 (2025)	700W	~1,200	~$20M	~$17,000
B200 (2026)	1,000W	~830	~$28M	~$34,000
B300/NVL72 (2026+)	1,400W	~595	~$35M+	~$59,000+

The NVL72 rack (NVIDIA’s GB200 rack-scale system) is the clearest example of where this is heading: 72 Blackwell GPUs, over a megawatt of draw, in roughly the same floor footprint as the 10kW enterprise rack it replaced. The facility-side capital per GPU slot at that density, including DLC plumbing, HVDC power distribution, CDUs, and the structural reinforcements needed for 6,000 lb racks, is in the $55,000 to $65,000 range before you put a single chip in it.

This is the part that doesn’t get discussed when people talk about GPU efficiency gains. Yes, a B300 delivers 3.7x the TFLOPS/watt of an H100. But deploying it costs roughly 3.5x more per slot in facility capital than deploying an H100 did two years ago. The per-chip efficiency improvement and the per-slot infrastructure cost are both rising, and they’re rising together. The net economics per useful compute unit are better than the raw hardware comparison suggests, but the capex required to realize that improvement keeps increasing.

There’s a one-way ratchet embedded in this cost structure that’s worth pointing out. Each generation requires more expensive facilities: more power density, liquid cooling instead of air, HVDC distribution instead of standard AC, structural reinforcements, DLC manifolds. Operators who build for B200 or NVL72 density are committing to infrastructure that implicitly requires the revenue profile of B200 or NVL72 class hardware to justify the capital. You can’t put H100s in a $60,000-per-slot facility and make the economics work. And you can’t easily go backwards: a facility designed for 1MW racks can’t be cheaply redeployed for lower-density hardware once that hardware stops generating enough revenue to cover the infrastructure cost.

This changes the incentive structure for everyone in the chain. Operators who’ve committed to high-density infrastructure need the next GPU generation to ship, need it to command premium rental rates long enough to amortize the facility, and need demand to stay strong enough that the cascade through inference and secondary markets actually materializes. The facilities lock in an expectation of continuous hardware improvement at roughly the same cadence, because slowing down means stranded infrastructure cost. Nvidia knows this. The hyperscalers know this. It’s part of why the build-out keeps accelerating even as H100 rental rates crater: stopping or slowing means admitting that the facilities already built are ahead of the demand that can pay for them.

There’s also a density wall approaching. Each generation has pushed more watts into the same rack footprint by shifting cooling from air to liquid. Air cooling topped out around 30 to 40kW per rack. Direct liquid cooling handles 100kW comfortably and 1MW at significant engineering cost. What comes after 1MW is genuinely uncertain: immersion cooling (submerging hardware in dielectric fluid) is one path, but it introduces its own operational complexity and still has physical limits. At some point the silicon itself has thermal constraints that no amount of cooling innovation can engineer around. When the hardware efficiency curve eventually flattens, the ratchet becomes a problem: facilities priced for the rate of improvement we’ve seen over the last three years, serving hardware that isn’t improving at that rate anymore.

What the GPU You’re Deploying Today Actually Earns

Cloud GPU rental rates are the most transparent signal we have for what compute is actually worth in the market. The H100, which defined the AI compute moment of 2023 and 2024, is the object lesson.

GPU	Peak Rental Rate	Current Rate (2026)	Decline
A100	~$6.00/hr	~$1.35/hr	-78%
H100	~$8.50/hr	~$1.50-2.50/hr	-70-82%
H200	~$6.00/hr	~$2.50-3.40/hr	-40-58%
B200	(launching)	~$5.00-6.00/hr	–

H100 rental rates have collapsed 64 to 75% from peak. The A100 is basically commodity at this point. The reason is straightforward: there’s more H100 capacity than the market needs, because the H200 and B200 have shown up and buyers prefer the newer hardware for new workloads.

This is the revenue side of the useful-life question. When you deploy an H100 cluster today, you’re not locking in today’s rental rate. You’re locking in a trajectory that the A100 already traced.

The Efficiency Gap Widens Every Generation

Here’s where the math gets uncomfortable. Each new GPU generation doesn’t just offer more performance. It offers dramatically better performance per dollar and per watt, which changes the economics of what the previous generation can charge.

xychart-beta title "AI Compute Performance Scaling (Normalized to H100 = 1)" x-axis ["H100 (2023)", "H200 (2024)", "B200 (2025)", "B300 (2026)"] y-axis "Relative FP8 Throughput" 0 --> 8 bar [1, 1.9, 4.5, 7.5]

The specific numbers from NVIDIA: H100 delivers roughly 2 petaFLOPS FP8, H200 around 3.9 petaFLOPS, B200 hits 9 petaFLOPS, and the B300 (Blackwell Ultra, shipping January 2026) delivers 15 petaFLOPS per chip. That’s about a 7.5x improvement in raw throughput from H100 to B300 in three years.

But raw throughput understates the problem. NVIDIA claims 50x higher throughput per megawatt for Blackwell vs. Hopper on inferencing workloads. Fifty times. Even discounting that figure generously (NVIDIA’s marketing numbers deserve scrutiny), the efficiency gap for inference (which is where most of the actual revenue generation happens) is substantial.

GPU	TDP (Watts)	FP8 TFLOPS	TFLOPS/Watt	Relative Efficiency
H100 SXM	700W	~2,000	2.86	1.0x
H200 SXM	700W	~3,900	5.57	1.9x
B200 SXM	1,000W	~9,000	9.00	3.1x
B300	1,400W	~15,000	10.7	3.7x

Every watt-hour that runs an H100 today is a watt-hour that could run a B200 delivering 3x more useful work. That comparison only gets worse as time passes and the next generation (presumably something after B300) widens the gap further.

Power Is the Cost That Keeps on Taking

The last piece is operating expense, specifically power. An H100 cluster running flat-out burns ~700W per GPU, but that’s just the chip. The full system draw per GPU slot is closer to 1,000 to 1,100W once you add cooling overhead and power conversion losses (PUE). On top of that, the InfiniBand switch fabric for a dense H100 cluster adds another 20 to 40W per GPU slot when you amortize switch power across the GPUs it serves (NVLink is on-package and already inside the 700W TDP), and NVMe storage arrays add roughly 20 to 50W per GPU slot depending on the storage-to-compute ratio. A fully loaded H100 slot isn’t a 700W problem, it’s closer to a 1,050 to 1,290W problem when you account for everything running alongside it.

Component	Power per GPU slot
H100 GPU	~700W
Cooling + PUE overhead	~300-500W
InfiniBand switch fabric (amortized)	~20-40W
Storage (NVMe arrays, amortized)	~20-50W
Total system	~1,040-1,290W

At current datacenter electricity rates, a fully loaded H100 slot runs $600 to $1,300 per year in power costs. Best case is a large operator with a long-term PPA locked at $0.06/kWh and minimum system draw: over 5 years that's about $3,000 cumulative. For operators on spot power or renegotiating contracts, rates compound. U.S. electricity prices jumped 27% between 2019 and 2025 and PPA contract prices jumped 35% in 2024 alone, so assume 8% per year if you’re not locked in. Starting at $0.08/kWh mid-range, that compounds to over $5,400 cumulative by year 5, and the annual bill in year 5 is 40% higher than year 1.

The PPA question matters a lot here. If you’re locked in for 10 to 15 years, your power cost is predictable. The problem is that new capacity additions often can’t get PPA coverage fast enough, and the rate you lock in today is higher than what you’d have locked in two years ago. You’re not escaping the trend, you’re just drawing it out.

So the power bill isn’t just an operating cost, it’s an opportunity cost. Every watt running an H100 at year 3 is a watt that could be running a B200 delivering 3x more useful compute for roughly the same electricity. That gap doesn’t shrink over time, it widens as each new generation ships.

When Does It Make Sense to Replace?

At what point does a newer generation make your existing hardware worth turning off, not just less profitable?

The naive answer is “when the new hardware pays for itself in efficiency gains,” but that understates the real calculation. You’re not just comparing the new hardware’s efficiency against your old hardware’s efficiency. You’re comparing the total system economics: what the old hardware earns minus what it costs to run, versus what new hardware would earn minus what it costs to acquire and run. Replacement makes sense when the gap between those two numbers exceeds the capital cost of the swap.

With H100s specifically, the numbers are getting uncomfortable. Rental rates have dropped 70 to 80% from peak, so revenue per hour is already a fraction of what it was at deployment. B200s deliver roughly 3x better compute per watt. If an H100 slot earns $2/hr today and a B200 slot earns $5/hr at similar or lower power cost, the B200 pays back acquisition cost within months at reasonable utilization. The H100 isn’t worthless, but its margin is thin enough that the replacement calculus is real.

The honest answer for hardware purchased today is probably a 2 to 3 year economic peak, followed by a tail where the hardware still earns but increasingly just covers operating costs. NVIDIA’s roughly annual release cadence makes this worse than it used to be. When a new architecture ships every 12 to 18 months and each generation delivers 2x to 4x efficiency improvements, the crossover point arrives faster than enterprise server refresh cycles (which ran 5 years) would suggest. It’s the number that should be driving every capex model in this space.

The Secondary Market Adds Another Exit

You don’t have to run aging hardware to end-of-life. You can sell it.

The secondary market for AI GPUs is real and surprisingly liquid, at least for now. H100s that cost $30,000 new are trading used at $18,000 to $22,000 in early 2026. A100s, a full generation back, still fetch $8,000 to $18,000 depending on variant. That's not nothing. For an operator who deployed H100s in 2023 at $30K a chip, selling in 2025 at $20K and redeploying capital into B200s is a legitimate economic decision, potentially better than running the H100s into year 4 at shrinking margins.

This creates a third option beyond “run it until it’s worthless” or “scrap it”: sell into the secondary market while the hardware still has residual value, and use the proceeds to fund the next deployment. The optimal exit point is somewhere before rental rates collapse far enough that buyers start pricing in the same math you’re doing.

The catch is that the secondary market itself is on a decay curve. As more operators reach this conclusion and the supply of used H100s grows, secondary prices will follow rental rates down. The A100 traced this path already: premium secondary pricing in 2022 and 2023, then a long slide as H100 supply expanded and A100 demand softened. The buyers who got out of A100s early captured real value. The ones who held into 2025 got commodity prices. H100s are somewhere in the middle of that same arc right now, which means the window for a good secondary exit is open but not permanently.

Long term, this accelerates. As generation gaps widen and efficiency improvements compound, the pool of buyers willing to deploy used older-gen hardware shrinks. Eventually the secondary market for a given GPU generation isn’t “data center operators looking for a deal,” it’s “research labs, universities, and crypto miners who need cheap compute and don’t care about inference economics.” That’s not zero value, but it’s a different buyer at a much lower price.

The Financial Engineering Layer

The AI infrastructure build-out has attracted an enormous amount of creative financing, and a lot of it is structured in ways that move the depreciation problem off the operator’s balance sheet. The basic pattern: sell the GPUs to a special-purpose vehicle, lease them back on a triple-net structure, book the lease payments as operating expense rather than capital depreciation. The SPV owns the depreciating asset; you operate it. Risk transferred, at least on paper.

A specific and publicly disclosed example: [Apollo led a $3.5 billion capital solution for Valor Compute Infrastructure](https://ir.apollo.com/news-events/press-releases/detail/599/apollo-backs-5-4-billion-valor-and-xai-data-center-compute) to fund a $5.4 billion purchase of GB200 GPUs, leased to xAI on a triple-net structure. Nvidia went in as an anchor LP. What’s interesting is the round-trip: Nvidia booked $5.4 billion in revenue on the sale, but then re-injected $1.9 billion back into VCI as a limited partner. Outside capital in the deal was roughly $3.5 billion. If part of your "sale" is funded by capital you re-injected, there's a legitimate question under ASC 606 about whether the full $5.4 billion should be recognized as revenue, or whether the $1.9 billion round-trip portion should be netted off. Auditors will also need to decide whether VCI qualifies as a variable interest entity that should be consolidated onto Nvidia’s balance sheet. The accounting treatment of the round-trip is a real open question that will matter at scale.

What is ASC 606 and why does it matter here?

ASC 606 is the US accounting standard that governs when a company can recognize revenue from a contract with a customer. The core principle: revenue gets recorded when control of a good or service transfers to the buyer, in an amount that reflects what the seller expects to receive in exchange. Straightforward for most transactions, but it has teeth in situations where the “sale” isn’t a clean arm’s-length exchange.

The Nvidia/VCI deal raises a specific issue under ASC 606’s guidance on “variable consideration” and “transactions with related parties.” Nvidia sold $5.4 billion of GB200 GPUs to VCI. Fine. But Nvidia also put $1.9 billion back into VCI as a limited partner, meaning it effectively funded roughly 35% of its own customer’s purchase. The question auditors have to answer: did control of the GPUs genuinely transfer to VCI, or is this more like a consignment arrangement where Nvidia retains meaningful economic exposure to the assets?

If VCI bears the real risks and rewards of ownership (it does, at least formally, on a triple-net lease structure), Nvidia can book the sale. But the $1.9 billion re-injection complicates the "transaction price" calculation. Under ASC 606 paragraph 606-10-32-25, consideration payable to a customer reduces the transaction price unless it's in exchange for a distinct good or service. An LP stake isn't obviously a distinct good or service (it looks more like a price concession or an inducement to do the deal). The clean treatment would be to net the $1.9 billion off the $5.4 billion and recognize $3.5 billion in revenue. The aggressive treatment is to book the full $5.4 billion and carry the LP stake as a separate investment.

There’s also a Variable Interest Entity question under ASC 810. If Nvidia has the power to direct VCI’s activities and absorbs a significant portion of its losses or returns (both plausible given a 35% LP stake), VCI might need to be consolidated onto Nvidia’s balance sheet entirely. At that point the “sale” disappears and the GPUs stay on Nvidia’s books as leased assets. That’s a very different income statement.

The point is that “Nvidia booked $5.4 billion in GPU revenue" and "Nvidia sold $5.4 billion of GPUs to an independent buyer” are not necessarily the same statement.

The GPU valuations sitting inside these structures are also worth watching. Fair value on assets with no active market gets classified as Level 3, meaning no directly observable price inputs. That doesn’t mean unverifiable (auditors use secondary market comps and bring in valuation specialists), but it does mean management has significant discretion in the estimates. On a multi-year lease with 16x leverage and GPU residual-value risk at the bottom, optimistic Level 3 marks are where problems tend to hide until they don’t.

At the hyperscaler level the same dynamic plays out through depreciation schedules. Meta extended its GPU useful-life estimate to 6 years in January 2025, reducing its 2025 depreciation expense by $2.9 billion in a single accounting change. Microsoft and Google made the same move. Amazon went the other direction: it shortened useful life for a subset of servers in February 2025, explicitly citing “the increased pace of technology development, particularly in the area of artificial intelligence.” One of these companies is reading the hardware market correctly.

What all of it has in common is that the economic decay curve I’ve been describing doesn’t disappear, it just moves to whoever is on the other side of the financing. When enough of those bets go wrong at the same time, that tends to get interesting in a headline-making way. Michael Burry called the deal “fugazi” and warned that retirees were unknowingly carrying GPU residual-value risk through Athene (Apollo’s insurance subsidiary, which bought the securitized debt). That framing is a bit sensationalized (policyholders hold fixed contractual claims against Athene’s solvency, not direct exposure to GPU prices), but the underlying concern about round-tripped revenue and optimistic Level 3 marks stacked on 16x leverage is defensible. One auditor put it well: “I’d hate to be the audit partner signing these transactions off.” The Arthur Andersen reference at the end of that sentence wasn’t subtle.

Where the Risk Actually Goes

The creative financing structures I just described don’t reduce risk. They redistribute it, and the redistribution isn’t random. It follows a predictable pattern: the party with the most sophisticated understanding of GPU economics offloads exposure to the party with less. Think of it as 3-card monte where the risk is the card and the shell game keeps moving until it ends up with whoever stopped paying close attention.

Technology obsolescence risk is the core one. A GPU bought today at $30,000 may be worth $10,000 in three years when the next architecture ships. On a 5-year lease, the lender who accepted the GPU as collateral at year-1 valuations is holding an asset that may not cover the loan balance by year 3.

Duration mismatch risk compounds this. GPU useful life runs 18 to 36 months in practice. The debt financing these purchases runs 5 to 7 years. The collateral deteriorates faster than the loan amortizes. Nobody structures a mortgage where the house loses 70% of its value in the first three years, but that’s roughly what GPU-backed debt looks like if you take the hardware economics seriously.

Concentration risk is the one that makes this systemic rather than just individual. [AI data center debt issuance exceeded $200 billion in 2025](https://www.theaiconsultingnetwork.com/blog/ai-data-center-gpu-debt-financing-insurance-cre-investors-2026), with JPMorgan projecting $30 to $40 billion in annual GPU-backed securitization by 2026 and 2027. The underlying collateral across all of these deals is essentially the same asset class: NVIDIA GPUs facing the same obsolescence curve on roughly the same timeline. Traditional securitization gets its safety from diversification across uncorrelated assets. Mortgage-backed securities worked (when they worked) because not every house in every market falls at once. GPU-backed securities have no such protection. When Blackwell Ultra made Hopper look slow, it made every H100-collateralized loan look worse simultaneously.

Counterparty concentration risk is separate but adjacent. The deals keep looping back through the same small set of players: Nvidia as manufacturer and LP, Apollo as arranger, Athene as ultimate debt holder, a handful of hyperscalers and neoclouds as lessees. When something goes wrong, it goes wrong for all of them at once. Nvidia’s LP stake means its balance sheet is exposed to the same collateral decline it theoretically offloaded by selling the GPUs.

Now watch how the shell game works in practice. Nvidia sells GPUs to VCI (technology obsolescence risk transferred to VCI). Apollo arranges financing against those GPUs (duration mismatch risk transferred to debt investors). That debt gets securitized and sold to Athene (concentration risk transferred to an insurance company’s investment portfolio). Athene backs annuity products with those assets (ultimate exposure lands with retirees holding fixed contractual claims, insulated by one layer of corporate solvency). At each step, the party selling the risk knows more about GPU economics than the party buying it.

What a realistic bad scenario looks like: a next-generation architecture ships 18 months into a 5-year lease and cuts inference costs by 4x (this is roughly what happened from H100 to B200 for certain workloads). The lessee’s revenue drops because their customers switch to cheaper inference elsewhere. They start struggling to make lease payments. The SPV that owns the GPUs triggers default provisions and tries to liquidate the collateral. Secondary market prices collapse as every operator with aging hardware tries to exit simultaneously (cross-default provisions in most data center loan agreements can amplify this into a cascade). Athene is sitting on a portfolio of securities backed by GPUs now worth a fraction of their collateralized value. The Federal Reserve Bank of Chicago has flagged this structure explicitly as a tail risk for banks exposed to AI infrastructure debt.

None of this has to happen all at once to be a problem. Even a partial version, one or two large lessees in distress while secondary GPU prices slide, would be enough to reprice the whole asset class and make the next round of AI infrastructure financing significantly more expensive. Which slows the build-out, which affects GPU demand, which affects Nvidia’s revenue, which affects the LP stakes Nvidia holds in the SPVs it helped capitalize. The loop is tight.

The bet they’re making is that AI demand grows fast enough, and stays strong enough over the lease term, that the collateral stays valuable. That’s a reasonable bet. It’s just not the same as the risk having gone away.

Putting It Together

A deployed GPU has several things working against it simultaneously. Rental revenue falls as newer hardware commoditizes its tier. Relative compute value falls as each generation delivers more tokens per dollar. Operating cost stays flat or rises while the efficiency of alternatives keeps improving. And the secondary market exit window closes as more operators reach the same conclusion at the same time.

xychart-beta title "H100: Revenue vs. Operating Cost as % of Revenue" x-axis ["Year 1", "Year 2", "Year 3", "Year 4", "Year 5"] y-axis "Index (Year 1 revenue = 100)" 0 --> 180 line [100, 75, 50, 30, 18] line [40, 55, 85, 130, 175]

The first line is rental revenue per hour (consistent with the A100 trajectory). The second is operating cost as a percentage of that year’s revenue: low early when revenue is strong, rising sharply as rental rates fall while the power bill doesn’t. The crossover lands around year 3, which is why Michael Burry’s 2 to 3 year useful-life estimate is closer to economic reality than the 6-year schedules the hyperscalers are booking.

When you see “$500 billion in AI infrastructure capex” in a headline, roughly 25 to 30% is the facility, 50 to 60% is GPU hardware, and the rest is networking, storage, and integration. The facility stays useful for 20+ years. The GPU hardware is the part on the accelerating obsolescence curve, and the creative financing layered on top of it doesn’t change that, it just determines who’s holding the bag when the curve bends hard.

There’s a legitimate counter-argument: the “value cascade” (training clusters today, inference clusters tomorrow, secondary market exit before the floor drops out) means the hardware earns across multiple tiers before it’s truly worthless. That’s real. The question is whether demand fills the capacity at each tier, and right now H100 rental rates suggest it doesn’t, at least not at the margins originally assumed. The SPV structures and 6-year depreciation schedules are, in part, a bet that the cascade works as advertised. Either the hyperscalers have demand visibility the rest of us don’t, or those structures are doing a lot of work to make the timeline look more forgiving than it is.

Probably both.

Building an SMB inference stack

Jared Watkins — Sat, 23 May 2026 00:00:00 +0000

Frontier API costs are fine when you’re experimenting. They get painful once you’re running a real workload. GPT-4o is $2.50 per million input tokens and $10 per million output. Claude Sonnet 4.6 is $3 in and $15 out. Claude Opus 4.7, Anthropic’s current flagship, runs $5 in and $25 out. Anthropic also recently shifted enterprise billing to usage-based consumption pricing on top of seat fees, which means those token costs show up as a line item more visibly than before. A 10-person team doing active AI use across document drafting, summarization, code review, and internal Q&A will generate somewhere around 100 million output tokens a month. Using GPT-4o that’s $1,000/month. Sonnet 4.6 is about $1,500/month and with Opus 4.7 it climbs to $2,500/month. It might not ruin you, but it’s a recurring bill that scales directly with adoption, and adoption tends to grow.

The math changes when you own the hardware. A $15,000 server running 24/7 at full throughput starts paying for itself in months at serious API volumes. You also get things you can’t buy from a frontier API: zero egress of sensitive documents, no data retention concerns, the ability to fine-tune on your own data, and a latency profile that’s not subject to someone else’s infrastructure decisions.

This is for people who’ve already tinkered with local models and are thinking about the next step: a dedicated server, maybe a rack, maybe something you’d sell as a managed service to clients. I’m assuming you’ve read the hardware guide for individual developers or are already past that level.

The three moving parts

Before getting into hardware, it’s worth naming what you’re actually building. A production inference stack has three distinct layers.

The inference engine is the process that actually runs the model: loads weights into memory, handles batching, manages KV cache, produces tokens. vLLM, Ollama, llama.cpp, SGLang, TensorRT-LLM. These are not interchangeable and the differences matter at team scale.

The gateway/router sits in front of the inference engine and handles everything the engine doesn’t: authentication, per-user rate limiting, routing between models, cost tracking, fallback to cloud APIs when local is overwhelmed. LiteLLM is the main player here. This is the piece most first-time server builders skip and then regret.

The hardware determines what models you can run and how fast. Get this wrong and the other two don’t matter.

Hardware tiers

A small Mac Studio cluster — the kind of thing that fits on a shelf and serves a 5 to 15 person team.

Tier 1: High-memory single server ($5K to $25K)

This is where most teams should start. One box, one power circuit, one set of cooling concerns, no networking complexity.

Apple Mac Studio M3 Ultra: The prior post covered these as developer machines. At server scale, the picture changes a bit. A Mac Studio M3 Ultra (96GB unified, 819 GB/s) running Ollama with an OpenAI-compatible API endpoint is a legitimate inference server for a 5 to 15 person team. You can run Llama 4 Scout comfortably, DeepSeek-R1 70B at Q8, or any 70B dense model with room to spare. The 96GB ceiling means you won’t fit Maverick-class MoE models at usable quantization without heavy compression. The limitation is also concurrency: unified memory architecture means you’re sharing bandwidth across all active requests, and you don’t get the same parallel throughput as discrete NVIDIA. For teams doing mostly batch or sequential work (document processing, summarization pipelines), this is fine. For interactive multi-user chat at scale, you’ll hit the ceiling faster than the specs suggest.

The Mac Studio M3 Ultra starts at $3,999 with 96GB. Silent, efficient, zero driver pain, runs macOS (which is either a feature or a bug depending on your ops preferences). Apple Silicon supports vLLM-MLX now, which handles concurrency better than Ollama for team deployments.

NVIDIA workstation-class GPUs: The RTX PRO 6000 Blackwell (96GB VRAM, 1,792 GB/s bandwidth) at around $8,000 to $9,200 is the single-card ceiling for NVIDIA workstation hardware right now. 96GB gets you a 70B model at Q4 with throughput that makes sense for interactive use, typically 60 to 90 tokens/sec on Llama 3.1 70B depending on batch size and quantization. If you want 48GB, the L40S is the server-grade option (built for 24/7 operation, passive cooling, ECC memory) at around $8,000 to $12,000.

The CUDA ecosystem advantage is real. vLLM, TensorRT-LLM, SGLang all have first-class NVIDIA support. You get speculative decoding, PagedAttention, FlashAttention. The software story is significantly more mature than Apple or AMD.

AMD Instinct MI300X: The MI300X is genuinely interesting at this tier. 192GB of HBM3 at roughly 5.3 TB/s memory bandwidth in a single GPU. That’s more memory bandwidth than four H100s and enough capacity to run a 70B model at full precision or a 405B model quantized, without sharding. Enterprise pricing runs $10,000 to $15,000 per card. The catch is ROCm. AMD’s software stack has improved meaningfully but it still requires more configuration work than CUDA, and some frameworks have incomplete support. If you’re building on vLLM or Ollama with a model that has well-tested ROCm support (Llama, Qwen, Mistral), you’re mostly fine. If you’re doing something exotic, budget time for it.

Tier 1 comparison:

Hardware	Memory	Bandwidth	Approx cost	70B Q4 tok/s	TDP	tok/s/W	Best for
Mac Studio M3 Ultra	96GB unified	819 GB/s	~$4K to $5K	~30 to 45	~250W	~0.14	Low-concurrency, batch, macOS shop
RTX PRO 6000 Blackwell	96GB VRAM	1,792 GB/s	~$9K card	~60 to 90	300W	~0.25	Interactive team serving, CUDA stack
L40S	48GB VRAM	864 GB/s	~$10K card	~25 to 35 (split layers)	350W	~0.09	30B models, 24/7 server duty
AMD MI300X	192GB HBM3	5,300 GB/s	~$12K card	~120 to 150	750W	~0.17	High-throughput batch, large models

The MI300X throughput numbers look wild on paper. In practice you land closer to the lower end in real serving scenarios because bandwidth isn’t the only variable, but it’s still the fastest single-card option for large-model inference if you can get the ROCm stack working.

Why efficiency jumps between Tier 1 and Tier 2

The tok/s/W column in the Tier 1 table looks bad compared to what you’ll see at Tier 2, and the reason is worth understanding because it shapes every hardware decision at team scale.

A single GPU serving one user at a time is wasteful by design. On each forward pass through the model, the GPU reads all the weights from memory (tens of gigabytes) to produce a handful of tokens for a single request. Most of that memory bandwidth is spent moving weights, not doing useful work per watt. The GPU is underutilized.

Continuous batching changes the math entirely. Instead of serving one request per forward pass, vLLM packs multiple in-flight requests into the same pass. The weights get read once from memory and applied to dozens of concurrent requests simultaneously. Token output per joule scales almost linearly with batch size up to the point where VRAM fills up or memory bandwidth saturates. A single H100 running at batch=1 produces roughly 50 to 80 tokens/sec. The same H100 at batch=32 with FP8 and vLLM produces 1,800 to 2,000 tokens/sec. Same GPU, same power draw, roughly 25 to 40 times more tokens per watt.

This is why the efficiency numbers in the comparison matrix jump from under 0.3 tok/s/W at Tier 1 to 1.0 to 2.9 tok/s/W at Tier 2. It’s not that the hardware is fundamentally different. It’s that Tier 2 deployments have enough concurrent users to actually saturate the batch. A single-user developer setup running Ollama is running at batch=1 almost all of the time. A 20-person team hitting an inference server through LiteLLM is generating enough concurrent requests to fill a batch continuously. That utilization difference, not the GPU specs, is what makes the per-watt numbers look so different between tiers.

The practical implication: if you’re choosing hardware for a team deployment, pick for the batch throughput ceiling, not single-request latency. And if you’re already running a Tier 1 box for a team, before you upgrade the hardware, check whether you’re actually running continuous batching. Switching from Ollama to vLLM on the same hardware can double or triple your effective throughput with no additional spend.

Tier 2: Multi-GPU server ($25K to $100K)

Two to four GPU configurations running vLLM with tensor parallelism. This is where 70B models get comfortable at real concurrency levels and 405B models become plausible.

2x to 4x L40S: Four L40S cards (48GB each, 192GB total) in a Supermicro or Dell server lands around $60,000 to $80,000 all-in. You’re running Llama 3.1 70B at Q4 with comfortable headroom, good concurrency via vLLM’s continuous batching, and a server that can handle 20 to 50 concurrent users without breaking a sweat. The L40S is still available new through CDW, ASA Computers, ServerSupply, and Viperatech at around $7,500 to $10,000 per card (don’t confuse it with the original L40, which is EOL). Probably the most cost-effective Tier 2 config for businesses that need reliable 70B inference without betting on newer silicon.

From a power efficiency standpoint, 4× L40S running Llama-2-70B FP8 with vLLM achieved 1,718 tokens/sec in batch (offline) mode and 1,469 tokens/sec in server mode in MLPerf Inference v4.1 results published by Red Hat. At 4× 350W TDP (1,400W total draw), that works out to roughly 1.2 tok/s/W at batch throughput and 1.0 tok/s/W under interactive load. These are measured results on a production workload, not marketing specs.

2x to 4x RTX PRO 6000 Blackwell Server Edition: The current-gen replacement for the L40S in the PCIe server card category. 96GB GDDR7 (double the L40S), 1.6 TB/s memory bandwidth (nearly double), fifth-gen Tensor Cores, FP4 support. Four cards gives you 384GB total, enough to run Llama 3.1 70B at full FP16 with room left for large KV caches, which the L40S can’t do. Available now from Lenovo, AMAX, Hyperscalers, and on Amazon at $8,000 to $9,200 per card, putting a four-card server at $80,000 to $100,000 all-in. The catch is the 600W TDP per card: four cards draws 2,400W in GPU power alone versus the L40S at 1,400W, which starts to matter for colo power budgets. vLLM and SGLang both have Blackwell support. If you’re buying new hardware today and plan to run it for three to five years, the RTX PRO 6000 Server Edition is the better starting point.

2x H100 SXM: Two H100s (80GB each, 160GB total) runs $80,000 to $120,000. Faster raw throughput than four L40S cards, better for latency-sensitive workloads. The H100 SXM variant matters here: SXM has NVLink interconnect between cards, which means tensor parallelism across them is fast. PCIe-connected GPUs have to go through the CPU interconnect and the bandwidth penalty is real.

The H100 SXM draws 700W each. At continuous batching with FP8 and vLLM, a single H100 achieves approximately 2,000 tokens/sec on Llama 3.1 70B, giving ~2.9 tok/s/W per card, roughly 2.4× more efficient per watt than 4× L40S on the same workload. The tradeoff is that each H100 costs significantly more than each L40S, so the L40S config wins on capital cost per token even if it loses on power cost per token. For colo deployments billing on power draw, that efficiency gap starts to matter at scale.

On self-build vs. branded systems: Lambda Labs, Supermicro, and Dell all sell rack-mount GPU servers in this tier. Self-building is possible but you’re taking on the support burden, and the savings over a Supermicro system are smaller than people expect once you factor in rail kits, power distribution, and the time to debug weird firmware interactions.

vLLM is the right inference engine at this scale. Its continuous batching, PagedAttention, and tensor parallel support across multiple GPUs are production-tested. SGLang edges it out on throughput for some workloads (roughly 29% higher on 7B to 8B models on H100, narrowing to 3 to 5% on 70B), and has better latency tails. Either works; vLLM has a larger community and more deployment examples.

Tier 3: Rack scale ($100K+)

Eight-GPU nodes, multiple nodes, InfiniBand networking. This is where you’re running 405B+ models comfortably, serving hundreds of concurrent users, or building a managed service for multiple clients.

An 8x H100 SXM server from Supermicro (the SYS-821GE-TNHR is the reference system) runs $200,000 to $320,000 depending on configuration. A rack with four of these nodes is $800K to $1.2M in hardware. That’s before NDR InfiniBand switches, PDUs, and colocation, all of which cost more than most people budget.

On networking first: inter-node GPU communication is the thing that makes or breaks distributed inference. 1GbE is a non-starter. 10GbE is marginal. 25GbE is the floor, and 100G+ InfiniBand is what serious systems actually run. A pair of NDR 400G switches to connect four nodes adds roughly $200K to $300K to the hardware bill and another 1.5 kW to the power draw. vLLM handles pipeline and tensor parallelism across nodes, but it needs the bandwidth to go with it.

Power is where the math gets uncomfortable. The GPU TDP figure (700W × 8 = 5.6 kW) is not the system draw. NVIDIA’s own DGX H100 spec puts total system power at 10.2 kW, covering CPUs, NVLink switch fabric, memory, storage, and cooling. The Supermicro SYS-821GE-TNHR ships with eight 3,000W PSUs for a reason. Four nodes at 10.2 kW each is 40.8 kW in compute alone. Add the InfiniBand switches and overhead, and you’re at 43 to 45 kW of IT load per rack. At a typical data center PUE of 1.3 to 1.5, the facility draw is 56 to 68 kW per rack.

That is not a standard cabinet, and finding a facility that will take it is harder than it sounds. Most colo providers cap air-cooled racks at 15 to 25 kW; above that you’re in dedicated high-density space, often negotiating liquid cooling. Many Tier I operators (Equinix, Digital Realty, CoreSite) won’t touch a sub-100 kW deployment in 2026 given power queue backlogs. Expect to work with mid-tier or regional providers. Current US market rates for committed high-density power run $130 to $225 per kW per month, depending on market. At 44 kW committed, that’s $68K to $119K per year per rack in power alone, before cross-connects ($100 to $400/month each) and remote hands ($150 to $300/hour). Budget the actual number before signing anything.

The Watt Counts benchmark paper (arXiv:2604.09048) makes the point cleanly: at rack scale, power capacity is the binding constraint, not GPU count. Every improvement in tokens/watt directly reduces facility footprint and operating cost. A fully utilized 8-GPU node with vLLM FP8 and continuous batching produces around 16,000 tokens/sec, which works out to roughly 2.5 to 2.9 tok/s/W against full system draw. That’s a useful planning number: divide your throughput requirement by it and you get the power budget you need to secure before you order hardware.

Who needs this? MSPs building multi-tenant AI platforms, enterprises with very high-volume document processing, anyone seriously selling inference as a product. At this scale, self-hosted tokens cost well under $1/M for 70B-class models, versus $10/M for GPT-4o, $15/M for Claude Sonnet 4.6, or $25/M for Opus 4.7. The economics work. The challenge is utilization: the infrastructure costs what it costs whether the GPUs are busy or not.

If you want to push further and fancy yourself a hyperscaler, the next frontier is megawatt-scale rack architectures that make a four-node H100 cabinet look quaint. I wrote about where that’s heading: Megawatt racks and what comes after.

Comparison matrix

Tier	Approx cost	Concurrent users	Max model size	Throughput (70B ref)	Efficiency (tok/s/W)	Example use case
Tier 1: Single server	$5K to $25K	5 to 20	70B to 405B (quant)	30 to 150 tok/s	0.09 to 0.25	Small business document processing; solo MSP onboarding first clients
Tier 2: Multi-GPU server	$25K to $100K	20 to 100	405B (quant), 70B comfortable	150 to 600 tok/s	1.0 to 2.9	Mid-size business AI platform; VAR serving 5 to 10 business clients
Tier 3: Rack scale	$100K to $1M+	100 to 500+	Full precision 405B+, multi-model	1,000+ tok/s	2.5 to 2.9 per node	Multi-tenant MSP with dozens of clients; high-volume batch processing at enterprise scale

The efficiency column reflects tok/s per watt at continuous batching with FP8/Q4 quantization on Llama 3.1 70B. Tier 1 and Tier 2 numbers use GPU TDP as the denominator (single-card or multi-card draw). Tier 3 uses full system draw (10.2 kW per 8-GPU node per NVIDIA DGX H100 spec), which is the right number for facility planning. Tier 1 numbers reflect single-GPU single-request throughput; the step change at Tier 2 comes from vLLM’s continuous batching. Sources: MLPerf Inference v4.1 L40S results (Red Hat), Spheron H100 tokens/watt analysis, Watt Counts benchmark (arXiv:2604.09048).

Clustering: connecting machines together

Everything above assumes a single machine with one or more GPUs in it. That’s the right starting point for most teams. But once you’ve maxed out what a single box can hold, or you want to run models larger than any single node can fit, clustering is the next question. The answer is significantly different depending on whether you’re on Apple Silicon or NVIDIA.

Mac clustering: exo and RDMA over Thunderbolt

The Apple path is genuinely interesting, partly because it wasn’t really viable until late 2025.

The software that makes it work is exo, an open-source clustering tool from Exo Labs. Exo handles automatic device discovery (no manual configuration), distributes model weights across nodes using tensor parallelism (not pipeline parallelism, which matters a lot, as I’ll explain below), and exposes a single OpenAI-compatible API endpoint across the whole cluster. It uses MLX as its inference backend, which means it’s Mac-only today. Linux support via GPU is still in development.

The network path matters enormously. Running two Mac Studios over 2.5GbE Ethernet works, but you’re leaving most of the benefit on the table. Thunderbolt 5 does roughly 50 to 60 Gbps real-world throughput between nodes. The bigger improvement came when macOS 26.2 added RDMA support over Thunderbolt 5, and exo 1.0 shipped day-zero support for it. RDMA drops inter-node memory access latency from around 300 microseconds down to under 50 microseconds. That latency gap is what separates “this technically works” from “this actually makes the model faster as you add nodes.”

To enable it, you boot each Mac into recovery mode, run rdma_ctl enable in Terminal, and reboot. That’s the whole setup on the Mac side. Exo handles the rest.

The hardware constraint to understand is Thunderbolt topology. Thunderbolt 5 switches don’t exist. Every Mac has to be directly cabled to every other Mac. Right now the practical ceiling is four nodes, in a full-mesh where each machine has a direct TB5 connection to all the others. You also can’t use the Thunderbolt 5 port adjacent to the Ethernet port on the Mac Studio back panel for RDMA, and all machines in the cluster need to run the exact same macOS version (including minor beta numbers) or the RDMA discovery breaks.

Jeff Geerling benchmarked a 4-node cluster of M3 Ultra Mac Studios using Apple loaner hardware (two at 512GB, two at 256GB — configurations that aren’t available to buy; the current M3 Ultra tops out at 96GB) using exo with RDMA and got:

Model	Parameters	Active params	Cluster tok/s
Qwen3-235B (8-bit)	235B MoE	22B active	~32 tok/s
DeepSeek V3.1 (8-bit)	671B MoE	37B active	~15 tok/s
Kimi K2 Thinking (native 4-bit)	~1T MoE	32B active	~30 tok/s

Those throughput numbers are interesting for what they demonstrate about the architecture, but the memory configuration isn’t something you can actually buy. Four standard M3 Ultra Mac Studios at 96GB each gives you 384GB total — enough to run Qwen3-235B at Q4 comfortably, but you’re not fitting DeepSeek V3.1 671B at 8-bit without very heavy quantization, and Kimi K2 Thinking at native weights is out entirely. More practically: four M3 Ultras at $3,999 each is $16K in hardware before storage and networking cables, and you’re running a cluster you can only expand to four nodes. The practical point is that exo with RDMA over Thunderbolt 5 scales up as you add nodes rather than staying flat or degrading — that’s the test of a real distributed inference implementation and it passes it — but the memory ceiling of what you can actually purchase matters more than the benchmark hardware.

The alternative for Mac clustering without RDMA is llama.cpp’s RPC backend, which spreads model layers across nodes using pipeline parallelism. It works for fitting models that won’t fit on one machine, but throughput degrades as you add nodes because each layer’s output has to be transferred before the next node can start work. Exo’s tensor-parallel approach does more communication per step but computes in parallel rather than sequentially, which is why you see a speedup with more nodes instead of a slowdown.

What doesn’t work across Mac clusters: vLLM, SGLang, TensorRT-LLM. All NVIDIA CUDA. The Mac path is exo plus MLX, full stop.

NVIDIA clustering: NVLink and InfiniBand

The NVIDIA clustering story is better-documented and more mature, but the network requirements are unforgiving.

Within a single server, NVLink is the interconnect for SXM-form-factor GPUs (H100 SXM, H200 SXM). NVLink bandwidth between cards is 900 GB/s in the H100 generation. That’s fast enough to treat the cards as a single unified pool for tensor parallelism in vLLM. PCIe-connected GPUs (L40S, RTX PRO 6000) share bandwidth through the CPU interconnect instead, which cuts effective inter-GPU bandwidth to roughly 128 GB/s bidirectional (PCIe 5.0 x16). Tensor parallelism still works, but it’s slower, and you’ll see the difference in throughput on large models.

vLLM handles multi-GPU distribution via Megatron-LM tensor parallelism. For single-server multi-GPU, set --tensor-parallel-size to the number of GPUs and launch normally. For multi-node, you start a Ray cluster first:

# On the head node
ray start --head

# On each worker node
ray start --address=<head-node-ip>:6379

# Then launch vLLM on the head node, setting tensor-parallel-size to total GPU count
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B \
  --tensor-parallel-size 8   # 2 nodes × 4 GPUs each

SGLang works similarly. Both use Ray as the distributed runtime.

Multi-node networking is where this gets expensive. 10GbE between nodes is marginal. 25GbE is the floor for small-scale (2 to 4 node) deployments. Serious multi-node inference runs on InfiniBand: HDR (200 Gbps) or NDR (400 Gbps). The difference isn’t marginal: running a 70B model across two H100 nodes over 10GbE versus 100G InfiniBand can drop throughput by 30 to 50% due to the all-reduce communication bottleneck during each forward pass. If you’re spending $100K+ on GPU hardware and using a $200/switch for the interconnect, you’ve made a mistake.

The NVLink SXM advantage also shows up here: H100 SXM nodes connected via NVLink Switch fabric (the NVSwitch-based backplane in DGX systems) treat cross-node bandwidth nearly the same as intra-node bandwidth, which is why 8-GPU DGX nodes are so effective for large model inference.

Network requirements summary:

Setup	Minimum	Recommended	Notes
Mac cluster (exo)	Thunderbolt 5 direct cable	TB5 full mesh + RDMA	No switches; point-to-point only; max ~4 nodes
NVIDIA single-server (SXM)	NVLink (built-in)	NVLink with NVSwitch fabric	Already present in SXM systems
NVIDIA single-server (PCIe)	PCIe 5.0 x16	PCIe 5.0 x16	~128 GB/s bidirectional vs 900 GB/s NVLink
NVIDIA multi-node	25GbE	HDR/NDR InfiniBand (200 to 400 Gbps)	IB adds $20K to $100K+ per switch depending on port count

Dense models vs MoE: clustering impacts them very differently

This is something I haven’t seen explained clearly elsewhere, and it changes the calculus on whether clustering is worth it for a given model.

Dense models (Llama 3.1 70B, Qwen2.5-Coder 32B, and similar) read all their weights through memory on every forward pass. Every token generated requires touching all 70 billion parameters. This means memory bandwidth is the primary bottleneck, and adding more nodes only helps if the nodes can share bandwidth fast enough. With NVLink, this works well because the interconnect is wide enough. With anything slower, the cross-node transfer overhead starts eating into the gains. Tensor parallelism on a 2-node NVIDIA NVLink setup for a 70B dense model gives you roughly 1.7 to 1.9x throughput (not quite linear due to communication overhead). On PCIe multi-node at 25GbE, that scales down toward 1.2 to 1.4x at best. The communication tax is real.

MoE models are different in an important way. A model like Qwen3-235B has 235 billion total parameters but only activates 22 billion of them per forward pass (hence “Mixture of Experts”: each token is routed to a small subset of expert layers, not all of them). The 213 billion inactive parameters still have to be held in memory and loaded as-needed when their expert layer is selected, but they aren’t contributing to bandwidth usage on every single token.

The practical implication for clustering is that MoE models get two distinct benefits from adding nodes: more total VRAM to hold the full weight set without heavy quantization, and the ability to run models that wouldn’t fit on one machine at all. The communication overhead of tensor parallelism also tends to be lower for MoE compared to dense models at the same parameter count, because you’re routing fewer active parameters across nodes per pass.

This explains why the Mac cluster benchmarks above were MoE models, and why that’s also what makes sense on purchasable hardware. With four M3 Ultras at 96GB each (384GB total), Qwen3-235B MoE fits cleanly at Q4 and runs well across nodes because the inter-node traffic is mostly routing decisions and activations, not the full 235B weight matrix. You’d never attempt Llama 3.1 405B (dense) across a Mac cluster at useful quality because moving all 405B parameters through memory on every token would saturate the Thunderbolt 5 links almost immediately.

The rough practical guidance: if you’re clustering primarily to fit a model that’s too large for one machine, MoE is your friend and clustering works well. If you’re clustering to get more throughput on a model that already fits, dense models need very fast interconnects (NVLink or InfiniBand) to see meaningful gains.

Query routing

Every non-trivial inference setup needs a router in front of the model server. The reasons pile up fast: you need to handle different models for different task types (you’re not running Whisper through vLLM), you want fallback to cloud APIs when local is overwhelmed, you need authentication and per-user rate limiting, and you want to track what’s being spent where.

LiteLLM is the right answer for most setups. MIT-licensed, actively maintained, exposes a single OpenAI-compatible API endpoint that routes to 100+ backends including local vLLM/Ollama instances and cloud APIs. You configure routing rules, it handles the rest. Your application code never needs to know whether a request is going to local hardware or a cloud API.

Portkey is the polished hosted alternative. Better observability UI out of the box, commercial support, governance and guardrails built in. Worth considering if you’re building a managed service and don’t want to operate the gateway infrastructure yourself.

RouteLLM from the LMSYS team (the Chatbot Arena people) is academically interesting: it trains classifiers on human preference data to predict which model will produce better output for a given prompt, then routes based on that prediction. In practice it’s more useful for routing between different-quality versions of the same task than for task-type routing. Research-grade, not production infrastructure.

The three routing strategies that actually matter for a business inference stack:

Capability-based routing: Different endpoint for each task type. Whisper endpoint for audio, vision model for image queries, general LLM for text. The client specifies the task type in the model parameter, LiteLLM routes accordingly. Simple, explicit, reliable.

Cost-based routing: Route to local first, fall back to cloud if the queue depth exceeds a threshold. Requires you to monitor local queue depth and expose it to the router, but LiteLLM supports this.

Load-based fallback: Under normal conditions, everything runs local. When local inference is overloaded (say, a batch job is saturating the GPU), interactive requests fall back to cloud APIs. Ensures interactive users don’t notice the batch job.

Usage tracking and billing

This is where “self-hosted” gets complicated if you’re reselling.

LiteLLM’s proxy has built-in per-user and per-team spend tracking. It automatically tracks token counts per API key, associates keys with users or teams, and exposes a /user/daily/activity endpoint with spend breakdowns by model, date, and API key. For internal chargeback within a company, this is sufficient out of the box.

For billing external clients, you still need to build the last mile. LiteLLM gives you token counts and can export to Prometheus. What it doesn’t give you: invoice generation, payment processing, client-facing dashboards, or any concept of your billing rate per token. You need to build or buy that layer.

A workable reseller billing pipeline: LiteLLM for token attribution (per client API key), Prometheus for metrics export, your billing system for rate application and invoice generation. If you’re just getting started with a handful of clients, simpler still: query the LiteLLM API monthly, pull per-key token counts, apply your markup in a spreadsheet. Unglamorous. Effective.

Worth naming plainly: there is no off-the-shelf managed inference billing platform for self-hosted models the way Stripe is for payments. You’re assembling pieces.

Models by task type

Not all tasks need the same model. Running a 70B model for tasks that a 7B handles fine wastes memory and throughput. Here’s what’s actually worth running at each tier:

Task	Recommended model(s)	Min tier	At scale	Notes
Document summarization (long)	Qwen3 72B, Llama 3.1 70B	Tier 1 (96GB VRAM)	Tier 2 for 20+ concurrent users	Qwen3 and Llama 3.1 70B both have 128K context. Either is excellent.
RAG / document Q&A	Qwen3 30B-A3B, Llama 3.1 8B	Tier 1 (32GB VRAM)	Tier 1 scales well; retrieval quality matters more than model size	8B is often fine. Invest in the retrieval pipeline before upgrading the model.
Voice to text	Whisper large-v3-turbo	8GB VRAM (any tier, secondary card)	Tier 2+ for high-volume transcription alongside LLM workloads	25 to 30x real time on GPU. Keep it on its own card, it competes for KV cache.
Image / vision tasks	Qwen2-VL 72B, Llama 3.2 Vision 90B	Tier 1 (80GB+ VRAM)	Tier 2 for concurrent vision + text serving	Vision models are larger than text equivalents for the same quality bar.
PDF extraction / OCR	Whisper-style pipeline or Qwen2-VL	Tier 1 (16GB VRAM)	Tier 2 for high-volume batch	Tesseract still wins for clean scans. VLMs add value for complex layouts.
Document writing / editing	Qwen3 30B-A3B, Llama 3.1 8B	Tier 1 (16GB VRAM)	Tier 1 handles most business volumes	8B with good prompt engineering often beats 70B with lazy prompts.
Code / software development	Qwen2.5-Coder 32B	Tier 1 (32GB VRAM)	Tier 2 for team-wide interactive use	Matches GPT-4o on HumanEval, outperforms it on SWE-bench for practical agentic coding.
General reasoning / complex tasks	DeepSeek-R1 70B, Qwen3 235B-A22B	Tier 1 (96GB+)	Tier 2 for 235B MoE at real concurrency; Tier 3 for multi-tenant serving	DeepSeek-R1 for structured reasoning. Qwen3 235B-A22B MoE is the quality ceiling for open weights.
Multi-tenant AI platform (mixed workloads)	Multiple specialized models	Tier 2	Tier 3 for dozens of clients or 100+ concurrent users	At this scale you’re running separate endpoints per task type behind a LiteLLM router.

A few things worth flagging. Put Whisper on a separate GPU from your main LLM inference stack if possible: it’s a different architecture, uses memory differently, and you don’t want audio transcription jobs competing with document processing for KV cache. If your server has two cards, Whisper goes on one and the LLM stack on the other.

For RAG workloads, the model size matters less than people expect. A 7B or 8B model with well-structured context and good retrieval usually beats a 70B model with poor retrieval. Invest in the retrieval pipeline before upgrading the model.

7B models are genuinely good enough for a surprising range of business tasks: document classification, entity extraction, sentiment analysis, first-pass summarization, simple Q&A. Don’t run a 70B model for jobs where a 7B does the work. The throughput difference is enormous. You’ll serve 5 to 10x more concurrent requests at the same latency.

Latency and where you host it

Two latency numbers matter: time to first token (TTFT) and tokens per second after that.

TTFT is the latency a user actually feels. The typing indicator appears, then they wait. 200ms is the standard threshold for “responsive”. Below 200ms feels interactive. 500ms is noticeable. 1,000ms+ feels like a loading state, not a chat interface. For interactive use cases (chat, document Q&A, code suggestions), TTFT is the number to optimize.

Tokens per second is what matters for throughput once generation starts. For reading text in a UI, anything above 30 tok/s feels fast. For background batch processing, throughput matters and TTFT doesn’t.

These two metrics trade off against each other, which is annoying. Continuous batching (what vLLM does) increases throughput dramatically by serving multiple requests in parallel, but it can increase TTFT because a new request has to wait for the current batch to make progress before it gets scheduled. At low concurrency this isn’t a problem. At 50 concurrent users, you need to tune the batching parameters or TTFT will degrade.

“Hosted nearby but not on-prem” is a real option and I think it’s underrated. A GPU server in a colocation facility 30ms of round-trip latency away will have TTFT in the 250ms to 400ms range for a typical prompt (30ms network + 50 to 200ms prefill depending on prompt length). That’s acceptable for most use cases. On-prem eliminates the network component but adds physical management burden. For most small businesses and MSPs, colocation is the right tradeoff: you own the hardware and the data never leaves your control, but someone else handles the facility.

On storage: model weights are big. Llama 3.1 70B at Q4 is about 40GB. Qwen3 235B-A22B at Q4 is around 130GB. If you’re running multiple models, you want NVMe, not spinning disk. Not for performance during inference (weights are loaded into GPU memory, not streamed from disk on every token), but for reasonable load times when switching models. An NVMe array with 2TB to 4TB is the right starting point for a multi-model server.

Reference configurations

Small: ~$8K to $10K total

Mac Studio M3 Ultra ($3,999 base, ~$4,500 to $5,500 configured) plus $3,000 to $4,500 for NVMe storage, networking, and UPS. Ollama or vLLM-MLX for inference, LiteLLM proxy in front, Prometheus for metrics. Run Qwen3 30B-A3B for most tasks, Llama 3.1 8B for high-volume lightweight work, Whisper large-v3-turbo for transcription.

Capacity: 5 to 15 concurrent users doing document processing and summarization. Not the right setup for more than a few concurrent heavy users.

At $10/M output tokens from a frontier API and around 500M tokens/month, you're spending $5,000/month on API costs. A $10K server pays for itself in 2 months. At lower volumes the math is tighter.

Medium: ~$60K to $100K total

4x L40S (48GB each, 192GB total VRAM) in a Supermicro or Dell server, plus networking, NVMe storage, colocation first year. All-in around $60,000 to $80,000. If you’re buying new today, 4x RTX PRO 6000 Blackwell Server Edition is worth serious consideration: 384GB total at full precision, better throughput headroom, similar server chassis, all-in around $80,000 to $100,000. vLLM with tensor parallel across the four cards, LiteLLM proxy with per-user tracking, Grafana/Prometheus for visibility. Llama 3.1 70B at Q8 as the main model, Qwen2.5-Coder 32B on a separate endpoint, Whisper on its own card.

Capacity: 20 to 50 concurrent users. Production-grade serving for a mid-size business or a small MSP with a handful of clients. 150 to 300 tok/s aggregate throughput on 70B.

At $2M tokens/month output (not extraordinary for a 30-person team using AI across document workflows), frontier API costs run $20,000+/month. The $60,000 server investment pays back in 3 months. At 500K tokens/month you’re looking at 12 to 18 months payback, which is still reasonable for a 3 to 5 year hardware lifecycle.

Large: ~$500K total

Two 8x H100 SXM nodes ($400K to $640K in hardware depending on configuration), NDR InfiniBand switch, NVMe storage array, and colocation for year one. Hardware alone pushes this well past the $250K number that circulates in smaller-scale discussions. Colo for two nodes at ~22 kW committed runs $35K to $60K/year at current US market rates, before cross-connects and remote hands. Call it $500K all-in for a realistic first year, more if you’re in a high-cost market like Northern Virginia or Silicon Valley. vLLM with pipeline and tensor parallel across nodes, LiteLLM as the unified gateway, custom billing middleware feeding your invoicing system.

Capacity: 100+ concurrent users. Multi-tenant MSP with multiple business clients. Enough headroom to run batch jobs in parallel with interactive serving.

Cost per million tokens below $0.05 at good utilization. The economics are compelling if you're billing clients even $1 to $2 per million tokens. At Claude Sonnet 4.6 pricing ($15/M output) and 5M output tokens/month, you’re spending $75K/month on APIs. A $500K infrastructure investment pays back in under a year at that volume. The challenge isn’t the economics. It’s maintaining the utilization that makes the math work.

What I’d actually build

The medium configuration is the most interesting to me from a business standpoint. It’s the tier that hits a useful intersection: a 20 to 50-person business or MSP serving a handful of clients, generating 1 to 2 million output tokens per month per client, and spending enough on frontier APIs that the $60K to $80K hardware investment pays back in under six months. Four L40S cards draw about 1.4 kW under GPU load, well inside what any standard colo accepts without a conversation about liquid cooling or dedicated high-density space. You get real serving capacity without the facility negotiation headaches that come with Tier 3. Four L40S cards in a Supermicro chassis, vLLM, LiteLLM, and a thin billing layer. It’s a legitimate business you can run out of a half-rack.

The hard part isn’t the hardware or the model selection. It’s the operational layer: monitoring inference server health, managing model updates without dropping requests, building enough around LiteLLM to actually send invoices. Those are real engineering problems that take real time. If you’re a solo operator, budget for that before assuming the hardware cost is the whole story.

But the unit economics work, and they’re getting better. Open-source models in 2026 are genuinely competitive with frontier APIs for most business tasks. The gap has closed enough that “we just use OpenAI” is increasingly a choice about operational simplicity rather than quality. That’s worth knowing, even if you decide the tradeoff isn’t worth it for your situation.

Let the (AI) Bodies Hit the Floor

Jared Watkins — Thu, 07 May 2026 00:00:00 +0000

In 2001, an estimated 95% of all the fiber optic cable in the ground was dark. Telecom companies poured over $500 billion into it on the thesis that internet traffic was doubling every hundred days (it wasn’t, but everyone believed it was), and WorldCom, Global Crossing, and Williams Communications took on enormous debt to lay fiber across oceans and under highways. It took nearly 20 years for traffic to grow into that capacity. Several of those companies went bankrupt. The fiber itself was fine, sitting patiently in the ground, waiting for demand to catch up.

The GPU version of this is already happening. And unlike fiber, GPUs don’t wait.

The buildout that can’t be built

The numbers are staggering on paper. Microsoft, Amazon, Google, and Meta have committed roughly $725 billion in AI-related capital expenditure for 2026 alone, up 77% from the prior year. The announced pipeline of US datacenter capacity for 2026 is somewhere around 12 to 16 gigawatts. That’s a lot of zeros.

Here’s the problem: only about a third of it has broken ground. Close to half of planned US datacenter builds for 2026 have been delayed or canceled, according to Sightline Climate. Not because demand evaporated (the checks are being written), but because the physical world has hard constraints that press releases don’t.

The poster child is Stargate, the $500 billion joint venture between OpenAI, Oracle, and SoftBank that Trump personally announced in January 2025. More than a year later, the JV hasn’t hired staff and isn’t actively developing datacenters. The planned 600 MW expansion at the Abilene, Texas campus was canceled after negotiations broke down. Satellite imagery of the original 1,200-acre site shows six plots cleared, one with actual development. Oracle pushed delivery schedules for several large OpenAI facilities from 2027 to 2028, blaming labor and materials shortages.

But the construction delays aren’t even the most interesting problem with Stargate. The money isn’t real. SoftBank, the supposed primary financial backer, could only assemble 10% equity funding. The rest was going to come from debt. Their first $10 billion tranche was borrowed from Mizuho and other Japanese lenders. OpenAI tried to build its own datacenters but couldn't get financing because lenders weren't willing to back billion-dollar construction projects from a company losing $14 billion a year with no clear path to profitability. SoftBank eventually conditioned its investment on OpenAI restructuring into a public benefit corporation, or the commitment drops to $10 billion. The partners spent months arguing about control structure instead of breaking ground.

And then there’s the circular financing, which is the part that should make anyone who remembers the telecom bubble really nervous. NVIDIA invested in OpenAI. OpenAI uses that money to buy NVIDIA chips. Oracle committed to spending $40 billion on NVIDIA GPUs to power Stargate's Abilene facility. OpenAI signed a $300 billion deal to buy Oracle cloud capacity. So NVIDIA funds OpenAI, who pays Oracle, who pays NVIDIA. The money goes in a circle. Bloomberg published a whole investigation into these arrangements, calling them what they are: AI circular deals where Microsoft, OpenAI, and NVIDIA keep paying each other.

This is vendor financing with extra steps. If you drew it on a whiteboard, a first-year business student would circle it in red. Somewhere in Cupertino and Redmond, very smart people are nodding at this chart and calling it a partnership ecosystem. In the late 90s, telecom equipment makers like Nortel and Lucent lent money to their customers so those customers could buy their products. It inflated demand numbers beautifully right up until the loans went bad and the whole thing collapsed. The AI version is more sophisticated (it’s structured as equity investments, cloud commitments, and partnership agreements rather than simple loans), but the economic logic is identical. The demand looks enormous on paper because the same dollars are being counted multiple times as they circulate between a handful of companies. When actual external revenue has to support the structure instead of recycled internal capital, the math stops working.

Stargate isn’t an outlier. It’s the pattern.

To be clear, the fact that these companies are willing to spend this kind of money isn’t irrational. The underlying demand signal for AI compute is real and growing fast. The problem isn’t the bet. It’s the mismatch between the speed of the financial commitments and the speed of the physical world.

Everything bottlenecks at once

The constraint isn’t any single thing. It’s everything, simultaneously.

Large power transformers now have lead times of 128 to 144 weeks. That’s two and a half to nearly three years. Prices are up 77% since 2019, and Wood Mackenzie projects a 30% deficit in power transformer availability for 2026. These aren’t exotic components. They’re the things that connect a datacenter to the electrical grid. Without them, nothing turns on.

HBM (the specialized memory that goes into AI accelerators) has demand growing 80 to 100% annually against supply growing 50 to 60%. Only three companies on earth make it. NVIDIA’s Blackwell GPUs are sold out through mid-2026 with a massive backlog, and the company reportedly cut consumer RTX 50-series production significantly because the same memory capacity feeds HBM production lines. Datacenters will consume 70% of all memory chips produced worldwide in 2026.

Copper is at roughly $5.60 a pound and hit $6 earlier this year. A datacenter needs about 27 tonnes per megawatt of capacity. The same copper is being fought over by the renewable energy buildout that’s supposed to power these same facilities.

And then there’s the grid interconnection queue, which is the real binding constraint. You can have the land, the permits, the chips, and the money, but if you can’t get power to the site, you have a very expensive, very well-permitted patch of dirt.

None of these constraints are permanent. Transformer manufacturing is scaling (Hitachi Energy, Siemens, and others are expanding capacity). HBM production is ramping. But “scaling” and “ramping” operate on industrial timelines. Tech leadership hasn’t had to face that level of reality in recent decades.

Behind-the-meter won’t save you

The industry’s answer to the power bottleneck has been behind-the-meter generation: bring your own power plant, skip the grid entirely. It’s a smart instinct, and it’s the kind of creative problem-solving that eventually does work in infrastructure buildouts. But the near-term reality is messier than the pitch decks suggest.

Most BTM deals are centered on natural gas, with some nuclear restarts and fuel cell projects in the mix. AEP and Bloom Energy announced a 1 GW fuel cell deal (the largest utility-scale fuel cell procurement in US history). It hasn’t delivered yet. Some of the announced deals read more like science fiction (space-based solar, small modular reactor designs that don’t exist yet) with delivery timelines in the 2028 to 2030 range. That’s not giving you near-term relief. Anything more than 3 years out feels more like hope than a plan.

The fuel cell angle deserves its own reality check. Bloom Energy and others talk about fuel cells as clean, flexible BTM power, and the technology is real. But hydrogen isn’t an energy source. It’s an energy carrier. Unlike natural gas, which comes out of the ground ready to burn, hydrogen has to be manufactured first, usually by electrolysis (running electricity through water) or steam methane reforming (which uses natural gas anyway, so what’s the point). The round-trip energy efficiency of producing hydrogen, compressing it, and running it through a fuel cell is roughly 40%. You’re losing 60% of the energy you started with before a single GPU turns on. (There is a version of this where the clean energy answer is just a very complicated way to burn natural gas less efficiently, and some of these announcements are basically that.) And that’s before you deal with the storage and transport problems: hydrogen is the smallest molecule there is, it leaks through containment walls and pipe joints that are perfectly tight for other gases, it causes embrittlement in conventional steel (gradually weakening the metal until it cracks), it needs to be stored at extremely high pressures or cryogenic temperatures, and it’s explosive across a wide range of concentrations in air. There is no industrial-scale hydrogen supply chain today, and building one is a decades-long infrastructure project unto itself. Fuel cells running on natural gas are more practical, but at that point you’ve built a less efficient gas turbine with extra steps.

Even the relatively conventional BTM projects (gas turbines, which are the most deployable option) face the same transformer and switchgear bottlenecks as grid-connected builds. Gas turbines produce AC at medium voltage. The next-gen AI racks they’re supposed to power run on 400V to 800V DC. That means you still need the full power conversion chain between the generator and the rack: step-down transformers, rectifiers, and DC distribution infrastructure, all of which use the same components that are backordered for years. BTM doesn’t eliminate the supply chain. It just changes who you’re buying from. And the 800VDC ecosystem that NVIDIA’s latest architectures require won’t even be commercially available until the second half of 2026.

It also doesn’t eliminate the community opposition problem, and may actually make it worse. Nobody loves having a datacenter next door, but a datacenter with its own gas-fired power plant raises environmental and permitting issues that a grid-connected facility doesn’t. There are now 188 local opposition groups across 40 states. Over 300 state datacenter bills were filed in just the first six weeks of 2026. Maine enacted the first state-level moratorium on large datacenters. Virginia (home to 643 facilities) has a proposed moratorium halting all new applications until July 2028. Georgia’s Senate is considering a one-year ban. There’s even a federal moratorium bill now. While it’s not expected to pass the sentiment against these projects is growing.

Behind-the-meter power isn’t big enough and won’t come online fast enough to rescue the near-term pipeline. And the local opposition is only getting louder.

Dark GPUs are worse than dark fiber

Here’s where the analogy breaks down in a way that makes the current situation more dangerous to the economy than dark fiber of the past.

Dark fiber sits in the ground. It doesn’t rot. It doesn’t become obsolete. It costs almost nothing to maintain once installed. The fiber laid in 1998 was still perfectly usable in 2015 when traffic finally grew into it. Patience was rewarded, even if the investors who funded the original buildout went bankrupt waiting.

GPUs don’t work that way. A GPU that can’t be powered or deployed today isn’t going to sit on a shelf and be useful in three years. AI accelerator generations move on 18 to 24 month cycles. NVIDIA’s Blackwell is already being succeeded by Rubin. The H100s that were the hottest commodity in 2023 are already being displaced. A chip produced today that can’t be put to work has a shelf life measured in months before it’s obsolete, not decades.

And this isn’t hypothetical. It’s already happening. Microsoft’s Satya Nadella has said publicly that power, not compute, is their biggest datacenter constraint, and that Microsoft has AI GPUs “sitting in inventory” because it lacks the power to install them. In Santa Clara (literally minutes from NVIDIA’s headquarters), two freshly built datacenters, Digital Realty’s SJC37 and Stack Infrastructure’s SVY02 campus, are standing empty because the local utility can’t supply the electricity. They may sit empty for years. In Northern Virginia, the largest datacenter market in the country, connection delays are running multiple years as utilities struggle to reinforce high-voltage infrastructure. Regions in the Pacific Northwest and the Southeast are reporting wait times of two to five years for new power capacity.

So the dark GPUs aren’t a future risk. They exist right now. NVIDIA keeps shipping chips. The datacenters keep getting built (or half-built). And the power to run them isn’t there. Every GPU that sits in a warehouse or in a powered-down rack is depreciating toward obsolescence while the next generation rolls off the fab line.

The capital destruction isn’t deferred. It’s immediate. And it’s compounding. Unlike fiber that sat dark but held its value, a GPU that misses its deployment window doesn’t get a second chance. By the time the power arrives, the chip is last-generation and worth a fraction of what was paid for it.

This pricing reality probably isn’t fully baked into the market yet, because the worst of the delivery failures are still 12 to 18 months out. The commitments have been made, the purchase orders are in, but the physical constraints haven’t fully collided with the financial expectations. When they do, someone is going to be holding a lot of very expensive, very obsolete silicon.

The subsidy cliff

There’s a demand-side problem too, and it’s related.

OpenAI is projected to lose $14 billion in 2026 despite hitting $20 billion in annualized revenue and having 900 million weekly ChatGPT users. Ninety-five percent of those users don’t pay. The company’s cumulative losses between 2023 and 2029 are projected at an eye-watering $115 billion, (with a B) with profitability not expected until 2029 or 2030. That’s a lot of subsidized usage.

Anthropic is doing better on unit economics (they reportedly hit $30 billion ARR while spending a quarter of what OpenAI spends on training), but the broader pattern holds across the industry: AI services are being sold below cost to build market share, and the bills are coming due. The era of $20-a-month plans that cost the provider $100-plus to serve is ending as these companies approach IPOs and investor patience thins.

When prices rise to cover actual costs, how much of current usage survives? A 2025 MIT study found that 95% of enterprise AI pilot programs failed to deliver measurable financial returns. Only 6% of organizations qualify as “AI high performers” (generating 5% or more EBIT impact) per McKinsey. Now, to be fair, those numbers deserve some nuance. Enterprise adoption of anything is historically slow, and a lot of those early pilots were running older models with teams that were still figuring out how to use them. It’s not clear how much of that 95% failure rate reflects genuine limitations of the technology versus enterprises being bad at adoption (which they almost always are with new tools) versus a measurement problem where the ROI is real but shows up in places the study wasn’t looking. The tools have gotten dramatically better even in the last twelve months, and the organizations that started early are probably seeing compounding returns that newer adopters haven’t caught up to yet. But even granting all of that, the gap between the infrastructure investment and the demonstrable enterprise revenue is enormous, and it’s the revenue that has to justify the buildout.

Sequoia Capital’s David Cahn laid out the math starkly: take NVIDIA’s GPU revenue, double it for total datacenter costs, double again for the margins end users need to justify the spending, and you get a $600 billion annual revenue requirement from AI services. The actual revenue is a fraction of that. That gap tripled in twelve months.

The balance sheet problem

The hyperscalers funding this buildout are entering unfamiliar financial territory.

Historically, these companies spent about 40% of their operating cash flow on capital expenditure. In 2026, that number approaches 100%. Google’s free cash flow is projected to drop roughly 90%, from $73 billion to around $8 billion. Amazon is expected to go free-cash-flow negative (Morgan Stanley projects negative $17 billion, Bank of America projects negative $28 billion). Microsoft’s free cash flow drops an estimated 28%. Meta has $237 billion in non-cancelable contractual commitments.

These companies have always been valued partly on their enormous free cash flow generation. They didn’t need to borrow. They self-funded everything. That’s changing. Bank of America forecasts hyperscaler debt issuance will hit $175 billion in 2026, more than six times the annual average of the prior five years.

When tech companies that were valued like capital-light software businesses start borrowing like capital-intensive industrial companies, the market tends to re-rate them accordingly. Software companies trade at 25 to 35x earnings. Heavy industrials and utilities trade at 12 to 18x. If investors start pricing hyperscalers like the infrastructure-heavy companies they’re becoming, the multiple compression alone wipes out trillions in market cap before a single revenue target is missed.

That said, these companies aren’t utilities. They still have the advertising, cloud, and commerce businesses that generated the cash flow in the first place. The AI capex is an overlay on businesses that remain enormously profitable. The repricing risk is real, but it assumes the market ignores the base business entirely, which is the kind of overcorrection that creates buying opportunities as often as it creates crises.

There’s a workforce trap buried in that repricing. These companies have built their compensation structures around stock. At Google, Meta, and Microsoft, stock-based compensation is a huge chunk of total pay, especially for the engineering talent they can’t afford to lose during a buildout this complex. That worked beautifully when share prices climbed every year. But when multiples compress, stock comp stops being a retention tool and starts being a source of attrition. The employees who can leave, will. And the companies can’t just replace stock comp with cash, because they’ve spent all their cash (and then some) on datacenters and GPUs. You end up in a situation where the people you need most to execute the buildout are the ones most likely to walk, right when you have the least financial flexibility to keep them.

And here’s why that matters beyond tech investors: AI-related stocks now represent roughly 45% of S&P 500 market cap. Forty-one AI-linked stocks (about 8% of index constituents) account for 47% of the total index value and contributed 74% of the index’s gains since ChatGPT launched. The S&P 500 isn’t a broad market index anymore. It’s a leveraged bet on AI monetization.

AI-linked investment-grade debt has climbed to $1.4 trillion, representing 15% of the US credit market. If the revenue doesn’t materialize on schedule, the correction doesn’t stay in tech. It cascades through every retirement account, every index fund, every pension plan that’s passively allocated to the S&P 500.

Crowding out the reindustrialization

Here’s a dimension of this that I think is badly underappreciated: the AI buildout isn’t happening in a vacuum. It’s happening at the same time the US is trying to reshore semiconductor fabrication, battery manufacturing, and a whole range of industrial capacity that’s been offshore for decades.

The numbers on that reshoring push are enormous. Since 2020, over $630 billion has been invested across 140 semiconductor projects alone, creating roughly 500,000 jobs in 28 states. TSMC is building a $100 billion campus in Arizona. Micron announced $200 billion across Idaho, New York, and Virginia. Total manufacturing construction spending hit $234 billion annually by mid-2024, up 217% from 2019. The IRA, CHIPS Act, and IIJA together authorized over $2 trillion in federal funding. This is the most ambitious industrial policy the US has attempted in generations.

And it needs the same stuff the AI buildout needs. The same transformers, the same switchgear, the same copper, the same grid interconnection capacity, the same skilled electricians, the same construction labor. A single TSMC fab phase requires around 200 megawatts of power. Multiply that across dozens of fabs, battery plants, and related industrial facilities, and you’re talking about gigawatts of new industrial demand competing with the datacenter pipeline for grid capacity that doesn’t exist yet.

The grid interconnection queue now exceeds 2,100 gigawatts, which is more than the total installed capacity of the US grid. Everything is in line: datacenter projects, semiconductor fabs, battery plants, solar farms, wind farms. The queue itself has become the bottleneck, and the datacenter buildout is the 800-pound gorilla in that line.

The labor competition is just as bad. The datacenter construction industry faces a projected shortfall of up to 500k workers. TSMC’s Arizona fab was delayed six months largely because of skilled labor shortages, with Intel and other fab builders competing for the same pool of certified electricians and mechanical specialists. Construction unemployment hit a record low of 3.2% in August 2025. There’s no reserve workforce to absorb simultaneous megaproject buildouts across multiple industries.

What this means in practice is that every datacenter project that outbids an industrial project for transformers, power capacity, or construction crews is directly slowing down the reshoring effort. And the datacenter operators have deeper pockets. Hyperscalers can pay whatever it takes for a transformer allocation or a power interconnection because they’re spending hundreds of billions this year. A midsized semiconductor equipment supplier or battery plant builder can’t compete with that.

Data for Progress polling in early 2026 found that more than two-thirds of voters support new manufacturing, housing, and clean energy projects in their communities. Support for new datacenter development sits at 48%. People want the factories. They’re less sure about the datacenters. And the datacenters are eating the supply chain alive.

If the AI buildout stalls and the capital turns out to have been misallocated, the damage isn’t limited to tech company balance sheets. It will have crowded out and delayed the industrial projects that were supposed to reduce American dependence on foreign supply chains. That’s a strategic cost that goes well beyond stock prices.

The employment squeeze

And all of this is happening alongside a labor market disruption that’s already underway and accelerating.

Companies are cutting jobs in anticipation of AI’s impact, not because AI has actually proven it can replace those jobs. Harvard Business Review reported in January 2026 that firms are laying off workers based on AI’s potential, not its demonstrated performance. Fifty-five thousand job cuts were directly attributed to AI in 2025, with another 32,000 in the first two months of 2026. One in six employers expects AI to reduce headcount this year.

The irony is thick: the same AI that isn’t generating enough revenue to justify its infrastructure costs is already being used as justification for layoffs. Gartner predicts that by 2027, half of the companies that attributed headcount reductions to AI will rehire for similar functions under different titles, having overestimated what AI could actually do. But that’s cold comfort to the people being let go now.

Looking further out, the World Economic Forum projects 23% structural labor market churn through 2027, with a net loss of 14 million jobs globally. The introduction of capable, affordable robotics in the three to six year timeframe will sharpen this considerably, extending AI displacement from knowledge work into physical labor.

Where this all lands

What makes this moment different from a normal correction is that everything is converging at once. The physical buildout is stalling (transformers, power, chips, community opposition). The demand is softer than projected (subsidies ending, enterprise ROI still unproven at scale). The financial engineering is reaching its limits (free cash flow consumed, debt replacing equity, circular financing inflating demand numbers). The stock market is concentrated enough that a tech repricing ripples through every index fund and pension plan in the country. The labor market is absorbing AI-driven cuts based on hype rather than demonstrated capability. And the whole thing is competing for resources with the reindustrialization effort that’s supposed to reduce American dependence on foreign supply chains.

Let me try to put a timeline on how this plays out.

Late 2026 through mid-2027 is when the first wave of delivery failures becomes undeniable. The datacenter projects announced in 2024 and 2025 with 18 to 24 month timelines start missing their dates in volume. GPUs pile up in warehouses and unpowered facilities. The gap between announced capacity and operational capacity widens visibly. Hyperscaler earnings calls start featuring uncomfortable questions about returns on AI capex. The token factory companies (OpenAI, Anthropic, and the rest) face real pressure to raise prices as investor patience wears thin, and usage numbers start revealing how much of current demand was price-sensitive.

2027 through 2028 is when the financial consequences arrive. If hyperscaler free cash flow stays near zero or negative for multiple quarters, the market will reprice these companies. A shift from software-company multiples (25 to 35x) toward industrial multiples (12 to 18x) on companies that represent 45% of the S&P 500 would be a multi-trillion dollar repricing event. Stock-based compensation loses its pull, talent starts moving, and the companies can’t replace it with cash they don’t have. Credit markets tighten on AI-linked debt (currently $1.4 trillion). Meanwhile, the reindustrialization pipeline is two to three years behind schedule because the datacenters ate the transformers, the construction labor, and the grid interconnection capacity.

2028 through 2030 is where the employment picture gets sharp. By then, the AI tools will have matured enough (and the robotics will have arrived in early commercial form) that the job displacement moves from anticipatory layoffs to structural replacement. The companies doing the replacing will be under financial pressure themselves, creating a strange dynamic where firms cut headcount to save money on labor while simultaneously spending more on AI infrastructure that isn’t paying for itself yet.

The aggregate impact? If even half of this plays out on the timeline I’m sketching, you’re looking at a meaningful drag on US GDP. Not a recession caused by AI directly, but a combination of suppressed capital investment returns, stock market wealth destruction concentrated in the indices that most Americans are exposed to through retirement accounts, a delayed reindustrialization that leaves supply chain vulnerabilities unaddressed, and labor market disruption hitting both white-collar and (eventually) blue-collar workers simultaneously. The 2000 to 2002 tech correction erased about $5 trillion in market value and contributed to a mild recession. The current AI exposure is larger in both absolute and relative terms.

I’ve tried to be fair to the counterarguments throughout, because they’re real. The engineering talent being thrown at these constraints is world-class, the financial incentives to solve them are enormous, and the history of technology consistently embarrasses people who bet against human ingenuity on long enough timescales. I don’t think this is a bubble in the sense that the underlying technology is fake. I think it’s a buildout that’s outrunning its own supply chain and revenue base, and the correction when those things catch up will be painful.

The question isn’t whether AI will be transformative. I think it will. The question is whether the timeline of the buildout matches the timeline of the revenue, and what happens to the US economy during the gap between the two. The dark fiber era took twenty years to resolve. The companies that laid the fiber went bankrupt, but the fiber itself eventually became the backbone of the modern internet. With dark GPUs, the hardware won’t wait. The chips depreciate, the architectures move on, and the capital is gone. If there’s a resolution, it has to come faster than twenty years, because the assets don’t have twenty years in them. And I don’t think anyone knows the answer yet.

Predictions vs. Reality

The original post above doesn’t get edited as evidence comes in. This section does. New entries go here as the predictions play out.

Update: May 2026

I wrote that the token factory companies would face real pressure to raise prices as investor patience wears thin, and that usage numbers would start revealing how much demand was price-sensitive. I put that in the late 2026 to mid-2027 window. It’s showing up now, and it’s coming from the demand side first.

Microsoft canceled most of its internal Claude Code licenses in mid-May. The affected engineers, in the Experiences and Devices division, have until June 30 to migrate to GitHub Copilot CLI. The reason isn’t capability. It’s that agentic coding tools burn tokens at rates that are orders of magnitude above single-query LLM use, and the per-engineer monthly bills were running $500 to $2,000. Multiply that across a division at Microsoft scale and it stops penciling out.

Sit with that for a second. Microsoft. The company that bet its cloud franchise on Copilot, that’s directly invested in OpenAI, that’s been the loudest enterprise AI evangelist in the industry. Pulling back on AI tool licenses inside its own engineering org. If the economics don’t hold for the company selling this stuff, the pitch to enterprise customers has a credibility problem.

Uber’s story is sharper. The CTO disclosed that Uber had exhausted its entire 2026 AI coding tools budget by April. The whole year’s budget. In four months. And there’s a detail in there that’s worse than the number: Uber said token consumption didn’t appear to correlate reliably with useful product output. So they weren’t even getting value proportional to what they were spending. They were just burning tokens.

This is the subsidy cliff, but from the buyer side. Not providers being priced too low (though that’s also true), but enterprises discovering that agentic workflows cost dramatically more than anyone forecasted when “token spend” was still an abstract line in a planning doc rather than a real invoice.

It’s not just these two. A Mavvrik survey found 85% of companies missed their AI cost forecasts by more than 10%, and 84% saw gross margins drop more than six points. One healthcare company consumed a trillion tokens over six months and racked up more than $6 million in unplanned costs before anyone in finance figured out what was causing the spike. Amazon had an internal “tokenmaxxing” problem where employees were inflating AI usage metrics by using tools for unnecessary tasks (which is a very human response to being measured on a proxy metric, and also completely useless to the business).

Goldman Sachs put out a forecast that agentic AI could drive a 24-fold increase in token demand by 2030. That’s the bull case for the infrastructure buildout. It’s also a problem for near-term revenue if enterprises are already pulling licenses at current token consumption levels. The demand expansion Goldman is forecasting assumes someone will pay for all those tokens. The evidence from May 2026 is that enterprises are not enthusiastic about that.

On the ROI side, the numbers have gotten worse since I cited the 2025 MIT study. McKinsey’s 2026 survey puts the failure rate at 73% of AI projects not delivering intended business value. IBM finds only 25% of AI initiatives delivering expected ROI. Morgan Stanley says only 21% of S&P 500 companies can point to any measurable AI benefit at all. Token prices have fallen roughly 280-fold over two years, and total enterprise AI spending rose 320% in the same period. Cheaper tokens don’t reduce the bill. They expand consumption until something structural pushes back. Right now that structural pushback is a CFO asking why the engineering department blew its annual AI budget before Q2 ended.

I said the subsidy cliff would become visible in late 2026. It’s visible now. The first people going over it are enterprise buyers who got a real invoice, which is a problem for OpenAI and Anthropic specifically, since enterprise contracts (not the $20/month consumer plans that cost more to serve than they charge) are supposed to be the revenue base that eventually makes the unit economics work.

Megawatt Compute Racks!

Jared Watkins — Mon, 27 Apr 2026 00:00:00 +0000

I’ve designed a lot of racks. Most of them land somewhere between 10 and 15 kW. That’s the enterprise baseline, the design point that most of the world’s installed data center capacity is built around. Standard power distribution, off-the-shelf PDUs, hot aisle/cold aisle airflow. The physics are solved, the playbook is 30 years old, and nothing about it is surprising.

The racks going into AI facilities right now are a different species entirely. The ones being installed today are in the 80 to 100 kW range. The ones coming next are over a megawatt. Each step breaks assumptions from the one before it.

Glossary — acronyms and jargon used in this post

AC / DC — Alternating current / direct current. AC is what comes from the wall; DC is what processors actually run on. Every server has a power supply that converts AC to DC internally. The efficiency push in modern datacenters is about doing that conversion once, at high voltage, rather than repeatedly at lower voltages inside each server.

Aisle containment (hot aisle / cold aisle) — A layout convention where racks face alternating directions so that cold air intakes face a “cold aisle” and hot exhaust faces a “hot aisle.” Containment means physically enclosing one or both aisles with panels and doors to prevent cold and hot air from mixing, which makes cooling far more efficient.

AllReduce — A collective communication operation used in distributed GPU training where each GPU sends its gradient updates to all others and receives theirs back simultaneously. It’s the most bandwidth-intensive operation in large model training, and the reason interconnect bandwidth between GPUs is as important as raw compute.

Ampacity — The maximum current a conductor (wire, bus bar) can carry continuously without overheating. Higher ampacity requires either thicker conductors or higher voltage to carry the same power.

Blind-mate connector — A connector designed to make contact automatically as a component slides into position, without manual alignment or plugging. Used in high-density datacenter systems so a server tray makes both electrical and liquid cooling connections in a single insertion motion.

Bus bar — A solid copper or aluminum conductor that distributes power through a rack or row. Higher ampacity than wire bundles; used in datacenter power distribution because it handles high currents more efficiently than discrete cables.

Busway — An overhead or underfloor power distribution track (think: a giant extension cord rail) that runs the length of a server row and provides tap-off points for each rack. Replaces individual conduit runs at high rack densities where per-rack wiring becomes impractical.

CDU (Coolant Distribution Unit) — The rack-level or row-level appliance that circulates chilled water through a liquid-cooled system. It typically includes pumps, a heat exchanger that connects to the building’s facility water loop, and flow controls. Think of it as the “radiator unit” for a liquid-cooled rack.

CFM (Cubic Feet per Minute) — A measure of airflow volume. Used to quantify how much air needs to move through a rack for air cooling. At 100 kW densities, the CFM requirements become loud, physically challenging, and expensive.

Cold plate — A metal block (usually copper) bolted directly onto a GPU or CPU that has internal channels carrying coolant. Transfers heat from the chip directly into the liquid rather than into the surrounding air.

CRAC (Computer Room Air Conditioner) — The dedicated precision air conditioning units used in datacenters. Unlike home AC, they’re designed for high-sensible-heat loads (mostly heat, little humidity control) and run continuously. At 100 kW rack densities, you typically need one every few rows rather than around the perimeter.

DLC (Direct Liquid Cooling) — A cooling approach where liquid-carrying cold plates are attached directly to heat-generating components (GPUs, CPUs). The heat goes straight into the coolant rather than first into the air. Required at megawatt densities where air physically cannot carry enough heat.

GaN (Gallium Nitride) — A wide-bandgap semiconductor used in high-frequency power conversion. More efficient than silicon at high switching speeds; used in DC-DC conversion stages in datacenter power supplies and increasingly in consumer chargers.

GOES (Grain-Oriented Electrical Steel) — A specialty steel used in transformer cores. The grain alignment is optimized to reduce magnetic losses. There’s limited global production capacity, and both AI datacenter buildout and renewable energy interconnection are competing for it.

HVDC (High-Voltage DC) — A power distribution approach that distributes DC power at high voltage (48V, 400V, or 800V) through a datacenter rather than distributing AC and converting it at each server. Eliminates conversion stages and reduces energy losses.

IGBT (Insulated Gate Bipolar Transistor) — A power semiconductor switch used in UPS systems, solar inverters, and motor drives. Being progressively replaced by SiC in high-performance applications due to SiC’s better efficiency at high voltages and temperatures.

LPM (Liters per Minute) — The flow rate of coolant through a liquid cooling loop. At 1.2 LPM/kW (the industry rule of thumb for direct liquid cooling), an 85 kW rack requires around 102 LPM.

NVLink — NVIDIA’s proprietary high-bandwidth interconnect for GPU-to-GPU communication within a rack or system. Much faster than PCIe or Ethernet; allows multiple GPUs to act as a single unified compute resource.

OCP (Open Compute Project) — A Meta-founded industry consortium that publishes open hardware specifications for datacenter equipment: racks, power distribution, servers, networking. ORV3 (Open Rack Version 3) is their current rack standard; it defines bus bar voltage, connector specs, and physical dimensions.

ODM (Original Design Manufacturer) — Companies like Wiwynn, Quanta, and Supermicro that design and manufacture servers sold under another brand or sold directly to hyperscalers. The AI rack market is largely built on ODM hardware.

OpEx — Operating expenditure; ongoing costs like power, cooling, and staffing. Contrasted with CapEx (capital expenditure), which is the upfront cost of building or buying infrastructure.

PDU (Power Distribution Unit) — The rack-level power strip, essentially, but engineered for datacenter loads. Provides individual branch circuits to each server with metering and protection. At 100 kW densities they’re large, heavy, and custom-spec’d rather than off-the-shelf.

PUE (Power Usage Effectiveness) — The ratio of total facility power to IT equipment power. A PUE of 1.0 is theoretically perfect (all power goes to compute). 1.2 means 20% overhead for cooling and lighting. Lower is better; modern liquid-cooled facilities approach 1.1 to 1.2.

Raised floor — A floor system with removable tiles sitting above the structural slab, creating a plenum underneath for cabling and air distribution. Standard in enterprise datacenters; the underfloor space distributes cold air up through perforated tiles to server inlets.

RDHx / Rear-Door Heat Exchanger — A heat exchanger that replaces the rear door of a standard rack and captures heat from the rack’s exhaust airflow by running chilled water through a finned coil. A hybrid approach: rack fans still run, but the liquid loop captures a large portion of the heat before it reaches the room.

SiC (Silicon Carbide) — A wide-bandgap semiconductor with better high-voltage and high-temperature performance than silicon. Used in EV traction inverters, solar inverters, and increasingly in datacenter power conversion. The same 1,200V SiC MOSFET goes into both 800V EV drivetrains and 800VDC datacenter rectifiers.

Switchgear — High-voltage electrical equipment that controls, protects, and isolates power distribution circuits. The large metal cabinets you see at the entry point of a facility’s electrical system. Lead times for datacenter-grade switchgear have extended significantly as AI buildout demand accelerates.

UPS (Uninterruptible Power Supply) — A battery-backed power system that provides continuous power during utility outages or fluctuations. Datacenter UPS systems use a “double-conversion” topology where all power passes through the battery inverter continuously, giving true zero-transfer-time protection at the cost of some efficiency.

42U — A rack size designation where “U” is a rack unit (1.75 inches). A 42U rack is 73.5 inches tall, the most common standard height. Higher-density AI racks may run 48U or more.

A standard data center rack is roughly 42U to 48U tall (about 7 feet), 24 inches wide, and 36 to 48 inches deep. Call it 18 to 24 cubic feet of usable volume. That physical envelope is the constant across everything that follows. A 10 kW enterprise rack, a 100 kW AI rack, and a 1 MW hyperscale system all occupy roughly the same floor footprint. The density jump is what changes everything else.

The baseline: what 10–15 kW per rack looks like

The CERN data centre: rows of enclosed racks with hot/cold aisle containment, CRAC units along the back wall, raised floor, overhead cable trays. Classic enterprise design, refined over 30 years. Photo: Florian Hirzinger, CC BY-SA 3.0

A typical rack in this range has 20 to 40 servers, each pulling 300 to 500 watts at load. Three-phase AC, maybe 208V or 480V. PDUs are catalog items. Your biggest concern is cable management.

Cooling is straightforward: hot aisle/cold aisle containment, CRAC units pushing conditioned air, done. Air has enough thermal capacity to carry the heat load out of the rack before anything catches fire. Enterprise data centers are typically designed for 150 to 200 watts per square foot of raised floor, and these racks fit comfortably inside that.

This is the installed base (something like 90% of rack capacity in the world right now). The buildings that house it were designed for it. Thousands of these facilities exist and the design has been refined over 30 years.

The middle tier: 80–100 kW racks being deployed today

NVIDIA DGX SuperPOD — a cluster of DGX H100 racks. Each populated rack runs 40 to 50 kW; a full SuperPOD row approaches 100 kW per rack footprint. Image: NVIDIA

The first wave of purpose-built AI data centers isn’t running megawatt racks. It’s running GPU clusters in the 80 to 100 kW per rack range. Think DGX H100 clusters, or dense A100/H100 configurations from ODMs like Wiwynn, Quanta, or Supermicro. This is what’s actually getting installed at scale right now, and it already breaks the enterprise playbook in several important ways.

At 80 to 100 kW, air cooling is still technically possible but you’re working against physics rather than with it. The airflow volumes required are substantial: roughly 2,000 to 3,000 CFM through a single rack, which means high-velocity fans, significant acoustic load, and real structural air management. Hot aisle containment stops being optional and becomes mandatory. Cold aisle containment and blanking panels have to be perfect; any bypass airflow means hot spots. A lot of facilities running these densities are running at CRAC unit limits, with CRACs located every few rows rather than around the perimeter.

The power delivery changes significantly too. At 100 kW per rack you’re looking at 400 to 500 amps at 208V three-phase, which means you’re no longer running standard 30A or 60A branch circuits to a PDU. You need high-amperage three-phase feeds, often delivered via overhead busway (a dedicated power track running the length of the row) rather than individual conduit runs. The PDUs themselves are large, heavy, and rated for continuous 80%+ of their maximum load. Branch circuit protection, cord sizing, and PDU tap-off points all have to be engineered specifically for the load. This isn’t a catalog selection anymore.

Hyperscalers building for this tier are also pushing past 48V DC distribution and moving to 400VDC and 800VDC architectures, centralizing the AC-to-DC rectification closer to the utility feed and distributing high-voltage DC directly to the rack row. The efficiency gains at 100 kW density are real enough that Meta, Google, and Microsoft are all deploying medium-voltage distribution (some running as high as 13.8 kV) before stepping down to rack-level DC. Delta’s 800VDC “AI Power Cube” (co-developed with NVIDIA) is targeting 1.1 MW-scale racks, but the same architecture is relevant even at 100 kW because it eliminates conversion stages that compound into real money at this density.

The buildings designed for this tier look noticeably different from enterprise data centers. Power density per square foot goes from the 150 to 200W/sqft enterprise standard up to 500 to 800W/sqft for a dense GPU row. That changes transformer sizing, switchgear ratings, UPS topology, and generator capacity significantly. Floor loading is a separate hard constraint: racks with liquid cooling hardware at this density can weigh 2,000 to 3,000 pounds, and if you’re running a coolant distribution unit (CDU) per row, a fully flooded unit alone can weigh around 3 tons, so you need slab capacity around 800 kg/m² rather than the typical raised-floor spec. You also need extended rack depth (standard 42-inch racks won’t fit current NVIDIA HGX servers), and those deeper racks affect aisle spacing and the whole floor layout.

On the cooling side, 100 kW is where two distinct approaches are both in active use and worth understanding separately.

The first is rear-door heat exchangers. A rear-door HX (RDHx) replaces the rack’s back door with a chilled-water coil that the rack’s own fans blow exhaust air through. The liquid captures heat from the airstream before it reaches the room, but air is still the medium moving heat away from the chips. The fans keep running, and you still need hot/cold aisle management. Latest-generation units like OptiCool’s 120 kW RDHx can now absorb close to the full heat load of a 100 kW rack, up from the 40 to 70% capture typical of earlier units. A common 2025 deployment pattern runs about 70% liquid capture via RDHx with the remaining 30% handled by conventional room cooling. This approach works without redesigning the facility cooling loop from scratch, which is why it’s popular as a retrofit and for facilities not quite ready to commit to full direct liquid cooling.

The second approach is direct liquid cooling (DLC), where coolant runs through cold plates bolted directly onto the GPUs and CPUs. No air involved in moving heat away from the chips at all. Heat goes straight into the coolant. DLC is more efficient and handles higher densities, but it requires CDUs, supply and return manifold plumbing, and leak detection throughout. The industry sizing rule for a DLC loop is 1.2 liters per minute per kilowatt at 45°C inlet temperature: an 85 kW rack needs a CDU and manifold supporting roughly 102 LPM of flow. That’s not exotic hardware, but it has to be deliberately designed in rather than bolted on after the fact.

At 100 kW, both approaches are viable. The choice comes down to how the facility was built and what the next GPU generation will demand.

The critical point is that 100 kW racks are demanding but solvable within a purpose-built or heavily upgraded facility. Building new infrastructure to this spec costs somewhere between $200K and $300K per rack in facility-side capital (not counting the compute itself). That’s a real number. Retrofitting an existing facility up to 40 kW density is cheaper, around $50K to $100K per rack, but leaves headroom on the table when the next GPU generation arrives. The challenges are well understood, the vendor ecosystem is mature, and there’s enough operational experience to draw from. None of it requires fundamentally new infrastructure categories. It just requires actually building the right infrastructure rather than adapting what’s already there.

What it does require is supply chain access that’s getting harder to take for granted, because a lot of the components that make 100 kW infrastructure work are the same ones going into utility-scale solar farms by the thousands.

The most acute overlap is in transformers. A 100 kW GPU row drawing several megawatts across a facility hall requires large medium-voltage transformers to step utility power down to distribution voltage. Those same transformer types are going into solar interconnection projects in massive numbers: over 90% of new electric generating capacity installed globally in 2025 was solar and wind, and every one of those projects needs medium-voltage step-up transformers to get power onto the grid. Large power transformers now take two to three years to procure in some cases, versus weeks before 2020. The cores of those transformers require grain-oriented electrical steel (GOES), which in the US is produced by essentially one domestic mill (Cleveland-Cliffs). Hyperscalers have been documented outbidding utility grid suppliers for transformer allocations. That’s not a supply chain abstraction. That’s a literal bidding war between AI infrastructure buildout and the power grid that everyone depends on.

The UPS systems at 100 kW facilities have the same problem at the semiconductor level. Double-conversion UPS units (which virtually all purpose-built AI facilities use, since they can’t tolerate even a millisecond of power interruption during GPU training runs) rely on IGBTs and increasingly SiC MOSFETs for the conversion stages. Those devices are in the same demand pool as solar inverter switching components. A 650V GaN switch or a 1,200V SiC MOSFET doesn’t know if it’s going into a solar microinverter, a UPS module, or a datacenter PDU. The fabs don’t care either. Renesas, for example, is now explicitly marketing a single bidirectional 650V GaN device for both solar inverter and AI datacenter applications simultaneously. That’s convenient for the chip vendor and a scheduling problem for anyone trying to place a large order during a tight quarter.

The copper situation compounds everything at this tier too. Microsoft’s 80 MW Chicago facility used roughly 2,100 tonnes of copper across on-site and near-site power connections (about 26 tonnes per megawatt). Scale that to a 100-rack GPU hall at 10 MW of IT load and you’re sourcing 260 tonnes of copper just for the power infrastructure, before you run any cable to the racks themselves. That copper is competing with the solar farms and grid storage projects being built at unprecedented rates to supply the power those same facilities need. It is genuinely circular: the AI buildout is driving power demand that requires renewable buildout, and both the AI buildout and the renewable buildout are competing for the same copper, transformers, and power semiconductors to do it.

The new world: 1+ MW in the same box

The NVIDIA GB200 NVL72 — 72 Blackwell GPUs, 18 compute trays, 9 switch trays, direct liquid cooling throughout. Over a megawatt at peak load. Image: NVIDIA

The NVL72 (NVIDIA’s GB200 rack-scale system) fits in roughly the same floor footprint as all of the above. Same basic rack envelope. And it draws over a megawatt. That’s not a typo. One megawatt, in a box the size of a large refrigerator cabinet.

The physical layout of the NVL72 is worth understanding because it’s nothing like a conventional rack. Inside you have 18 liquid-cooled compute trays and 9 switch trays, all in 1U form factor, plus 4 NVLink cartridges mounted vertically at the rear. Those 4 cartridges alone contain over 5,000 active copper cables, the interconnect fabric that lets all 72 GPUs talk to each other as a single unified compute domain. Each GPU gets 1.8 TB/s of NVLink bandwidth, which is 36x faster than 400 Gbps Ethernet and about 2x faster than the previous H200 generation (which topped out at 900 GB/s per GPU). The aggregate AllReduce bandwidth across all 72 GPUs is 260 TB/s. That number exists because of those 5,000 copper cables crammed into the rear of the rack.

To put the compute density in physical terms: 1 MW sustained is enough to power somewhere around 800 to 900 average American homes. Coming out of a box that fits in a large office.

NVIDIA contributed the NVL72’s rack, compute tray, and switch tray designs to the Open Compute Project in late 2024, which means the full mechanical and electrical specs are now public. A few things in those specs are worth calling out because they show just how much had to be rethought from first principles.

The rack frame has over 100 lbs of steel reinforcements to handle 6,000 lbs of mating force as trays blind-mate into position. The bus bar carries 1,400 amps (double the existing ORV3 standard), same width as before but with a deeper profile for the increased ampacity. The cooling connections use a floating blind-mate liquid cooling manifold: each tray makes its coolant connection automatically as it slides in, the same mechanical motion that makes the electrical connection. No separate plumbing step, no hose connections. The tray goes in, everything connects.

That’s a 7 to 10x jump over the 100 kW racks being deployed today, and 70 to 100x over the enterprise baseline, all in the same floor footprint. And unlike the move from 15 kW to 100 kW (demanding but solvable), the jump to 1 MW broke categories. The existing OCP standards didn’t have answers. NVIDIA had to write new ones.

Power delivery: why you can’t just plug it in

Standard AC power delivery has a dirty secret: every conversion step wastes energy. Electricity comes in from the utility, goes through a transformer, hits a UPS, gets distributed through PDUs, and finally runs through server power supply units that convert it again to DC on the board. Each of those steps is 94 to 97% efficient. Cascade four or five of them and you’ve lost 15 to 25% of your input power to heat before a single computation runs.

At 10 kW per rack, this is annoying but manageable. At 100 kW, it’s a real cost that starts showing up in facility OpEx. At 1 MW, it’s a crisis.

Key highlights from NVIDIA’s GB200 NVL72 OCP submission: new bus bar spec, floating blind-mate cooling manifold, reinforced rack frame. Image: NVIDIA

High-voltage DC delivery eliminates one to two of those conversion stages. The OCP ORV3 standard uses a 48V DC bus bar delivered via blind-mate connector: the rack slides in and makes contact. No cord management, no intermediate conversion, direct DC to the server boards. Some hyperscale deployments push this further to 400V HVDC, eliminating another stage. The NVL72’s enhanced bus bar spec (1,400 amps, as mentioned above) is now part of NVIDIA’s OCP contribution, available to the industry rather than kept proprietary.

The difference between 85% end-to-end efficiency and 95% efficiency is 100 kW of waste heat per megawatt rack. A hundred kilowatts that you’re paying for, generating heat from, and then paying again to cool. That “paying again to cool” part is real and it compounds the loss. Modern liquid-cooled facilities run a PUE around 1.2 to 1.3, meaning roughly 0.2 to 0.3 kW of cooling energy is consumed for every kW of heat the facility has to reject. Apply that to the waste heat alone (the heat that never needed to exist in the first place) and the cooling overhead adds another 25% or so on top of the direct conversion loss cost.

At 1,000 racks (a medium-sized hyperscale hall), the annual cost difference between efficient and inefficient power delivery, counting both the losses and the cost to cool those losses, is somewhere between $130 and $175 million per year. That’s the business case for a 9-figure investment in HVDC infrastructure. The math isn’t subtle.

Here’s a rough comparison across delivery methods for a 1 MW rack, at $0.065/kWh with a 1.25 cooling overhead factor applied to waste heat:

Power delivery	Efficiency	Loss (kW)	Annual power cost of losses	Annual cooling cost of losses	Total annual waste cost
Single phase 120V AC	80 to 82%	180 to 200	$100K to $115K	$25K to $29K	$125K to $144K
Three phase 208V/480V AC	88 to 91%	90 to 120	$51K to $70K	$13K to $18K	$64K to $88K
48V HVDC (OCP ORV3)	94 to 96%	40 to 60	$23K to $35K	$6K to $9K	$29K to $44K

The gap between single-phase AC and HVDC, fully loaded, is roughly $96K to $115K per rack per year in pure waste: power you bought, converted to heat you didn’t want, and then spent more money to remove. These numbers are why you see hyperscalers spending billions on power infrastructure before they spend anything on compute.

Cooling: air physically cannot do this job

At 80 to 100 kW, air cooling is already working hard. You’re managing it with rear-door HXs, tight containment, and purpose-built facilities, but the physics are still on your side if you’re disciplined. At 1 MW, you’ve left the realm of “air cooling is expensive” and entered “air cooling is physically impossible in any meaningful sense.” I don’t mean difficult. I mean the airflow velocities required to move enough heat would damage components and make the room uninhabitable.

Here’s the physics. Air has a specific heat capacity of about 1 kJ/kg·°C. Water has about 4.18 kJ/kg·°C. But density matters too: water is about 830 times denser than air at standard conditions. So water carries roughly 3,400 times as much heat per unit volume as air. To remove 1 MW of heat with air at the temperature deltas you can realistically achieve in a data center (maybe 15 to 20°C rise across a rack), you’d need airflow rates that generate serious acoustic problems and create structural forces on lightweight components.

Direct liquid cooling (DLC) is the answer. Coolant flows through cold plates physically attached to GPUs, CPUs, and memory modules. The coolant absorbs heat at the source, carries it to a coolant distribution unit (CDU), and gets rejected to the facility cooling loop. With water as the coolant, you can pull 1 MW out of a rack with a flow rate measured in tens of liters per minute. Manageable with standard chilled water infrastructure, as long as that infrastructure was designed for it.

Cold plates themselves are also getting smarter. The standard approach runs parallel channels in a rectilinear grid across the chip surface, distributing coolant uniformly regardless of where the heat is actually being generated. Modern GPUs aren’t thermally uniform. There are intense hotspots at compute cores and memory interfaces sitting next to relatively cool regions, and a thermally blind cold plate treats all of it the same way. A Swiss EPFL spinout called Corintis is attacking this directly with microfluidic chip-scale cooling: channels roughly 100 microns in diameter (about the width of a human hair) etched into or just above the silicon die, with AI-optimized topologies that route more flow to hotspots and less to cool regions. Microsoft tested an early version on production server hardware and reported 3x better heat removal than advanced cold plates, a 65% reduction in GPU peak temperature, and a 55% drop in pressure compared to a parallel-channel baseline. The next generation goes further, embedding the channels directly in the chip die and co-designing the thermal structure alongside the electronics. I’ve got a full writeup on Corintis in the research section if you want the technical detail. The point here is that cold plates aren’t standing still. The ceiling on what DLC can do at the chip level keeps rising.

Rear-door heat exchangers (a useful stopgap for medium-density racks) get you into the 30 to 50 kW range without touching facility cooling loops. At 1 MW, they’re a footnote.

The facility requirements for DLC at megawatt density are not trivial. You need dedicated CDUs for each rack or row, supply and return manifold piping, leak detection at every connection point (a leak in a 1 MW liquid-cooled rack is a very bad day), and secondary containment for the coolant loop. NVIDIA’s reference architecture for a 7 MW GB200 NVL72 cluster (developed with Vertiv) shows what this looks like at scale: a purpose-built liquid-cooled floor plan where every element, from CDU placement to power distribution topology, is designed around the rack rather than adapted from a conventional air-cooled facility. That reference architecture reportedly cuts deployment time by up to 50% compared to custom-designed approaches, which says something about how standardized this problem is becoming even at megawatt densities.

Reference floor plan for a 7 MW GB200 NVL72 cluster, developed by NVIDIA and Vertiv. Every element is designed around the rack’s liquid cooling and HVDC requirements. Image: NVIDIA

This is a ground-up design requirement, and increasingly one with published reference architectures and OCP-standardized components rather than a bespoke engineering problem every time.

What it actually costs to run a 1 MW rack for a year

Let’s put numbers on the full picture. One megawatt of IT load, operating 24/7/365:

Raw power cost: 1 MW × 8,760 hours × $0.065/kWh = **$569,400 per year**
Apply a PUE (Power Usage Effectiveness) of 1.2, which is realistic for a modern liquid-cooled facility: total facility load is 1.2 MW
Total facility power cost: $683,000 per year per rack

That’s before amortizing the cost of the rack itself (the NVL72 is reportedly in the $3 to $4M range per system, before networking), the facility infrastructure, or the power delivery and cooling buildout.

Scale to 1,000 racks and you’re looking at roughly $683M per year in power costs alone. A mid-sized hyperscale AI hall. The infrastructure to support those racks (power substations, cooling towers, HVDC distribution, DLC manifolds) runs another $1 to $2 billion in capital. The compute itself is additional.

This is why the conversation in data center infrastructure has shifted so completely in the last two years. The decisions about PUE targets, power delivery topology, and cooling architecture are not engineering preferences. They’re P&L items. The difference between a 1.4 PUE facility and a 1.2 PUE facility, at this scale, is $136M per year in wasted power costs for that same 1,000-rack hall. Every tenth of a PUE point is worth fighting for.

The supply chain you didn’t expect: this stuff competes with electric cars

Here’s something that doesn’t come up enough in datacenter conversations: a meaningful chunk of the bill of materials for a 1 MW rack comes from the same supply chain as a high-end electric vehicle. Not metaphorically similar. Literally the same components, made by the same manufacturers, allocated from the same production capacity.

Start with the power semiconductors. The move to 800VDC datacenter distribution requires silicon carbide (SiC) MOSFETs rated at 1,200V for the front-end AC-DC conversion stages, and gallium nitride (GaN) transistors for the downstream DC-DC conversion. Those 1,200V SiC devices are the exact same class of component used in 800V EV traction inverters, the inverter that drives the motors in a Porsche Taycan, a Hyundai Ioniq 6, or a Lucid Air. Infineon, onsemi, STMicroelectronics, and Wolfspeed are the dominant suppliers to both markets, and they’re drawing from the same wafer fabrication capacity. NVIDIA’s 800V HVDC supplier alliance (announced May 2025 with Navitas and others) is specifically targeting this component class for the 1 MW rack generation. The SiC content per rack is projected to increase roughly 11x from GB200 to the Rubin Ultra generation. That’s not a rounding error in the supply chain.

The copper situation is similar. At 54V distribution, a single 1 MW rack requires around 200 kg of copper bus bar. That’s why the push to 800VDC matters beyond efficiency: running the same power at 15x higher voltage means roughly 45% less copper for equivalent current-carrying capacity. Even with that reduction, the aggregate copper demand from hyperscale AI buildout is enormous. Analysts project a 6 million-tonne global shortfall by 2035, driven jointly by AI infrastructure and clean energy electrification (including EVs and grid storage). These aren’t separate demand pools. They’re competing for the same mining output, the same refining capacity, and the same bus bar fabricators.

The coolant pumps are another one. The CDUs moving fluid through DLC loops at 1 MW densities use high-pressure centrifugal pumps with specifications (flow rate, pressure head, thermal tolerance) that overlap closely with the thermal management systems in EV battery packs. The same industrial suppliers (Grundfos, Ebara, and several automotive-derived vendors) serve both markets. This isn’t a theoretical concern; procurement teams at large datacenter operators have already run into allocation conflicts when trying to source at scale.

What makes this interesting is the timing mismatch. EV demand hit a rough patch in Western markets through 2024 and into 2025, which led SiC manufacturers to overcapitalize on production capacity that then looked underutilized when EV ramp rates slowed. Wolfspeed (historically one of the most important SiC suppliers) filed for bankruptcy restructuring in early 2026 after betting heavily on continued EV growth that didn’t materialize fast enough. Meanwhile datacenter demand for the same devices was accelerating sharply. The SiC market ended up with a strange combination of manufacturer financial stress and genuine tightening on specific high-specification parts. The long-term 800V EV trend is still intact (the physics of 800V drivetrains are compelling and won’t reverse), which means the demand competition is real and ongoing, just with a timing phase shift between the two application domains.

The practical implication for anyone building megawatt-class infrastructure: the supply chain for these racks isn’t just datacenter infrastructure suppliers. It’s also automotive tier-1 suppliers, SiC wafer fabs, and copper miners. Lead times on 1,200V SiC modules, high-ampacity bus bar stock, and specialty coolant pump assemblies are all being driven by a demand pool that extends well beyond the datacenter industry’s historical footprint.

Where this goes

I keep coming back to the physical constraint: same rack footprint, 100x the power density, and the world’s data center capacity was designed for the baseline, not the frontier. The greenfield buildout happening right now (the gigawatt-scale campus announcements, the utility partnerships, the dedicated substation builds) isn’t hype. It’s the physical infrastructure catching up to a compute density that existing facilities simply can’t support.

What comes after 1 MW per rack is a question I don’t have a clean answer to yet. There are 2 MW designs in discussion. Immersion cooling (fully submerging hardware in dielectric fluid) becomes more compelling as density increases further, though it introduces its own operational complexity. And at some point the silicon itself has thermal limits that packaging and cooling can’t engineer around.

The Clock Is Already Running on Quantum Crypto Risk

Jared Watkins — Fri, 17 Apr 2026 00:00:00 +0000

On March 31, Google Quantum AI published a paper that got quiet but serious attention from the security world. The upshot: under optimistic error rate assumptions, you might need fewer than 500,000 physical qubits to break the elliptic curve cryptography that underpins almost everything on the internet. Previous estimates were in the millions. Their suggested migration target is 2029.

That’s not “quantum computers are coming someday.” That’s three years.

To understand why this matters, you need to know what “break” means here. RSA, ECDSA, and Diffie-Hellman, the signature and key exchange algorithms that secure HTTPS, VPNs, SSH, email, banking, and a substantial fraction of the world’s financial infrastructure, rely on mathematical problems (integer factorization and discrete logarithm) that are hard for classical computers. In 1994, Peter Shor showed that a quantum computer could solve these problems efficiently. We’ve been slowly accepting that this would eventually be a real problem ever since. “Eventually” keeps getting closer.

The threat isn’t only about what a quantum computer breaks in real-time. There’s a second threat model called “harvest now, decrypt later” (HNDL), and it’s already happening. Nation-state adversaries are almost certainly collecting encrypted traffic today (VPN sessions, TLS connections, classified communications) with the explicit intent to decrypt it retroactively once a CRQC (a cryptographically relevant quantum computer) becomes available. If your data needs to stay secret for more than 5 to 10 years, the security clock is already running, not starting when a CRQC is built.

Why now?

The NIST post-quantum cryptography standardization process has been grinding along since 2016, and in August 2024 it finally produced real standards: FIPS 203, 204, and 205. These are the CRYSTALS-Kyber and Dilithium-derived algorithms that are going to replace RSA and ECDH everywhere. The standards are done. The problem is that “done” at NIST and “deployed everywhere” are separated by a gap that historically takes a decade or more to close.

And Google’s paper just compressed the urgency. Prior estimates suggesting we had until the mid-2030s to get our act together were already making people uncomfortable; a plausible path to CRQC by 2029 is a different conversation entirely.

Network infrastructure

For enterprise networking, the migration path exists but it’s not plug-and-play. IKEv2/IPsec (the protocol behind most enterprise VPNs) has IETF extensions for PQC KEMs (RFC 9242, RFC 9370). TLS 1.3 has hybrid key exchange drafts; Chrome and Firefox have been running X25519Kyber768 in some configurations since 2023. The standards are real and the protocol work is largely done.

The vendor picture is messier. Fortinet, Palo Alto, Check Point, and Cisco all ship or are shipping ML-KEM IKEv2 support in recent software versions. But “ships support” and “deployed in production at scale across heterogeneous environments” are not the same thing. Most of these implementations are software-only right now, with meaningful CPU overhead. A lot of networking hardware doesn’t have the crypto acceleration to run post-quantum algorithms at line rate. Interoperability between vendor implementations is still an open problem. And then there’s the long tail of embedded devices, industrial systems, and legacy infrastructure that nobody’s going to upgrade on a 3-year timeline regardless of how urgent the threat becomes.

The cost to upgrade just the identifiable enterprise network perimeter globally (firewalls, VPN concentrators, core routing infrastructure) is probably in the low hundreds of billions of dollars when you account for hardware refresh cycles, migration labor, and the operational disruption of replacing cryptographic algorithms end-to-end. That’s before touching government classified systems (which have hard mandates under NSA’s CNSA 2.0 by 2030) or financial sector infrastructure. The Fed and CISA have both been clear that the financial sector needs to treat this as a critical infrastructure priority, but the actual upgrade spend is still largely ahead of us.

Cryptocurrency is a much harder problem

Enterprise networking is hard because of scale and operational inertia. Cryptocurrency is hard for a different and more fundamental reason: there’s no one in charge.

Bitcoin uses ECDSA secp256k1 and Schnorr signatures for transaction authorization. Both are fully quantum-vulnerable via Shor’s algorithm. Approximately 28 to 35% of the total Bitcoin supply sits in addresses where the public key is already visible on-chain, meaning a CRQC could derive the private key directly without the owner doing anything. That’s somewhere in the neighborhood of 6 to 7 million BTC, including most of Satoshi’s known holdings (around 1.1M BTC in early P2PK outputs). At current prices that’s over a trillion dollars in directly attackable assets across the crypto ecosystem.

Migrating Bitcoin requires the whole ecosystem to agree on a new signature scheme, ship it as a consensus change (probably a soft fork for the address infrastructure, definitely a hard fork for phasing out old signature types), and then get every wallet, exchange, node operator, and user to actually do the migration. The historical precedent here is not encouraging. SegWit took about two years from proposal to activation, and a significant portion of the network still doesn’t use it. Taproot activated in 2021 and most wallets still default to legacy address types. Core developers are estimating 5 to 10 years for full Bitcoin PQC migration from the point a change is activated. That’s a best case.

There’s an active governance debate happening right now. BIP-360 proposes quantum-resistant output infrastructure, and BIP-361 proposes a controversial phased sunset of legacy signature types that would effectively freeze coins in old address formats that don’t migrate. Jameson Lopp (Casa CTO) is driving the more aggressive approach. Adam Back is publicly against mandatory freezing. The fact that this is still being debated at the conceptual level, with no miner signaling, no algorithm formally selected, and a testnet that’s only a few weeks old, is not reassuring given the 2029 migration target that Google just put on the table.

Ethereum is in a similar position but with a more organized response. They have a dedicated post-quantum team, an active proposal (EIP-8141), and devnets running. Their self-imposed 2029 migration deadline implies roughly seven hard forks in the remaining time. It’s ambitious but at least there’s an organized program.

Some chains are further along. Algorand has Falcon-based signatures live on mainnet for state proofs as of November 2025. QRL was built from the ground up as a quantum-resistant blockchain in 2018. But the assets most at risk are on Bitcoin and Ethereum, which have the most value, the most decentralized governance, and therefore the hardest migration paths.

The broader picture

Beyond networks and crypto, there’s everything else. HTTPS certificates that secure web browsing. S/MIME email encryption. Code signing certificates that verify software updates. Document signing. The PKI underpinning most of the internet’s trust model. Healthcare record confidentiality, where HNDL is particularly concerning given how long medical data retains sensitivity. Financial records. Legal contracts. Smart grid infrastructure. Every layer of the stack that relies on public-key cryptography has to migrate, in some cases multiple times (signing and key exchange often use different algorithms).

Early estimates for migrating federal government systems put the number well north of $7 billion and the better part of this decade to complete. The private sector is an order of magnitude larger. The total cost of global crypto infrastructure migration, done properly, is probably in the trillions when you account for hardware, software, labor, testing, and the inevitable disruption to systems that can’t be taken offline cleanly.

The realistic outcome isn’t that we migrate everything in time. It’s that we prioritize the highest-sensitivity, highest-value assets and connections, get those migrated first, and accept that long-tail infrastructure will remain vulnerable for years longer than is comfortable. That means some HNDL-harvested traffic will eventually be decrypted. Some legacy embedded systems will get cracked once a CRQC is operational. The question is whether the critical stuff (financial clearing, classified communications, critical infrastructure control) is protected before the window closes.

The standards exist. The vendor implementations are starting to arrive. The governance debates in decentralized systems are happening, if contentiously. Whether 2029 proves out as the actual CRQC timeline or the real date turns out to be 2032 or 2035, the direction is clear and the work is late. The difference between treating this as a three-year problem versus a fifteen-year problem is whether we’re actually going to be ready.

I’ve been tracking this closely in the research section. There’s more depth there on specific vendor implementations, the cryptocurrency exposure numbers, and the NIST standards themselves, if you want to go further.

Picking hardware for local AI inference in 2026

Jared Watkins — Tue, 07 Apr 2026 00:00:00 +0000

Nobody buying AI hardware in 2026 is short on opinions. Everyone has a take. The forums are full of people who swear by their setup and can’t understand why anyone would choose differently. Most of those arguments are happening across completely different use cases which raises the noise floor for this subject.

“What’s the best hardware for running AI locally?” is roughly as useful as asking what’s the best vehicle without mentioning whether you’re hauling gravel or commuting to an office. The answer depends entirely on what you’re trying to do, and getting that wrong wastes real money.

Here’s my attempt to cut through it.

The three things that actually matter

Local AI hardware comes down to three variables: capacity, bandwidth, and software stack.

Capacity is whether the model fits in memory at all. If it doesn’t fit, nothing else matters. You’re either offloading to disk (more on that disaster in a bit) or you need a bigger box.

Bandwidth is how fast the hardware can feed data to the compute units. This is the single best first-pass predictor of how fast tokens actually come out. Memory bandwidth is not the same as tokens per second, but it’s the cleanest way to sort real performance tiers before you waste a weekend arguing with someone posting single-prompt screenshots.

Software stack is how much of the spec sheet you can actually cash out. A card with strong bandwidth numbers on paper does nothing useful if the inference framework doesn’t support it. This is still where CUDA’s dominance matters, and it’s where Tenstorrent’s fully open source stack is a genuine long-term bet worth watching.

The hardware landscape

Five distinct markets, same buzzword. Here’s what each one is actually good for.

Raw speed when the model fits: discrete GPUs

If the model fits in VRAM, discrete GPUs are still the fastest thing by a wide margin. Nothing else comes close on a per-token basis.

NVIDIA’s RTX PRO 6000 Blackwell (96GB, 1792 GB/s, around $8,000 to $9,200 retail right now) and the RTX 5090 (32GB, 1792 GB/s, street price has been running $3,000 to $5,000 and climbing due to supply issues) share identical bandwidth. The difference is capacity. The PRO 6000 can hold a 70B model at Q4 comfortably and will push around 100 to 120 tok/s on it; the 5090 tops out around 30B quantized but hits 150 to 200 tok/s on 8B models where bandwidth and VRAM both cooperate. The RTX 4090 (24GB, 1008 GB/s) runs around 80 to 100 tok/s on 8B and is still worth knowing about if you find one at a good price on the secondary market.

AMD’s discrete cards deserve more credit than they typically get. The RX 7900 XTX (24GB, 960 GB/s) is genuinely competitive on bandwidth per dollar. The Radeon PRO W7900 (48GB, 864 GB/s) doubles the memory at workstation pricing. The newer AI PRO R9700 (32GB, 640 GB/s) sits in between. ROCm support has improved enough that AMD is a real option now, especially with llama.cpp and Ollama.

Intel showed up too. The Arc Pro B65 (32GB, ~608 GB/s) and B60 (24GB, ~456 GB/s) are interesting if you’re following where Intel’s headed with this. Not my first choice today, but they’re not irrelevant.

Discrete GPUs win because they can drink from a firehose. They lose the moment the model doesn’t fit.

Biggest one-box memory: Apple Silicon

Apple’s pitch is simple: not the fastest, but more unified memory in a quiet box than anything else you can buy.

Apple’s current Mac Studio lineup spans two chips, and they’re not interchangeable for AI work: the M3 Ultra has more memory bandwidth and more total memory than the M4 Max, which makes it the better inference box despite being the older chip.

The Mac Studio M3 Ultra tops out at 96GB of unified memory at 819 GB/s. That’s enough to run Llama 4 Scout (109B MoE) at reasonable quantization, or DeepSeek-R1 70B at Q8 with room to spare. The 96GB config starts around $3,999. There is no higher memory option — Apple does not offer a 192GB or 512GB M3 Ultra configuration; the M3 Ultra is a single fixed memory tier.

The Mac Studio M4 Max (up to 64GB, 546 GB/s on the upgraded 40-core GPU config, from around $1,999) does about 20 to 25 tok/s on a 70B Q4 model and around 50 tok/s on 8B. If you want top-of-line Mac Studio and don’t need the CUDA stack, the M3 Ultra is currently the stronger inference box — more bandwidth, more memory ceiling. The M4 Max is faster on smaller models where 64GB is enough, and it’s cheaper. But if you’re buying for capacity, the M3 Ultra is the one to get.

The MacBook Pro M5 Max (up to 128GB, 460 to 614 GB/s, from around $3,900) is in the same ballpark as the M4 Max Mac Studio. The MacBook Pro M5 Pro (up to 64GB, 307 GB/s, from around $2,200) lands around 10 to 15 tok/s on 70B when it fits. The Mac mini M4 Pro (up to 64GB, 273 GB/s, from around $1,400) is at the bottom of this tier, roughly 5 to 8 tok/s on 70B (usable for background work, slow for interactive use).

Apple wins when you want one box, you want silence, and you want to run models that simply won’t fit on a normal GPU. It loses when raw tokens per second and concurrency start to matter more than everything else.

Coherent NVIDIA appliance: DGX Spark and RTX Spark

The DGX Spark (128GB unified, 273 GB/s) launched at $3,999 and has since been bumped to $4,699 due to memory supply constraints. It’s not a bandwidth monster. It’s a compact NVIDIA CUDA appliance with 128GB of coherent memory and NVFP4 support that hasn’t fully matured yet but is genuinely interesting for the future of quantization.

NVIDIA just announced the RTX Spark at Computex 2026, and it’s essentially the same architectural premise in a consumer form factor. The RTX Spark is a superchip (Grace ARM CPU with up to 20 cores, Blackwell GPU with 6,144 CUDA cores, up to 128GB unified LPDDR5X) built for Windows laptops and compact desktops, co-developed with Microsoft. OEMs including ASUS, Dell, HP, Lenovo, and Microsoft Surface are targeting fall 2026. This is the first time the full CUDA stack ships inside a thin Windows laptop, which is genuinely new even if the rest of the specs feel familiar.

The bandwidth story is the same as the DGX Spark: 273 GB/s from LPDDR5X, which puts real numbers on the table. On a 70B Q4 model, the DGX Spark decodes at around 3 tok/s. On 8B it’s around 40 to 50 tok/s, where smaller models are more compute-bound so the CUDA advantage shows up. The Mac Studio M4 Max at $2,000 does 20 to 25 tok/s on 70B (6 to 8x faster on the large model that actually justifies the 128GB box) and is likely cheaper than a premium RTX Spark laptop will land.

NVIDIA is also marketing the RTX Spark with a 1 petaflop of AI performance claim, which is technically accurate the same way claiming a car “can go 150 mph” on a track under ideal conditions is technically accurate. That figure is FP4 with structured sparsity enabled, a 2x multiplier that only applies when model weights are at least 50% zeros. Most aren’t. At FP16 it’s closer to 250 teraflops. I’ve already noted the same trick for the RTX 4090 (1,321 TOPS with sparsity vs. around 660 dense) in the TOPS section below, but the RTX Spark version is more brazen because the gap is bigger and the format (FP4) is less established in real inference pipelines.

Pricing for RTX Spark consumer devices isn’t confirmed yet, but premium laptops will likely land somewhere in the $2,000 to $3,500 range given TSMC 3nm fabrication and LPDDR5X memory costs. If that holds, the value proposition against a Mac Studio M4 Max is rough: same memory, half the bandwidth, different OS, and CUDA dependency to justify the premium. The CUDA software story is real and matters to developers who need it. But if you’re just running inference, the bandwidth gap follows you everywhere.

There’s also the Windows on ARM compatibility question, which has a rough history. The Surface RT (2012) was a fiasco, Windows 10 ARM limped along for years with an emulation layer that was slow and incomplete, and even the first Snapdragon X Elite machines in 2024 had real gaps in driver support. The current picture is genuinely better. Windows 11’s Prism emulator runs most x86 apps with around 10 to 15% overhead, and over 93% of commonly used apps run natively as of early 2026. The remaining compatibility failures are almost entirely kernel-mode drivers: anti-cheat software, some security tools, legacy hardware drivers. Jensen Huang claimed at Computex that RTX Spark will run “every Windows app ever made,” which is the kind of thing a CEO says at a keynote and which the kernel-mode driver situation makes not quite true. For most users running standard productivity and developer software, the platform is fine. If you depend on specific kernel-level tooling (corporate endpoint security with no ARM64 driver, game anti-cheat, some DAW plugins), you’ll want to check before buying.

NVIDIA’s roadmap has Vera Rubin with LPDDR6 memory after this, which should improve the bandwidth ceiling meaningfully. The first-generation RTX Spark is an interesting platform bet, not a current-generation performance win.

Both the DGX Spark and RTX Spark are developer appliances first. Full NVIDIA stack, 128GB in a small box, not optimizing for raw decode speed. The GB10-class machines like the ASUS Ascent GX10 belong here too.

First real x86 unified-memory contender: Strix Halo

AMD’s Ryzen AI Max / Strix Halo is the most interesting new category in local AI hardware, in my opinion. Up to 128GB of LPDDR5X at ~256 GB/s, with up to ~96GB assignable as GPU memory on Windows. The Framework Desktop implements this starting at $1,099 for 32GB, $1,599 for 64GB, and $1,999 for the 128GB config. Real-world decode on a Llama 70B Q4 model lands around 4 to 5 tok/s, similar to the DGX Spark and well below the Mac Studio M4 Max (same bandwidth ceiling, same result). On 8B models it does around 40 to 45 tok/s, comfortable for interactive use.

This is not just another mini PC. It’s the first mainstream x86 box where local AI starts feeling like a serious hardware class rather than a laptop pretending very hard. The value proposition at 128GB for $1,999 is hard to beat, especially if you’re running MoE models where capacity matters more than raw bandwidth. You’re paying for the ability to load the model, not for fast decode once it’s loaded.

The fully open source bet: Tenstorrent

Tenstorrent’s Wormhole n300 (24GB, 576 GB/s, around $1,400) and Blackhole p150 (32GB, 512 GB/s, around $1,400 with 800G interconnect) run a fully open source stack from top to bottom. I’m genuinely rooting for this one to mature. The AI world needs more fully open stacks, and the bandwidth is competitive with mid-tier discrete GPUs. The Blackhole’s interconnect makes multi-card scaling worth watching as the software ecosystem develops.

RISC-V: SpacemiT K3

SpacemiT is a Chinese fabless semiconductor company that has been quietly building a RISC-V roadmap worth checking out. Their K1 chip shipped over 150,000 units, which is an unusually high number for RISC-V. The K3 is their follow-up, and it’s a meaningful step up.

The K3 packages eight X100 RISC-V CPU cores (up to 2.4 GHz) and eight A100 AI cores into a single SoC, with RVA23 compliance (RISC-V standard), LPDDR5-6400 (low powered memory), and 60 TOPS of AI compute at INT8/FP8/FP16/BF16. For context: the K1 topped out at 2 TOPS and 16GB of LPDDR4. The K3 is a 30x jump in AI performance and doubles the max memory to 32GB.

The 60 TOPS figure comes from the A100 AI core cluster, not a discrete NPU. SpacemiT claims the platform can run a 30B-parameter model at more than 10 tokens/second. I’d wait for independent benchmarks before taking that at face value, but at least the claim is specific enough to be falsifiable.

Early CPU benchmarks from CNX-Software (January 2026) show multi-core 7-Zip performance slightly better than a Rockchip RK3588, single-core slightly below a Raspberry Pi 5. Memory bandwidth (memcpy ~5,947 MB/s) is closer to Pi 5 territory than RK3588. AES-256 single-core performance lags both. Competitive general-purpose compute for a RISC-V chip, not competitive with ARM at the same price point. Not yet.

K3-based boards coming to market

Several boards are either shipping or in active pre-order as of May 2026:

Board	Maker	Price	RAM	Status	Key I/O
Jupiter 2	Milk-V	$300--$575	8/16/32GB LPDDR5	Pre-order	10GbE SFP+, PCIe Gen3 x4, Wi-Fi 6, M.2 NVMe
BPI-SM10 Dev Kit	Banana Pi	TBD	up to 32GB LPDDR5	Announced	Compute module format
K3 Pico-ITX SBC	SpacemiT/Sipeed	$299+	up to 32GB LPDDR5	Shipping	10GbE, UFS up to 256GB, M.2 NVMe
AIBOX-K3	Firefly	$349--$689	8/32GB	Available	Industrial edge AI box, fanless
DC-ROMA Mainboard III	DeepComputing	$699--$999	16/32GB LPDDR5	Pre-order (ships June 2026)	Framework Laptop 13 drop-in mainboard

The Milk-V Jupiter 2 is the most compelling entry point at $300 for 8GB: Pico-ITX form factor, 10GbE SFP+, PCIe Gen3 x4, Wi-Fi 6/BT 5.2, and an aluminum enclosure with a built-in fan. The DeepComputing board is a different thing entirely. It turns a Framework Laptop 13 into a RISC-V laptop, which is either a compelling experiment or a $699 curiosity depending on your use case. Firefly also announced the CSC2-N48SPK3, a 2U rack server with 48 K3 nodes (2,880 TOPS aggregate) starting around $38,800. That one is squarely for researchers and the “I want a RISC-V cluster” crowd.

Linux and inference stack

The K3 is RVA23-compliant, which matters because Ubuntu 25.10 and later mandate RVA23. The K1 was excluded; the K3 is not. Canonical officially supports Ubuntu 26.04 LTS on K3 platforms. The kernel shipping on current hardware is 6.12.16. Initial mainline support (device tree sources under arch/riscv/boot/dts/spacemit/) landed in Linux 7.0, with peripheral driver expansion queued for 7.1.

RVV 1.0 is implemented with 256-bit vector registers per core. For llama.cpp specifically: SpacemiT’s bianbu downstream fork has the best-optimized build path, since peak AI performance requires their GCC toolchain. Upstream llama.cpp has RISC-V RVV support, but expect lower throughput until the A100 AI core backend matures further.

Who this is for

Hobbyists, RISC-V ecosystem enthusiasts, and developers who want to build or test RISC-V-native software. If you’re comparing the Jupiter 2 against a Raspberry Pi 5 or Orange Pi 5 on pure price/performance, you’ll be disappointed. If you’re deliberately targeting a non-ARM, non-x86 architecture (for software portability work, embedded Linux development, or just the novelty of running actual LLM inference on RISC-V hardware), the K3 is the most capable option shipping today.

The AI PC trap

Most machines wearing an “AI PC” sticker are still bandwidth-starved in any practical sense. Snapdragon X Elite (~135 GB/s), Intel Lunar Lake (~136 GB/s), MacBook Air M5 (~153 GB/s), Snapdragon X2 Elite (~152 to 228 GB/s depending on SKU). On an 8B Q4 model you’re looking at 15 to 25 tok/s, which is usable. Try to push a 13B dense model and you’re dropping below 15 tok/s on most of these. Anything bigger either doesn’t fit or crawls. These are fine machines for small models, personal assistants, edge workloads. They are not serious local inference hardware for anything larger than a 7 to 8B dense model. Physics still applies, which is inconvenient but consistent.

The gimmicks section (or: technically possible doesn’t mean useful)

A pattern keeps coming up in local AI discussions that costs people real time and money chasing something that doesn’t work well in practice.

The pitch goes like this: “My hardware doesn’t have enough memory, but I can still run a big model by only loading part of it at a time.” This is usually presented as a clever hack. Sometimes it is. More often it’s a performance cliff dressed up as a feature.

Layer offloading. Tools like llama.cpp let you split model layers between GPU VRAM, system RAM, and even disk. The flag is -ngl (number of GPU layers). When you don’t have enough VRAM for the whole model, you offload some layers to CPU RAM. The problem is that every token generation step has to shuffle data across the PCIe bus between GPU and CPU. Real-world numbers here are brutal, people running 70B models with partial CPU offloading report around 1 to 3 tokens per second. That’s technically running the model. It’s also roughly the speed of reading text out loud to yourself. Not useful for interactive work.

Disk offloading. Some tools and frameworks support streaming model weights from NVMe directly. Modern NVMe drives can hit 7 GB/s reads in ideal conditions, which sounds fast until you realize your GPU memory bandwidth is 10 to 100x that. The energy penalty alone is significant, recent research puts SSD-offloaded decode at roughly 3 to 4x the energy cost versus in-memory inference on comparable hardware. Token generation with disk offload in practice tends to land below 1 token per second. I’ve seen people run 405B models this way. I’ve also seen them wait two minutes for a 60-token response.

Extreme quantization. Q4 is excellent, Q5 and Q8 are great when you can afford the memory for them. The cliff is at the bottom. Q2 quantization degrades quality enough that for many use cases you’d be better off running a smaller, better-quantized model. A Q2 70B model often loses to a Q4 7B on reasoning tasks while using four times the memory. The tradeoff is real.

The 30 tokens-per-second floor. For interactive use, actual back-and-forth conversation or coding assistance where you’re watching the output stream, 30 tok/s is roughly where it starts feeling like a tool rather than a waiting exercise. Below 15 tok/s it becomes noticeable. Below 5 tok/s it’s painful regardless of model quality. For batch processing or background tasks, slower is tolerable. But if you’re evaluating a hardware setup for daily driving, “it runs” and “it’s usable” are different things.

The test I’d apply: if your setup produces tokens slower than you read them, you’re probably past the gimmick threshold for interactive use.

What models are you actually trying to run?

Hardware decisions only make sense relative to the models you’re targeting. Open source models in 2026 have gotten genuinely good, close enough to frontier API models on many tasks that the conversation has shifted from “is open source good enough?” to “which open source model is right for this?”

The big architectural shift is MoE (Mixture of Experts). These models have enormous total parameter counts but only activate a fraction of them per token. That changes the capacity-vs-speed tradeoff dramatically. A model that “needs” 192GB to load might only activate 17B parameters per forward pass.

24 to 32GB (RTX 5090, RX 7900 XTX, Arc Pro B65, MacBook Air M5 max): This is Llama 4 Scout territory at Q4 (109B total, 17B active, fits in roughly 55 to 60GB quantized, so you need a second GPU or larger box), or more realistically: Qwen3 30B-A3B (only 3B active per token), Gemma 4 26B MoE (~14GB at Q4, 85+ tok/s on consumer hardware, genuinely excellent for the size), Phi-4 14B for reasoning, and Qwen2.5-Coder 14B for coding work. Useful territory, not the frontier.

48 to 64GB (Mac Studio M4 Max, MacBook Pro M5 Pro, Framework Desktop 64GB): Dense 30 to 40B models at Q4 land comfortably here. Llama 4 Scout (109B MoE, 17B active) fits at reasonable quantization. Qwen3 235B-A22B MoE needs more room, but the smaller Qwen3 variants are excellent here. This is where local AI starts feeling like a real tool rather than an experiment.

96GB (Mac Studio M3 Ultra, RTX PRO 6000, DGX Spark + 128GB configs): Llama 4 Scout at Q8, DeepSeek-R1 70B for serious reasoning, Qwen3 235B-A22B MoE with 22B active parameters. The DGX Spark and Framework Desktop stretch to 128GB which adds some headroom. The Mac Studio M3 Ultra maxes at 96GB — that’s the only memory tier available. GPT-OSS-120B also fits here at Q4.

128GB+ (multi-GPU rigs, DGX Spark, Framework Desktop): The Mac Studio M3 Ultra doesn’t reach this tier — its ceiling is 96GB. To get to 128GB or beyond in a single box you’re looking at the DGX Spark, Framework Desktop, or multi-GPU NVIDIA setups. Llama 4 Maverick (400B MoE, 17B active) and DeepSeek-V3 (671B MoE) at aggressive quantization need this range or higher. If you need frontier-class open source models running locally with zero cloud dependency, multi-GPU is currently the path to get there.

The quadrant chart

Here’s how these platforms map when you combine memory capacity, bandwidth, and cost into a single picture. The vertical axis is a synthesized “memory performance” score explained in detail in the collapsed section below. The horizontal axis is platform cost. Only min and max configs are shown for platforms with multiple options.

quadrantChart title Local AI Hardware - Memory Performance vs Cost 2026 x-axis Low Cost --> High Cost y-axis Low Performance - bandwidth weighted --> High Performance - bandwidth weighted quadrant-1 High Perf High Cost quadrant-2 High Perf Low Cost quadrant-3 Low Perf Low Cost quadrant-4 Low Perf High Cost RTX PRO 6000: [0.99, 0.91] RTX 5090: [0.48, 0.72] Mac Studio M3 Ultra: [0.45, 0.54] MBP M5 Max 128GB: [0.59, 0.53] DGX Spark: [0.54, 0.43] Framework 128GB: [0.21, 0.43] Radeon PRO W7900: [0.40, 0.42] RTX 4090: [0.27, 0.40] RX 7900 XTX: [0.08, 0.39] Arc Pro B65: [0.05, 0.28] Mac mini M4 Pro: [0.11, 0.27] Framework 64GB: [0.18, 0.25] TT Blackhole: [0.15, 0.22] TT Wormhole: [0.10, 0.21] MacBook Air M5: [0.10, 0.11] SpacemiT K3: [0.01, 0.02]

How the memory performance score is calculated + data table

The score on the vertical axis synthesizes two things: memory capacity (GB) and memory bandwidth (GB/s). Neither alone tells the full story. A box with massive bandwidth but tiny capacity runs out of useful models quickly, and a box with massive capacity but slow bandwidth produces tokens at a crawl.

Scoring method:

I normalized both dimensions independently across the full set of platforms (0 = worst in set, 1 = best in set), then combined them with bandwidth weighted at 65% and capacity at 35%. The y-axis is a tokens-per-second proxy, not a general memory score. Bandwidth drives tok/s directly; capacity determines which models fit but doesn’t affect how fast they run once loaded. A platform with more memory but slower bandwidth will score lower here even if it can run larger models — that tradeoff is real and intentional. The formula is:

capacity_score = (GB - min_GB) / (max_GB - min_GB)
bandwidth_score = (GB_s - min_GB_s) / (max_GB_s - min_GB_s)
performance_score = 0.35 * capacity_score + 0.65 * bandwidth_score

The dataset spans 8GB (low end) to 128GB (max single-box config shown) for capacity, and 51 GB/s (SpacemiT K3) to 1792 GB/s (RTX 5090 / PRO 6000) for bandwidth.

Cost axis: Normalized from ~$299 (SpacemiT K3) to ~$8,500 (RTX PRO 6000). I used street price midpoints where ranges exist.

Platform	Memory (GB)	Bandwidth (GB/s)	Cap Score	BW Score	Perf Score	Cost ($)	Cost Score
RTX PRO 6000	96	1792	0.73	1.00	0.91	8,500	1.00
RTX 5090	32	1792	0.20	1.00	0.72	4,200	0.48
Mac Studio M3 Ultra 96GB	96	819	0.73	0.44	0.54	3,999	0.45
MacBook Pro M5 Max 128GB	128	546	1.00	0.28	0.53	5,100	0.59
DGX Spark	128	273	1.00	0.13	0.43	4,699	0.54
Framework 128GB	128	256	1.00	0.12	0.43	1,999	0.21
Radeon PRO W7900	48	864	0.33	0.47	0.42	3,600	0.40
RTX 4090	24	1008	0.13	0.55	0.40	2,500	0.27
RX 7900 XTX	24	960	0.13	0.52	0.39	950	0.08
Arc Pro B65	32	608	0.20	0.32	0.28	750	0.05
Mac mini M4 Pro 64GB	64	273	0.47	0.13	0.25	1,400	0.13
Framework 64GB	64	256	0.47	0.12	0.24	1,599	0.16
TT Blackhole p150	32	512	0.20	0.26	0.24	1,400	0.13
TT Wormhole n300	24	576	0.13	0.30	0.24	1,400	0.13
MacBook Air M5 32GB	32	153	0.20	0.06	0.11	1,100	0.10
SpacemiT K3 Pico-ITX (8GB)	8	~51†	0.00	0.00	0.00	299	0.00
Milk-V Jupiter 2 (8GB)	8	~51†	0.00	0.00	0.00	300	0.00

Note: Cap Score and BW Score of 0.00 for the K3 boards reflect dataset minimums (8GB, 51 GB/s), not zero capability. All scores are relative to the range of platforms in this comparison. The 24GB discrete GPU cards score 0.13 on capacity because 24GB is the third tier in this dataset (above 8GB K3 boards), not the absolute minimum.

†SpacemiT K3 LPDDR5-6400 theoretical peak bandwidth is ~51 GB/s on a 32-bit bus. Both K3 boards anchor the low end of the bandwidth scale here. They belong in a different tier entirely. See the RISC-V section above for context on what these boards are actually for.

A few things jump out. The RX 7900 XTX is the best value pure-bandwidth play if 24GB is enough for your models. The Framework 128GB is quietly the best price-to-capacity ratio in the whole field. The bandwidth score drags it down but nothing else gives you 128GB assignable to a GPU for $1,999. The DGX Spark is the most interesting chart anomaly: high capacity, middling bandwidth, high cost, and a software stack that might eventually justify all of it. The Mac Studio M3 Ultra at 96GB sits at a strong capacity/bandwidth sweet spot — it’s the only Apple Silicon option with Ultra-class bandwidth and the most memory you can get in a single Mac Studio today.

Understanding TOPS

Every AI chip ships with a TOPS number. TOPS stands for Tera Operations Per Second (one trillion integer operations per second). It sounds like a clean unit, and it is, as far as it goes.

The problems start when you try to compare TOPS numbers across vendors.

TOPS is almost always measured at INT8 precision, 8-bit integer arithmetic. INT8 is common in quantized inference because it’s faster and cheaper than FP32, and accuracy loss is usually acceptable. But vendors quote peak theoretical TOPS under ideal conditions: perfectly parallelized workloads, full hardware utilization, no memory stalls, supported operators only.

Real LLM inference rarely hits that ceiling. Memory bandwidth is usually the bottleneck, not compute. A large language model spends most of its time moving weights between memory and compute units, and if those weights don’t fit in fast on-chip memory (they usually don’t), you’re waiting on DRAM bandwidth regardless of how fast the arithmetic units are.

There’s also a precision equivalence problem. Apple quotes its Neural Engine at 38 TOPS for the M4, but that figure comes from counting INT8 operations as 2x the FP16 rate. That’s industry convention, not necessarily hardware reality. The ANE dequantizes INT8 weights to FP16 before compute, so the 2x multiplier is partly accounting. A 38 TOPS ANE and a 38 TOPS NPU from a different vendor are not the same thing.

The bigger issue is software maturity. A 2 TOPS NPU with mature, optimized software will outperform a 26 TOPS NPU with poor framework support for most real-world workloads. Apple’s Neural Engine is the clearest example: Core ML and the ANE runtime are deeply co-designed, operator coverage is comprehensive, and the toolchain handles quantization automatically. A Hailo-8 at 26 TOPS needs Hailo’s SDK and specific model conversion, operator coverage gaps are real and documented. For RISC-V platforms like the K3, the AI core stack is still maturing. The hardware headroom is there, but the software to actually cash out 60 TOPS on arbitrary LLM workloads isn’t yet at Apple’s or NVIDIA’s level.

Cross-platform TOPS comparison

Platform	TOPS	Precision	Scope	Notes
SpacemiT K1	2	INT8	AI cores	8-core RISC-V, K3 predecessor
SpacemiT K3	60	INT4/INT8/FP8/FP16/BF16	8 A100 AI cores	RVA23 RISC-V; software stack still maturing
Hailo-8L	13	INT8	Dedicated NPU	Used in Raspberry Pi AI HAT+
Hailo-8	26	INT8	Dedicated NPU	M.2 accelerator; ~2.5W; SDK-dependent
Apple M3 Neural Engine	18	FP16	NPU only	16-core ANE; mature Core ML stack
Apple M3 Ultra Neural Engine	~36	FP16	NPU only	32-core ANE (2x M3 die); estimated
Apple M4 Neural Engine	38	INT8 (conv.)	NPU only	Per Apple; INT8 to FP16 dequant in practice
Intel Core Ultra (Lunar Lake)	~47	INT8	NPU only	Core Ultra 9 288V; Copilot+ certified
Qualcomm Snapdragon X Elite	45	INT8	NPU only	Hexagon NPU; Windows on ARM
Qualcomm Snapdragon X2 Elite	80	INT8	NPU only	Latest gen (2025); 78% jump over X Elite
Tenstorrent Wormhole n150	~262 TFLOPs	FP8	Full chip	Not a traditional TOPS figure; 72 Tensix cores
NVIDIA RTX 4090	1,321	INT8 w/ sparsity	GPU tensor	Includes structured sparsity 2x multiplier
NVIDIA RTX 5090	~3,352	INT8 w/ sparsity	GPU tensor	With sparsity; ~1,677 TOPS dense

A few things worth calling out: the RTX 4090’s 1,321 TOPS includes structured sparsity, a 2x multiplier that only applies when model weights are 50% or more zero. Most aren’t. NVIDIA’s dense INT8 is closer to 660 TOPS. The Tenstorrent Wormhole reports FP8 TFLOPs rather than TOPS, which reflects a different architectural philosophy entirely. The SpacemiT K3’s 60 TOPS is dependent on a software stack that’s still being built.

Raw TOPS is a starting point, not an answer. Match the platform to the workload, check framework support for your model architecture, and benchmark before committing.

Which bottleneck are you buying?

Stop asking which hardware is best. Start asking which bottleneck you’re willing to pay to solve.

If you’re doing multi-agent workflows where you need fast concurrent inference, with multiple agents running in parallel each waiting on responses, bandwidth wins and you want discrete NVIDIA. If you’re running a single large reasoning model for deep analysis or long-context work, capacity wins and you want unified memory. If you’re experimenting and want the best flexibility per dollar, the Framework Desktop at 128GB or a Mac mini M4 Pro are hard to beat as starting points.

The local AI hardware market in 2026 is finally interesting enough that there’s no single right answer, which means the space has matured past the point where CUDA was the only viable path and a $10,000 GPU was the only serious option.

Sources and where to buy

Primary references

Where to buy

NVIDIA RTX 5090 – Best Buy, Newegg, B&H Photo (stock is spotty, prices above MSRP)
NVIDIA RTX PRO 6000 Blackwell – Newegg, Amazon RTX PRO 6000, Micro Center, B&H Photo RTX PRO 6000
RTX 4090 – secondary market, eBay RTX 4090, Newegg used
Apple Mac Studio – Mac Studio M4 Max (up to 64GB), Mac Studio M3 Ultra (96GB)
Apple MacBook Pro – MacBook Pro 16" M5 Max 128GB
Apple Mac mini – Mac mini M4 Pro config selector
Apple MacBook Air – MacBook Air M5 configs
NVIDIA DGX Spark – NVIDIA Marketplace direct (enterprise ordering)
ASUS Ascent GX10 – ASUS store (system integrator quotes)
Framework Desktop – Framework order page (direct from manufacturer, 128GB config available)
AMD RX 7900 XTX – Newegg RX 7900 XTX, Amazon RX 7900 XTX, B&H Photo RX 7900 XTX
AMD Radeon PRO W7900 – AMD.com Radeon PRO, CDW PRO W7900, B&H Photo PRO W7900
AMD Radeon AI PRO R9700 – AMD.com AI PRO, CDW R9700
Intel Arc Pro B65 – Intel Arc Pro B-series, CDW Arc Pro B65
Tenstorrent Wormhole / Blackhole – Tenstorrent store (direct ordering)

I built a research section with AI and it might be useful to you too

Jared Watkins — Sat, 04 Apr 2026 00:00:00 +0000

I’ve been spending a lot of time lately exploring what AI is actually good for beyond writing code… which is where most people seem to stop. One thing I’ve landed on that I think is genuinely useful is using it to maintain a structured research knowledge base on technical topics I care about. So I built one, and it now lives on this site.

It covers cutting-edge datacenters (cooling systems, power infrastructure, robotic server management, construction), energy (solar, SMR nuclear, batteries, grid resources), and robotics (actuators, sensors, edge compute, aerial and ground drones). These aren’t random topics. They’re pretty tightly interconnected when you start pulling on the threads, and that overlap is a big part of why I find them interesting.

I work at Amazon, which means I’m adjacent to infrastructure at a scale that makes you think differently about where things are heading. The question of how you build, cool, power, and automate a datacenter isn’t abstract to me. It’s the physical substrate underneath everything I work on. So I want to understand it at the component level. Who’s actually making the immersion cooling systems? Which SMR designs are closest to permitted? What startups are supplying the actuation systems that will eventually run the robotic logistics inside these facilities? That kind of thing.

What I found when I started digging is that most publicly available research in this space is either too high-level (analyst report fluff) or too narrow (a single company’s press releases). There isn’t a lot of good connective tissue between, say, perovskite solar cell efficiency breakthroughs and the energy sourcing decisions being made for the next generation of AI training clusters. But those things are connected, and if you’re trying to understand where the puck is going – for investing, for career positioning, or just because you find it fascinating – the connections are where the useful insight lives.

The research section is my attempt to build that connective tissue. Each entry is structured like a well-maintained Wikipedia stub: what it is, why it matters, recent developments, key people and organizations, sources. The focus is on smaller companies, university spinouts, and component suppliers, not the IBMs and Googles of the world, which already have plenty of coverage. The interesting stuff is one or two layers down from the obvious names.

It’s also AI-maintained on a schedule. I set up automated updates that keep the entries reasonably fresh as new developments happen. I do review things but I’m not manually rewriting every entry every time something changes. It’s an experiment in whether you can use AI not just to generate content but to curate and maintain a living knowledge base over time. So far the answer seems to be yes, with appropriate skepticism about any specific claim that would benefit from verification.

If you’re researching similar topics, hopefully this saves you some time. Why burn your tokens on what I’ve already gathered. And if you notice something wrong or missing, let me know.

Fixing MySQL SSL Replication Errors

Jared Watkins — Tue, 07 May 2013 00:00:00 +0000

I spent 2 nights trying to get mysql replication over ssl to work.. and kept hitting this generic error message: connection error 2026. After much searching and trying different things I finally found the solution that worked for me. If you are hitting this error here’s a short list of things to check.

Searching around I found a few key things to look for when you see this error. First.. make sure your certificate CNs are unique. Some people mistakenly used the same strings when generating server and client certificates. When generating your CSR and going through all the cert detail questions your CN would normally be the fully qualified domain name of the host you are generating the cert for.

openssl x509 -in <cert file> -noout -noout -subject

Then I would check that you can connect manually from your replication slave to the master.

mysql -h <master server> --ssl -u <replication user on master>

After you connect make sure your sesson is using ssl by issuing the command “\s”. It should list something like this: SSL: Cipher in use is DHE-RSA-AES256-SHA

In my case the manual connection was working fine.. but not the replication.

The problem I had was actually with paths used in the ‘CHANGE MASTER TO’ line on the slave. When setting values for MASTER_SSL_CA, MASTER_SSL_CERT, and MASTER_SSL_KEY you should specificy the complete path to the files. The docs suggest otherwise.. in combination with MASTER_SSL_CAPATH. I found that particular setting wasn’t used the way I expected.. so I left it out and went with full path names.

Hope that’s helpful for someone else!

Clearwire In Seattle + Easy Google Mapping

Jared Watkins — Sun, 12 Aug 2012 00:00:00 +0000

I’m living in a part of town that has no good option for broadband internet. That’s very annoying as in some parts of town you can get verizon fios.. and in most of the rest you can get Comcast which isn’t terrible. Where I am your big choice is a company called Broadstripe.. which is so bad even the employees blog about it. So with that I decided to try Clearwire.. the Wimax broadband provider.

(Sidenote I don’t like it when companies use names that are easily confused or otherwise in common use.. makes it very difficult to research them. So while they may call themselves ‘Clear’ now.. I’m sticking with the original Clearwire)

Clearwire operates a network of wireless towers to provide broadband internet (fixed or mobile) in the 2ghz range. That means their signals don’t penetrate buildings or trees very well and line of sight is best. In my case I’m going with one of their fixed wireless devices that includes an external port for a directional antenna. Where I am is technically listed as a dead zone on their map.. but I find that if I’m on the 3rd floor I get a good signal.. 4 out of 5. In periodic testing I’ve seen speeds of 4Mb down and 1Mb up in the best times and about half that in the worst times which is not bad for $50 a month in a ‘dead zone’.

Since a directional antenna could improve on that… especially when the weather is bad (and this is Seattle we are talking about) I thought it would be good to know where exactly the closest towers are so I could choose the best room placement and help aim the antenna. I did a little research and found a site that has a database of tower operators and pulled out Clearwire antenna locations all over the West and East side of Seattle. I then found a pretty slick way of taking that spreadsheet data and putting almost directly into an embedded google map. Check it out on my Project Page: Seattle Clearwire Towers The tower data is at antennasearch.com and the mapping tool is at batchgeo.com.