How Chiplet Disaggregation Breaks Cache Coherency and What Architects Do About It
P. NakamuraSplitting a monolithic SoC into chiplets is not a packaging exercise that happens after the architecture is settled. The moment you physically separate compute dies, you've introduced latency that cache coherency protocols treat as a fault condition. Designers who learn this late pay for it in respins.
Photo by Jonas Svidras on Pexels.
Cache coherency on a monolithic die is fast enough to hide behind hardware. MESI or MOESI state machines run in the tens of cycles, and the on-die interconnect fabric delivers directory lookups before software even notices the cost. Separate those same caches across a UCIe bump array or an EMIB bridge, and the round-trip penalty climbs into the hundreds of cycles. Coherency traffic that was cheap becomes expensive. What was invisible is now a bottleneck measured in real application latency.
This is the disaggregation tax, and it compounds.
Every cache line that crosses a die boundary carries three costs: the electrical crossing penalty (serialization, retimer logic, and signal integrity overhead), the protocol overhead of maintaining directory state across physically separate agents, and the thermal variance that makes timing margins unequal across chiplets. None of these existed when the same cache hierarchy lived on a single piece of silicon.
graph TD
A[Compute Die A] -->|UCIe Link| C{Directory Controller}
B[Compute Die B] -->|UCIe Link| C
C --> D[HBM Memory Die]
C --> E[I/O Die]
A -->|Snoop Request| B
B -->|Snoop Response| A
The snoop path is where things get painful. On a monolithic die, a snoop broadcast reaches every cache in a few cycles. Across chiplets, that same broadcast must transit the die-to-die interconnect, and every chiplet in the domain must respond before the requesting agent can proceed. Scale that to eight compute chiplets in a server processor, and the worst-case snoop latency grows linearly with chiplet count unless the directory protocol is redesigned around the physical topology.
AMD addressed this directly in EPYC by engineering the coherency fabric to treat each CCD (core complex die) as a coherency island with explicit cross-die request queues. The Infinity Fabric carries coherency state across the package, but the protocol was designed with that latency budget as a first-class input, not a late constraint. That design decision predates the physical floorplan. It predates RTL sign-off on the individual dies.
Intel's Foveros approach for tiled client processors trades some coherency flexibility for tighter vertical coupling. By stacking compute tiles over a base die that owns the interconnect logic, snoop responses can be handled closer to the memory controller without crossing the package horizontally. Vertical integration reduces one dimension of latency while introducing different tradeoffs around thermal stacking and base die yield.
What both approaches share: the coherency protocol is a first-order input to the package design. Not the other way around.
Software-managed coherency is gaining traction as an alternative for workloads that can tolerate it. ML inference engines, certain DSP pipelines, and streaming data-path accelerators often have predictable data ownership patterns. When software can guarantee that die A owns a data region while die B processes a different one, hardware coherency across that boundary becomes unnecessary overhead. The protocol can be relaxed, and the die-to-die bandwidth that would carry snoop traffic gets redirected to productive data movement.
That model requires compiler and runtime awareness of die topology. The software stack has to know which physical die a thread is executing on and schedule data placement accordingly. It's a significant ask. But for purpose-built inference silicon, the performance gains from eliminating cross-die coherency traffic are large enough that vendors are building the toolchains to support it.
Hybrid approaches split the difference. Define a small coherency domain per cluster of cores on each compute die, handle local sharing inside that domain with full hardware coherency, and demote cross-die sharing to a software-assisted protocol with explicit ownership transfer. Intel's CXL.cache layer formalizes something close to this for attached accelerators, though the latency profile of PCIe-rooted CXL is still too high for tightly-coupled compute workloads.
The path forward for high-core-count disaggregated processors runs through directory protocols designed around known die-to-die latency numbers, snoop filter hierarchies that reduce broadcast scope across package boundaries, and software runtimes that schedule with physical topology in mind. None of this is speculative; production silicon from AMD, Intel, and Qualcomm has been shipping variants of these approaches for several product generations.
The hard part is that every time packaging density improves, the tradeoff space shifts. Tighter bump pitch reduces die-to-die latency. Hybrid bonding reduces it further. Each step changes what the coherency protocol can afford to do in hardware and what it must hand off to software. Architects working on 2nm chiplet designs today are making coherency decisions based on interconnect latencies that packaging engineers are still characterizing in the lab.
Get that handoff wrong and the protocol is either over-engineered for a link that turned out to be fast, or under-engineered for one that turned out to be slow. Both outcomes ship with a performance tax that no firmware update will fully recover.
Get Chiplet Ecosystem in your inbox
New posts delivered directly. No spam.
No spam. Unsubscribe anytime.