die-to-die interconnectchiplet disaggregationUCIeadvanced packagingheterogeneous integration

How Die-to-Die Interconnect Latency Is Reshaping Chiplet Disaggregation Decisions

P. Nakamura P. Nakamura
/ / 4 min read

Die-to-die latency is not the sexiest number in a chiplet spec sheet. Bandwidth density gets the headlines. Bump pitch earns the conference talks. But latency, measured in single-digit nanoseconds that seem almost trivially small, is quietly becoming the deciding factor in which functions engineers will disaggregate into separate chiplets and which they will not.

Detailed close-up image of a circuit board showcasing soldered components and circuit pathways. Photo by Tima Miroshnichenko on Pexels.

Here's the uncomfortable reality: breaking a monolithic SoC into chiplets always costs you something in signal travel time. Even a 2 mm hop across a silicon interposer introduces roughly 100–200 ps of propagation delay, plus serialization overhead, plus whatever the PHY layer adds on both ends. Stack those up across multiple die boundaries in a complex multi-chiplet design and you're easily at 2–5 ns round-trip for a single transaction. That's negligible for memory streaming. It is absolutely not negligible for a tightly coupled cache coherency protocol.

Where Latency Bites First

The disaggregation decisions that hurt most are the ones involving coherent shared state. Last-level cache slices, for instance, are a prime example. AMD's chiplet approach with Zen CPUs works because each core complex die (CCD) manages its own L3; cross-die coherency traffic goes through the I/O die, and the latency penalty is accepted and accounted for in the NUMA-like topology. Intel's Foveros-based designs for client chips take a different tack, keeping the latency-sensitive compute logic stacked tightly over the base die. Neither approach is wrong, they're calibrated to different latency budgets.

For AI inference accelerators, the calculus shifts again. Matrix math doesn't care much about nanosecond round trips; what it cares about is sustained bandwidth. This is precisely why disaggregating HBM from compute logic into a 2.5D CoWoS layout is a clean win, latency tolerance is high, bandwidth density is everything, and the interposer delivers.

But disaggregate the tensor cores themselves from the SRAM scratchpad that feeds them? Now you're in trouble. The scratchpad access pattern is low-latency by design, it exists to hide DRAM latency. Adding a die-to-die hop between the compute tile and its scratchpad reintroduces the very latency you built the scratchpad to eliminate.

The PHY Tax

Every die-to-die interconnect standard, UCIe, BoW, AIB, XSR, carries a PHY overhead that dominates the latency budget at short distances. UCIe's short-reach PHY targets sub-2 ns latency in the best case, but that's for a point-to-point link with no protocol stacking above it. Add a protocol layer (PCIe, CXL), and you're looking at 10–20 ns or more, depending on credit-return timing and packetization overhead.

This is why the industry is bifurcating into two tiers of die-to-die connectivity:

graph TD
    A[Chiplet-to-Chiplet Traffic] --> B{Latency Sensitive?}
    B -->|Yes| C[Native D2D PHY - UCIe SR / AIB]
    B -->|No| D[Protocol-Based - CXL / PCIe]
    C --> E[Sub-5ns, tight bump pitch]
    D --> F[10-50ns, longer reach]
    E --> G[Cache, scratchpad, control planes]
    F --> H[Memory expansion, I/O, accelerators]

The native PHY path, short-reach, low overhead, tight bump pitch, is reserved for functions where latency matters enough to justify the routing complexity and the power cost of dense interconnect. The protocol-based path absorbs everything else.

What This Means for Disaggregation Strategy

Not every block in an SoC is a candidate for chiplet separation, and latency tolerance is the first filter to apply, before yield math, before reticle limits, before any packaging cost modeling.

A useful mental model: if the block communicates with another block at rates above ~10 GB/s and requires round-trip latency below ~10 ns, you have a coherency-class interface. Separate those two blocks at your peril unless you're prepared to redesign the protocol to tolerate the gap, or colocate the logic on the same die using face-to-face bonding at hybrid bond pitches below 10 µm.

Some teams are doing exactly that. TSMC's SoIC with hybrid bonding pushes bump pitch down to 4–9 µm, which shrinks the effective latency penalty dramatically. At that scale, a die-to-die hop starts looking less like a chip boundary and more like a long on-die wire. That's not an accident; it's the entire point.

The path forward for chiplet disaggregation isn't simply "break everything apart and connect with UCIe." It's a careful partitioning exercise where latency budgets drive the first cut, and the interconnect technology is chosen to fit the budget, not the other way around. Engineers who internalize that ordering will make better disaggregation decisions than those chasing bandwidth density numbers alone.

Get Chiplet Ecosystem in your inbox

New posts delivered directly. No spam.

No spam. Unsubscribe anytime.

Related Reading