Skip to content

How Chiplet Clock Domain Boundaries Multiply Timing Closure Complexity Across Dies

P. Nakamura P. Nakamura
/ / 4 min read

Clock domain crossings inside a monolithic die are annoying. Across a die-to-die interface, they become a different class of problem entirely.

When you split an SoC into multiple chiplets, every inter-die signal crosses a physical boundary where the clocks are not just different in frequency but sourced from separate PLLs on separate power domains, fabricated on potentially different process nodes. The usual CDC (clock domain crossing) analysis flow treats that boundary as a synchronizer insertion problem. Add a FIFO, close timing, move on. What it misses is that the latency budget, the jitter profile, and the spread of valid setup/hold windows are all fundamentally changed by the die-to-die interconnect medium itself.

Consider a UCIe-attached compute die talking to an HBM controller die. The PHY on each side runs its own forwarded-clock scheme, and the two sides negotiate a training sequence at bringup to align data and clock edges. That alignment is not static. Thermal gradients shift PLL lock points. Supply noise modulates VCO frequency. The synchronizer that closes timing at 25°C junction temperature may fail its setup margin at 90°C if the PDN on the HBM controller chiplet sags during a burst write. Timing sign-off at the die level does not catch this. It requires a co-simulation of the full package, and most flows do not support that natively.

The problem compounds when the two connected chiplets come from different foundries. A compute tile on TSMC N3 and an I/O tile on TSMC N6 will have measurably different clock jitter floors because the PLL circuit styles optimized for each node differ in VCO topology and loop filter tuning. Integrated jitter on an N3 PLL running at 4 GHz can be 200 fs RMS or better. On an older node running the same nominal frequency it may be 400-600 fs. That 200-400 fs difference is not a rounding error when your die-to-die interface is budgeted for 800 Gb/s and every bit needs a clean eye opening.

What do architects actually do?

The most common mitigation is to widen the clock domain boundary: push the inter-die interface to a lower frequency with a wider data bus, using serialization inside each chiplet's PHY. This trades aggregate bandwidth efficiency for timing headroom. UCIe's bump pitch and modulation choices reflect exactly this tradeoff. Running the physical interface at a slower, more conservative clock lets you absorb more jitter without adding synchronizer stages.

A second approach is mesochronous design. Both chiplets derive their inter-die clocks from a single reference distributed across the package, typically through a low-skew clock spine embedded in the interposer or RDL. TSMC's CoWoS-S has demonstrated reference clock distribution structures in the interposer layer for exactly this reason. The inter-die interface then operates mesochronously: same frequency, bounded phase offset, no FIFO needed for steady-state operation. This does not eliminate the alignment problem; it replaces a wide-open asynchronous crossing with a tightly bounded one that synthesis and static timing analysis can reason about correctly.

Below is a simplified view of how clock source relationships change across die boundaries in a typical chiplet assembly:

graph TD
    A[Package Reference Clock] --> B(Chiplet A PLL)
    A --> C(Chiplet B PLL)
    B --> D[Chiplet A Logic Domain]
    C --> E[Chiplet B Logic Domain]
    B --> F{Die-to-Die PHY A}
    C --> G{Die-to-Die PHY B}
    F --> H((Synchronization Point))
    G --> H

The synchronization point in that diagram is where all the budget pain concentrates. Every nanosecond of latency added there has downstream consequences for cache coherency protocols, transaction ordering, and memory access time. Adding synchronizer stages to cover worst-case jitter costs two to four cycles per crossing at 4 GHz. Across a processor-to-memory-controller interface that fires thousands of transactions per microsecond, that penalty accumulates into measurable bandwidth degradation.

There is one thing the industry has not fully solved yet: automated sign-off across a heterogeneous chiplet assembly. Die-level timing closure tools close timing to the bump. Package-level simulation captures electrical parasitics. Nothing currently stitches those two sign-off worlds together with the fidelity that a monolithic design gets from a single STA run. Companies doing this well are writing custom glue scripts or investing in early EDA partnerships. Everyone else is leaving timing margin on the table and hoping bringup finds the problems before a customer does.

That gap between die-level sign-off and system-level reality is where chiplet timing problems actually live. Closing it requires treating the package as part of the timing domain, not as a boundary condition appended after the real design work is done.

Get Chiplet Ecosystem in your inbox

New posts delivered directly. No spam.

No spam. Unsubscribe anytime.

Related Reading