a

In-Package Optical I/O versus Co-Packaged Optics – Let’s Get Technical!

by | Mar 1, 2023

There’s a lot of industry excitement around advances in optical interconnects – and also a lack of clarity. Terms are often mixed and dissimilar technologies lumped into the same bucket. This is nowhere more prevalent than with in-package optical I/O (OIO) and co-packaged optics modules (CPO). The truth: comparing these two technologies is an apples-to-watermelons (or, if you prefer, networking-to-compute fabric) comparison. And here are the reasons:

  1. CPO is a replacement strategy for pluggable optics, geared towards large Ethernet network switches; its footprint (bandwidth density, energy cost, latency) is just a bit better than pluggable optics.
  2. OIO is a chiplet-based optical interconnect solution integrated in the same package as the compute chips (CPUs, GPUs, XPUs), and designed to enable seamless communication among them in a distributed compute system (across boards, racks and compute rows) at the bandwidth density, energy cost, and latency comparable to in-package electrical interconnects.
  3. Low latency, high-bandwidth density and low energy make OIO uniquely suitable for compute fabrics (i.e., memory-semantic fabrics), which are emerging as the key drivers of new data center architectures tailored for machine learning scale-out, resource disaggregation and memory pooling.

Let me elaborate a bit more on this last point about the importance of OIO characteristics for advanced AI and data center design, since it is a very important one. In 2022, we saw dramatic breakthroughs in machine learning driven by foundational models like ChatGPT. These models are trained continuously and require thousands of GPUs (both for compute power and memory footprint), making the hardware that supports them a permanent part of the new data center infrastructure. For example, NVIDIA DGX™ systems are expanding their NVLink® memory-semantic fabric beyond the DGX board, which hosts 8 H100 GPUs, to connect up to 256 H100 GPUs with a two-stage NVLink fabric, and pushing the InfiniBand network further out. This fabric scaling will continue both in radix (number of nodes supported) and bandwidth, and OIO is uniquely suited to enable this to happen in terms of power usage and cost-efficiency.

Figure 1 illustrates Ayar Labs’ in-package OIO, with four TeraPHY™ optical I/O chiplets in the same package (multi-chip package – MCP) as the host System-on-a-Chip (SoC). The solution below enables 16 Terabits per second (Tbps) of bidirectional throughput at <5 pJ/b energy cost from a typical size compute package (50mm x 50mm).

TeraPHYᵀᴹ optical I/O chiplets

Figure 1: Ayar Labs’ in-package Optical I/O (four TeraPHYᵀᴹ optical I/O chiplets on the same MCP with the host SoC).

TeraPHY™ OIO chiplets are designed in a CMOS process assembled into an MCP just like any other CMOS chiplet, using standard 2D and 2.5D packaging technologies. These chiplets are designed to support wide-parallel interfaces, enabling in-package communication with high bandwidth-density at lowest energy cost per bit. Wide-parallel interfaces are emerging through various flavors of the Universal Chiplet Interconnect Express™ (UCIe) standard and NVLink-C2C.

The optical link part of the TeraPHY™ chiplet is also engineered to enable optimal energy efficiency by exploiting parallelism – something that CMOS designers are very familiar with. By utilizing efficient wavelength division multiplexing (WDM) enabled by small microring resonators integrated on the same chip as the link electronics, TeraPHY™ OIO chiplets utilize multiple wavelengths (link lanes) per fiber, and multiple links (fibers) per chip. For example, the current generation of TeraPHY™ chiplets achieves bidirectional throughputs of 4096 Gbps with 8 optical ports. Each link supports 256 Gbps per optical port, with each optical port having 8 wavelengths at 32 Gbps per wavelength.

Having more wavelengths per fiber enables link electronics to be optimized for energy efficiency, rather than pushing the high data rate per very few wavelengths at higher energy cost. The multiple wavelengths for multiple links are enabled by our SuperNova™ optical source, which in its current generation provides 8 optical ports with 8 wavelengths per port.

And this is just the start. By adding more wavelengths, and more optical ports, we can keep doubling the chip throughput without having to compromise on its energy efficiency.

Figure 2 summarizes the key technologies that enable these efficient TeraPHY™ OIO chiplet solutions.

TeraPHY™ OIO

Figure 2 – Key technologies enabling the TeraPHY™ OIO

Now let’s contrast that with CPO modules. Figure 3 illustrates the CPO module with its own package, 16 of which are then assembled with a switch package on the large (> 160mm x 160mm) interposer board (so clearly the optics is not in the same package as the switch SoC). I would call this solution NPO (near-package optics), but it is actually a form factor specified by the Co-Packaged Optics Collaboration Joint Development Forum and Optical Internetworking Forum (OIF) – for example, see the “3.2 Tb/s Co-packaged Optics Optical Module Product Requirements Document.” These CPO modules are forced to support 112 Gbps PAM4 signaling per wavelength, which when combined with the lack of electronic-photonic integration results in poor energy efficiencies compared to the OIO solution.

CPO Module

Figure 3 – (upper left) CPO module top view illustrating discrete photonic and electronic chips on a module package; (lower left) cross-section of the full solution illustrating separate package substrates for the CPO modules and the switch, all mounted on the interposer board; (right) 51.2 Terabits/s CPO switch demo.

Product of shoreline bandwidth density and energy efficiency

So, let’s look at some data. The chart below illustrates the first important figure of merit (FoM1) for various interconnect solutions. This is a product of shoreline bandwidth density (Gbps/mm) and energy efficiency (1/(pJ/b)).

Bandwidth Density and Energy Efficiency Graph

Figure 4 – Product of shoreline bandwidth density and energy efficiency of various interconnect solutions plotted vs reach

Why is this metric important? Chips and packages, and even the chassis board, have limited edge escape real estate as well as limited ability to dissipate heat. I plot the FoM1 for various interfaces versus their reach capability, to better understand the capability of the technology to escape large amounts of data, at lower energy cost, while reaching a reasonable distance required to implement a distributed computing solution.

At the package level, UCIe Advanced interface offers the highest FoM1, enabled by the 2.5D integration and limited to only a few millimeters (mm) of reach. UCIe Standard and NVLink-C2C enable in-package chip-to-chip connectivity over standard organic package substrates, reducing the packaging cost. These links are very energy efficient and have very high shoreline bandwidth density, but unfortunately, their reach is limited to within the SoC package. Ideally, we would like to carry these energy and bandwidth density efficiencies to anywhere in a large-scale distributed compute system – up to distances of at least hundreds of meters. This is where Ayar Labs’ OIO solution comes in.

Key takeaways:

  • Ayar Labs’ in-package OIO solution [starting with demonstrated TeraPHY™ 4 Tbps (4T) and follow-on generations of TeraPHY™ 8 Tbps (8T) and 16 Tbps (16T) are in the FoM1 regime of UCIe Standard and NVLink-C2C, enabling reach-independent off-package connectivity at the shoreline bandwidth density and energy cost comparable to that of the in-package electrical interconnects.
  • CPO is down by close to an order of magnitude compared to OIO, and only slightly above the 800G pluggable optics solutions.

Area bandwidth density

A second important metric is the area bandwidth density (Gbps/mm2). It represents the efficiency of the interface footprint in utilizing the chip, package and board real estate. Figure 5 illustrates our second figure of merit – a product of area bandwidth density (Gbps/mm2) and energy efficiency (1/(pJ/b)). This metric is especially important for the interconnect solutions placed in the package. Package real estate is at a premium – especially in compute applications, where packages are typically smaller than large networking switches and want to utilize the area to host compute chips and memory stacks, leaving little room for the I/O.

Area Bandwidth Density Graph

Figure 5 – Product of area bandwidth density and energy efficiency of various interconnect solutions plotted vs reach

Key takeaways:

  • OIO is in the range of in-package and board-level electrical interconnect solutions hosted in the compute package.
  • CPO is more than an order of magnitude below OIO and closer to the pluggable optics solutions.

Latency

Latency is a third key metric (and actually super important for memory-semantic fabrics). Ayar Labs’ OIO chiplets are designed to have 5 nanoseconds (ns) latency, comparable to in-package interfaces, with a raw bit error rate (BER) target of 10⁻¹⁵. In contrast, just like pluggable optics, CPO requires a forward-error correction code to achieve this BER target, at a cost of 100-150ns of latency. This is acceptable in traditional networking applications, but not in memory-semantic fabrics for machine learning scale-out and disaggregation.

The last but not the least metric is the cost efficiency of the interconnect solution (throughput/cost – Gbps/$). Due to having many discrete components and the resulting module assembly costs, pluggable optics solutions have struggled to break the 1 Gbps/$ barrier. A CPO module as shown in Fig. 3, has a similar number of components and needs its own module (package substrate, lid, etc), just like pluggable optics modules, so it is hard to see how it will be able to significantly improve the cost efficiency. While this may be fine for traditional networking applications, where most of the cost is concentrated in the switch and many servers have relatively low throughput needs (hence relatively low networking cost compared to other hardware in the server), the situation is entirely different in high-performance distributed compute used for machine learning applications. Here, each compute unit (e.g., GPU accelerator) has 1-2 orders of magnitude more off-package interconnect throughput that needs to reach other distributed compute units.

OIO is uniquely suited to significantly improve cost efficiency by leveraging integration – a proven recipe for improving cost efficiency in the CMOS world.

For the TeraPHY™ OIO chiplet, we leverage electronic-photonic integration to pack many links on the same CMOS chip, and then integrate these chips into the same package as the compute SoC. The SuperNova™ optical source is also highly integrated, and provides many wavelengths, further amortizing the cost. Of course, to fully optimize the manufacturing costs, the high-volume manufacturing ecosystem needs to be established. Ayar Labs has been working with many partners illustrated in Fig. 6 to realize this technology potential.

Figure 6 – Ayar Labs and partners in the in-package optical I/O ecosystem

As we’ve seen, optical I/O and CPO are both exciting optical breakthroughs, designed for very different applications. Thus, the characteristics of each – across power density, performance per watt, latency, package cost, etc. – are also distinct. When looking at the overall market for optical interconnect, or discussing the technologies individually, it pays to be clear on the specific technology and application.

See the industry’s first 4 Tbps OIO solution demonstration at OFC, Ayar Labs booth #6008, or learn more about this key milestone announcement.

Join our mailing list

Recent News

More News →

Resources

More Resources →

Follow Us

Related Blog Posts

Optics on SC23

Optics on SC23

Discover the key insights and groundbreaking advancements from SC23 with Ayar Labs’ top observations. Dive into the future of HPC and AI.

See More

Pin It on Pinterest

Share This