Opening a new era in AI training, Broadcom's Jericho4 ASICs enable multi-datacenter learning capabilities.

In the realm of artificial intelligence (AI), the challenge of bandwidth and latency has long been a significant hurdle in training models across multiple datacenters. However, recent advancements are paving the way for a more connected and efficient future.

Modern distributed AI systems are increasingly relying on Pulse Amplitude Modulation 4-level (PAM4) optical digital signal processors (DSPs), which offer 200 Gbps per lane, enabling 1.6 Tbps modules to handle the massive bandwidth demands inside data centers and across racks. This technology ensures low-latency, high-reliability connections over tens to hundreds of meters, crucial for scaling AI clusters across multiple rows and pods [1].

For longer distances within a data center campus (2–20 km), coherent-lite technology provides a viable solution. It offers longer reach than PAM4 links but at a lower cost and power than traditional coherent systems, operating in the O-band and helping to alleviate power/space constraints inside single facilities [1].

To connect geographically distributed compute clusters across cities or continents, coherent ZR and ZR+ optics enable multi-terabit bandwidth over distances up to 2,500 km, ensuring high bandwidth and low latency over long distances critical for cross-data-center distributed training [1].

Enterprise networks are also being designed for ultra-low latency and massive throughput to meet the demands of distributed AI environments [3]. Distributing computational workloads closer to data sources (edge computing) reduces latency dramatically, enhances data privacy, and reduces bandwidth pressure between sites [5].

Techniques like model pruning, quantization, and distillation further help reduce the compute and data transfer load per training step, indirectly aiding with latency [2][4].

Broadcom's Jericho4, currently available for large customers to sample, is a next-gen networking ASIC designed for high throughput and low latency, targeting data center networking with support for Terabit-scale Ethernet switching and advanced traffic management features. Unlike the optical and DSP-based physical interconnects mentioned above, Jericho4 operates as a high-performance packet-processing chip within the networking switch and router layer [6].

Jericho4 excels in managing large-scale data flows at line-rate speeds with advanced congestion management, telemetry, and programmability, essential for modern AI data center fabrics [6]. The key difference is Jericho4 focuses on packet switching and network layer optimization inside and between data centers, whereas the PAM4/Coherent DSPs focus on the physical layer and optical transmission technology that carry these packets across distances [6].

In summary, overcoming bandwidth and latency challenges for distributed AI training combines optical hardware innovations (PAM4 DSPs, coherent optics) to transport large data volumes efficiently over distance with network ASICs like Jericho4 to optimise data flow within and between network nodes. Jericho4 complements but does not replace the physical-layer optical technologies critical for multi-datacenter AI training scale [1][3].

Broadcom's Jericho4 is positioned for datacenter-to-datacenter interconnect (DCI) and offers 51.2 Tb/s of aggregate bandwidth across the ASIC's switch and fabric ports [2]. Amir Sheffer, an associate product line manager at Broadcom, stated that Jericho4 is the only valid solution for running a training cluster beyond the capacity of a single building [2].

The round trip latency for Jericho4 over a 100-kilometer span works out to nearly one millisecond, before considering transceiver and protocol overheads [3]. Jericho4 can be scaled into configurations of up to 36,000 hyper ports, capable of connecting two datacenters at 115.2 petabits per second [3].

The paper by Google's DeepMind team, titled "Streaming DiLoCo with overlapping communication," published in late January, details an approach to low-communication training for distributed workloads [4]. Although the paper does not mention Jericho4 directly, it provides a potential solution to the latency challenges faced by distributed training workloads [4].

By using quantization and strategically scheduling communication between datacenters, many of the bandwidth and latency challenges can be overcome [4]. The basic idea in the paper was to create distributed work groups that don't have to talk to one another all that often [4].

Historically, datacenter operators have employed some degree of over-subscription in their DCI deployments, whether it be 4:1 or 8:1, and this is likely to continue to be the case [5]. This bandwidth is sufficient to connect 144,000 GPUs, each at 800Gbps, to an equal number in a neighboring datacenter without bottlenecks [5].

Jericho4, as an alternative to building one great big datacenter campus, allows AI outfits to build multiple smaller datacenters and pool their resources [6]. This approach can help reduce the impact of a single point of failure and provide more flexible scaling options for AI model developers.

In conclusion, Broadcom's Jericho4 and the advancements in optical interconnects, optimised networking fabrics, and edge/accelerator infrastructure are paving the way for more efficient, high-throughput, and low-latency distributed AI training across multiple datacenters. As these technologies continue to evolve, we can expect to see even more exciting developments in the field of AI.

References: 1. Optical Interconnects for Distributed AI Training 2. Broadcom Jericho4: Terabit-scale Ethernet Switching ASIC 3. Broadcom Jericho4: Redefining Data Center Interconnects 4. Google's DeepMind Team Publishes Paper on Low-Communication Training for Distributed Workloads 5. Optimising Distributed AI Training for Multi-Datacenter Scenarios 6. Broadcom Jericho4: The Solution for Distributed AI Training Across Multiple Datacenters

In the context of data-and-cloud-computing, advancements in optical interconnects like PAM4 DSPs and coherent optics are addressing the bandwidth and latency challenges for distributed AI training, while network ASICs like Broadcom's Jericho4 optimize data flow within and between network nodes, playing a crucial role in enabling efficient, high-throughput, and low-latency distributed AI training across multiple datacenters.

Moreover, as the technology evolves, Jericho4 and the advancements in optical interconnects, optimized networking fabrics, and edge/accelerator infrastructure are anticipated to pave the way for even more exciting developments in the field of AI, enabling a more connected and efficient future.

Opening a new era in AI training, Broadcom's Jericho4 ASICs enable multi-datacenter learning capabilities.