AICB: A benchmark for evaluating the communication subsystem of LLM training clusters

Xinyue Li , Heyang Zhou , Qingxu Li , Sen Zhang , Gang Lu
Alibaba Cloud, Beijing, 100124, China

Abstract

AICB (Artificial Intelligence Communication Benchmark) is a benchmark for evaluating the communication subsystem of GPU clusters, which includes representative workloads in the fields of Large Language Model (LLM) training. Guided by the theories and methodologies of Evaluatology, we simplified the real-workload LLM training systems through AICB that maintain good representativeness and usability. AICB bridges the gap between application benchmarks and microbenchmarks in the scope of LLM training. In addition, we constructed a new GPU-free evaluation system that helps researchers evaluate the communication system of the LLM training systems. To help the urgent demand on this evaluation subject, we open-source AICB and make it available at https://github.com/aliyun/aicb.

Background

1. Introduction The AI infrastructure is in rapid development with the flourishing of Artificial Intelligence [1], [2]. For example, the explosion of the Large Language Model(LLM) applications leads to the fast evolution of the training frameworks [3], [4], [5], collective communication algorithms [6], network transports [7], and scale-out and scale-up network architectures [8]. Due to the large number of parameters in LLM, data is distributed across different GPUs for computation, requiring synchronization between these GPUs. Therefore, in LLM training, besides computation, communication also affects training efficiency. Consequently, evaluating the performance of the communication subsystem is a critical subject, that is, ensuring foundational technologies evolve in a manner that is both responsible and conducive to the continued progress in the field.

Journal Name, Volume, Issue