X. Li et al.
BenchCouncil Transactions on Benchmarks, Standards and Evaluations 5 (2025) 100212
computing systems, and network architectures. The workload gener-
ated by AICB can serve as input for SimAI to simulate the conditions of
model training, including various stages of model training, the size of
communication data, communication operations, and the computation
time corresponding to each stage. SimAI can form a comprehensive
simulation evaluation system based on workload input, network topol-
ogy information, and related network configurations, making it an
important tool for evaluating large model infrastructure.
[3] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, Megatron-
LM: Training multi-billion parameter language models using model parallelism,
[4] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti,
D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, M.
Zaharia, Efficient large-scale language model training on GPU clusters using
megatron-LM, in: Proceedings of the International Conference for High Perfor-
mance Computing, Networking, Storage and Analysis, SC ’21, Association for
[5] V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, B.
Catanzaro, Reducing activation recomputation in large transformer models, 2022,
5. Conclusion
[6] NCCL, NVIDIA collective communications library (NCCL), 2024, https://
In this paper, we introduce AICB, a benchmark for evaluating
the communication subsystem of LLM Training clusters. AICB focuses
on communication subsystems in large-scale AI training clusters and
defines appropriate ranges for RC to construct ES. By ‘‘hijacking’’
distributed training frameworks, it simulates specific collective commu-
nication operations. In addition to visualizing communication distribu-
tion, AICB uses bus bandwidth as a metric to evaluate the compatibility
with specified clusters. AICB offers precise simulation and accurate
evaluation of collective communications, providing substantial support
for simulating and evaluating LLM training.
[7] J. Zhang, Y. Wang, X. Zhong, M. Yu, H. Pan, Y. Zhang, Z. Guan, B. Che,
Z. Wan, T. Pan, T. Huang, PACC:
A
proactive CNP generation scheme for
datacenter networks, IEEE/ACM Trans. Netw. 32 (3) (2024) 2586–2599, http:
[8] K. Qian, Y. Xi, J. Cao, J. Gao, Y. Xu, Y. Guan, B. Fu, X. Shi, F. Zhu, R.
Miao, C. Wang, P. Wang, P. Zhang, X. Zeng, E. Ruan, Z. Yao, E. Zhai, D.
Cai, Alibaba HPN: A data center network for large language model training, in:
Proceedings of the ACM SIGCOMM 2024 Conference, in: ACM SIGCOMM ’24,
Association for Computing Machinery, New York, NY, USA, 2024, pp. 691–706,
[9] J. Zhan, L. Wang, W. Gao, H. Li, C. Wang, Y. Huang, Y. Li, Z. Yang,
G. Kang, C. Luo, H. Ye, S. Dai, Z. Zhang, Evaluatology: The science and
CRediT authorship contribution statement
engineering of evaluation, BenchCouncil Trans. Benchmarks, Stand. Eval.
4
Xinyue Li: Writing – review & editing, Writing – original draft, Visu-
alization, Conceptualization. Heyang Zhou: Software, Conceptualiza-
tion. Qingxu Li: Software, Conceptualization. Sen Zhang: Validation,
Conceptualization. Gang Lu: Validation, Conceptualization.
[10] X. Wang, Q. Li, Y. Xu, G. Lu, D. Li, L. Chen, H. Zhou, L. Zheng, S. Zhang, Y. Zhu,
Accessed 4 October 2024).
Declaration of competing interest
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
[14] F. Tang, W. Gao, J. Zhan, C. Lan, X. Wen, L. Wang, C. Luo, Z. Cao, X. Xiong, Z.
Jiang, T. Hao, F. Fan, F. Zhang, Y. Huang, J. Chen, M. Du, R. Ren, C. Zheng, D.
Zheng, H. Tang, K. Zhan, B. Wang, D. Kong, M. Yu, C. Tan, H. Li, X. Tian, Y. Li,
J. Shao, Z. Wang, X. Wang, J. Dai, H. Ye, Aibench training: Balanced industry-
standard AI training benchmarking, in: 2021 IEEE International Symposium
on Performance Analysis of Systems and Software, ISPASS, 2021, pp. 24–35,
References
[1] Z. Jiang, H. Lin, Y. Zhong, Q. Huang, Y. Chen, Z. Zhang, Y. Peng, X. Li, C. Xie,
S. Nong, Y. Jia, S. He, H. Chen, Z. Bai, Q. Hou, S. Yan, D. Zhou, Y. Sheng, Z.
Jiang, H. Xu, H. Wei, Z. Zhang, P. Nie, L. Zou, S. Zhao, L. Xiang, Z. Liu, Z. Li,
X. Jia, J. Ye, X. Jin, X. Liu, MegaScale: Scaling large language model training
[15] J. Rasley, S. Rajbhandari, O. Ruwase, Y. He, DeepSpeed: System optimizations
enable training deep learning models with over 100 billion parameters, in:
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge
[2] W. Li, X. Liu, Y. Li, Y. Jin, H. Tian, Z. Zhong, G. Liu, Y. Zhang, K. Chen, Under-
standing communication characteristics of distributed training, in: Proceedings
of the 8th Asia-Pacific Workshop on Networking, APNet ’24, Association for
Discovery
&
Data Mining, KDD ’20, Association for Computing Machinery,
5