tag

#allreduce

총 3개의 글

AI 2026.05.03 · 12 min Advanced Distributed Training Deep Dive · 1

Broadcast부터 Ring AllReduce의 bandwidth-optimal 증명까지, 분산 학습 multi-GPU 통신의 6가지 collective operation과 NCCL 토폴로지 선택 원리를 추적한다.

AI 2026.05.03 · 11 min Advanced Distributed Training Deep Dive · 2

Gradient averaging의 linearity 증명부터 critical batch size, async staleness의 수렴 조건까지, 분산 학습 Data Parallelism의 수학적 토대를 추적한다.

AI 2026.05.03 · 9 min Advanced Distributed Training Deep Dive · 3

단일 GPU 메모리 한계에서 출발해 Column-GELU-Row 구조의 2-AllReduce 최적성과 NVLink vs InfiniBand 효율 차이까지, Megatron-LM의 설계 결정을 추적한다.