An Order-of-Magnitude Larger and Faster Training with ZeRO-2

ZeRO-2 expands the scope of memory optimizations in the original ZeRO by tackling the full spectrum of memory consumption during training. More specifically, ZeRO-2 introduces new technology to reduce the memory footprint of gradients, activation memory, and fragmented memory, in addition to optimizer state memory optimization in the original ZeRO. Altogether, the memory savings empower DeepSpeed to improve the scale and speed of deep learning training by an order of magnitude. More concretely, ZeRO-2 allows training models as large as 170 billion parameters up to 10x faster compared to state of the art.

For more information on ZeRO-2, see our blog post.

For more information on how to use ZeRO-2, see an example of training GPT family of models in this tutorial.

For a technical overview, see our technical report.