Tutorials

DeepSpeed Mixture-of-Quantization (MoQ)

DeepSpeed introduces new support for model compression using quantization, called Mixture-of-Quantization (MoQ). MoQ is designed on top of QAT (Quantization...

DeepSpeed Accelerator Abstraction Interface

Contents Contents Introduction Write accelerator agnostic models Port accelerator runtime calls Port accelerator device name Te...

DeepSpeed Accelerator Setup Guides

Contents Contents Introduction Intel Architecture (IA) CPU Intel XPU Huawei Ascend NPU Intel Gaudi

Installation Details

The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA ve...

Automatic Tensor Parallelism for HuggingFace Models

Contents Introduction Example Script Launching T5 11B Inference Performance Comparison OPT 13B Inference Performance Comparison ...

Autotuning

Automatically discover the optimal DeepSpeed configuration that delivers good training speed

Getting Started with DeepSpeed on Azure

This tutorial will help you get started with DeepSpeed on Azure.

BingBertSQuAD Fine-tuning

BERT Pre-training

CIFAR-10 Tutorial

Train your first model with DeepSpeed!

Communication Logging

Log all DeepSpeed communication calls

Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training

Watch out! On 12/12/2022, we released DeepSpeed Data Efficiency Library which provides a more general curriculum learning support. This legacy curriculum lea...

DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality

What is DeepSpeed Data Efficiency: DeepSpeed Data Efficiency is a library purposely built to make better use of data, increases training efficiency, and impr...

DeepNVMe

This tutorial will show how to use DeepNVMe for data transfers between persistent storage and tensors residing in host or device memory. DeepNVMe improves th...

Domino

Domino achieves near-complete communication hiding behind computation for tensor parallel training. Please find our Domino-tutorial in DeepSpeedExample repo.

Getting Started with DeepSpeed-Ulysses for Training Transformer Models with Extreme Long Sequences

In this tutorial we describe how to enable DeepSpeed-Ulysses for Megatron-Deepspeed. DeepSpeed-Ulysses is a simple but highly communication and memory effici...

DS4Sci_EvoformerAttention eliminates memory explosion problems for scaling Evoformer-centric structural biology models

1. What is DS4Sci_EvoformerAttention DS4Sci_EvoformerAttention is a collection of kernels built to scale the Evoformer computation to larger number of sequen...

Flops Profiler

Measure the parameters, latency, and floating-point operations of your model

DCGAN Tutorial

Train your first GAN model with DeepSpeed!

Getting Started

First steps with DeepSpeed

Getting Started with DeepSpeed for Inferencing Transformer based Models

DeepSpeed-Inference v2 is here and it’s called DeepSpeed-FastGen! For the best performance, latest features, and newest model support please see our DeepS...

Training your large model with DeepSpeed

Overview

Learning Rate Range Test

This tutorial shows how to use to perform Learning Rate range tests in PyTorch.

Megatron-LM GPT2

If you haven’t already, we advise you to first read through the Getting Started guide before stepping through this tutorial.

Mixed Precision ZeRO++

Mixed Precision ZeRO++ (MixZ++) is a set of optimization strategies based on ZeRO and ZeRO++ to improve the efficiency and reduce memory usage for large mode...

Getting Started with DeepSpeed-MoE for Inferencing Large-Scale MoE Models

DeepSpeed-MoE Inference introduces several important features on top of the inference optimization for dense models (DeepSpeed-Inference blog post). It embra...

Mixture of Experts for NLG models

In this tutorial, we introduce how to apply DeepSpeed Mixture of Experts (MoE) to NLG models, which reduces the training cost by 5 times and reduce the MoE m...

Mixture of Experts

DeepSpeed v0.5 introduces new support for training Mixture of Experts (MoE) models. MoE models are an emerging class of sparsely activated models that have s...

DeepSpeed Model Compression Library

What is DeepSpeed Compression: DeepSpeed Compression is a library purposely built to make it easy to compress models for researchers and practitioners while ...

Monitor

Monitor your model’s training metrics live and log for future analysis

1-Cycle Schedule

This tutorial shows how to implement 1Cycle schedules for learning rate and momentum in PyTorch.

1-bit Adam: Up to 5x less communication volume and up to 3.4x faster training

Note: On 03/07/2022 we released 0/1 Adam, which is a new communication-efficient Adam optimizer partially following the 1-bit Adam’s design. Compared to the ...

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed

Watch out! 1) The NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs). See details below. 2) Although 1...

Pipeline Parallelism

DeepSpeed v0.3 includes new support for pipeline parallelism! Pipeline parallelism improves both the memory and compute efficiency of deep learning training ...

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

In this tutorial, we are going to introduce the progressive layer dropping (PLD) in DeepSpeed and provide examples on how to use PLD. PLD allows to train Tra...

Using PyTorch Profiler with DeepSpeed for performance debugging

This tutorial describes how to use PyTorch Profiler with DeepSpeed.

DeepSpeed Sparse Attention

In this tutorial we describe how to use DeepSpeed Sparse Attention (SA) and its building-block kernels. The easiest way to use SA is through DeepSpeed launch...

DeepSpeed Transformer Kernel

This tutorial shows how to enable the DeepSpeed transformer kernel and set its different configuration parameters.

Arctic Long Sequence Training (ALST) for HF Transformers integration

Ulysses Sequence Parallelism for Hugging Face (HF) Transformers implements an efficient way of training on long sequences by employing sequence parallelis...

DeepSpeed Ulysses-Offload

DeepSpeed Ulysses-Offload is a system of chunking and offloading long-context transformer model training scheme built on top of ZeRO and DeepSpeed Ulysses. I...

Universal Checkpointing with DeepSpeed: A Practical Guide

DeepSpeed Universal Checkpointing feature is a powerful tool for saving and loading model checkpoints in a way that is both efficient and flexible, enabling ...

ZeRO-Offload

ZeRO-3 Offload consists of a subset of features in our newly released ZeRO-Infinity. Read our ZeRO-Infinity blog to learn more!

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Watch out! 1) The NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs). See details below. 2) Although 0...

Zero Redundancy Optimizer

ZeRO++

ZeRO++ is a system of communication optimization strategies built on top of ZeRO to offer unmatched efficiency for large model training regardless of the sca...