Skip links

  • Skip to primary navigation
  • Skip to content
  • Skip to footer
DeepSpeed
  • Getting Started
  • News
  • Tutorials
  • Documentation
  • GitHub
    1. Home
    2. /
    • Feature Overview
    • Getting Started
      • Installation
      • Writing models
      • Training
      • Launching
    • Configuration
      • Batch size
      • Optimizer
      • Scheduler
      • Communication
      • FP16
      • AMP
      • Gradient Clipping
      • ZeRO optimizations
      • Logging
      • Flops Profiler
      • Activation checkpointing
      • Sparse Attention
    • Tutorials
      • Getting started
      • Getting started on Azure
      • BingBertSQuAD Fine-tuning
      • BERT Pre-training
      • CIFAR-10
      • Flops Profiler
      • GAN
      • Learning Rate Range Test
      • Megatron-LM GPT2
      • One-Cycle Schedule
      • One-Bit Adam
      • Pipeline Parallelism
      • Progressive Layer Dropping
      • Sparse Attention
      • Transformer Kernel
      • ZeRO-Offload
      • ZeRO Redundancy Optimizer (ZeRO)
    • Contributing

    Recent Posts

    Progressive Layer Dropping

    DeepSpeed Sparse Attention

    Training a Trillion Parameters with Pipeline Parallelism

    Up to 5x less communication and 3.4x faster training through 1-bit Adam

    DeepSpeed with 1-bit Adam: 5x less communication and 3.4x faster training

    10x bigger model training on a single GPU with ZeRO-Offload

    Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention

    DeepSpeed Microsoft Research Webinar is now on-demand Permalink

    DeepSpeed Microsoft Research Webinar on August 6th, 2020 Permalink

    Microsoft DeepSpeed achieves the fastest BERT training time

    ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale Permalink

    An Order-of-Magnitude Larger and Faster Training with ZeRO-2

    The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels

    Turing-NLG: A 17-billion-parameter language model by Microsoft Permalink

    DeepSpeed was used to train the world’s largest language model.

    ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters Permalink

    Developed by Microsoft AI & Research.

    • Feed
    © 2021 DeepSpeed. Powered by Jekyll & Minimal Mistakes.