Automatic Tensor Parallelism (Training)

This tutorial covers Automatic Tensor Parallelism for combining tensor parallelism with ZeRO optimization during training. For inference-only tensor parallelism, see Automatic Tensor Parallelism (Inference).

Introduction
Quick Start
Custom Layer Specifications
Limitations

Introduction

The AutoTP Training API enables hybrid parallelism by combining:

Tensor Parallelism (TP): Split model weights across GPUs within a node
Data Parallelism (DP): Replicate model across GPU groups
ZeRO Optimization: Memory-efficient optimizer states (Stage 0, 1, or 2)

Tensor parallelism (TP) splits the computations and parameters of large layers across multiple GPUs so each rank holds only a shard of the weight matrix. This is an efficient way to train large-scale transformer models by reducing per-GPU memory pressure while keeping the layer math distributed across the TP group.

Quick Start

Basic Usage

AutoTP training can be enabled entirely through the DeepSpeed config. When tensor_parallel is set in the config, deepspeed.initialize(...) applies AutoTP sharding during engine initialization, so the training loop itself does not change.

import torch
import deepspeed

# 1. Create your model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

# 2. Define the DeepSpeed config with tensor_parallel settings
ds_config = {
    "train_micro_batch_size_per_gpu": 1,
    "zero_optimization": {"stage": 2},
    "bf16": {"enabled": True},
    "tensor_parallel": {"autotp_size": 4},
}

# 3. Initialize DeepSpeed with AutoTP + ZeRO
engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config,
    mpu=mpu  # Model parallel unit (optional if you provide tp_group elsewhere)
)

# 4. Train as usual
for batch in dataloader:
    outputs = engine(input_ids=batch["input_ids"], labels=batch["labels"])
    engine.backward(outputs.loss)
    engine.step()

Compatibility note: For backward compatibility, you can still call set_autotp_mode(training=True) and deepspeed.tp_model_init(...), but they are not required when the DeepSpeed config provides the necessary tensor_parallel settings.

Preset-based Sharding

If your model matches a built-in preset, set tensor_parallel.preset_model in the DeepSpeed config:

{
    "train_batch_size": 8,
    "train_micro_batch_size_per_gpu": 1,
    "bf16": { "enabled": true },
    "zero_optimization": { "stage": 2 },
    "tensor_parallel": {
        "autotp_size": 4,
        "preset_model": "llama"
    }
}

For the list of available presets, see supported models.

Custom Patterns

If you are training a custom model, define regex-based patterns and partition rules in tensor_parallel.partition_config:

{
    "tensor_parallel": {
        "autotp_size": 4,
        "partition_config": {
            "use_default_specs": false,
            "layer_specs": [
                {
                    "patterns": [".*\\.o_proj\\.weight$", ".*\\.down_proj\\.weight$"],
                    "partition_type": "row"
                },
                {
                    "patterns": [".*\\.[qkv]_proj\\.weight$"],
                    "partition_type": "column"
                },
                {
                    "patterns": [".*\\.gate_up_proj\\.weight$"],
                    "partition_type": "column",
                    "shape": [2, -1],
                    "partition_dim": 0
                }
            ]
        }
    }
}

Custom Layer Specifications

For models not covered by presets, define custom layer specs:

{
    "tensor_parallel": {
        "autotp_size": 4,
        "partition_config": {
            "use_default_specs": false,
            "layer_specs": [
                {
                    "patterns": [".*\\.o_proj\\.weight$", ".*\\.down_proj\\.weight$"],
                    "partition_type": "row"
                },
                {
                    "patterns": [".*\\.[qkv]_proj\\.weight$"],
                    "partition_type": "column"
                },
                {
                    "patterns": [".*\\.gate_up_proj\\.weight$"],
                    "partition_type": "column",
                    "shape": [2, -1],
                    "partition_dim": 0
                }
            ]
        }
    }
}

Fused Layers with Unequal Sub-parameters (GQA)

For Grouped Query Attention with different Q/K/V sizes:

{
    "tensor_parallel": {
        "partition_config": {
            "layer_specs": [
                {
                    "patterns": [".*\\.qkv_proj\\.weight$"],
                    "partition_type": "column",
                    "shape": [[q_size, kv_size, kv_size], -1],
                    "partition_dim": 0
                }
            ]
        }
    }
}

Limitations

ZeRO Stage 3 not supported: AutoTP currently only works with ZeRO stages 0, 1, and 2.
TP size must divide model dimensions: The tensor parallel size must evenly divide the attention head count and hidden dimensions.