Getting Started with DeepSpeed for Inferencing Transformer based Models

DeepSpeed-Inference introduces several features to efficiently serve transformer-based PyTorch models. It support model parallelism (MP) to fit large models that would otherwise not fit in GPU memory. Even for smaller models, MP can be used to reduce latency for inference. To further reduce latency and cost, we introduce inference-customized kernels. Finally, we propose a novel approach to quantize models, called MoQ, to both shrink the model and reduce the inference-cost at production. For more details on the inference related optimizations in DeepSpeed, please refer to our blog-post.

DeepSpeed provides a seamless inference-mode for compatible transformer based models trained using DeepSpeed, Megatron and HuggingFace, meaning that we don’t require any change on the modeling side such as exporting the model or creating a different checkpoint from your trained checkpoints. To run inference on multi-GPU for compatible models, simply provide the model parallelism degree and the checkpoint information or the model which is already loaded with acheckpoint, and Deepspeed will do the rest. It will automatically partition the model as necessary, inject compatible high performance kernels into your model and manage the inter-gpu communication. For list of compatible models please see here.

Initializing for Inference

To inference the model with DeepSpeed, use init_inference API to load the model for inference. Here, you can specify the MP degree, and if the model has not been loaded with the appropriate checkpoint, you can also provide the checkpoint description using a json file. To inject the high-performance kernels, you can pass int the replace_method as ‘auto’ for the compatible models, or define a new policy in replace_policy class and pass in the injection_policy that specifies the differenct parameters of a Transformer layer, such as attention and feed-forward parts. The injection_policy shows the mapping between the parameters of the original layer implementation with the inference-customized Transformer layer.

# create the model
if args.pre_load_checkpoint:
    model = model_class.from_pretrained(args.model_name_or_path)
    model = model_class()

import deepspeed

# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(model,
                                 checkpoint=None if args.pre_load_checkpoint else args.checkpoint_json,
model = ds_engine.module
output = model('Input String')

Loading Checkpoints

For the models trained using HuggingFace, the model checkpoint can be pre-loaded using the from_pretrained API as shown above. For Megatron-LM models trained with model parallelism, we require a list of all the model parallel checkpoints passed in JOSN config. Below we show how to load a Megatron-LM checkpoint trained using MP=2.

  "type": "Megatron",
    "version": 0.0,
    "checkpoints": [

For models that are trained with DeepSpeed, the checkpoint json file only requires storing the path to the model checkpoints.

  "type": "DeepSpeed",
    "version": 0.3,
    "checkpoint_path": "path_to_checkpoints",

DeepSpeed supports running different MP degree for inference than from training. For example, a model trained without any MP can be run with MP=2, or a model trained with MP=4 can be inferened without any MP. DeepSpeed automatically merges or split checkpoints during intialization as necessary.


Simply use the Deepspeed launcher deepspeed to launch your inference on multiple GPUs.

deepspeed --num_gpus 2

End-to-End GPT NEO 2.7B Inference

DeepSpeed inference can be used in conjunction with HuggingFace pipeline. Below is the end-to-end client code combining DeepSpeed inference with HuggingFace pipeline for generating text using the GPT-NEO-2.7B model.

# Filename:
import os
import deepspeed
import torch
import transformers
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank)

generator.model = deepspeed.init_inference(generator.model,

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if torch.distributed.get_rank() == 0:

The above script modifies the model in HuggingFace text-generation pipeline to use DeepSpeed inference. Note that here we can run the inference on multiple GPUs using the model-parallel tensor-slicing across GPUs even though the original model was trained without any model parallelism and the checkpoint is also a single GPU checkpoint. To run the client simply run:

deepspeed --num_gpus 2

Below is an output of the generated text. You can try other prompt and see how this model generates text.

    'generated_text': 'DeepSpeed is a blog about the future. We will consider the future of work, the future of living, and the future of society. We will focus in particular on the evolution of living conditions for humans and animals in the Anthropocene and its repercussions'

Datatypes and Quantized Models

DeepSpeed inference supports fp32, fp16 and int8 parameters. The appropriate datatype can be set using dtype in init_inference, and DeepSpeed will chose the kernels optimized for that datatype. For quantized int8 models, if the model was quantized using DeepSpeed’s quantization approach (MoQ), the setting by which the quantization is applied needs to be passed to the init_inference. This setting includes the number of groups used for quantization and whether the MLP part of transformer is quantized with extra grouping. For more information on these parameters, please visit our quantization tutorial.

import deepspeed
import deepspeed.module_inject as module_inject
model = deepspeed.init_inference(model,

Congratulations! You have completed DeepSpeed inference Tutorial.