Installation Details
The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our ‘ops’. By default, all of these extensions/ops will be built just-in-time (JIT) using torch’s JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.
Note: PyTorch must be installed before installing DeepSpeed.
pip install deepspeed
After installation, you can validate your install and see which ops your machine
is compatible with via the DeepSpeed environment report with ds_report
or
python -m deepspeed.env_report
. We’ve found this report useful when debugging
DeepSpeed install or compatibility issues.
ds_report
Pre-install DeepSpeed Ops
Sometimes we have found it useful to pre-install either some or all DeepSpeed C++/CUDA ops instead of using the JIT compiled path. In order to support pre-installation we introduce build environment flags to turn on/off building specific ops.
You can indicate to our installer (either install.sh or pip install) that you
want to attempt to install all of our ops by setting the DS_BUILD_OPS
environment variable to 1, for example:
DS_BUILD_OPS=1 pip install deepspeed
DeepSpeed will only install any ops that are compatible with your machine.
For more details on which ops are compatible with your system please try our
ds_report
tool described above.
If you want to install only a specific op (e.g., FusedLamb), you can toggle
with DS_BUILD
environment variables at installation time. For example, to
install DeepSpeed with only the FusedLamb op use:
DS_BUILD_FUSED_LAMB=1 pip install deepspeed
Available DS_BUILD
options include:
DS_BUILD_OPS
toggles all opsDS_BUILD_CPU_ADAM
builds the CPUAdam opDS_BUILD_FUSED_ADAM
builds the FusedAdam op (from apex)DS_BUILD_FUSED_LAMB
builds the FusedLamb opDS_BUILD_SPARSE_ATTN
builds the sparse attention opDS_BUILD_TRANSFORMER
builds the transformer opDS_BUILD_STOCHASTIC_TRANSFORMER
builds the stochastic transformer opDS_BUILD_UTILS
builds various optimized utilities
To speed up the build-all process, you can parallelize the compilation process with:
DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option="-j8"
This should complete the full build 2-3 times faster. You can adjust -j
to specify how many cpu-cores are to be used during the build. In the example it is set to 8 cores.
Install DeepSpeed from source
After cloning the DeepSpeed repo from GitHub, you can install DeepSpeed in JIT mode via pip (see below). This install should complete quickly since it is not compiling any C++/CUDA source files.
pip install .
For installs spanning multiple nodes we find it useful to install DeepSpeed using the install.sh script in the repo. This will build a python wheel locally and copy it to all the nodes listed in your hostfile (either given via –hostfile, or defaults to /job/hostfile).
Building for the correct architectures
If you’re getting the following error:
RuntimeError: CUDA error: no kernel image is available for execution on the device
when running deepspeed that means that the cuda extensions weren’t built for the card you’re trying to use it for.
When building from source deepspeed will try to support a wide range of architectures, but under jit-mode it’ll only support the archs visible at the time of building.
You can build specifically for a desired range of architectures by setting a TORCH_CUDA_ARCH_LIST
env variable, like so:
TORCH_CUDA_ARCH_LIST="6.1;7.5;8.6" pip install ...
It will also make the build faster when you only build for a few architectures.
This is also recommended to do to ensure your exact architecture is used. Due to a variety of technical reasons a distributed pytorch binary isn’t built to fully support all architectures, skipping binary compatible ones, at a potential cost of underutilizing your full card’s compute capabilities. To see which archs get included during the deepspeed build from source - save the log and grep for -gencode
arguments.
The full list of nvidia gpus and their compute capabilities can be found here.
Feature specific dependencies
Some DeepSpeed features require specific dependencies outside of the general dependencies of DeepSpeed.
-
Python package dependencies per feature/op please see our requirements directory.
-
We attempt to keep the system level dependencies to a minimum, however some features do require special system-level packages. Please see our
ds_report
tool output to see if you are missing any system-level packages for a given feature.
Pre-compiled DeepSpeed builds from PyPI
Coming soon