DeepSpeed is Microsoft’s open source that focuses on large-scale training and inference operation through efficient memory management for PyTorch.
Key components for inference are custom CUDA kernels for common LLM operations like attention and MLP and tensor parallelism for efficient memory usage and low latency.
DeepSpeed also has several architecture and quantization specific optimizations.
DeepSpeed is good for large scale (thousand+ GPUs) well-optimized training and inference.