Nvidia Kubernetes Device Plug-in Brings Temporal GPU Concurrency

provisioning the right-sized GPU acceleration for each workload is key to improving utilization and reducing the operational costs of deployment, whether on-premises or in the cloud.

To address the challenge of GPU utilization in Kubernetes (K8s) clusters, NVIDIA offers multiple GPU concurrency and sharing mechanisms to suit a broad range of use cases. The latest addition is the new GPU time-slicing APIs, now broadly available in Kubernetes with NVIDIA K8s Device Plugin 0.12.0 and the NVIDIA GPU Operator 1.11. Together, they enable multiple GPU-accelerated workloads to time-slice and run on a single NVIDIA GPU.

When to share NVIDIA GPUs

Here are some example workloads that can benefit from sharing GPU resources for better utilization:

  • Low-batch inference serving, which may only process one input sample on the GPU
  • High-performance computing (HPC) applications, such as simulating photon propagation, that balance computation between the CPU (to read and process inputs) and GPU (to perform computation). Some HPC applications may not achieve high throughput on the GPU portion due to bottlenecks on the CPU core performance.
  • Interactive development for ML model exploration using Jupyter notebooks
  • Spark-based data analytics applications, where some tasks, or the smallest units of work, are run concurrently and benefit from better GPU utilization
  • Visualization or offline rendering applications that may be bursty in nature
  • Continuous integration/continuous delivery (CD) pipelines that want to use any available GPUs for testing

GPU concurrency mechanisms

The NVIDIA GPU hardware, in conjunction with the CUDA programming model, provides a number of different concurrency mechanisms for improving GPU utilization. The mechanisms range from programming model APIs, where the applications need code changes to take advantage of concurrency, to system software and hardware partitioning including virtualization, which are transparent to applications (Figure 1).

Table 1 summarizes these technologies including when to consider these concurrency mechanisms.

StreamsMPSTime-SlicingMIGvGPU
Partition TypeSingle processLogicalTemporal (Single process)PhysicalTemporal & Physical – VMs
Max PartitionsUnlimited48Unlimited7Variable
SM Performance IsolationNoYes (by percentage, not partitioning)YesYesYes
Memory ProtectionNoYesYesYesYes
Memory Bandwidth QoSNoNoNoYesYes
Error IsolationNoNoYesYesYes
Cross-Partition InteropAlwaysIPCLimited IPCLimited IPCNo
ReconfigureDynamicAt process launchN/AWhen idleN/A
GPU Management (telemetry)N/ALimited GPU metricsN/AYes – GPU metrics, support for containersYes – live migration and other industry virtualization tools
Target use cases (and when to use each)Optimize for concurrency within a single applicationRun multiple applications in parallel but can deal with limited resiliencyRun multiple applications that are not latency-sensitive or can tolerate jitterRun multiple applications in parallel but need resiliency and QoSSupport multi-tenancy on the GPU through virtualization and need VM management benefits

With this background, the rest of the post focuses on oversubscribing GPUs using the new time-slicing APIs in Kubernetes.

Reference List

  1. https://www.infoq.com/news/2022/12/k8s-gpu-time-slicing/
  2. https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/