Deploy LLM Chatbot on Kubernetes Using Nvidia GPU

With the rapid advancement of artificial intelligence and large language models, more and more companies and developers are eager to integrate language models into their own chatbot systems (LLM Chatbot). This article aims to guide readers through deploying a high-performance LLM chatbot in a Kubernetes environment using Nvidia GPU, covering everything from essential installation and tools to detailed deployment steps.

Table of Contents

Environment

We will use the following virtual machine (VM) configuration as the foundation for deployment:

CPU: AMD Epyc 7413 16-core
RAM: 16GB
GPU: Tesla P4 8GB
OS: Ubuntu 22.04
Kubernetes CRI: containerd

Providing Nvidia GPU in Kubernetes

Install Nvidia GPU Driver

On the Ubuntu system, you can install Nvidia's proprietary drivers via the apt package manager. First, run the following command to search for available Nvidia driver versions:

apt search nvidia-driver

This tutorial will demonstrate installation using the Nvidia driver version 525:

sudo apt install nvidia-driver-525-server

After installation completes, you can verify the driver has been successfully installed by running the following command:

sudo dkms status

lsmod | grep nvidia

The successful installation output will be:

nvidia-srv/525.147.05, 5.15.0-97-generic, x86_64: installed

nvidia_uvm           1363968  2
nvidia_drm             69632  0
nvidia_modeset       1241088  1 nvidia_drm
nvidia              56365056  218 nvidia_uvm,nvidia_modeset

Install Nvidia Container Toolkit

Nvidia Container Toolkit enables containers to directly access GPU resources. Install this toolkit using the following command:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit

After installation, configure containerd to use the Nvidia container runtime:

sudo nvidia-ctk runtime configure --runtime=containerd

This operation will modify /etc/containerd/config.toml Configure the file as follows to enable GPU support:

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

Install Nvidia Device Plugin

Use Helm to install the Nvidia device plugin so Kubernetes can detect and allocate GPU resources:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.14.4

The detailed values file can be found atThis one obtained, and by specifying the appropriate nodeSelector, you can ensure the device plugin is installed only on nodes with GPU hardware.

Use Nvidia GPU in Pod

To make the Pod use Nvidia GPU, you need to add the following configuration in the Kubernetes YAML file:

spec:
  containers:
      resources:
        limits:
          nvidia.com/gpu: 1

Deploy LLM Chatbot

Using ollama and open-webui as deployment options for LLM Chatbot, these tools have already provided Kubernetes-compatible YAML files and Helm Charts to simplify the deployment process.

The following are the steps for deploying using Kustomize:

git clone https://github.com/open-webui/open-webui.git
cd open-webui
kubectl apply -k kubernetes/manifest

After completion, you will see two running pods under the open-webui namespace. You can access the Web UI via Ingress or NodePort, with the default port set to 8080.

kubectl get pods -n open-webui
NAME                                   READY   STATUS    RESTARTS   AGE
ollama-0                               1/1     Running   0          3d3h
open-webui-deployment-9d6ff55b-9fq7r   1/1     Running   0          4d19h

The default will automatically launch a new user registration page. After registration, you can log in using the provided username and password, and will see an interface remarkably similar to ChatGPT:

You can download an LLM model onto ollama from the configuration:

After downloading, select the desired model to start chatting with the chatbot:

Summary

Through the steps above, we have successfully deployed Nvidia GPU support in the Kubernetes environment and established a fully functional LLM Chatbot website. This not only deepened our understanding of Kubernetes and Nvidia GPU deployment, but also provided practical reference for building high-performance computing applications and interactive AI services.

Leveraging Nvidia GPU on Kubernetes for LLM Chatbot Deployment