Nvml initialize error

How to fix - Failed to initialize NVML: Unknown Error

When you start a nvidia container on docker, you may get
- $ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.6.2-base-ubuntu22.04 nvidia-smi
- $ Failed to initialize NVML: Unknown Error
To solve the problem - Follow steps in the link below
- https://bobcares.com/blog/docker-failed-to-initialize-nvml-unknown-error/
From the website

How to resolve docker failed to initialize NVML unknown error?

After a random amount of time the GPUs become unavailable inside all the running containers and nvidia-smi returns below error:

“Failed to initialize NVML: Unknown Error”. A restart of all the containers fixes the issue and the GPUs return available.

Today, let us see the methods followed by our support techs to resolve docker error: Method 1:

1) Kernel parameter The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :

cat /proc/cmdline

It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime

2) nvidia-container configuration In the file

/etc/nvidia-container-runtime/config.toml

set the parameter

no-cgroups = false

After that restart docker and run test container:

sudo systemctl restart docker sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi