安装Docker和Nvidia-Docker

在ubuntu20 上安装Docker和Nvidia-Docker

Posted by WW on April 6, 2021

记录在ubuntu20 上安装Docker和Nvidia-Docker

安装Docker

Setting up Docker

Docker-CE on Ubuntu can be setup using Docker’s official convenience script:

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

ubuntu14

sudo service docker restart
配置docker用户组,免root使用
 sudo usermod -aG docker $USER

不用注销也可以使设置生效

su - ${USER}
安装docker-compose
pip install docker-compose
docker-compose Command-line completion
sudo curl -L https://raw.githubusercontent.com/docker/compose/1.25.0/contrib/completion/bash/docker-compose -o /etc/bash_completion.d/docker-compose

常用命令

start a container
docker-compose run --entrypoint bash ros-deploy2
映射端口,默认是没有映射
docker-compose run --service-ports  moveit-ssh
attach to a running container
docker-compose exec ros-deploy2 bash
重新编译
docker-compose build --no-cache  coder
kill all container
docker-compose kill
remove all stopped container
docker-compose rm
kill and rm
docker-compose down
remove stoped container
docker container prune
remove none images
docker rmi $(docker images --filter "dangling=true" -q --no-trunc)
export/import container,work
ocker export -o deploy.tar ros_ros-deploy2_run_8a13a1e8ffe8
docker import deploy.tar deploy

退出状态码

exitstatus - What is the authoritative list of Docker Run exit codes? - Stack Overflow

  • 125: docker run itself fails
  • 126: contained command cannot be invoked
  • 127: if contained command cannot be found
  • 128 + n Fatal error signal n:
    • 130 = (128+2) Container terminated by Control-C
    • 137 = (128+9) Container received a SIGKILL
    • 143 = (128+15) Container received a SIGTERM

配置代理

docker-proxy

dockerd代理 ¶

在执行docker pull时,是由守护进程dockerd来执行。 因此,代理需要配在dockerd的环境中。 而这个环境,则是受systemd所管控,因此实际是systemd的配置。

sudo mkdir -p /etc/systemd/system/docker.service.d
sudo gedit /etc/systemd/system/docker.service.d/http-proxy.conf

在这个proxy.conf文件(可以是任意*.conf的形式)中,添加以下内容:

[Service]
 # NO_PROXY is optional and can be removed if not needed
 # Change proxy_url to your proxy IP or FQDN and proxy_port to your proxy port
 # For Proxy server which require username and password authentication, just add the proper username and password to the URL. (see example below)

 # Example without authentication
 # Environment="HTTP_PROXY=http://127.0.0.1:8889" "NO_PROXY=localhost,127.0.0.0/8"

 # Example with authentication
 # Environment="HTTP_PROXY=http://username:password@proxy_url:proxy_port" "NO_PROXY=localhost,127.0.0.0/8"
Container代理 ¶

在容器运行阶段,如果需要代理上网,则需要配置

gedit ~/.docker/config.json

以下配置,只在Docker 17.07及以上版本生效。

{
 "proxies":
 {
   "default":
   {
     "httpProxy": "http://proxy.example.com:8080",
     "httpsProxy": "http://proxy.example.com:8080",
     "noProxy": "localhost,127.0.0.1,.example.com"
   }
 }
}

这个是用户级的配置,除了proxies,docker login等相关信息也会在其中。 而且还可以配置信息展示的格式、插件参数等。

此外,容器的网络代理,也可以直接在其运行时通过-e注入http_proxy等环境变量。 这两种方法分别适合不同场景。 config.json非常方便,默认在所有配置修改后启动的容器生效,适合个人开发环境。 在CI/CD的自动构建环境、或者实际上线运行的环境中,这种方法就不太合适,用-e注入这种显式配置会更好,减轻对构建、部署环境的依赖。 当然,在这些环境中,最好用良好的设计避免配置代理上网。

  • Reload systemctl so that new settings are read
    sudo systemctl daemon-reload
    

    ``

  • Verify that docker service Environment is properly set
    sudo systemctl show docker --property Environment
    
  • Restart docker service so that it uses updated Environment settings
    sudo systemctl restart docker
    

docker build代理 ¶

虽然docker build的本质,也是启动一个容器,但是环境会略有不同,用户级配置无效。 在构建时,需要注入http_proxy等参数。

docker build . \
    --build-arg "HTTP_PROXY=http://proxy.example.com:8080/" \
    --build-arg "HTTPS_PROXY=http://proxy.example.com:8080/" \
    --build-arg "NO_PROXY=localhost,127.0.0.1,.example.com" \
    -t your/image:tag

注意:无论是docker run还是docker build,默认是网络隔绝的。 如果代理使用的是localhost:3128这类,则会无效。 这类仅限本地的代理,必须加上–network host才能正常使用。 而一般则需要配置代理的外部IP,而且代理本身要开启gateway模式。

设置全局代理之后,使用此命令

docker build . --network host


version: "2"
services:
  web:
    build:
      context: .
      args:
        - http_proxy = "http://localproxy.localdomain",
        - https_proxy = "https://localproxy.localdomain",
        - no_proxy = ".our.local.domain,localhost",

docker-compose build --build-arg HTTP_PROXY=http://proxyurl:proxyport --build-arg HTTPS_PROXY=http://proxyurl:proxyport

以下测试配置测试成功 获取host的ip为 172.17.0.1

echo $(ip addr show docker0 | grep -Po 'inet \K[\d.]+')

version: '2'
services:
  coder:
    build:
      context: coder
	  args: 
        #http_proxy : "" # no proxy
        #https_proxy : "" # no proxy
        http_proxy: "http://172.17.0.1:8889" #ok no need to map port
        https_proxy: "http://172.17.0.1:8889" #ok no need to map port
        #http_proxy: "http://host.docker.internal:8889" # fail
        #https_proxy: "http://host.docker.internal:8889"# fail
        #http_proxy: "http://localhost:8889"# fail
        #https_proxy: "http://localhost:8889"# fail
    container_name : coder
    security_opt: # options needed for gdb debugging
      - seccomp:unconfined
      - apparmor:unconfined   
    user: waxz
    working_dir: /home/waxz
    #network_mode: host
    #network_mode: bridge
    ports:
      - "2222:22"
    cap_add:
      - ALL
      
    volumes:
      - /tmp/.X11-unix:/tmp/.X11-unix:rw
      - $HOME/.Xauthority:/home/waxz/.Xauthority:rw
      #- /tmp/.docker.xauth:/tmp/.docker.xauth:rw
      - /dev/usb:/dev/usb
      - ./shell:/opt/shell
      # - ./shell/.bashrc:/home/waxz/.bashrc
      - ./share:/home/waxz/share
      - ./share/.tmux.conf:/home/waxz/.tmux.conf
      - /home/waxz/CLionProjects:/home/waxz/CLionProjects


    entrypoint: /opt/shell/entrypoint.sh
    
    stop_signal: SIGKILL
    pid: "host"
    ipc: host
    privileged: true
    read_only: false
    stdin_open: true
    tty: true
    #networks:
      #- default

重启生效 ¶

代理配置完成后,reboot重启当然可以生效,但不重启也行。

docker build代理是在执行前设置的,所以修改后,下次执行立即生效。 Container代理的修改也是立即生效的,但是只针对以后启动的Container,对已经启动的Container无效。

dockerd代理的修改比较特殊,它实际上是改systemd的配置,因此需要重载systemd并重启dockerd才能生效。

sudo systemctl daemon-reload
sudo systemctl restart docker

NVIDIA Container Toolkit

安装方法1

Setup the stable repository and the GPG key:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Install the nvidia-docker2 package (and dependencies) after updating the package listing:

sudo apt-get update

sudo apt-get install -y nvidia-docker2

Restart the Docker daemon to complete the installation after setting the default runtime:

sudo systemctl restart docker

At this point, a working setup can be tested by running a base CUDA container: 正常获取cuda信息

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
安装方法2
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
更新host的nvidia驱动后,需要重新安装toolkit
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
测试镜像

## 获取镜像 nvidia/cuda
https://gitlab.com/nvidia/container-images/cuda/-/tree/master
```sh
export IMAGE_NAME="nvidia/cuda"
export CUDA_VERSION="11.0"
export OS="ubuntu20.04"
export ARCH=`uname -m`

docker pull "${IMAGE_NAME}:${CUDA_VERSION}-base-${OS}"
docker-compose 中使用GPU

nvidia/cuda:11.0-base 正常获取cuda信息,cudnn需要额外安装

version: '2'
services:
  test:
    image: nvidia/cuda:11.0-base
    command: nvidia-smi
    
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]

简单办法是使用集成cudnn的镜像,pytorch/pytorch:latest 或者tensorflow/tensorflow:latest-gpu

services:
  test:
    image: tensorflow/tensorflow:latest-gpu
    command: python -c "import tensorflow as tf;tf.test.gpu_device_name()"
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]
2021-04-21 07:57:38.024079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 245.91GiB/s
tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-21 07:57:38.052262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 5350 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)