记录在ubuntu20 上安装Docker和Nvidia-Docker

安装Docker

Setting up Docker

Docker-CE on Ubuntu can be setup using Docker’s official convenience script:

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

ubuntu14

sudo service docker restart

配置docker用户组，免root使用

 sudo usermod -aG docker $USER

不用注销也可以使设置生效

su - ${USER}

安装docker-compose

pip install docker-compose

docker-compose Command-line completion

sudo curl -L https://raw.githubusercontent.com/docker/compose/1.25.0/contrib/completion/bash/docker-compose -o /etc/bash_completion.d/docker-compose

常用命令

start a container

docker-compose run --entrypoint bash ros-deploy2

映射端口，默认是没有映射

docker-compose run --service-ports  moveit-ssh

attach to a running container

docker-compose exec ros-deploy2 bash

重新编译

docker-compose build --no-cache  coder

kill all container

docker-compose kill

remove all stopped container

docker-compose rm

kill and rm

docker-compose down

remove stoped container

docker container prune

remove none images

docker rmi $(docker images --filter "dangling=true" -q --no-trunc)

export/import container,work

ocker export -o deploy.tar ros_ros-deploy2_run_8a13a1e8ffe8

docker import deploy.tar deploy

退出状态码

exitstatus - What is the authoritative list of Docker Run exit codes? - Stack Overflow

125: docker run itself fails
126: contained command cannot be invoked
127: if contained command cannot be found
128 + n Fatal error signal n:
- 130 = (128+2) Container terminated by Control-C
- 137 = (128+9) Container received a SIGKILL
- 143 = (128+15) Container received a SIGTERM

配置代理

docker-proxy

dockerd代理 ¶

在执行docker pull时，是由守护进程dockerd来执行。因此，代理需要配在dockerd的环境中。而这个环境，则是受systemd所管控，因此实际是systemd的配置。

sudo mkdir -p /etc/systemd/system/docker.service.d
sudo gedit /etc/systemd/system/docker.service.d/http-proxy.conf

在这个proxy.conf文件（可以是任意*.conf的形式）中，添加以下内容：

[Service]
 # NO_PROXY is optional and can be removed if not needed
 # Change proxy_url to your proxy IP or FQDN and proxy_port to your proxy port
 # For Proxy server which require username and password authentication, just add the proper username and password to the URL. (see example below)

 # Example without authentication
 # Environment="HTTP_PROXY=http://127.0.0.1:8889" "NO_PROXY=localhost,127.0.0.0/8"

 # Example with authentication
 # Environment="HTTP_PROXY=http://username:password@proxy_url:proxy_port" "NO_PROXY=localhost,127.0.0.0/8"

Container代理 ¶

在容器运行阶段，如果需要代理上网，则需要配置

gedit ~/.docker/config.json

以下配置，只在Docker 17.07及以上版本生效。

{
 "proxies":
 {
   "default":
   {
     "httpProxy": "http://proxy.example.com:8080",
     "httpsProxy": "http://proxy.example.com:8080",
     "noProxy": "localhost,127.0.0.1,.example.com"
   }
 }
}

这个是用户级的配置，除了proxies，docker login等相关信息也会在其中。而且还可以配置信息展示的格式、插件参数等。

此外，容器的网络代理，也可以直接在其运行时通过-e注入http_proxy等环境变量。这两种方法分别适合不同场景。 config.json非常方便，默认在所有配置修改后启动的容器生效，适合个人开发环境。在CI/CD的自动构建环境、或者实际上线运行的环境中，这种方法就不太合适，用-e注入这种显式配置会更好，减轻对构建、部署环境的依赖。当然，在这些环境中，最好用良好的设计避免配置代理上网。

Reload systemctl so that new settings are read
```
sudo systemctl daemon-reload
```
``
Verify that docker service Environment is properly set
```
sudo systemctl show docker --property Environment
```
Restart docker service so that it uses updated Environment settings
```
sudo systemctl restart docker
```

docker build代理 ¶

虽然docker build的本质，也是启动一个容器，但是环境会略有不同，用户级配置无效。在构建时，需要注入http_proxy等参数。

docker build . \
    --build-arg "HTTP_PROXY=http://proxy.example.com:8080/" \
    --build-arg "HTTPS_PROXY=http://proxy.example.com:8080/" \
    --build-arg "NO_PROXY=localhost,127.0.0.1,.example.com" \
    -t your/image:tag

注意：无论是docker run还是docker build，默认是网络隔绝的。如果代理使用的是localhost:3128这类，则会无效。这类仅限本地的代理，必须加上–network host才能正常使用。而一般则需要配置代理的外部IP，而且代理本身要开启gateway模式。

设置全局代理之后，使用此命令

docker build . --network host

version: "2"
services:
  web:
    build:
      context: .
      args:
        - http_proxy = "http://localproxy.localdomain",
        - https_proxy = "https://localproxy.localdomain",
        - no_proxy = ".our.local.domain,localhost",

docker-compose build --build-arg HTTP_PROXY=http://proxyurl:proxyport --build-arg HTTPS_PROXY=http://proxyurl:proxyport

以下测试配置测试成功获取host的ip为 172.17.0.1

echo $(ip addr show docker0 | grep -Po 'inet \K[\d.]+')

version: '2'
services:
  coder:
    build:
      context: coder
	  args: 
        #http_proxy : "" # no proxy
        #https_proxy : "" # no proxy
        http_proxy: "http://172.17.0.1:8889" #ok no need to map port
        https_proxy: "http://172.17.0.1:8889" #ok no need to map port
        #http_proxy: "http://host.docker.internal:8889" # fail
        #https_proxy: "http://host.docker.internal:8889"# fail
        #http_proxy: "http://localhost:8889"# fail
        #https_proxy: "http://localhost:8889"# fail
    container_name : coder
    security_opt: # options needed for gdb debugging
      - seccomp:unconfined
      - apparmor:unconfined   
    user: waxz
    working_dir: /home/waxz
    #network_mode: host
    #network_mode: bridge
    ports:
      - "2222:22"
    cap_add:
      - ALL
      
    volumes:
      - /tmp/.X11-unix:/tmp/.X11-unix:rw
      - $HOME/.Xauthority:/home/waxz/.Xauthority:rw
      #- /tmp/.docker.xauth:/tmp/.docker.xauth:rw
      - /dev/usb:/dev/usb
      - ./shell:/opt/shell
      # - ./shell/.bashrc:/home/waxz/.bashrc
      - ./share:/home/waxz/share
      - ./share/.tmux.conf:/home/waxz/.tmux.conf
      - /home/waxz/CLionProjects:/home/waxz/CLionProjects


    entrypoint: /opt/shell/entrypoint.sh
    
    stop_signal: SIGKILL
    pid: "host"
    ipc: host
    privileged: true
    read_only: false
    stdin_open: true
    tty: true
    #networks:
      #- default

重启生效 ¶

代理配置完成后，reboot重启当然可以生效，但不重启也行。

docker build代理是在执行前设置的，所以修改后，下次执行立即生效。 Container代理的修改也是立即生效的，但是只针对以后启动的Container，对已经启动的Container无效。

dockerd代理的修改比较特殊，它实际上是改systemd的配置，因此需要重载systemd并重启dockerd才能生效。

sudo systemctl daemon-reload
sudo systemctl restart docker

NVIDIA Container Toolkit

安装方法1

Setup the stable repository and the GPG key:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Install the nvidia-docker2 package (and dependencies) after updating the package listing:

sudo apt-get update

sudo apt-get install -y nvidia-docker2

Restart the Docker daemon to complete the installation after setting the default runtime:

sudo systemctl restart docker

At this point, a working setup can be tested by running a base CUDA container: 正常获取cuda信息

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

安装方法2

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

更新host的nvidia驱动后，需要重新安装toolkit

sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

测试镜像

## 获取镜像 nvidia/cuda
https://gitlab.com/nvidia/container-images/cuda/-/tree/master
```sh
export IMAGE_NAME="nvidia/cuda"
export CUDA_VERSION="11.0"
export OS="ubuntu20.04"
export ARCH=`uname -m`

docker pull "${IMAGE_NAME}:${CUDA_VERSION}-base-${OS}"

docker-compose 中使用GPU

nvidia/cuda:11.0-base 正常获取cuda信息,cudnn需要额外安装

version: '2'
services:
  test:
    image: nvidia/cuda:11.0-base
    command: nvidia-smi
    
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]

简单办法是使用集成cudnn的镜像，pytorch/pytorch:latest 或者tensorflow/tensorflow:latest-gpu

services:
  test:
    image: tensorflow/tensorflow:latest-gpu
    command: python -c "import tensorflow as tf;tf.test.gpu_device_name()"
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]

2021-04-21 07:57:38.024079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 245.91GiB/s
tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-21 07:57:38.052262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 5350 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)

安装Docker和Nvidia-Docker

在ubuntu20 上安装Docker和Nvidia-Docker