记录在ubuntu20 上安装Docker和Nvidia-Docker
安装Docker
Setting up Docker
Docker-CE on Ubuntu can be setup using Docker’s official convenience script:
curl https://get.docker.com | sh \
&& sudo systemctl --now enable docker
ubuntu14
sudo service docker restart
配置docker用户组,免root使用
sudo usermod -aG docker $USER
不用注销也可以使设置生效
su - ${USER}
安装docker-compose
pip install docker-compose
docker-compose Command-line completion
sudo curl -L https://raw.githubusercontent.com/docker/compose/1.25.0/contrib/completion/bash/docker-compose -o /etc/bash_completion.d/docker-compose
常用命令
start a container
docker-compose run --entrypoint bash ros-deploy2
映射端口,默认是没有映射
docker-compose run --service-ports moveit-ssh
attach to a running container
docker-compose exec ros-deploy2 bash
重新编译
docker-compose build --no-cache coder
kill all container
docker-compose kill
remove all stopped container
docker-compose rm
kill and rm
docker-compose down
remove stoped container
docker container prune
remove none images
docker rmi $(docker images --filter "dangling=true" -q --no-trunc)
export/import container,work
ocker export -o deploy.tar ros_ros-deploy2_run_8a13a1e8ffe8
docker import deploy.tar deploy
退出状态码
exitstatus - What is the authoritative list of Docker Run exit codes? - Stack Overflow
125
:docker run
itself fails126
: contained command cannot be invoked127
: if contained command cannot be found128 + n
Fatal error signaln
:130
=(128+2)
Container terminated by Control-C137
=(128+9)
Container received aSIGKILL
143
=(128+15)
Container received aSIGTERM
配置代理
dockerd代理 ¶
在执行docker pull时,是由守护进程dockerd来执行。 因此,代理需要配在dockerd的环境中。 而这个环境,则是受systemd所管控,因此实际是systemd的配置。
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo gedit /etc/systemd/system/docker.service.d/http-proxy.conf
在这个proxy.conf文件(可以是任意*.conf的形式)中,添加以下内容:
[Service]
# NO_PROXY is optional and can be removed if not needed
# Change proxy_url to your proxy IP or FQDN and proxy_port to your proxy port
# For Proxy server which require username and password authentication, just add the proper username and password to the URL. (see example below)
# Example without authentication
# Environment="HTTP_PROXY=http://127.0.0.1:8889" "NO_PROXY=localhost,127.0.0.0/8"
# Example with authentication
# Environment="HTTP_PROXY=http://username:password@proxy_url:proxy_port" "NO_PROXY=localhost,127.0.0.0/8"
Container代理 ¶
在容器运行阶段,如果需要代理上网,则需要配置
gedit ~/.docker/config.json
以下配置,只在Docker 17.07及以上版本生效。
{
"proxies":
{
"default":
{
"httpProxy": "http://proxy.example.com:8080",
"httpsProxy": "http://proxy.example.com:8080",
"noProxy": "localhost,127.0.0.1,.example.com"
}
}
}
这个是用户级的配置,除了proxies,docker login等相关信息也会在其中。 而且还可以配置信息展示的格式、插件参数等。
此外,容器的网络代理,也可以直接在其运行时通过-e注入http_proxy等环境变量。 这两种方法分别适合不同场景。 config.json非常方便,默认在所有配置修改后启动的容器生效,适合个人开发环境。 在CI/CD的自动构建环境、或者实际上线运行的环境中,这种方法就不太合适,用-e注入这种显式配置会更好,减轻对构建、部署环境的依赖。 当然,在这些环境中,最好用良好的设计避免配置代理上网。
- Reload systemctl so that new settings are read
sudo systemctl daemon-reload
``
- Verify that docker service Environment is properly set
sudo systemctl show docker --property Environment
- Restart docker service so that it uses updated Environment settings
sudo systemctl restart docker
docker build代理 ¶
虽然docker build的本质,也是启动一个容器,但是环境会略有不同,用户级配置无效。 在构建时,需要注入http_proxy等参数。
docker build . \
--build-arg "HTTP_PROXY=http://proxy.example.com:8080/" \
--build-arg "HTTPS_PROXY=http://proxy.example.com:8080/" \
--build-arg "NO_PROXY=localhost,127.0.0.1,.example.com" \
-t your/image:tag
注意:无论是docker run还是docker build,默认是网络隔绝的。 如果代理使用的是localhost:3128这类,则会无效。 这类仅限本地的代理,必须加上–network host才能正常使用。 而一般则需要配置代理的外部IP,而且代理本身要开启gateway模式。
设置全局代理之后,使用此命令
docker build . --network host
version: "2"
services:
web:
build:
context: .
args:
- http_proxy = "http://localproxy.localdomain",
- https_proxy = "https://localproxy.localdomain",
- no_proxy = ".our.local.domain,localhost",
docker-compose build --build-arg HTTP_PROXY=http://proxyurl:proxyport --build-arg HTTPS_PROXY=http://proxyurl:proxyport
以下测试配置测试成功 获取host的ip为 172.17.0.1
echo $(ip addr show docker0 | grep -Po 'inet \K[\d.]+')
version: '2'
services:
coder:
build:
context: coder
args:
#http_proxy : "" # no proxy
#https_proxy : "" # no proxy
http_proxy: "http://172.17.0.1:8889" #ok no need to map port
https_proxy: "http://172.17.0.1:8889" #ok no need to map port
#http_proxy: "http://host.docker.internal:8889" # fail
#https_proxy: "http://host.docker.internal:8889"# fail
#http_proxy: "http://localhost:8889"# fail
#https_proxy: "http://localhost:8889"# fail
container_name : coder
security_opt: # options needed for gdb debugging
- seccomp:unconfined
- apparmor:unconfined
user: waxz
working_dir: /home/waxz
#network_mode: host
#network_mode: bridge
ports:
- "2222:22"
cap_add:
- ALL
volumes:
- /tmp/.X11-unix:/tmp/.X11-unix:rw
- $HOME/.Xauthority:/home/waxz/.Xauthority:rw
#- /tmp/.docker.xauth:/tmp/.docker.xauth:rw
- /dev/usb:/dev/usb
- ./shell:/opt/shell
# - ./shell/.bashrc:/home/waxz/.bashrc
- ./share:/home/waxz/share
- ./share/.tmux.conf:/home/waxz/.tmux.conf
- /home/waxz/CLionProjects:/home/waxz/CLionProjects
entrypoint: /opt/shell/entrypoint.sh
stop_signal: SIGKILL
pid: "host"
ipc: host
privileged: true
read_only: false
stdin_open: true
tty: true
#networks:
#- default
重启生效 ¶
代理配置完成后,reboot重启当然可以生效,但不重启也行。
docker build代理是在执行前设置的,所以修改后,下次执行立即生效。 Container代理的修改也是立即生效的,但是只针对以后启动的Container,对已经启动的Container无效。
dockerd代理的修改比较特殊,它实际上是改systemd的配置,因此需要重载systemd并重启dockerd才能生效。
sudo systemctl daemon-reload
sudo systemctl restart docker
NVIDIA Container Toolkit
安装方法1
Setup the stable repository and the GPG key:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
Install the nvidia-docker2 package (and dependencies) after updating the package listing:
sudo apt-get update
sudo apt-get install -y nvidia-docker2
Restart the Docker daemon to complete the installation after setting the default runtime:
sudo systemctl restart docker
At this point, a working setup can be tested by running a base CUDA container: 正常获取cuda信息
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
安装方法2
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
更新host的nvidia驱动后,需要重新安装toolkit
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
测试镜像
## 获取镜像 nvidia/cuda
https://gitlab.com/nvidia/container-images/cuda/-/tree/master
```sh
export IMAGE_NAME="nvidia/cuda"
export CUDA_VERSION="11.0"
export OS="ubuntu20.04"
export ARCH=`uname -m`
docker pull "${IMAGE_NAME}:${CUDA_VERSION}-base-${OS}"
docker-compose 中使用GPU
nvidia/cuda:11.0-base 正常获取cuda信息,cudnn需要额外安装
version: '2'
services:
test:
image: nvidia/cuda:11.0-base
command: nvidia-smi
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
简单办法是使用集成cudnn的镜像,pytorch/pytorch:latest 或者tensorflow/tensorflow:latest-gpu
services:
test:
image: tensorflow/tensorflow:latest-gpu
command: python -c "import tensorflow as tf;tf.test.gpu_device_name()"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
2021-04-21 07:57:38.024079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 245.91GiB/s
tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-21 07:57:38.052262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 5350 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)