分布式 Faster-RCNN 训练 (基于 Horovod + Tensorpack)

实验背景: 在 2 台机器上,进行了小规模分布式训练的验证实验。主要是为了验证代码能够在分布式环境下运行。实验配置如下:

HOST IP GPU Num
hw21 10.30.0.21 2
hw22 10.30.0.22 1

1. 自定义容器

1.1 Dockerfile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
FROM horovod/horovod:0.16.3-tf1.12.0-torch1.1.0-mxnet1.4.1-py3.5

LABEL author="xingw.xiong" \
maintainer="xingw.xiong@gmail.com" \
version="1.0" \
description="Faster-RCNN based on tensorpack"

# 安装依赖
RUN apt-get update && \
apt-get install -y net-tools telnet iputils-ping sshpass \
build-essential libcap-dev

RUN pip install tifffile python-prctl matplotlib

RUN apt-get install -y ocl-icd-opencl-dev libglib2.0-0 && \
pip install 'opencv-python==3.3.0.9'

RUN pip install --upgrade git+https://github.com/tensorpack/tensorpack.git && \
pip install --upgrade cython && \
pip install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI"

# 修改密码
RUN echo "root:xingw.xiong" | chpasswd

# 修改 `/etc/ssh/sshd_config`
RUN sed -i 's/^PermitRootLogin.*$/PermitRootLogin yes/g' /etc/ssh/sshd_config && \
sed -i 's/^StrictModes.*$/StrictModes no/g' /etc/ssh/sshd_config

# 生成 SSH-RSA key pair
RUN ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa

# 复制 ssh_config 文件
# COPY ssh_config /root/.ssh/config

EXPOSE 22
CMD ["/usr/sbin/sshd", "-D"]

1.2 生成容器镜像

1
$ docker build -t climo:v2 -t

2. 配置多主机容器网络

1
2
3
4
5
6
[manager]$ docker swarm init
[clients]$ docker swarm join --token ${MANAGER_TOKEN} ${MANAGER_IP}:2377
[manager]$ docker network create --attachable --driver overlay \
--subnet 10.30.0.0/24 climo-net
[manager]$ docker run -d --net climo-net --name test climo:v2
[clients]$ docker run -d --net climo-net --name test climo:v2

3. 启动容器

1
2
3
4
5
6
7
8
9
10
11
12
# 后台启动容器
TAG=21
CONTAINER_NAME=climo-${TAG}
docker run --runtime=nvidia --ipc=host --network=climo-net \
--hostname=${CONTAINER_NAME} --name=${CONTAINER_NAME} \
-v ${USER_PATH}:${USER_PATH} \
-v /mnt/sdb:/mnt/sdb \
-v /mnt/sdc:/mnt/sdc \
--ip 10.30.0.${TAG} \
--add-host hw21:10.30.0.21 \
--add-host hw22:10.30.0.22 \
-d climo:v2
1
2
# 进入容器命令行
$ docker exec -it ${CONTAINER_NAME} bash

4. 配置SSH免密登录

4.1 在每个容器中单独操作

1
2
3
4
# 把生成的公钥发到 `${SERVER}:22` 的机器上
SERVER=hw22
sshpass -p "xingw.xiong" ssh-copy-id -f -i \
/root/.ssh/id_rsa.pub root@${SERVER}

4.2 分发 authorized_keys 文件

1
2
3
4
5
6
7
8
9
10
11
cat /root/.ssh/authorized_keys 	# Ouput for debug

# 从 `SERVER` 往 `CLIENTS` 分发 `authorized_keys` 文件
CLIENTS=("hw21" "hw22")
set -x
for cli in ${CLIENTS[@]}
do
sshpass -p "xingw.xiong" scp -o StrictHostKeyChecking=no -P 22 \
/root/.ssh/authorized_keys root@${cli}:/root/.ssh/authorized_keys
done
set +x

5. 分布式训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
horovodrun -np 3 -H hw21:2,hw22:1 python train.py \
--config MODE_MASK=False MODE_FPN=False \
DATA.BASEDIR=/home/xingw/climo/data_sm/

mpirun --allow-run-as-root -np 3 \
-H hw21:2,hw22:1 \
-bind-to none -map-by slot \
-mca plm_rsh_args "-p 22" \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py \
--config MODE_MASK=False MODE_FPN=False \
DATA.BASEDIR=/home/xingw/climo/data_sm/

horovodrun -np 3 -H hw21:2,hw22:1 -p 2222 python keras_mnist_advanced.py

未完结,期末考试结束再补充!

坚持原创技术分享,您的支持将鼓励我继续创作!