Feature/ddp fixed #401

MagicFrogSJTU · 2020-07-14T14:21:50Z

Fixing DDP mode. #177
Work in Progress, But most of the hard things have already been done!
There are lots of commits. If every thing is settled down, I will merge them into two commits!

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced YOLOv5 testing and training capabilities with text output and DDP support.

📊 Key Changes

Added --save-txt flag in test.py for saving test results in text format.
Implemented Distributed Data Parallel (DDP) support in train.py.
Included a new torch utility torch_distributed_zero_first for synchronizing distributed datasets.
Modified create_dataloader function to support distributed training in utils/datasets.py.
Increased robustness of loading images by sorting them in exif_size function.
Applied several code optimisations for better memory handling and DDP training efficiency.

🎯 Purpose & Impact

📝 The --save-txt option allows users to output test results as text files, enabling easier analysis of model performance.
🚀 DDP integration provides efficient handling of large-scale training across multiple GPUs, leading to faster and more scalable training processes.
🐍 The torch_distributed_zero_first context manager ensures smooth loading of datasets without clashes in a distributed training setup.
🤝 The updated dataloader supports synchronized data loading across distributed environments, maintaining the accuracy of the training process.
✔️ The listing of images now ensures a consistent order, which can be especially beneficial when reproducing experiments or debugging.
💾 Memory and computation optimizations improve the footprint and speed of training, ensuring that resources are used effectively.

glenn-jocher · 2020-07-15T01:30:55Z

Thanks guys! I looked over the files, it looks like perhaps some of the simpler commits could be grouped into their own smaller PR that would be much faster to merge, definitely for example the dockerfile and readme updates. BTW, the argparser arguments for files are smart, so you don't need to supply the entire path: python test.py --data coco.yaml works fine. The repo searches for files automatically and assigns them absolute paths if necessary.

I wrapped up my current baselining using 1x, 2x and 4x T4 GPUs (in order from legend top to bottom). The epoch train times were 29, 19 and 15 min each. The test times were always around 1 min. When trained to 40 epochs each (well, trained to 300 and then CTRL-C after 40) using the following command these were the curves below. The final epoch 39 mAPs ranged from 0.252 to 0.254 (essentially identical). I'd like to try to repeat the same set of tests with the PR branch if I have some time this week.

python train.py --batch 64 --cfg yolov5s.yaml --data coco.yaml --img 640 --nosave --device 0,1,2,3

EDIT: Is there any difference in the command required with the PR? What's the equivalent command to the one above for the branch? Thanks!

NanoCode012 · 2020-07-15T02:58:32Z

EDIT: Is there any difference in the command required with the PR? What's the equivalent command to the one above for the branch? Thanks!

For single GPU, it would be the same with --device 0

For multiple GPU, we would have to use torch.distributed.launch to launch multiple process. nproc_per_node is the number of gpus.

python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 64 --data coco.yaml --cfg models/yolov5s.yaml --weights '' --epochs 300

Theoretically, we can expand this code to use multiple nodes with multiple GPUs, but I don't think it's necessary.
From our tests, 2 GPU is the best config for performance and speed.
See #401 (comment) and MagicFrogSJTU#7 (comment) to see our results so far.

MagicFrogSJTU · 2020-07-15T03:00:48Z

Thanks guys! I looked over the files, it looks like perhaps some of the simpler commits could be grouped into their own smaller PR that would be much faster to merge, definitely for example the dockerfile and readme updates. BTW, the argparser arguments for files are smart, so you don't need to supply the entire path: python test.py --data coco.yaml works fine. The repo searches for files automatically and assigns them absolute paths if necessary.

The commit will be merged into several commits, after everything is settled down!

EDIT: Is there any difference in the command required with the PR? What's the equivalent command to the one above for the branch? Thanks!

python train.py --batch 64 --cfg yolov5s.yaml --data coco.yaml --img 640 --nosave --device 0,1,2,3 will activate DP mode.
To activate DDP mode, use the following command

# 2-GPU DDP
python -m torch.distributed.launch --nproc_per_node 2 train.py --data data/coco.yaml  --batch-size 64 --cfg models/yolov5s.yaml --weights '' --epochs 300 --device 0,1
# 2-GPU DDP with SyncBN
python -m torch.distributed.launch --nproc_per_node 2 train.py --data data/coco.yaml  --batch-size 64 --cfg models/yolov5s.yaml --weights '' --epochs 300 --device 0,1 --sync-bn
# 4-GPU DDP 
# is not supported right now because it generates lower performance, and the reason remains unknown as discussed in #264

Here is my test results for earlier epoch.
All have: total batch size of 64. trained on V100.

exp	gpus	has syncBN	extra config	epoch1	epoch2	epoch3	epoch4	epoch5	train speed(min/epoch)
default	1	\	\	1.13	6.43	12.2	19	23.9	14
DDP	2	Yes	\	0.659	5.77	12.2	18.8	23.6	11
				0.558	5.93	12.7	18.4	\
DDP	2	No	\	1.1	6.42	12.9	19.3	23.9	8-9
DDP	4	Yes	\	0.517	3.82	7.34	\	\	9
DDP	4	No	\	0.611	4.2	7.66	12.6	16.3
DDP	4	No	new random seed	0.569	4.03	7.85	12.5

In conclusion, 2-GPU DDP without Sync-BN is the better chocice for DDP now, while DP is applicable to arbitrary gpu numbers.

NanoCode012 · 2020-07-15T03:09:45Z

Don't forget to change first epoch of DDP 4 from 5/6%->0.5/0.6% since it would say the wrong thing.

Edit: ~~One more thing, I cannot replicate your results for 2 GPU with SyncBN. I got results similar to the Default one.~~
Edit 2: I misread the arguments. I thought SyncBN was on by default.

glenn-jocher · 2020-07-15T06:46:49Z

@MagicFrogSJTU got it, thanks for the table! What we need to do is update it now with the default 2x and 4x GPU to compare the mutli-gpu updates with the current multi-gpu baseline.

If 4 GPUs are not working correctly... its going to be a bit problematic. I know some groups are using 4x and even 8x GPU trainings currently, so we need a robust solution for everyone naturally.

NanoCode012 · 2020-07-15T07:22:40Z

I copied this from the other Issue to keep things closer. These are results from my runs.

Table runs:
Batch size: 64
SyncBatch is disabled for Magic
Trained on V100.

Branch	GPU	Type	Epoch 1	Epoch 2	Epoch 5	Epoch 10	Epoch 25	Train time for epoch 1
Default	1	\	0.01226	0.06774	0.2447	0.3266	0.3957	-
	2	DP	0.01105	0.06385	0.2409	0.3297	0.3907	11:50
		DDP	0.01243	0.06131	0.2411	0.3313	0.3911	11:55
	4	DDP	0.01167	0.06343	0.2336	0.326	0.3906	-
Magic P1	1	\	0.0121	0.06502	-	-	-	19:42 (CPU bottleneck)
		\	0.0131	0.06419	0.2403	0.3359	0.3951	-
	2	DDP	0.009887	0.05979	0.2389	0.33	0.395	-
Magic Ft	4	DDP	0.00519	0.0403	0.168	0.251	0.323	-

Ft is short for Magic feature/DDP-fixed branch
P1 is short for Magic Patch 1 branch, which is slightly behind Feature branch. However performance should be the same.

Default's DDP is internally implemented as DP right now by Pyroch -Magic

My opinion is to enable DDP for 2 GPU and use DP for anything higher, until the issue can be found.

glenn-jocher · 2020-07-15T07:30:59Z

@MagicFrogSJTU oops, I might have messed up the PR. I meant to remove the readme as I just pushed a few updates and included the quick fix you had here.

Ah perfect, I see the updated table. It's late here, will get back to this tomorrow.

MagicFrogSJTU · 2020-07-15T07:33:02Z

@MagicFrogSJTU got it, thanks for the table! What we need to do is update it now with the default 2x and 4x GPU to compare the mutli-gpu updates with the current multi-gpu baseline.

If 4 GPUs are not working correctly... its going to be a bit problematic. I know some groups are using 4x and even 8x GPU trainings currently, so we need a robust solution for everyone naturally.

I used to run programs with 10xGPU, 8xGPU a lot. But I have never come across a case where 2xGPU works but 4xGPU doesn't.
This is not an easy problem, as I have put tons of time but found nothing.

MagicFrogSJTU · 2020-07-15T07:33:42Z

@MagicFrogSJTU oops, I might have messed up the PR. I meant to remove the readme as I just pushed a few updates and included the quick fix you had here.

Ah perfect, I see the updated table. It's late here, will get back to this tomorrow.

never mind. Good night!

NanoCode012 · 2020-07-15T08:11:26Z

Here are the results from table #401 (comment) plotted on the graph.

@MagicFrogSJTU , can I have your results.txt so I can compile them into one picture?

MagicFrogSJTU · 2020-07-15T08:43:13Z

@glenn-jocher @NanoCode012
Damn! I find the reason for unexpected-performance of the 4-GPU DDP.
It is because random seed is set to a same number for all processses. It will result in similar pictures for mosaic data augmentation:

For each input image:
  mosaic randomly sample 3 other images, and merge all 4 images into one

Because random seed is the same for every process, the the sampled 3 other images are the same for every process! This will of course reduce the training efficiency!
Damn! I am so stupid! Stuck in this for 2 weeks!
@NanoCode012 I have pushed the fixed code. Could be please rerun your DDP test? It doesn't affect DP and normal single-gpu.

MagicFrogSJTU · 2020-07-15T08:45:55Z

@glenn-jocher
By the way, I suggest we improve the data generation code. It is now over-complicated and hard to maintain..

NanoCode012 · 2020-07-15T08:54:45Z

@glenn-jocher @NanoCode012
Damn! I find the reason for unexpected-performance of the 4-GPU DDP.
It is because random seed is set to a same number for all processses. It will result in similar pictures for mosaic data augmentation:
For each input image:
  mosaic randomly sample 3 other images, and merge all 4 images into one
Because random seed is the same for every process, the the sampled 3 other images are the same for every process! This will of course reduce the training efficiency!

Hi @MagicFrogSJTU , I looked and saw that before. I saw on documentations that we should set their seed to same value.

https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html

I think I also saw this on Pytorch documentation but cannot find it now.
That's why I did not set the seed to different values. Moreover, aren't they given different samples of images? When mosaic is done, aren't the other pictures part of their sample?

However, I will set mine to run.

MagicFrogSJTU · 2020-07-15T08:59:08Z

@glenn-jocher @NanoCode012
Damn! I find the reason for unexpected-performance of the 4-GPU DDP.
It is because random seed is set to a same number for all processses. It will result in similar pictures for mosaic data augmentation:
For each input image:
  mosaic randomly sample 3 other images, and merge all 4 images into one
Because random seed is the same for every process, the the sampled 3 other images are the same for every process! This will of course reduce the training efficiency!
Hi @MagicFrogSJTU , I looked and saw that before. I saw on documentations that we should set their seed to same value.

https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html

I think I also saw this on Pytorch documentation but cannot find it now.
That's why I did not set the seed to different values. Moreover, aren't they given different samples of images? When mosaic is done, aren't the other pictures part of their sample?

However, I will set mine to run.

Setting random seed a fixed value is key to experiment reproduction.
In DDP, we should have different random seed for different processes, but their values are fixed. Thus, the DDP experiment is still able to be reproduced.

The modern DDP will broacast the weight of rank0 to other process when DDP is set up. Thus, there is no need to set the same random seed for different processes for this.

NanoCode012 · 2020-07-15T09:05:36Z

I see. I guess that's why I missed it..

MagicFrogSJTU · 2020-07-15T09:15:35Z

I see. I guess that's why I missed it..

My machine is down for maintainance. I don't know when it will recover...
But I have the time to run the first epoch for 4GPU DDPs. And got 1.3% mAP, which is quite similar to batch-size-16 single-gpu (with 4 accumulations). As I said before, 4GPU DDPs is theoritically same to batch-size-16 sinlge-gpu. This confirms it!
Please please tell me tell me the new results when you get it! I am so excited now!

NanoCode012 · 2020-07-15T09:26:40Z

@MagicFrogSJTU See table below!

Type	Epoch 1	Epoch 2	Epoch 3	Epoch 4	Epoch 5
DDP 4	0.0124	0.0635	0.119	0.189	0.233

I'm also setting 1 and 2 GPU to run right now to make sure nothing abnormal happened!

I'm also not sure if rebasing is the best thing to do because we will lose the history of the commits and some are valuable parts like this point on "DDP deterioration`. I think there is an option on github to "squash" commits into one big commit.

MagicFrogSJTU · 2020-07-15T09:39:16Z

DDP 4

@MagicFrogSJTU See table below!

Type Epoch 1 Epoch 2
DDP 4 0.0124 0.0635
I'm also setting 1 and 2 GPU to run right now to make sure nothing abnormal happened!
I'm also not sure if rebasing is the best thing to do because we will lose the history of the commits and some are valuable parts like this point on "DDP deterioration`.

Thanks for your experiments！
Squash is okay as long as the number of commit are reduced! I am not familiar with squash though. I will have a survey.

NanoCode012 · 2020-07-16T03:32:32Z

@MagicFrogSJTU I think the results are quite clear.

f is Magic's feature branch
Number is GPU count.
Batch size 64.
Normal parameters. (without SyncBatch)

Edit: Added 8 GPU

commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 a1c8406 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 121d90b Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 3bdea3f Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established

commit 94147314e559a6bdd13cb9de62490d385c27596f Merge: 65157e2 37acbdc Author: yizhi.chen <[email protected]> Date: Thu Jul 16 14:00:17 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov4 into feature/DDP_fixed commit 37acbdc Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:03:41 2020 -0700 update test.py --save-txt commit b8c2da4 Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:00:48 2020 -0700 update test.py --save-txt commit 65157e2 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:44:13 2020 +0800 Revert the README.md removal commit 1c802bf Merge: cd55b44 0f3b8bb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:43:38 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit cd55b44 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:42:33 2020 +0800 fix the DDP performance deterioration bug. commit 0f3b8bb Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 00:28:53 2020 -0700 Delete README.md commit f5921ba Merge: 85ab2f3 bd3fdbb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 11:20:17 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit bd3fdbb Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:38:20 2020 -0700 Update README.md commit c1a97a7 Merge: 2bf86b8 f796708 Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:36:53 2020 -0700 Merge branch 'master' into feature/DDP_fixed commit 2bf86b8 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 22:18:15 2020 +0700 Fixed world_size not found when called from test commit 85ab2f3 Merge: 5a19011 c8357ad Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:58 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit 5a19011 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:15 2020 +0800 Add assertion for <=2 gpus DDP commit c8357ad Merge: e742dd9 787582f Author: yzchen <[email protected]> Date: Tue Jul 14 22:10:02 2020 +0800 Merge pull request #8 from MagicFrogSJTU/NanoCode012-patch-1 Modify number of dataloaders' workers commit 787582f Author: NanoCode012 <[email protected]> Date: Tue Jul 14 20:38:58 2020 +0700 Fixed issue with single gpu not having world_size commit 6364892 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 19:16:15 2020 +0700 Add assert message for clarification Clarify why assertion was thrown to users commit 69364d6 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:36:48 2020 +0700 Changed number of workers check commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 a1c8406 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 121d90b Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 3bdea3f Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established

NanoCode012 · 2020-07-16T10:16:33Z

UnitTest passed for the branch. I added test for DDP training.

set -e 
rm -rf yolov5 && git clone https://github.com/MagicFrogSJTU/yolov5.git -b feature/DDP_fixed && cd yolov5
pip install -qr requirements.txt onnx
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv -n ./coco128 ../
export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
  python -m torch.distributed.launch --nproc_per_node 2 train.py --weights $x.pt --cfg models/$x.yaml --epochs 3 --img 320 --device 0,1 # DDP train
  for di in 0,1 0 cpu # inference devices
  do
    python train.py --weights $x.pt --cfg models/$x.yaml --epochs 3 --img 320 --device $di  # train
    python detect.py --weights $x.pt --device $di  # detect official
    python detect.py --weights runs/exp0/weights/last.pt --device $di  # detect custom
    python test.py --weights $x.pt --device $di # test official
    python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
  done
  python models/yolo.py --cfg $x.yaml # inspect
  python models/export.py --weights $x.pt --img 640 --batch 1 # export
done

Edit: Add log unittest-log.txt

MagicFrogSJTU · 2020-07-16T11:49:37Z

@glenn-jocher
As @NanoCode012 's experiments show, DDP is now acting normally with arbitrary gpu nums. And Unit test passed. I think maybe it is time to start to merge this PR?

glenn-jocher · 2020-07-16T19:59:04Z

@MagicFrogSJTU @NanoCode012 awesome guys, thanks for the updated plots! They look perfect, and unit tests are passing so we are all set. Ok I will look through the updates today!

glenn-jocher · 2020-07-16T20:04:40Z

Ok this is a bit complicated. I'll stop making changes to the affected files to allow time to review them and merge. Of the 5 files updated, test.py changes are actually already reflected in master, those are updates I pushed yesterday to allow autolabeling of datasets using test.py. So it looks like test.py has no changes compared to master, is that right?

UPDATE: In torch_utils there is a pickle import, but I don't see it used anywhere? Also the EMA should now only ever be maintained as a single-gpu model, so is the check on it's DP/DPP status necessary (I haven't looked at train.py yet)?

UPDATE2: the msd DP/DPP check is implemented the current way because it profiles faster than checking for 'module' attributes. I tested 3 ways when I wrote the code, the type() method, the 'module' method, and isinstance() method, and used the fastest. So while the code may take up a bit more space, the current op should be the least expensive.

NanoCode012 · 2020-07-19T11:21:38Z

In this case we probably want the function to have a default world_size=1 argument, and simply not supply a world_size through test.py. I'll fix this.

I fixed the world_size bug. Tested it also on 1 GPU train, test, detect. CI covered CPU. Will re-run it fully when my machine is available.

When I put forward this issue, I have an experiment of 8 V100 on default DP mode with master code and got 2x acceleration.

May I ask what dataset you were training on? Did you set any specific parameters? Was it because you increased batchsize?

Here the reason why we got no acceleration at all (I assume your python train.py --data coco.yaml --epochs 2 --batch 64 --device 0,1,2,3 is run with this PR code?) may be the PR code implements the real DP mode (torch.nn.DataParallel). This is the code diff FYI.

I've set DP mode on Magic to test for comparisons.

Edit: Add chart

SyncBN is off. Batch size 64. It would be great if you can duplicate the result for the PR branch. It just seems so unreal. Time is average from 3 epochs.

@MagicFrogSJTU, I'm a bit confused a bit when running your branch's DP at different batch sizes (64,128,256) for (2,4,8) GPUs. They all take about 11-12 minutes to run. I was expecting it to be faster. Accuracy also slightly drops at higher GPUs.

MagicFrogSJTU · 2020-07-19T15:49:28Z

May I ask what dataset you were training on? Did you set any specific parameters? Was it because you increased batchsize?

On coco. No. My batch size is 64. It was done long ago, like a month, with master code. Maybe the code have been changed a lot..
I am currently short of machine to train, maybe you can check out the code a month ago and try training?

@MagicFrogSJTU, I'm a bit confused a bit when running your branch's DP at different batch sizes (64,128,256) for (2,4,8) GPUs. They all take about 11-12 minutes to run. I was expecting it to be faster. Accuracy also slightly drops at higher GPUs.

It happens when the batch size is not the key constraint of speed. I assume the data transfer between gpus and the cpu overload are more significant now for DP mode. Accuracy would drop if you run DP on higher GPUs because the batch size per gpu becomes too small. This is why we introduce SyncBN on DDP mode. (By the way, SyncBN is not applicable on DP mode).

glenn-jocher · 2020-07-19T18:50:53Z

@MagicFrogSJTU, I'm a bit confused a bit when running your branch's DP at different batch sizes (64,128,256) for (2,4,8) GPUs. They all take about 11-12 minutes to run. I was expecting it to be faster. Accuracy also slightly drops at higher GPUs.

It happens when the batch size is not the key constraint of speed. I assume the data transfer between gpus and the cpu overload are more significant now for DP mode. Accuracy would drop if you run DP on higher GPUs because the batch size per gpu becomes too small. This is why we introduce SyncBN on DDP mode. (By the way, SyncBN is not applicable on DP mode).

Yes, this is the same exact results I found myself for current master. I'm assuming the same thing, that on a T4 the speed is GPU TOPS constrained, but on a V100 that constraint is removed and the new constraint is CPU-GPU communication as well as device 0 tasks that DP is doing. Ok al I have left is to finish reviewing train.py, all other files are good.

glenn-jocher · 2020-07-19T19:33:06Z

@MagicFrogSJTU ok I understand about mp.spawn. It's unfortunate that the multi-gpu training process now has a different command, it's a bit more confusing to implement, but it definitely looks like you guys have succeeded in speeding it up greatly, which is the most important result of course.

I think train.py might be able to use a bit of simplification in the future, as it's more complicated to understand now than before, but I'll go ahead and merge this and then we can make tweaks as needed going forward.

Good job guys!!

MagicFrogSJTU · 2020-07-20T02:18:49Z

@MagicFrogSJTU ok I understand about mp.spawn. It's unfortunate that the multi-gpu training process now has a different command, it's a bit more confusing to implement, but it definitely looks like you guys have succeeded in speeding it up greatly, which is the most important result of course.

I think train.py might be able to use a bit of simplification in the future, as it's more complicated to understand now than before, but I'll go ahead and merge this and then we can make tweaks as needed going forward.

Good job guys!!

Great!

MagicFrogSJTU · 2020-07-20T02:26:29Z

I have come up with several things to fix

Use mp.spwan.
Replace all "print" to "log" to allow process-0-only screen output.
Speed up train-time val inference by spliting workload between processes rather than only on process-0.

@NanoCode012 Do you have any more ideas in your mind?

Edit 0:

I think train.py might be able to use a bit of simplification in the future, as it's more complicated to understand now than before,

Simplify train.py. (Not an easy task, should have a careful design first.)

NanoCode012 · 2020-07-20T02:38:35Z

The only thing left that comes to mind is to

Add tutorial for DDP commands
Improve readability, opt.distributed=True , to denote their current state rather than local rank.
Use multiple GPUs for test/detect

MagicFrogSJTU · 2020-07-20T02:48:08Z

@NanoCode012

I am not familiar with mp.spawn. I think if you can get this feature done quickly you don't have to do Add tutorial for DDP commands. What's your estimation of its difficulty?
Would you mind giving more explanation please? I am a little confused.

NanoCode012 · 2020-07-20T03:01:33Z

I've actually already implemented the bones of it in my repo, however, it didn't do a few things like "broadcast weights, adjust loss". That was why it did not work properly. Also, since master repo is about 10-15 days ahead. A lot of conflicts will happen. Will look over it this week.
Nothing fancy. I just meant something like, instead of local_rank in [-1,0], we could do not opt.distributed or not opt.parallel to check. We could have one argument like --distributed to activate ddp mode, else, use dp mode. Just for readability, but a bit more lines of code. This idea is from https://github.com/pytorch/examples/blob/master/imagenet/main.py

Edit: There is one qualm about mp.spawn though. Each time a dataloader is created, it re-calls the entire script (train.py) [anything out of function], this would slightly slow down the code. If there are 8 dataloaders per GPU, that would be a source of slowing down. That was why I loved torch.distributed.launch as it didn't have that issue.

MagicFrogSJTU · 2020-07-20T03:18:58Z

I've actually already implemented the bones of it in my repo, however, it didn't do a few things like "broadcast weights, adjust loss". That was why it did not work properly. Also, since master repo is about 10-15 days ahead. A lot of conflicts will happen. Will look over it this week.

Nothing fancy. I just meant something like, instead of local_rank in [-1,0], we could do not opt.distributed or not opt.parallel to check. We could have one argument like --distributed to activate ddp mode, else, use dp mode. Just for readability, but a bit more lines of code. This idea is from https://github.com/pytorch/examples/blob/master/imagenet/main.py

Edit: There is one qualm about mp.spawn though. Each time a dataloader is created, it re-calls the entire script (train.py) [anything out of function], this would slightly slow down the code. If there are 8 dataloaders per GPU, that would be a source of slowing down. That was why I loved torch.distributed.launch as it didn't have that issue.

local_rank is the requirement of torch.distributed.launch. When torch.distributed.launch invokes the train.py, it will give --local_rank=$rank to transfer the trank information to each of the process. Which means we can't use other key words if lanuch is still utilized. I read https://github.com/pytorch/examples/blob/master/imagenet/main.py, it use mp.spawn.

NanoCode012 · 2020-07-20T03:26:54Z

local_rank is the requirement of torch.distributed.launch. When torch.distributed.launch invokes the train.py, it will give --local_rank=$rank to transfer the trank information to each of the process. Which means we can't use other key words if lanuch is still utilized. I read https://github.com/pytorch/examples/blob/master/imagenet/main.py, it use mp.spawn.

I understand. I meant that, we can use opt.parallel = True if local_rank >= 0 else False, then work with opt.parallel instead. This is just my idea for readability because someone else could be confused why local_rank is or isn't in [-1,0]. It makes more sense that
if opt.parallel: model=DP(model) than if local_rank == -1 and torch.cuda.devices() > 1 : model = DP(model)

For mp.spawn, we have to add rank as the first parameter in train function, def train(rank, arg0, arg1) instead.

Edit: I want to make clear about something. opt.parallel could mean DP mode or DDP mode. opt.distributed would mean only DDP mode. They could set DDP mode by passing --distributed as a flag.

MagicFrogSJTU · 2020-07-20T03:32:48Z

local_rank is the requirement of torch.distributed.launch. When torch.distributed.launch invokes the train.py, it will give --local_rank=$rank to transfer the trank information to each of the process. Which means we can't use other key words if lanuch is still utilized. I read https://github.com/pytorch/examples/blob/master/imagenet/main.py, it use mp.spawn.

I understand. I meant that, we can use opt.parallel = True if local_rank >= 0 else False, then work with opt.parallel instead. This is just my idea for readability because someone else could be confused why local_rank is or isn't in [-1,0]. It makes more sense that
if opt.parallel: model=DP(model) than if local_rank == -1 and torch.cuda.devices() > 1 : model = DP(model)

For mp.spawn, we have to add rank as the first parameter in train function, def train(rank, arg0, arg1) instead.

Edit: I want to make clear about something. opt.parallel could mean DP mode or DDP mode. opt.distributed would mean only DDP mode. They could set DDP mode by passing --distributed as a flag.

You got a point!

NanoCode012 · 2020-07-20T16:09:06Z

Hi @glenn-jocher , I have implemented using mp.spawn over the current code. https://github.com/MagicFrogSJTU/yolov5/tree/feature/mp_spawn

However, I'm still testing the speed/accuracy. I'm just giving you a heads up before you start making a tutorial on DDP.
Command would be the same as before, python train.py .. . Add --distributed flag for ddp, and without it for dp.

Also, we should separate/hide the output from different gpus. @MagicFrogSJTU made an interesting point on logging. He suggests to use logging lib to log the outputs at different severity/levels for different GPUs. We would like you opinion on this because it would need to change all the print statements. I'm not sure if it's needed to change it only in train or everywhere for consistency.

glenn-jocher · 2020-07-20T19:16:10Z

@NanoCode012 ok got it. No I have not started a tutorial yet, I'm waiting until this settles a bit. But I think before going further you should do a git pull to bring your branch up to speed with the current master (I see 12 ahead 54 behind on your branch). The main complication in merging the last PR was that the code had drifted in the meantime between the two branches, so if you start from the current it will make future PRs much easier. I'll look into the logging idea.

glenn-jocher · 2020-07-21T04:37:45Z

@MagicFrogSJTU @NanoCode012 @alexstoken hi guys. Have a quick update here. I've been retraining the current models (which I'll call yolov5.1) and also training two new architectures, yolov5.2 and yolov5.3. I don't want to confuse everyone with a bunch of new names, but this is the simplest I could think of, and it leaves the door open in the future to more experiments like yolov5.4 etc. Each of the 3 comes in the same sizes as before, i.e. yolov5.1s, yolov5.1m etc.

The baseline yolov5.1 show slight improvements for the larger models, and the other two mainly show improvements for the smaller models, so there is no clear winner in my experiments (5.3 is not 'better' across the board than 5.2 or 5.1 for example, just different architecture compromises). 5.3 and 5.2 are better for small objects, but they are also slower than 5.1 as they introduce more ops on the P2/4 grid.

These models include breaking changes that will make current models incompatible unfortunately, but I think the changes are beneficial for the long term going forward to simplify the architecture a bit. I want to release all of this in about a week, I'm waiting on the final 5x models to finish training.

In the meantime I'm holding off on making changes because I'm not sure if you guys are making a lot of current modifications to your local branches. I think the most important thing you can do right now is to update your current branches to master to streamline any PRs in the future, as most of my holdup when merging is due a lot to confusion about whether commits are old or new etc. It's just an unfortunate side-effect of many people working on the same code region.

This is mainly my fault too of course, for pushing so many commits straight to master randomly throughout the week. In the future I'll try to consolidate my changes into fewer commits, and also open PRs myself to better group commits and push less often.

NanoCode012 · 2020-07-21T04:45:09Z

Hello glenn, I will update the branch to master by today. My test for launch vs spawn is done. Train time average across 3 epochs.

Right now, launch seems to be better by a small margin across the board.

glenn-jocher · 2020-07-21T05:01:22Z

@NanoCode012 oh wow, this is great work, good job! Yes it looks like launch is providing faster times, interesting. Well that's unfortunate then, maybe we should stick with the current work and simply try to clean up train.py a bit to make it more readable. What do you think?

I think your N4 and N8 experiments are showing the same times because the GPUs ops are no longer constraining the speed at that point, something else must be the bottleneck there, likely reading images from the hard drive, or moving data from cpu-gpu. For larger models, like yolov5l and up I think you'll probably get a more similar curve to what you'd expect, with N8 showing speed improvements compared to N4. 300 seconds for a COCO epoch is just insanely fast in any case.

The ultimate training speed would be N8 with train.py --cache, as all of the images would be preloaded into ram, removing the hard drive read speed constraint from the picture. At img-size 640 though for COCO this requires about 150 GB of system RAM, so it's not quite feasible with today's hardware. For smaller datasets though, this is quite feasible and makes a huge training speed difference.

* update test.py --save-txt * update test.py --save-txt * add GH action tests * requirements * requirements * requirements * fix tests * add badge * lower batch-size * weights * args * parallel * rename eval * rename eval * paths * rename * lower bs * timeout * less xOS * drop xOS * git attrib * paths * paths * Apply suggestions from code review * Update eval.py * Update eval.py * update requirements.txt * Update ci-testing.yml * Update ci-testing.yml * rename test * revert test module to confuse users... * update hubconf.py * update common.py add Classify() * Update ci-testing.yml * Update ci-testing.yml * Update ci-testing.yml * Update ci-testing.yml * update common.py Classify() * Update ci-testing.yml * update test.py * update train.py ckpt loading * update train.py class count assertion ultralytics#424 * update train.py class count assertion ultralytics#424 Signed-off-by: Glenn Jocher <[email protected]> * Update requirements.txt * [WIP] Feature/ddp fixed (ultralytics#401) * Squashed commit of the following: commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 a1c8406 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 121d90b Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 3bdea3f Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Squashed commit of the following: commit 94147314e559a6bdd13cb9de62490d385c27596f Merge: 65157e2 37acbdc Author: yizhi.chen <[email protected]> Date: Thu Jul 16 14:00:17 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov4 into feature/DDP_fixed commit 37acbdc Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:03:41 2020 -0700 update test.py --save-txt commit b8c2da4 Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:00:48 2020 -0700 update test.py --save-txt commit 65157e2 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:44:13 2020 +0800 Revert the README.md removal commit 1c802bf Merge: cd55b44 0f3b8bb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:43:38 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit cd55b44 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:42:33 2020 +0800 fix the DDP performance deterioration bug. commit 0f3b8bb Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 00:28:53 2020 -0700 Delete README.md commit f5921ba Merge: 85ab2f3 bd3fdbb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 11:20:17 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit bd3fdbb Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:38:20 2020 -0700 Update README.md commit c1a97a7 Merge: 2bf86b8 f796708 Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:36:53 2020 -0700 Merge branch 'master' into feature/DDP_fixed commit 2bf86b8 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 22:18:15 2020 +0700 Fixed world_size not found when called from test commit 85ab2f3 Merge: 5a19011 c8357ad Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:58 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit 5a19011 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:15 2020 +0800 Add assertion for <=2 gpus DDP commit c8357ad Merge: e742dd9 787582f Author: yzchen <[email protected]> Date: Tue Jul 14 22:10:02 2020 +0800 Merge pull request #8 from MagicFrogSJTU/NanoCode012-patch-1 Modify number of dataloaders' workers commit 787582f Author: NanoCode012 <[email protected]> Date: Tue Jul 14 20:38:58 2020 +0700 Fixed issue with single gpu not having world_size commit 6364892 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 19:16:15 2020 +0700 Add assert message for clarification Clarify why assertion was thrown to users commit 69364d6 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:36:48 2020 +0700 Changed number of workers check commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 a1c8406 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 121d90b Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 3bdea3f Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Fixed destroy_process_group in DP mode * Update torch_utils.py * Update utils.py Revert build_targets() to current master. * Update datasets.py * Fixed world_size attribute not found Co-authored-by: NanoCode012 <[email protected]> Co-authored-by: Glenn Jocher <[email protected]> * Update ci-testing.yml (ultralytics#445) * Update ci-testing.yml * Update ci-testing.yml * Update requirements.txt * Update requirements.txt * Update google_utils.py * Update test.py * Update ci-testing.yml * pretrained model loading bug fix (ultralytics#450) Signed-off-by: Glenn Jocher <[email protected]> * Update datasets.py (ultralytics#454) Co-authored-by: Glenn Jocher <[email protected]> Co-authored-by: Jirka <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: yzchen <[email protected]> Co-authored-by: pritul dave <[email protected]>

MagicFrogSJTU · 2020-08-10T06:21:04Z

@NanoCode012
Have you looked into the difference between mp.spawn and launch laterly?
It seems they should have the same speed efficiency. What do you think the reason of mp.spawn's slowing speed might originate from?

NanoCode012 · 2020-08-10T06:54:10Z

@MagicFrogSJTU , I haven’t kept up with any new in Pytorch 1.6 DDP if there are any.

The reason I think it slows is during create_dataloaders. Each gpu creates N workers. Each worker would call the entire train.py . You can test it out by adding a print statement in global for train.py using my mp-spawn branch. 2 GPU would mean 16 workers..

NanoCode012 · 2020-08-10T07:15:30Z

Also, I was just told that launch doesn't work on Windows. If it's possible, I would like to add spawn.

MagicFrogSJTU · 2020-08-10T08:01:13Z

@MagicFrogSJTU , I haven’t kept up with any new in Pytorch 1.6 DDP if there are any.

The reason I think it slows is during create_dataloaders. Each gpu creates N workers. Each worker would call the entire train.py . You can test it out by adding a print statement in global for train.py using my mp-spawn branch. 2 GPU would mean 16 workers..

I read the official document. These two are expected to be equal.
Can you please give me the URL of your implementation? I wanna have a check.

And what's the meaning of Each worker would call the entire train.py .? Can you give more details?

NanoCode012 · 2020-08-10T08:21:35Z

Sorry, typing from mobile.

It means each worker from the dataloaders(we pass nw to Dataloaders) calls the train.py file. It would call all the imports and redefine all the function. That's why it was necessary to encapsulate all the global variables into functions. There was a note about this on Pytorch, but I cannot find it now.

You can test the above by adding a simple print("global") on the global scope(above def train) to count how many calls happen.

I hope this is clearer.

The branch can be found in your fork called mp_spawn.

Edit: The guide on pytorch has been updated. Maybe there could be something we could use.

* Squashed commit of the following: commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 880d072 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 86e7142 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 31a9f25 Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Squashed commit of the following: commit 94147314e559a6bdd13cb9de62490d385c27596f Merge: 65157e2 9de5a7a Author: yizhi.chen <[email protected]> Date: Thu Jul 16 14:00:17 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov4 into feature/DDP_fixed commit 9de5a7a Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:03:41 2020 -0700 update test.py --save-txt commit 825e729 Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:00:48 2020 -0700 update test.py --save-txt commit 65157e2 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:44:13 2020 +0800 Revert the README.md removal commit 1c802bf Merge: cd55b44 0f3b8bb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:43:38 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit cd55b44 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:42:33 2020 +0800 fix the DDP performance deterioration bug. commit 0f3b8bb Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 00:28:53 2020 -0700 Delete README.md commit f5921ba Merge: 85ab2f3 bd3fdbb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 11:20:17 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit bd3fdbb Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:38:20 2020 -0700 Update README.md commit c1a97a7 Merge: 2bf86b8 7d73bfb Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:36:53 2020 -0700 Merge branch 'master' into feature/DDP_fixed commit 2bf86b8 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 22:18:15 2020 +0700 Fixed world_size not found when called from test commit 85ab2f3 Merge: 5a19011 c8357ad Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:58 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit 5a19011 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:15 2020 +0800 Add assertion for <=2 gpus DDP commit c8357ad Merge: e742dd9 787582f Author: yzchen <[email protected]> Date: Tue Jul 14 22:10:02 2020 +0800 Merge pull request ultralytics#8 from MagicFrogSJTU/NanoCode012-patch-1 Modify number of dataloaders' workers commit 787582f Author: NanoCode012 <[email protected]> Date: Tue Jul 14 20:38:58 2020 +0700 Fixed issue with single gpu not having world_size commit 6364892 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 19:16:15 2020 +0700 Add assert message for clarification Clarify why assertion was thrown to users commit 69364d6 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:36:48 2020 +0700 Changed number of workers check commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 880d072 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 86e7142 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 31a9f25 Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Fixed destroy_process_group in DP mode * Update torch_utils.py * Update utils.py Revert build_targets() to current master. * Update datasets.py * Fixed world_size attribute not found Co-authored-by: NanoCode012 <[email protected]> Co-authored-by: Glenn Jocher <[email protected]>

MagicFrogSJTU changed the title ~~Feature/ddp fixed~~ [WIP] Feature/ddp fixed Jul 14, 2020

glenn-jocher mentioned this pull request Jul 15, 2020

Train YOLOv3-SPP from scratch to 62.6 [email protected] ultralytics/yolov3#310

Closed

MagicFrogSJTU force-pushed the feature/DDP_fixed branch from 65157e2 to 955fba0 Compare July 16, 2020 08:31

MagicFrogSJTU and others added 2 commits July 16, 2020 16:36

MagicFrogSJTU force-pushed the feature/DDP_fixed branch from 955fba0 to 52a540d Compare July 16, 2020 08:36

Fixed destroy_process_group in DP mode

e780d04

glenn-jocher merged commit 4102fcc into ultralytics:master Jul 19, 2020

MagicFrogSJTU mentioned this pull request Jul 20, 2020

Dead for Parallel Training Using Horovod for Acceleration! #177

Closed

NanoCode012 mentioned this pull request Jul 20, 2020

Changing to Multi-process DistributedDataParallel #264

Closed

glenn-jocher mentioned this pull request Jul 21, 2020

add batch norm sychronization when multi-card training. #460

Closed

NanoCode012 mentioned this pull request Jul 21, 2020

Improvement of DDP is needed! #463

Closed

MagicFrogSJTU changed the title ~~[WIP] Feature/ddp fixed~~ Feature/ddp fixed Jul 21, 2020

Feature/ddp fixed #401

Feature/ddp fixed #401

Conversation

MagicFrogSJTU commented Jul 14, 2020 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

glenn-jocher commented Jul 15, 2020 • edited Loading

NanoCode012 commented Jul 15, 2020 • edited Loading

MagicFrogSJTU commented Jul 15, 2020 • edited Loading

NanoCode012 commented Jul 15, 2020 • edited Loading

glenn-jocher commented Jul 15, 2020

NanoCode012 commented Jul 15, 2020 • edited Loading

glenn-jocher commented Jul 15, 2020

MagicFrogSJTU commented Jul 15, 2020

MagicFrogSJTU commented Jul 15, 2020

NanoCode012 commented Jul 15, 2020 • edited Loading

MagicFrogSJTU commented Jul 15, 2020 • edited Loading

MagicFrogSJTU commented Jul 15, 2020 • edited Loading

NanoCode012 commented Jul 15, 2020

MagicFrogSJTU commented Jul 15, 2020 • edited Loading

NanoCode012 commented Jul 15, 2020

MagicFrogSJTU commented Jul 15, 2020 • edited Loading

NanoCode012 commented Jul 15, 2020 • edited Loading

MagicFrogSJTU commented Jul 15, 2020

NanoCode012 commented Jul 16, 2020 • edited Loading

NanoCode012 commented Jul 16, 2020 • edited Loading

MagicFrogSJTU commented Jul 16, 2020

glenn-jocher commented Jul 16, 2020

glenn-jocher commented Jul 16, 2020 • edited Loading

NanoCode012 commented Jul 19, 2020 • edited Loading

MagicFrogSJTU commented Jul 19, 2020

glenn-jocher commented Jul 19, 2020

glenn-jocher commented Jul 19, 2020

MagicFrogSJTU commented Jul 20, 2020

MagicFrogSJTU commented Jul 20, 2020 • edited Loading

NanoCode012 commented Jul 20, 2020

MagicFrogSJTU commented Jul 20, 2020

NanoCode012 commented Jul 20, 2020 • edited Loading

MagicFrogSJTU commented Jul 20, 2020

NanoCode012 commented Jul 20, 2020 • edited Loading

MagicFrogSJTU commented Jul 20, 2020

NanoCode012 commented Jul 20, 2020 • edited Loading

glenn-jocher commented Jul 20, 2020

glenn-jocher commented Jul 21, 2020 • edited Loading

NanoCode012 commented Jul 21, 2020 • edited Loading

glenn-jocher commented Jul 21, 2020

MagicFrogSJTU commented Aug 10, 2020

NanoCode012 commented Aug 10, 2020

NanoCode012 commented Aug 10, 2020

MagicFrogSJTU commented Aug 10, 2020 • edited Loading

NanoCode012 commented Aug 10, 2020 • edited Loading

MagicFrogSJTU commented Jul 14, 2020 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Jul 15, 2020 •

edited

Loading

NanoCode012 commented Jul 15, 2020 •

edited

Loading

MagicFrogSJTU commented Jul 15, 2020 •

edited

Loading

NanoCode012 commented Jul 15, 2020 •

edited

Loading

NanoCode012 commented Jul 15, 2020 •

edited

Loading

NanoCode012 commented Jul 15, 2020 •

edited

Loading

MagicFrogSJTU commented Jul 15, 2020 •

edited

Loading

MagicFrogSJTU commented Jul 15, 2020 •

edited

Loading

MagicFrogSJTU commented Jul 15, 2020 •

edited

Loading

MagicFrogSJTU commented Jul 15, 2020 •

edited

Loading

NanoCode012 commented Jul 15, 2020 •

edited

Loading

NanoCode012 commented Jul 16, 2020 •

edited

Loading

NanoCode012 commented Jul 16, 2020 •

edited

Loading

glenn-jocher commented Jul 16, 2020 •

edited

Loading

NanoCode012 commented Jul 19, 2020 •

edited

Loading

MagicFrogSJTU commented Jul 20, 2020 •

edited

Loading

NanoCode012 commented Jul 20, 2020 •

edited

Loading

NanoCode012 commented Jul 20, 2020 •

edited

Loading

NanoCode012 commented Jul 20, 2020 •

edited

Loading

glenn-jocher commented Jul 21, 2020 •

edited

Loading

NanoCode012 commented Jul 21, 2020 •

edited

Loading

MagicFrogSJTU commented Aug 10, 2020 •

edited

Loading

NanoCode012 commented Aug 10, 2020 •

edited

Loading