-
-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/ddp fixed #401
Feature/ddp fixed #401
Conversation
Thanks guys! I looked over the files, it looks like perhaps some of the simpler commits could be grouped into their own smaller PR that would be much faster to merge, definitely for example the dockerfile and readme updates. BTW, the argparser arguments for files are smart, so you don't need to supply the entire path: I wrapped up my current baselining using 1x, 2x and 4x T4 GPUs (in order from legend top to bottom). The epoch train times were 29, 19 and 15 min each. The test times were always around 1 min. When trained to 40 epochs each (well, trained to 300 and then CTRL-C after 40) using the following command these were the curves below. The final epoch 39 mAPs ranged from 0.252 to 0.254 (essentially identical). I'd like to try to repeat the same set of tests with the PR branch if I have some time this week. python train.py --batch 64 --cfg yolov5s.yaml --data coco.yaml --img 640 --nosave --device 0,1,2,3 EDIT: Is there any difference in the command required with the PR? What's the equivalent command to the one above for the branch? Thanks! |
For single GPU, it would be the same with For multiple GPU, we would have to use python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 64 --data coco.yaml --cfg models/yolov5s.yaml --weights '' --epochs 300 Theoretically, we can expand this code to use multiple nodes with multiple GPUs, but I don't think it's necessary. |
The commit will be merged into several commits, after everything is settled down!
# 2-GPU DDP
python -m torch.distributed.launch --nproc_per_node 2 train.py --data data/coco.yaml --batch-size 64 --cfg models/yolov5s.yaml --weights '' --epochs 300 --device 0,1
# 2-GPU DDP with SyncBN
python -m torch.distributed.launch --nproc_per_node 2 train.py --data data/coco.yaml --batch-size 64 --cfg models/yolov5s.yaml --weights '' --epochs 300 --device 0,1 --sync-bn
# 4-GPU DDP
# is not supported right now because it generates lower performance, and the reason remains unknown as discussed in #264 Here is my test results for earlier epoch.
In conclusion, 2-GPU DDP without Sync-BN is the better chocice for DDP now, while DP is applicable to arbitrary gpu numbers. |
Don't forget to change first epoch of DDP 4 from 5/6%->0.5/0.6% since it would say the wrong thing. Edit: |
@MagicFrogSJTU got it, thanks for the table! What we need to do is update it now with the default 2x and 4x GPU to compare the mutli-gpu updates with the current multi-gpu baseline. If 4 GPUs are not working correctly... its going to be a bit problematic. I know some groups are using 4x and even 8x GPU trainings currently, so we need a robust solution for everyone naturally. |
I copied this from the other Issue to keep things closer. These are results from my runs. Table runs:
Ft is short for Magic feature/DDP-fixed branch
My opinion is to enable DDP for 2 GPU and use DP for anything higher, until the issue can be found. |
@MagicFrogSJTU oops, I might have messed up the PR. I meant to remove the readme as I just pushed a few updates and included the quick fix you had here. Ah perfect, I see the updated table. It's late here, will get back to this tomorrow. |
I used to run programs with 10xGPU, 8xGPU a lot. But I have never come across a case where 2xGPU works but 4xGPU doesn't. |
never mind. Good night! |
Here are the results from table #401 (comment) plotted on the graph. @MagicFrogSJTU , can I have your results.txt so I can compile them into one picture? |
@glenn-jocher @NanoCode012
Because random seed is the same for every process, the the sampled 3 other images are the same for every process! This will of course reduce the training efficiency! |
@glenn-jocher |
Hi @MagicFrogSJTU , I looked and saw that before. I saw on documentations that we should set their seed to same value.
I think I also saw this on Pytorch documentation but cannot find it now. However, I will set mine to run. |
Setting random seed a fixed value is key to experiment reproduction. The modern DDP will broacast the weight of rank0 to other process when DDP is set up. Thus, there is no need to set the same random seed for different processes for this. |
I see. I guess that's why I missed it.. |
My machine is down for maintainance. I don't know when it will recover... |
@MagicFrogSJTU See table below!
I'm also setting 1 and 2 GPU to run right now to make sure nothing abnormal happened! I'm also not sure if rebasing is the best thing to do because we will lose the history of the commits and some are valuable parts like this point on "DDP deterioration`. I think there is an option on github to "squash" commits into one big commit. |
Thanks for your experiments! |
@MagicFrogSJTU I think the results are quite clear. f is Magic's feature branch Edit: Added 8 GPU |
65157e2
to
955fba0
Compare
commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 a1c8406 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 121d90b Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 3bdea3f Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established
commit 94147314e559a6bdd13cb9de62490d385c27596f Merge: 65157e2 37acbdc Author: yizhi.chen <[email protected]> Date: Thu Jul 16 14:00:17 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov4 into feature/DDP_fixed commit 37acbdc Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:03:41 2020 -0700 update test.py --save-txt commit b8c2da4 Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:00:48 2020 -0700 update test.py --save-txt commit 65157e2 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:44:13 2020 +0800 Revert the README.md removal commit 1c802bf Merge: cd55b44 0f3b8bb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:43:38 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit cd55b44 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:42:33 2020 +0800 fix the DDP performance deterioration bug. commit 0f3b8bb Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 00:28:53 2020 -0700 Delete README.md commit f5921ba Merge: 85ab2f3 bd3fdbb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 11:20:17 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit bd3fdbb Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:38:20 2020 -0700 Update README.md commit c1a97a7 Merge: 2bf86b8 f796708 Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:36:53 2020 -0700 Merge branch 'master' into feature/DDP_fixed commit 2bf86b8 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 22:18:15 2020 +0700 Fixed world_size not found when called from test commit 85ab2f3 Merge: 5a19011 c8357ad Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:58 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit 5a19011 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:15 2020 +0800 Add assertion for <=2 gpus DDP commit c8357ad Merge: e742dd9 787582f Author: yzchen <[email protected]> Date: Tue Jul 14 22:10:02 2020 +0800 Merge pull request #8 from MagicFrogSJTU/NanoCode012-patch-1 Modify number of dataloaders' workers commit 787582f Author: NanoCode012 <[email protected]> Date: Tue Jul 14 20:38:58 2020 +0700 Fixed issue with single gpu not having world_size commit 6364892 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 19:16:15 2020 +0700 Add assert message for clarification Clarify why assertion was thrown to users commit 69364d6 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:36:48 2020 +0700 Changed number of workers check commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 a1c8406 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 121d90b Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 3bdea3f Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established
955fba0
to
52a540d
Compare
UnitTest passed for the branch. I added test for DDP training. set -e
rm -rf yolov5 && git clone https://github.com/MagicFrogSJTU/yolov5.git -b feature/DDP_fixed && cd yolov5
pip install -qr requirements.txt onnx
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv -n ./coco128 ../
export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
python -m torch.distributed.launch --nproc_per_node 2 train.py --weights $x.pt --cfg models/$x.yaml --epochs 3 --img 320 --device 0,1 # DDP train
for di in 0,1 0 cpu # inference devices
do
python train.py --weights $x.pt --cfg models/$x.yaml --epochs 3 --img 320 --device $di # train
python detect.py --weights $x.pt --device $di # detect official
python detect.py --weights runs/exp0/weights/last.pt --device $di # detect custom
python test.py --weights $x.pt --device $di # test official
python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
done
python models/yolo.py --cfg $x.yaml # inspect
python models/export.py --weights $x.pt --img 640 --batch 1 # export
done Edit: Add log unittest-log.txt |
@glenn-jocher |
@MagicFrogSJTU @NanoCode012 awesome guys, thanks for the updated plots! They look perfect, and unit tests are passing so we are all set. Ok I will look through the updates today! |
I fixed the world_size bug. Tested it also on 1 GPU train, test, detect. CI covered CPU. Will re-run it fully when my machine is available.
May I ask what dataset you were training on? Did you set any specific parameters? Was it because you increased batchsize?
I've set DP mode on Magic to test for comparisons. Edit: Add chart SyncBN is off. Batch size 64. It would be great if you can duplicate the result for the PR branch. It just seems so unreal. Time is average from 3 epochs. @MagicFrogSJTU, I'm a bit confused a bit when running your branch's DP at different batch sizes (64,128,256) for (2,4,8) GPUs. They all take about 11-12 minutes to run. I was expecting it to be faster. Accuracy also slightly drops at higher GPUs. |
On coco. No. My batch size is 64. It was done long ago, like a month, with master code. Maybe the code have been changed a lot..
It happens when the batch size is not the key constraint of speed. I assume the data transfer between gpus and the cpu overload are more significant now for DP mode. Accuracy would drop if you run DP on higher GPUs because the batch size per gpu becomes too small. This is why we introduce SyncBN on DDP mode. (By the way, SyncBN is not applicable on DP mode). |
Yes, this is the same exact results I found myself for current master. I'm assuming the same thing, that on a T4 the speed is GPU TOPS constrained, but on a V100 that constraint is removed and the new constraint is CPU-GPU communication as well as device 0 tasks that DP is doing. Ok al I have left is to finish reviewing train.py, all other files are good. |
@MagicFrogSJTU ok I understand about mp.spawn. It's unfortunate that the multi-gpu training process now has a different command, it's a bit more confusing to implement, but it definitely looks like you guys have succeeded in speeding it up greatly, which is the most important result of course. I think train.py might be able to use a bit of simplification in the future, as it's more complicated to understand now than before, but I'll go ahead and merge this and then we can make tweaks as needed going forward. Good job guys!! |
Great! |
I have come up with several things to fix
@NanoCode012 Do you have any more ideas in your mind? Edit 0:
|
The only thing left that comes to mind is to
|
|
Edit: There is one qualm about mp.spawn though. Each time a dataloader is created, it re-calls the entire script (train.py) [anything out of function], this would slightly slow down the code. If there are 8 dataloaders per GPU, that would be a source of slowing down. That was why I loved |
|
I understand. I meant that, we can use For mp.spawn, we have to add Edit: I want to make clear about something. |
You got a point! |
Hi @glenn-jocher , I have implemented using mp.spawn over the current code. https://github.com/MagicFrogSJTU/yolov5/tree/feature/mp_spawn However, I'm still testing the speed/accuracy. I'm just giving you a heads up before you start making a tutorial on DDP. Also, we should separate/hide the output from different gpus. @MagicFrogSJTU made an interesting point on logging. He suggests to use |
@NanoCode012 ok got it. No I have not started a tutorial yet, I'm waiting until this settles a bit. But I think before going further you should do a git pull to bring your branch up to speed with the current master (I see 12 ahead 54 behind on your branch). The main complication in merging the last PR was that the code had drifted in the meantime between the two branches, so if you start from the current it will make future PRs much easier. I'll look into the logging idea. |
@MagicFrogSJTU @NanoCode012 @alexstoken hi guys. Have a quick update here. I've been retraining the current models (which I'll call yolov5.1) and also training two new architectures, yolov5.2 and yolov5.3. I don't want to confuse everyone with a bunch of new names, but this is the simplest I could think of, and it leaves the door open in the future to more experiments like yolov5.4 etc. Each of the 3 comes in the same sizes as before, i.e. yolov5.1s, yolov5.1m etc. The baseline yolov5.1 show slight improvements for the larger models, and the other two mainly show improvements for the smaller models, so there is no clear winner in my experiments (5.3 is not 'better' across the board than 5.2 or 5.1 for example, just different architecture compromises). 5.3 and 5.2 are better for small objects, but they are also slower than 5.1 as they introduce more ops on the P2/4 grid. These models include breaking changes that will make current models incompatible unfortunately, but I think the changes are beneficial for the long term going forward to simplify the architecture a bit. I want to release all of this in about a week, I'm waiting on the final 5x models to finish training. In the meantime I'm holding off on making changes because I'm not sure if you guys are making a lot of current modifications to your local branches. I think the most important thing you can do right now is to update your current branches to master to streamline any PRs in the future, as most of my holdup when merging is due a lot to confusion about whether commits are old or new etc. It's just an unfortunate side-effect of many people working on the same code region. This is mainly my fault too of course, for pushing so many commits straight to master randomly throughout the week. In the future I'll try to consolidate my changes into fewer commits, and also open PRs myself to better group commits and push less often. |
@NanoCode012 oh wow, this is great work, good job! Yes it looks like launch is providing faster times, interesting. Well that's unfortunate then, maybe we should stick with the current work and simply try to clean up train.py a bit to make it more readable. What do you think? I think your N4 and N8 experiments are showing the same times because the GPUs ops are no longer constraining the speed at that point, something else must be the bottleneck there, likely reading images from the hard drive, or moving data from cpu-gpu. For larger models, like yolov5l and up I think you'll probably get a more similar curve to what you'd expect, with N8 showing speed improvements compared to N4. 300 seconds for a COCO epoch is just insanely fast in any case. The ultimate training speed would be N8 with train.py --cache, as all of the images would be preloaded into ram, removing the hard drive read speed constraint from the picture. At img-size 640 though for COCO this requires about 150 GB of system RAM, so it's not quite feasible with today's hardware. For smaller datasets though, this is quite feasible and makes a huge training speed difference. |
* update test.py --save-txt * update test.py --save-txt * add GH action tests * requirements * requirements * requirements * fix tests * add badge * lower batch-size * weights * args * parallel * rename eval * rename eval * paths * rename * lower bs * timeout * less xOS * drop xOS * git attrib * paths * paths * Apply suggestions from code review * Update eval.py * Update eval.py * update requirements.txt * Update ci-testing.yml * Update ci-testing.yml * rename test * revert test module to confuse users... * update hubconf.py * update common.py add Classify() * Update ci-testing.yml * Update ci-testing.yml * Update ci-testing.yml * Update ci-testing.yml * update common.py Classify() * Update ci-testing.yml * update test.py * update train.py ckpt loading * update train.py class count assertion ultralytics#424 * update train.py class count assertion ultralytics#424 Signed-off-by: Glenn Jocher <[email protected]> * Update requirements.txt * [WIP] Feature/ddp fixed (ultralytics#401) * Squashed commit of the following: commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 a1c8406 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 121d90b Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 3bdea3f Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Squashed commit of the following: commit 94147314e559a6bdd13cb9de62490d385c27596f Merge: 65157e2 37acbdc Author: yizhi.chen <[email protected]> Date: Thu Jul 16 14:00:17 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov4 into feature/DDP_fixed commit 37acbdc Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:03:41 2020 -0700 update test.py --save-txt commit b8c2da4 Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:00:48 2020 -0700 update test.py --save-txt commit 65157e2 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:44:13 2020 +0800 Revert the README.md removal commit 1c802bf Merge: cd55b44 0f3b8bb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:43:38 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit cd55b44 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:42:33 2020 +0800 fix the DDP performance deterioration bug. commit 0f3b8bb Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 00:28:53 2020 -0700 Delete README.md commit f5921ba Merge: 85ab2f3 bd3fdbb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 11:20:17 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit bd3fdbb Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:38:20 2020 -0700 Update README.md commit c1a97a7 Merge: 2bf86b8 f796708 Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:36:53 2020 -0700 Merge branch 'master' into feature/DDP_fixed commit 2bf86b8 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 22:18:15 2020 +0700 Fixed world_size not found when called from test commit 85ab2f3 Merge: 5a19011 c8357ad Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:58 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit 5a19011 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:15 2020 +0800 Add assertion for <=2 gpus DDP commit c8357ad Merge: e742dd9 787582f Author: yzchen <[email protected]> Date: Tue Jul 14 22:10:02 2020 +0800 Merge pull request #8 from MagicFrogSJTU/NanoCode012-patch-1 Modify number of dataloaders' workers commit 787582f Author: NanoCode012 <[email protected]> Date: Tue Jul 14 20:38:58 2020 +0700 Fixed issue with single gpu not having world_size commit 6364892 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 19:16:15 2020 +0700 Add assert message for clarification Clarify why assertion was thrown to users commit 69364d6 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:36:48 2020 +0700 Changed number of workers check commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 a1c8406 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 121d90b Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 3bdea3f Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Fixed destroy_process_group in DP mode * Update torch_utils.py * Update utils.py Revert build_targets() to current master. * Update datasets.py * Fixed world_size attribute not found Co-authored-by: NanoCode012 <[email protected]> Co-authored-by: Glenn Jocher <[email protected]> * Update ci-testing.yml (ultralytics#445) * Update ci-testing.yml * Update ci-testing.yml * Update requirements.txt * Update requirements.txt * Update google_utils.py * Update test.py * Update ci-testing.yml * pretrained model loading bug fix (ultralytics#450) Signed-off-by: Glenn Jocher <[email protected]> * Update datasets.py (ultralytics#454) Co-authored-by: Glenn Jocher <[email protected]> Co-authored-by: Jirka <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: yzchen <[email protected]> Co-authored-by: pritul dave <[email protected]>
@NanoCode012 |
@MagicFrogSJTU , I haven’t kept up with any new in Pytorch 1.6 DDP if there are any. The reason I think it slows is during create_dataloaders. Each gpu creates N workers. Each worker would call the entire train.py . You can test it out by adding a print statement in global for train.py using my mp-spawn branch. 2 GPU would mean 16 workers.. |
Also, I was just told that launch doesn't work on Windows. If it's possible, I would like to add spawn. |
I read the official document. These two are expected to be equal. And what's the meaning of |
Sorry, typing from mobile. It means each worker from the dataloaders(we pass nw to Dataloaders) calls the train.py file. It would call all the imports and redefine all the function. That's why it was necessary to encapsulate all the global variables into functions. There was a note about this on Pytorch, but I cannot find it now. You can test the above by adding a simple print("global") on the global scope(above def train) to count how many calls happen. I hope this is clearer. The branch can be found in your fork called mp_spawn. Edit: The guide on pytorch has been updated. Maybe there could be something we could use. |
* Squashed commit of the following: commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 880d072 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 86e7142 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 31a9f25 Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Squashed commit of the following: commit 94147314e559a6bdd13cb9de62490d385c27596f Merge: 65157e2 9de5a7a Author: yizhi.chen <[email protected]> Date: Thu Jul 16 14:00:17 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov4 into feature/DDP_fixed commit 9de5a7a Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:03:41 2020 -0700 update test.py --save-txt commit 825e729 Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 20:00:48 2020 -0700 update test.py --save-txt commit 65157e2 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:44:13 2020 +0800 Revert the README.md removal commit 1c802bf Merge: cd55b44 0f3b8bb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:43:38 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit cd55b44 Author: yizhi.chen <[email protected]> Date: Wed Jul 15 16:42:33 2020 +0800 fix the DDP performance deterioration bug. commit 0f3b8bb Author: Glenn Jocher <[email protected]> Date: Wed Jul 15 00:28:53 2020 -0700 Delete README.md commit f5921ba Merge: 85ab2f3 bd3fdbb Author: yizhi.chen <[email protected]> Date: Wed Jul 15 11:20:17 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit bd3fdbb Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:38:20 2020 -0700 Update README.md commit c1a97a7 Merge: 2bf86b8 7d73bfb Author: Glenn Jocher <[email protected]> Date: Tue Jul 14 18:36:53 2020 -0700 Merge branch 'master' into feature/DDP_fixed commit 2bf86b8 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 22:18:15 2020 +0700 Fixed world_size not found when called from test commit 85ab2f3 Merge: 5a19011 c8357ad Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:58 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit 5a19011 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 22:19:15 2020 +0800 Add assertion for <=2 gpus DDP commit c8357ad Merge: e742dd9 787582f Author: yzchen <[email protected]> Date: Tue Jul 14 22:10:02 2020 +0800 Merge pull request ultralytics#8 from MagicFrogSJTU/NanoCode012-patch-1 Modify number of dataloaders' workers commit 787582f Author: NanoCode012 <[email protected]> Date: Tue Jul 14 20:38:58 2020 +0700 Fixed issue with single gpu not having world_size commit 6364892 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 19:16:15 2020 +0700 Add assert message for clarification Clarify why assertion was thrown to users commit 69364d6 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:36:48 2020 +0700 Changed number of workers check commit d738487 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 17:33:38 2020 +0700 Adding world_size Reduce calls to torch.distributed. For use in create_dataloader. commit e742dd9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 15:38:48 2020 +0800 Make SyncBN a choice commit e90d400 Merge: 5bf8beb cd90360 Author: yzchen <[email protected]> Date: Tue Jul 14 15:32:10 2020 +0800 Merge pull request #6 from NanoCode012/patch-5 Update train.py commit cd90360 Author: NanoCode012 <[email protected]> Date: Tue Jul 14 13:39:29 2020 +0700 Update train.py Remove redundant `opt.` prefix. commit 5bf8beb Merge: c9558a9 880d072 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 14:09:51 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit c9558a9 Author: yizhi.chen <[email protected]> Date: Tue Jul 14 13:51:34 2020 +0800 Add device allocation for loss compute commit 4f08c69 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:16:27 2020 +0800 Revert drop_last commit 1dabe33 Merge: a1ce9b1 4b8450b Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:49 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit a1ce9b1 Author: yizhi.chen <[email protected]> Date: Thu Jul 9 11:15:21 2020 +0800 fix lr warning commit 4b8450b Merge: b9a50ae 02c63ef Author: yzchen <[email protected]> Date: Wed Jul 8 21:24:24 2020 +0800 Merge pull request #4 from NanoCode012/patch-4 Add drop_last for multi gpu commit 02c63ef Author: NanoCode012 <[email protected]> Date: Wed Jul 8 10:08:30 2020 +0700 Add drop_last for multi gpu commit b9a50ae Merge: ec2dc6c 86e7142 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:48:04 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit ec2dc6c Merge: d0326e3 82a6182 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:34:31 2020 +0800 Merge branch 'feature/DDP_fixed' of https://github.com/MagicFrogSJTU/yolov5 into feature/DDP_fixed commit d0326e3 Author: yizhi.chen <[email protected]> Date: Tue Jul 7 19:31:24 2020 +0800 Add SyncBN commit 82a6182 Merge: 96fa40a 050b2a5 Author: yzchen <[email protected]> Date: Tue Jul 7 19:21:01 2020 +0800 Merge pull request #1 from NanoCode012/patch-2 Convert BatchNorm to SyncBatchNorm commit 050b2a5 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:38:14 2020 +0700 Add cleanup for process_group commit 2aa3301 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 12:07:40 2020 +0700 Remove apex.parallel. Use torch.nn.parallel For future compatibility commit 77c8e27 Author: NanoCode012 <[email protected]> Date: Tue Jul 7 01:54:39 2020 +0700 Convert BatchNorm to SyncBatchNorm commit 96fa40a Author: yizhi.chen <[email protected]> Date: Mon Jul 6 21:53:56 2020 +0800 Fix the datset inconsistency problem commit 16e7c26 Author: yizhi.chen <[email protected]> Date: Mon Jul 6 11:34:03 2020 +0800 Add loss multiplication to preserver the single-process performance commit e838055 Merge: 625bb49 31a9f25 Author: yizhi.chen <[email protected]> Date: Fri Jul 3 20:56:30 2020 +0800 Merge branch 'master' of https://github.com/ultralytics/yolov5 into feature/DDP_fixed commit 625bb49 Author: yizhi.chen <[email protected]> Date: Thu Jul 2 22:45:15 2020 +0800 DDP established * Fixed destroy_process_group in DP mode * Update torch_utils.py * Update utils.py Revert build_targets() to current master. * Update datasets.py * Fixed world_size attribute not found Co-authored-by: NanoCode012 <[email protected]> Co-authored-by: Glenn Jocher <[email protected]>
Fixing DDP mode. #177
Work in Progress, But most of the hard things have already been done!
There are lots of commits. If every thing is settled down, I will merge them into two commits!
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Enhanced YOLOv5 testing and training capabilities with text output and DDP support.
📊 Key Changes
--save-txt
flag intest.py
for saving test results in text format.train.py
.torch_distributed_zero_first
for synchronizing distributed datasets.create_dataloader
function to support distributed training inutils/datasets.py
.exif_size
function.🎯 Purpose & Impact
--save-txt
option allows users to output test results as text files, enabling easier analysis of model performance.torch_distributed_zero_first
context manager ensures smooth loading of datasets without clashes in a distributed training setup.