-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU scaling is very poor #1882
Comments
Perhaps we can name this issue "Multi GPU scaling is very poor" so that we can resolve when the scaling gets better :-D |
@hennyg888 , could you tell us exactly what branch and script you used to produce this result? |
Just for clarification, how is "efficiency" defined? |
The total time for the serial job divided by the product of the number fo cores multiplied by the time for that run, say
where |
Ah nice thanks. Makes sense, between 0 and 1. |
The fact that the efficiency goes down to 40% for 2 gpus says that it's actually running slower than on one core. Certainly suboptimal. I'm sure we can do better, and we will. |
Is the problem being parallelized in y? Would it be better to use a problem that is relatively wide in the direction being parallelized? Eg layouts like (256, 512) with (1, 1); (256, 1024) with (1, 2), etc. |
This is actually weak scaling so the efficiency is just N_1 / N_p, and median times are used not mean.
I used the latest master branch and |
Right, that makes sense. What I said was wrong; 1 would not be an upper bound unless magic happened. efficiency=0.5 means that the problem takes roughly the same amount of time it would take if one continued to use a single core, rather than parallelized. The layout issue I point out above holds --- I think these problems have large "surface area" compared to computation so may not be the best target for parallelization. Unless I'm missing something. Another thing is that I'm not sure these problems are big enough. We can run problems with ~30 million dof (sometimes more). But 4096x256 has just ~1 million dof. Do we know how much GPU utilization we are getting with 1 million dof? |
Important would be to capture the environment used. Could you share your SLURM script and setup? Which modules did you use etc. Secondly we should do some profiling to see where the times goes. (Does oceanigans have something like that? Either based on CLIMA's TikTok or TimerOutputs.jl) |
I will let @hennyg888 share the SLURM and module information but I can say that we are keen to do some profiling of this, and other runs. I have not heard of Oceananigans having any profiling but would love to hear what people suggest we use. We were considering |
I have tried running the library I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans? |
This then sounds to me like you don't have a working CUDA-aware MPI. IGG should show >90% efficiency |
As I said, please post your slurp script and other environment options. It is impossible to debug otherwise. I have an annotated slurp script here https://github.com/CliMA/ClimateMachine.jl/wiki/Satori-Cluster which is what I used a while back for GPU scaling tests. A mossconfigured MPI can easily manifest itself as scaling this poor. |
@hennyg888 thanks posting this. a few thoughts - I assume what @hennyg888 is running is based on this https://github.com/christophernhill/onan-jcon2021-bits/blob/main/run/satori/run-on-bench-on-rhel7-satori-with-mpi ? There are quite a few things to double (triple) check
I was planning to look at this a bit more after having coffee with a Nvidia colleague who is involved in all this tomorrow. The ImplciitGlobalGris stuff should get reasonable behavior with the Lots of details here! |
@francispoulin (see my above comment). I think ImplictGlobalGrid.jl as downloaded is not configured to run across multiple GPUs. I added a line in a fork here ( https://github.com/christophernhill/ImplicitGlobalGrid.jl/blob/5e4fd0698b7087467d9314bfa253d6bc9a09a40a/diffusion3D_multigpu_CuArrays_novis.jl#L21 ) that is needed. With that I saw reasonable weak scaling - even with broken CUDA aware MPI support. Oceananigans.jl has some other things going on. I agree profiling with nvprof/nsight would be great. This link https://github.com/mit-satori/getting-started/blob/master/tutorial-examples/nvprof-profiling/Satori_NVProf_Intro.pdf and this https://mit-satori.github.io/tutorial-examples/nvprof-profiling/index.html?highlight=profiling might be helpful to get started. The slides also have links to various NVidia bits of documentation. |
Thanks @vchuravy . The runs for IGG were on a server that has CUDA-aware MPI, so that's not the problem. As @christophernhill points out, there are a lot of other possibilities though. |
@hennyg888 has been very busy this week so hasn't had a chance to response. The slurm script that he used was passed down from @christophernhill , and I will let him share that with you, but it might not happen until Monday. But I suppose I should learn to start running stuff on Satori as that is something that everyone else can use and people understand the configuration. I'll try to do that on Monday. |
Thanks @christophernhill for all this information. This will be most helpful. Unfortunately, tomorrow I am busy from 9am to 5pm so I don't think I can zoom, but maybe on Monday? I'll try and look into these resources before hand. |
Thank you very much @christophernhill !
I also changed the line in
|
Might make sense to figure out how to |
@christophernhill : I wanted to confirm that I took your clever idea of using The link to where the function is defined is copied below. Is this something that is done automatically in Oceananigans through In chatting with the developers of https://github.com/omlins/ParallelStencil.jl/blob/main/src/ParallelKernel/hide_communication.jl |
Another thought for @christophernhill At the talk today on |
We think that we cannot send non-contiguous data over MPI between GPUs (only CPUs). Thus certain |
@francispoulin thanks. I think we probably just want to do some buffer. I looked at LazyArrays.jl and I could imagine how that could maybe also be included, but I suspect the main thing is having a buffer (which https://github.com/eth-cscs/ImplicitGlobalGrid.jl has). I don't see any sign of LazyArrays in https://github.com/eth-cscs/ImplicitGlobalGrid.jl code! We can check with Ludovic though. |
Interesting. This means that we can't really use CUDA-aware MPI, since that is basically to allow GPUs to communicate directl. This puts a limit in terms fo the efficiency but I think we can still get something decent up and running. Can you give me any details as to why this is? What would be required to fix this in the long term? |
Thanks for looking at this @christophernhill and sorry that I misquoted. At the JuliaCon talk yesterday, they started off talking about a simple repo and then ended up talking about If you think that buffering is the way to go then I'm certainly happy to give that a try. Maybe we can have a zoom meeting this week to discuss in more detail? |
There's no limitation, we just have to send continguous data over MPI rather than non-contiguous data. We can do this by creating contiguous "buffer" arrays. The algorithm is 1. copy data from halos to buffer; 2. send buffer; 3. copy buffer to halo regions at receiving end. |
Ah, that makes a lot of sense and sounds very doable. I am happy to help with this where I can but don't know the MPI stuff nearly as well as @christophernhill . |
All the MPI stuff is in the https://github.com/CliMA/Oceananigans.jl/tree/master/src/Distributed |
Vastly increased multi-GPU efficiency by designating 1 GPU per process with
https://github.com/christophernhill/onan-jcon2021-bits/blob/main/run/satori/run-on-bench-on-rhel7-satori-with-mpi System info:
|
@hennyg888 good to see that helped. I think there is a CUDA.versioninfo() ( https://github.com/JuliaGPU/CUDA.jl/blob/4985b0d5827f776683edb702ff296dcb59ba1097/src/utilities.jl#L42 ) function that would be useful to log along side |
That is a huge leap forward @hennyg888 and great to see! Before we were at 50% and now we are at 75%. An increase of 50%, which is pretty huge all things considered. I like @christophernhill 's suggest of adding the version info. Yesterday when we talked the consensus was that one major problem was how we do buffering. As a silly experiment, what if we redo this without updating any halos, ever. Physically, it's going to be wrong but do we get another huge increase in the efficiency? If the efficiency gets close to 100% then in my mind that validate the hypothesis. If not, then that would signify there is another bottleneck that we need to hunt down. |
Apologies for not participating in this issue and for possibly being the cause of the issue via sending/receiving views... If we have to send contiguous data we could just modify the Receiving is done straight into the halo view (a trick(?) that seems to work nicely on the CPU), so we would probably need to create a new buffer of the right size to receive into and then copy it into the halo: https://github.com/CliMA/Oceananigans.jl/blob/master/src/Distributed/halo_communication.jl#L162-L166 Also not sure if relevant but I remember @hennyg888 and @francispoulin suggesting that placing an |
@ali-ramadhan I'm planning to pursue an abstraction wherein contiguous buffers are preallocated. It'd be great to discuss this! |
That would definitely be nice. Are you thinking of putting them inside the |
As suggested by @francispoulin, the following was commented out https://github.com/CliMA/Oceananigans.jl/blob/master/src/Models/ShallowWaterModels/update_shallow_water_state.jl#L19-L22 to remove filling halo regions and buffering between ranks.
system environment and CUDA.versioninfo():
|
Thanks @hennyg888 for confirming this. The result is as it should be and I think you have confirmed that when we get the buffer working for the MPI, that should drasatically improve the scaling on multi GPUs. |
I guess I'd say that we have confirmed its the MPI communication / halo filling that causes a drop in efficiency. Next we have to figure out if we can design a communication system that's efficient! Contiguous buffers are promising but not guaranteed I think. |
@kpamnany might be interested in this issue. |
Since we use buffered communication, this is solved. |
I recently ran the weak scaling shallow water model benchmark with the MultiGPU architecture on Satori, thanks to @christophernhill.
Here are the results:
The results are not good but at least we can benchmark multi-GPU performance now.
The text was updated successfully, but these errors were encountered: