Multi GPU scaling is very poor #1882

hennyg888 · 2021-07-21T13:52:42Z

I recently ran the weak scaling shallow water model benchmark with the MultiGPU architecture on Satori, thanks to @christophernhill.
Here are the results:

size	ranks	min	median	mean	max	memory	allocs	samples
(4096, 256)	(1, 1)	2.765 ms	2.786 ms	2.849 ms	3.374 ms	2.03 MiB	5535	10
(4096, 512)	(1, 2)	6.932 ms	7.081 ms	8.037 ms	26.174 ms	2.03 MiB	5859	20
(4096, 1024)	(1, 4)	12.592 ms	14.603 ms	16.417 ms	31.468 ms	2.03 MiB	5859	40

size	ranks	slowdown	efficiency	memory	allocs
(4096, 256)	(1, 1)	1.0	1.0	1.0	1.0
(4096, 512)	(1, 2)	2.54127	0.393505	1.00271	1.05854
(4096, 1024)	(1, 4)	5.24053	0.19082	1.00271	1.05854

The results are not good but at least we can benchmark multi-GPU performance now.

glwagner · 2021-07-21T13:56:02Z

Perhaps we can name this issue "Multi GPU scaling is very poor" so that we can resolve when the scaling gets better :-D

francispoulin · 2021-07-21T16:47:43Z

@hennyg888 , could you tell us exactly what branch and script you used to produce this result?

glwagner · 2021-07-21T17:30:11Z

Just for clarification, how is "efficiency" defined?

francispoulin · 2021-07-21T17:33:01Z

The total time for the serial job divided by the product of the number fo cores multiplied by the time for that run, say

N_1 / (p * N_p)

where N_1 is the time for 1 core, p is the number of cores, and N_p is the time for p cores.

glwagner · 2021-07-21T17:37:40Z

Ah nice thanks. Makes sense, between 0 and 1.

francispoulin · 2021-07-21T17:42:12Z

The fact that the efficiency goes down to 40% for 2 gpus says that it's actually running slower than on one core. Certainly suboptimal. I'm sure we can do better, and we will.

glwagner · 2021-07-21T17:45:54Z

Is the problem being parallelized in y? Would it be better to use a problem that is relatively wide in the direction being parallelized? Eg layouts like (256, 512) with (1, 1); (256, 1024) with (1, 2), etc.

hennyg888 · 2021-07-21T17:51:59Z

The total time for the serial job divided by the product of the number fo cores multiplied by the time for that run, say

N_1 / (p * N_p)

where N_1 is the time for 1 core, p is the number of cores, and N_p is the time for p cores.

This is actually weak scaling so the efficiency is just N_1 / N_p, and median times are used not mean.

@hennyg888 , could you tell us exactly what branch and script you used to produce this result?

I used the latest master branch and weak_scaling_shallow_water_model.jl and weak_scaling_shallow_water_model_single.jl except the architecture was changed from MultiCPU to MultiGPU.

glwagner · 2021-07-21T18:14:35Z

This is actually weak scaling so the efficiency is just N_1 / N_p, and median times are used not mean.

Right, that makes sense. What I said was wrong; 1 would not be an upper bound unless magic happened. efficiency=0.5 means that the problem takes roughly the same amount of time it would take if one continued to use a single core, rather than parallelized.

The layout issue I point out above holds --- I think these problems have large "surface area" compared to computation so may not be the best target for parallelization. Unless I'm missing something.

Another thing is that I'm not sure these problems are big enough. We can run problems with ~30 million dof (sometimes more). But 4096x256 has just ~1 million dof. Do we know how much GPU utilization we are getting with 1 million dof?

vchuravy · 2021-07-21T20:38:15Z

Important would be to capture the environment used. Could you share your SLURM script and setup? Which modules did you use etc.

Secondly we should do some profiling to see where the times goes. (Does oceanigans have something like that? Either based on CLIMA's TikTok or TimerOutputs.jl)

francispoulin · 2021-07-21T20:50:42Z

I will let @hennyg888 share the SLURM and module information but I can say that we are keen to do some profiling of this, and other runs.

I have not heard of Oceananigans having any profiling but would love to hear what people suggest we use. We were considering nvprof as that seems easy to start using but we are open to suggestions.

francispoulin · 2021-07-22T21:54:32Z

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent.

I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

vchuravy · 2021-07-22T22:56:47Z

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent.

I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

This then sounds to me like you don't have a working CUDA-aware MPI. IGG should show >90% efficiency

vchuravy · 2021-07-22T23:04:50Z

As I said, please post your slurp script and other environment options. It is impossible to debug otherwise.

I have an annotated slurp script here https://github.com/CliMA/ClimateMachine.jl/wiki/Satori-Cluster which is what I used a while back for GPU scaling tests. A mossconfigured MPI can easily manifest itself as scaling this poor.

christophernhill · 2021-07-23T01:04:44Z

@hennyg888 thanks posting this. a few thoughts -

I assume what @hennyg888 is running is based on this https://github.com/christophernhill/onan-jcon2021-bits/blob/main/run/satori/run-on-bench-on-rhel7-satori-with-mpi ?

There are quite a few things to double (triple) check

are you running on multiple GPUs?
There is some obscure foo for that here ( https://github.com/christophernhill/ImplicitGlobalGrid.jl/blob/5e4fd0698b7087467d9314bfa253d6bc9a09a40a/diffusion3D_multigpu_CuArrays_novis.jl#L21 ) that is not in Oceananigans or ImplicitGlobal as downloaded. Its not really particularly documented anywhere either (except in a blog post for this https://github.com/NVIDIA-developer-blog/code-samples/blob/master/posts/cuda-aware-mpi-example/src/CUDA_Aware_MPI.c as far as I can tell)! Without this bit you may end up running all ranks on the same GPU. The blog post here https://developer.nvidia.com/blog/benchmarking-cuda-aware-mpi/ gives a bit of background.
is there anything else running on the node when you test?
When I looked earlier in the week satori had become annoyingly busy. You need to request an exclusive node - and then wait unfortunately because of every else using. If you skip asking for exclusive you may end up sharing node - which is OK for getting work done, but confusing for benchmarking.
as @vchuravy mentions you may or may not be using messaging that goes direct GPU to GPU?
There is an issue with recent CUDA.jl that makes that hard (possibly not even possible). We are working to resolve that. @vchuravy has a suggested fix, but I found that caused other problems. The ImplicitGlobal team found @vchuravy fix to work, but with a very recent version of CUDA.jl where I think it isn't supposed to work - so they may have been mistaken.

I was planning to look at this a bit more after having coffee with a Nvidia colleague who is involved in all this tomorrow.

The ImplciitGlobalGris stuff should get reasonable behavior with the selectDevices() addition - but I think Oceananigans.jl may have some other problem too, related to passing @view indexing of arrays directly into the MPI calls. So good results for Oceananigans may require some other work too - which @glwagner is looking at.

Lots of details here!
Perhaps some of us could zoom tomorrow after I have seen Barton? It might be good to do a little single rank profiling too. That would be useful and would help once we have CUDA meets OpenMPI meets Nvidia drivers back under control.

christophernhill · 2021-07-23T01:12:50Z

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent.

I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

@francispoulin (see my above comment). I think ImplictGlobalGrid.jl as downloaded is not configured to run across multiple GPUs. I added a line in a fork here ( https://github.com/christophernhill/ImplicitGlobalGrid.jl/blob/5e4fd0698b7087467d9314bfa253d6bc9a09a40a/diffusion3D_multigpu_CuArrays_novis.jl#L21 ) that is needed. With that I saw reasonable weak scaling - even with broken CUDA aware MPI support. Oceananigans.jl has some other things going on.

I agree profiling with nvprof/nsight would be great. This link https://github.com/mit-satori/getting-started/blob/master/tutorial-examples/nvprof-profiling/Satori_NVProf_Intro.pdf and this https://mit-satori.github.io/tutorial-examples/nvprof-profiling/index.html?highlight=profiling might be helpful to get started. The slides also have links to various NVidia bits of documentation.

francispoulin · 2021-07-23T02:35:34Z

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent.
I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

This then sounds to me like you don't have a working CUDA-aware MPI. IGG should show >90% efficiency

Thanks @vchuravy . The runs for IGG were on a server that has CUDA-aware MPI, so that's not the problem. As @christophernhill points out, there are a lot of other possibilities though.

francispoulin · 2021-07-23T02:39:13Z

As I said, please post your slurp script and other environment options. It is impossible to debug otherwise.

I have an annotated slurp script here https://github.com/CliMA/ClimateMachine.jl/wiki/Satori-Cluster which is what I used a while back for GPU scaling tests. A mossconfigured MPI can easily manifest itself as scaling this poor.

@hennyg888 has been very busy this week so hasn't had a chance to response. The slurm script that he used was passed down from @christophernhill , and I will let him share that with you, but it might not happen until Monday.

But I suppose I should learn to start running stuff on Satori as that is something that everyone else can use and people understand the configuration. I'll try to do that on Monday.

francispoulin · 2021-07-23T02:40:21Z

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent.
I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

@francispoulin (see my above comment). I think ImplictGlobalGrid.jl as downloaded is not configured to run across multiple GPUs. I added a line in a fork here ( https://github.com/christophernhill/ImplicitGlobalGrid.jl/blob/5e4fd0698b7087467d9314bfa253d6bc9a09a40a/diffusion3D_multigpu_CuArrays_novis.jl#L21 ) that is needed. With that I saw reasonable weak scaling - even with broken CUDA aware MPI support. Oceananigans.jl has some other things going on.

I agree profiling with nvprof/nsight would be great. This link https://github.com/mit-satori/getting-started/blob/master/tutorial-examples/nvprof-profiling/Satori_NVProf_Intro.pdf and this https://mit-satori.github.io/tutorial-examples/nvprof-profiling/index.html?highlight=profiling might be helpful to get started. The slides also have links to various NVidia bits of documentation.

Thanks @christophernhill for all this information. This will be most helpful. Unfortunately, tomorrow I am busy from 9am to 5pm so I don't think I can zoom, but maybe on Monday? I'll try and look into these resources before hand.

hennyg888 · 2021-07-23T13:38:00Z

Thank you very much @christophernhill !
What I'm running is indeed based of of https://github.com/christophernhill/onan-jcon2021-bits/blob/main/run/satori/run-on-bench-on-rhel7-satori-with-mpi .
here's the salloc command used:

$ salloc  --mem=16G -n 5 --gres=gpu:3 -t 01:00:00
$ {ROOTDIR}/julia-1.6.2/bin/julia --project=. weak_scaling_shallow_water_model.jl

I also changed the line in weak_scaling_shallow_water_model.jl that launches weak_scaling_shallow_water_model_single.jl into this:

run(`srun --pty -n $R $julia --project=. weak_scaling_shallow_water_model_single.jl $(typeof(decomposition)) $Nx $Ny $Rx $Ry`)

glwagner · 2021-07-23T13:58:20Z

Might make sense to figure out how to @assert that the benchmark is configured correctly?

francispoulin · 2021-07-23T17:46:32Z

@christophernhill : I wanted to confirm that I took your clever idea of using select_device() and added that into my code. When I ran it on 1, 2 and 4 GPUs I was able to get efficiences of 97 percent. So the code is performing very well, and the server can be efficent on multi GPUs.

The link to where the function is defined is copied below. Is this something that is done automatically in Oceananigans through AbstractKernels.jl or something else?

https://github.com/christophernhill/ImplicitGlobalGrid.jl/blob/5e4fd0698b7087467d9314bfa253d6bc9a09a40a/src/select_device.jl

In chatting with the developers of ImplicitGlobalGrid.jl they mentioned that to get efficiency I should use something called @hide_communication in ParallelKernel.jl. Again, I don't pretend to understand what this does but wanted to share the information I was given.

https://github.com/omlins/ParallelStencil.jl/blob/main/src/ParallelKernel/hide_communication.jl

francispoulin · 2021-07-23T20:37:33Z

Another thought for @christophernhill

At the talk today on ImplicitGlobalGrid.jl, they were using @view in the simplest code but they dropped it as soon as they started to optimize the code. I believe they started using LazyArrays.jl. I don't know what it is but I suspect it doesn't have the problems that @view might have.

glwagner · 2021-07-23T22:03:07Z

Another thought for @christophernhill

At the talk today on ImplicitGlobalGrid.jl, they were using @view in the simplest code but they dropped it as soon as they started to optimize the code. I believe they started using LazyArrays.jl. I don't know what it is but I suspect it doesn't have the problems that @view might have.

We think that we cannot send non-contiguous data over MPI between GPUs (only CPUs). Thus certain views will not work. Possibly in this case the data is transferred to CPU, sent over MPI, and then copied back to the GPU (slow).

christophernhill · 2021-07-24T00:25:15Z

Another thought for @christophernhill

At the talk today on ImplicitGlobalGrid.jl, they were using @view in the simplest code but they dropped it as soon as they started to optimize the code. I believe they started using LazyArrays.jl. I don't know what it is but I suspect it doesn't have the problems that @view might have.

@francispoulin thanks. I think we probably just want to do some buffer. I looked at LazyArrays.jl and I could imagine how that could maybe also be included, but I suspect the main thing is having a buffer (which https://github.com/eth-cscs/ImplicitGlobalGrid.jl has). I don't see any sign of LazyArrays in https://github.com/eth-cscs/ImplicitGlobalGrid.jl code! We can check with Ludovic though.

francispoulin · 2021-07-24T16:09:03Z

We think that we cannot send non-contiguous data over MPI between GPUs (only CPUs). Thus certain views will not work. Possibly in this case the data is transferred to CPU, sent over MPI, and then copied back to the GPU (slow).

Interesting. This means that we can't really use CUDA-aware MPI, since that is basically to allow GPUs to communicate directl. This puts a limit in terms fo the efficiency but I think we can still get something decent up and running.

Can you give me any details as to why this is?

What would be required to fix this in the long term?

francispoulin · 2021-07-24T16:13:20Z

@francispoulin thanks. I think we probably just want to do some buffer. I looked at LazyArrays.jl and I could imagine how that could maybe also be included, but I suspect the main thing is having a buffer (which https://github.com/eth-cscs/ImplicitGlobalGrid.jl has). I don't see any sign of LazyArrays in https://github.com/eth-cscs/ImplicitGlobalGrid.jl code! We can check with Ludovic though.

Thanks for looking at this @christophernhill and sorry that I misquoted. At the JuliaCon talk yesterday, they started off talking about a simple repo and then ended up talking about ImplicitGlobalGrid.jl. The link I should have given was this.

If you think that buffering is the way to go then I'm certainly happy to give that a try. Maybe we can have a zoom meeting this week to discuss in more detail?

glwagner · 2021-07-25T04:03:34Z

We think that we cannot send non-contiguous data over MPI between GPUs (only CPUs). Thus certain views will not work. Possibly in this case the data is transferred to CPU, sent over MPI, and then copied back to the GPU (slow).

Interesting. This means that we can't really use CUDA-aware MPI, since that is basically to allow GPUs to communicate directl. This puts a limit in terms fo the efficiency but I think we can still get something decent up and running.

Can you give me any details as to why this is?

What would be required to fix this in the long term?

There's no limitation, we just have to send continguous data over MPI rather than non-contiguous data. We can do this by creating contiguous "buffer" arrays. The algorithm is 1. copy data from halos to buffer; 2. send buffer; 3. copy buffer to halo regions at receiving end.

francispoulin · 2021-07-25T15:20:58Z

Ah, that makes a lot of sense and sounds very doable. I am happy to help with this where I can but don't know the MPI stuff nearly as well as @christophernhill .

glwagner · 2021-07-26T13:40:47Z

All the MPI stuff is in the Distributed module:

https://github.com/CliMA/Oceananigans.jl/tree/master/src/Distributed

hennyg888 · 2021-07-27T02:40:50Z

Vastly increased multi-GPU efficiency by designating 1 GPU per process with CUDA.device!(local_rank) in the single case code, right after setting up MPI. https://cuda.juliagpu.org/stable/api/essentials/#CUDA.device!-Tuple{CuDevice}
This was a page taken out of ImplicitGlobalGrid.jl's book, more specifically https://github.com/eth-cscs/ImplicitGlobalGrid.jl/blob/master/src/select_device.jl.
Much better results:

size	ranks	min	median	mean	max	memory	allocs	samples
(4096, 256)	(1, 1)	2.702 ms	2.728 ms	2.801 ms	3.446 ms	2.03 MiB	5535	10
(4096, 512)	(1, 2)	3.510 ms	3.612 ms	4.287 ms	16.546 ms	2.03 MiB	5859	20
(4096, 768)	(1, 3)	3.553 ms	3.653 ms	5.195 ms	39.152 ms	2.03 MiB	5859	30

size	ranks	slowdown	efficiency	memory	allocs
(4096, 256)	(1, 1)	1.0	1.0	1.0	1.0
(4096, 512)	(1, 2)	1.32399	0.755293	1.00271	1.05854
(4096, 768)	(1, 3)	1.33901	0.746818	1.00271	1.05854

I could only get up to 3 GPUs because I'm still doing this on only one node. I will try to do more ranks and GPUs once more significant changes than my single line of code is added. This was done on Satori with setup instructions shown here:

https://github.com/christophernhill/onan-jcon2021-bits/blob/main/run/satori/run-on-bench-on-rhel7-satori-with-mpi
The better efficiency is not caused by a slowdown on the non-MPI case either. Both this result and the original one posted above had median one-rank times of around 2.7ms.

System info:

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (powerpc64le-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, pwr9)
Environment:
  JULIA_MPI_PATH = /home/software/spack/openmpi/3.1.4-nhjzelonyovxks5ydtrxehceqxsbf7ik
  JULIA_CUDA_USE_BINARYBUILDER = false
  JULIA_DEPOT_PATH = /nobackup/users/henryguo/projects/henry-test/Oceananigans.jl/benchmark/.julia
  GPU: Tesla V100-SXM2-32GB

christophernhill · 2021-07-27T10:24:09Z

@hennyg888 good to see that helped.

I think there is a CUDA.versioninfo() ( https://github.com/JuliaGPU/CUDA.jl/blob/4985b0d5827f776683edb702ff296dcb59ba1097/src/utilities.jl#L42 ) function that would be useful to log along side System info:.

francispoulin · 2021-07-27T11:24:28Z

That is a huge leap forward @hennyg888 and great to see! Before we were at 50% and now we are at 75%. An increase of 50%, which is pretty huge all things considered.

I like @christophernhill 's suggest of adding the version info.

Yesterday when we talked the consensus was that one major problem was how we do buffering. As a silly experiment, what if we redo this without updating any halos, ever. Physically, it's going to be wrong but do we get another huge increase in the efficiency? If the efficiency gets close to 100% then in my mind that validate the hypothesis. If not, then that would signify there is another bottleneck that we need to hunt down.

ali-ramadhan · 2021-07-27T21:04:22Z

Apologies for not participating in this issue and for possibly being the cause of the issue via sending/receiving views...

If we have to send contiguous data we could just modify the underlying_*_boundary functions to convert the view into a contiguous array.

Receiving is done straight into the halo view (a trick(?) that seems to work nicely on the CPU), so we would probably need to create a new buffer of the right size to receive into and then copy it into the halo: https://github.com/CliMA/Oceananigans.jl/blob/master/src/Distributed/halo_communication.jl#L162-L166

Also not sure if relevant but I remember @hennyg888 and @francispoulin suggesting that placing an MPI.Barrier() at the end of each time step helped with a certain scaling benchmark?

glwagner · 2021-07-27T23:50:20Z

@ali-ramadhan I'm planning to pursue an abstraction wherein contiguous buffers are preallocated. It'd be great to discuss this!

ali-ramadhan · 2021-07-28T00:16:08Z

That would definitely be nice. Are you thinking of putting them inside the Multi{C,G}PU architectures?

hennyg888 · 2021-07-28T01:07:39Z

As suggested by @francispoulin, the following was commented out https://github.com/CliMA/Oceananigans.jl/blob/master/src/Models/ShallowWaterModels/update_shallow_water_state.jl#L19-L22 to remove filling halo regions and buffering between ranks.
This gave perfect efficiency up to 3 ranks. This was mainly done to locate where possible bottlenecks are and not a legitimate change to the code. It was expected that buffering is what's causing efficiency decreases, and this confirms that there are no other additional undetected causes for efficiency drops.

size	ranks	slowdown	efficiency	memory	allocs
(4096, 256)	(1, 1)	1.0	1.0	1.0	1.0
(4096, 512)	(1, 2)	0.988079	1.01206	1.06328	1.0406
(4096, 768)	(1, 3)	0.992832	1.00722	1.06328	1.0406

system environment and CUDA.versioninfo():

Oceananigans v0.60.0
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (powerpc64le-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, pwr9)
Environment:
  JULIA_MPI_PATH = /home/software/spack/openmpi/3.1.4-nhjzelonyovxks5ydtrxehceqxsbf7ik
  JULIA_CUDA_USE_BINARYBUILDER = false
  JULIA_DEPOT_PATH = /nobackup/users/henryguo/projects/henry-test/.julia
  GPU: Tesla V100-SXM2-32GB

CUDA toolkit 10.1.243, local installation
CUDA driver 10.2.0
NVIDIA driver 440.64.0
Libraries: 
- CUBLAS: 10.2.2
- CURAND: 10.1.1
- CUFFT: 10.1.1
- CUSOLVER: 10.2.0
- CUSPARSE: 10.3.0
- CUPTI: 12.0.0
- NVML: 10.0.0+440.64.0
- CUDNN: missing
- CUTENSOR: missing
Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device capability support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false
3 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 31.738 GiB / 31.749 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 31.738 GiB / 31.749 GiB available)
  2: Tesla V100-SXM2-32GB (sm_70, 31.738 GiB / 31.749 GiB available)

francispoulin · 2021-07-28T13:27:43Z

Thanks @hennyg888 for confirming this. The result is as it should be and I think you have confirmed that when we get the buffer working for the MPI, that should drasatically improve the scaling on multi GPUs.

glwagner · 2021-07-28T14:11:37Z

Thanks @hennyg888 for confirming this. The result is as it should be and I think you have confirmed that when we get the buffer working for the MPI, that should drasatically improve the scaling on multi GPUs.

I guess I'd say that we have confirmed its the MPI communication / halo filling that causes a drop in efficiency. Next we have to figure out if we can design a communication system that's efficient! Contiguous buffers are promising but not guaranteed I think.

glwagner · 2021-10-20T20:02:25Z

@kpamnany might be interested in this issue.

glwagner · 2023-03-22T19:21:37Z

Since we use buffered communication, this is solved.

glwagner added the performance 🏍️ So we can get the wrong answer even faster label Jul 21, 2021

hennyg888 changed the title ~~Weak Scaling Shallow Water Model with GPU Benchmark Results~~ Multi GPU scaling is very poor Jul 21, 2021

navidcy added the GPU 👾 Where Oceananigans gets its powers from label Jul 22, 2021

glwagner mentioned this issue Mar 22, 2023

MPI communication and computation overlap in the HydrostaticFreeSurfaceModel #2953

Closed

glwagner closed this as completed Mar 22, 2023

simone-silvestri mentioned this issue Jun 1, 2023

(0.88.0) MPI communication and computation overlap in the HydrostaticFreeSurfaceModel and NonhydrostaticModel #3125

Merged

5 tasks

Multi GPU scaling is very poor #1882

Multi GPU scaling is very poor #1882

Comments

hennyg888 commented Jul 21, 2021

glwagner commented Jul 21, 2021

francispoulin commented Jul 21, 2021

glwagner commented Jul 21, 2021

francispoulin commented Jul 21, 2021

glwagner commented Jul 21, 2021

francispoulin commented Jul 21, 2021

glwagner commented Jul 21, 2021

hennyg888 commented Jul 21, 2021 • edited Loading

glwagner commented Jul 21, 2021

vchuravy commented Jul 21, 2021

francispoulin commented Jul 21, 2021

francispoulin commented Jul 22, 2021

vchuravy commented Jul 22, 2021

vchuravy commented Jul 22, 2021

christophernhill commented Jul 23, 2021

christophernhill commented Jul 23, 2021

francispoulin commented Jul 23, 2021

francispoulin commented Jul 23, 2021

francispoulin commented Jul 23, 2021

hennyg888 commented Jul 23, 2021

glwagner commented Jul 23, 2021

francispoulin commented Jul 23, 2021

francispoulin commented Jul 23, 2021

glwagner commented Jul 23, 2021

christophernhill commented Jul 24, 2021

francispoulin commented Jul 24, 2021

francispoulin commented Jul 24, 2021

glwagner commented Jul 25, 2021

francispoulin commented Jul 25, 2021

glwagner commented Jul 26, 2021

hennyg888 commented Jul 27, 2021 • edited Loading

christophernhill commented Jul 27, 2021

francispoulin commented Jul 27, 2021

ali-ramadhan commented Jul 27, 2021

glwagner commented Jul 27, 2021

ali-ramadhan commented Jul 28, 2021

hennyg888 commented Jul 28, 2021

francispoulin commented Jul 28, 2021

glwagner commented Jul 28, 2021 • edited Loading

glwagner commented Oct 20, 2021

glwagner commented Mar 22, 2023

hennyg888 commented Jul 21, 2021 •

edited

Loading

hennyg888 commented Jul 27, 2021 •

edited

Loading

glwagner commented Jul 28, 2021 •

edited

Loading