Overlapping computation and MPI halo communication #615

ali-ramadhan · 2020-02-04T11:46:31Z

In PR #590 [WIP] I've prototyped how I've thought about adding support for distributed parallelism by adding a non-invasive Distributed MPI layer on top of Oceananigans to keep the core code MPI-free.

At last week's CliMA software meeting @lcw and @jkozdon have pointed out a potential limitation of this approach: when running on many nodes and communication starts to eat up a lot of compute time it becomes beneficial to overlap computation and communication. Abstractions such as CLIMA.MPIStateArray help a lot with this but require MPI to be "baked in".

Obviously this issue won't be tackled for a while until we have a working distributed model and need more performance, so I'm just documenting the issue here for future discussion.

I think we can achieve this by splitting a kernel like calculate_interior_source_terms! into two kernels, one that computes source terms "near" the boundary (1-2? grid points from any boundary as needed), then halo communication can happen while a second more compute-intensive kernel computes the source terms in the rest of the interior.

But that only helps with one particular instance of halo communication. There will be other halo communications needed that may be impossible to overlap with compute-intensive kernels. Pursuing overlapping in this manner to the extreme and applying it to as many kernels as possible may be detrimental to code clarity.

Once we want more distributed performance we should go through the algorithm and minimize the number of halo communications (i.e. calls to fill_halo_regions!).

cc @leios @jm-c

Tasks

Give feedback

No tasks being tracked yet.

Options

The text was updated successfully, but these errors were encountered:

glwagner · 2020-02-07T12:59:55Z

I think we can achieve this by splitting a kernel like calculate_interior_source_terms! into two kernels, one that computes source terms "near" the boundary (1-2? grid points from any boundary as needed), then halo communication can happen while a second more compute-intensive kernel computes the source terms in the rest of the interior.

Don't you want the opposite? You want a kernel that computes source terms in the "deep interior", which can be performed without knowledge of halos and thus can be performed simultaneous to performing communication. After communication + deep interior calculations are complete, you then perform calculations on near-boundary elements.

Are communications restricted to fill_halo_regions! and the pressure solve? If so, we can start to prepare for such optimizations by refactoring the time-stepping slightly.

This is is the part of our current algorithm that involves interior tendency computation (there are additional halo filling calls associated with the fractional step):

function calculate_explicit_substep!(tendencies, velocities, tracers, pressures, diffusivities, model)
    time_step_precomputations!(diffusivities, pressures, velocities, tracers, model)
    calculate_tendencies!(tendencies, velocities, tracers, pressures, diffusivities, model)
    return nothing
end

The function calculate_tendencies! calculates interior and boundary contributions to tendencies and does not involve communication.

The function time_step_precomputations! is

function time_step_precomputations!(diffusivities, pressures, velocities, tracers, model)

    fill_halo_regions!(merge(velocities, tracers), model.boundary_conditions.solution, model.architecture,
                       model.grid, boundary_condition_function_arguments(model)...)

    calculate_diffusivities!(diffusivities, model.architecture, model.grid, model.closure, model.buoyancy,
                             velocities, tracers)

    fill_halo_regions!(diffusivities, model.boundary_conditions.diffusivities, model.architecture, model.grid)

    @launch(device(model.architecture), config=launch_config(model.grid, :xy),
            update_hydrostatic_pressure!(pressures.pHY′, model.grid, model.buoyancy, tracers))

    fill_halo_regions!(pressures.pHY′, model.boundary_conditions.pressure, model.architecture, model.grid)

    return nothing
end

To implement the optimizations discussed in this issue, we need to also consider the calculation of hydrostatic pressure and nonlinear diffusivities to intertwine communication with interior tendency computation. Can this be done abstractly perhaps via some combination of launch configurations and macro specifications to @loop_xyz? This would allow us to exert control over the "region" of interior source term computation from the "outside", while keeping our kernels intact.

Notice that the "pre-computation" of nonlinear diffusivities and the isolation of the hydrostatic pressure both add communication steps. We should monitor whether these become significantly suboptimal algorithms in the presence of expensive communication. We can easily combine hydrostatic pressure with nonhydrostatic pressure with no loss of performance (probably a small performance increase, in fact). We can also in principle calculate nonlinear diffusivities "in-line", though when we tried this previously we were unable to achieve good performance. Also, "in-line" calculation of diffusivities makes the application of diffusivity boundary conditions much more difficult (or impossible).

ali-ramadhan added distributed 🕸️ Our plan for total cluster domination performance 🏍️ So we can get the wrong answer even faster labels Feb 4, 2020

glwagner mentioned this issue Mar 22, 2023

MPI communication and computation overlap in the HydrostaticFreeSurfaceModel #2953

Closed

glwagner assigned simone-silvestri Mar 22, 2023

simone-silvestri mentioned this issue Jun 1, 2023

(0.88.0) MPI communication and computation overlap in the HydrostaticFreeSurfaceModel and NonhydrostaticModel #3125

Merged

5 tasks

glwagner modified the milestone: 🧊 Global simulations on the cubed sphere Jun 14, 2023

glwagner added this to the 🚅 10 SYPD at 1/10th degree on O(100) GPUs milestone Jun 21, 2023

simone-silvestri closed this as completed in #3125 Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlapping computation and MPI halo communication #615

Overlapping computation and MPI halo communication #615

ali-ramadhan commented Feb 4, 2020 •

edited by glwagner

Loading

Tasks

glwagner commented Feb 7, 2020

Overlapping computation and MPI halo communication #615

Overlapping computation and MPI halo communication #615

Comments

ali-ramadhan commented Feb 4, 2020 • edited by glwagner Loading

Tasks

glwagner commented Feb 7, 2020

ali-ramadhan commented Feb 4, 2020 •

edited by glwagner

Loading