Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlapping computation and MPI halo communication #615

Closed
ali-ramadhan opened this issue Feb 4, 2020 · 1 comment · Fixed by #3125
Closed

Overlapping computation and MPI halo communication #615

ali-ramadhan opened this issue Feb 4, 2020 · 1 comment · Fixed by #3125
Assignees
Labels
distributed 🕸️ Our plan for total cluster domination performance 🏍️ So we can get the wrong answer even faster

Comments

@ali-ramadhan
Copy link
Member

ali-ramadhan commented Feb 4, 2020

In PR #590 [WIP] I've prototyped how I've thought about adding support for distributed parallelism by adding a non-invasive Distributed MPI layer on top of Oceananigans to keep the core code MPI-free.

At last week's CliMA software meeting @lcw and @jkozdon have pointed out a potential limitation of this approach: when running on many nodes and communication starts to eat up a lot of compute time it becomes beneficial to overlap computation and communication. Abstractions such as CLIMA.MPIStateArray help a lot with this but require MPI to be "baked in".

Obviously this issue won't be tackled for a while until we have a working distributed model and need more performance, so I'm just documenting the issue here for future discussion.

I think we can achieve this by splitting a kernel like calculate_interior_source_terms! into two kernels, one that computes source terms "near" the boundary (1-2? grid points from any boundary as needed), then halo communication can happen while a second more compute-intensive kernel computes the source terms in the rest of the interior.

But that only helps with one particular instance of halo communication. There will be other halo communications needed that may be impossible to overlap with compute-intensive kernels. Pursuing overlapping in this manner to the extreme and applying it to as many kernels as possible may be detrimental to code clarity.

Once we want more distributed performance we should go through the algorithm and minimize the number of halo communications (i.e. calls to fill_halo_regions!).

cc @leios @jm-c

Tasks

No tasks being tracked yet.
@ali-ramadhan ali-ramadhan added distributed 🕸️ Our plan for total cluster domination performance 🏍️ So we can get the wrong answer even faster labels Feb 4, 2020
@glwagner
Copy link
Member

glwagner commented Feb 7, 2020

I think we can achieve this by splitting a kernel like calculate_interior_source_terms! into two kernels, one that computes source terms "near" the boundary (1-2? grid points from any boundary as needed), then halo communication can happen while a second more compute-intensive kernel computes the source terms in the rest of the interior.

Don't you want the opposite? You want a kernel that computes source terms in the "deep interior", which can be performed without knowledge of halos and thus can be performed simultaneous to performing communication. After communication + deep interior calculations are complete, you then perform calculations on near-boundary elements.

Are communications restricted to fill_halo_regions! and the pressure solve? If so, we can start to prepare for such optimizations by refactoring the time-stepping slightly.

This is is the part of our current algorithm that involves interior tendency computation (there are additional halo filling calls associated with the fractional step):

function calculate_explicit_substep!(tendencies, velocities, tracers, pressures, diffusivities, model)
    time_step_precomputations!(diffusivities, pressures, velocities, tracers, model)
    calculate_tendencies!(tendencies, velocities, tracers, pressures, diffusivities, model)
    return nothing
end

The function calculate_tendencies! calculates interior and boundary contributions to tendencies and does not involve communication.

The function time_step_precomputations! is

function time_step_precomputations!(diffusivities, pressures, velocities, tracers, model)

    fill_halo_regions!(merge(velocities, tracers), model.boundary_conditions.solution, model.architecture,
                       model.grid, boundary_condition_function_arguments(model)...)

    calculate_diffusivities!(diffusivities, model.architecture, model.grid, model.closure, model.buoyancy,
                             velocities, tracers)

    fill_halo_regions!(diffusivities, model.boundary_conditions.diffusivities, model.architecture, model.grid)

    @launch(device(model.architecture), config=launch_config(model.grid, :xy),
            update_hydrostatic_pressure!(pressures.pHY′, model.grid, model.buoyancy, tracers))

    fill_halo_regions!(pressures.pHY′, model.boundary_conditions.pressure, model.architecture, model.grid)

    return nothing
end

To implement the optimizations discussed in this issue, we need to also consider the calculation of hydrostatic pressure and nonlinear diffusivities to intertwine communication with interior tendency computation. Can this be done abstractly perhaps via some combination of launch configurations and macro specifications to @loop_xyz? This would allow us to exert control over the "region" of interior source term computation from the "outside", while keeping our kernels intact.

Notice that the "pre-computation" of nonlinear diffusivities and the isolation of the hydrostatic pressure both add communication steps. We should monitor whether these become significantly suboptimal algorithms in the presence of expensive communication. We can easily combine hydrostatic pressure with nonhydrostatic pressure with no loss of performance (probably a small performance increase, in fact). We can also in principle calculate nonlinear diffusivities "in-line", though when we tried this previously we were unable to achieve good performance. Also, "in-line" calculation of diffusivities makes the application of diffusivity boundary conditions much more difficult (or impossible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment