One-sided upwind reconstruction #3658

simone-silvestri · 2024-07-22T17:50:41Z

Yet another optimization for upwind stencil computations.
The pattern in Oceananigans to perform upwind reconstruction is roughly:

R_left    = _left_reconstruction(.....)
R_right = _right_reconstruction(.....)

return ifelse(u > 0, u * R_left, u * R_right)

This means that we are always performing the reconstruction twice. This is not a huge problem for linear reconstruction schemes (UpwindBiased) but leads to register blowup for WENO schemes that are extremely heavy to compute.
This PR aims to push the left - right choice inside the reconstruction function by realizing that the only difference between left and right reconstruction is how the data is organized in the stencil.
In this way, only one reconstruction is required significantly reducing register pressure, and consequently, computation time.
This follows the same pattern found in SpeedyWeather.jl

Some benchmarks are implemented in the NESAPOceananigans.jl repository.

Here are some timing tests on main with a NON-Immersed grid (launching julia with julia --project="environments/main" --check-bounds=no)

julia> using NESAPOceananigans
julia> set_problem_size!(500, 500, 50)

julia> trial1 = run_model_benchmark!(momentum_kernel_test, GPU();
                                      use_benchmarktools = true)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  21.916 ms …  22.784 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     22.036 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.144 ms ± 363.318 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██      ██                                                 █
  ██▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  21.9 ms         Histogram: frequency by time         22.8 ms <

 Memory estimate: 245.86 KiB, allocs estimate: 407.

julia> trial1 = run_model_benchmark!(tracer_kernel_test, GPU();
                                     use_benchmarktools = true)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  14.189 ms … 14.421 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.261 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.269 ms ± 93.553 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██                █   █                                   █
  ██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  14.2 ms         Histogram: frequency by time        14.4 ms <

 Memory estimate: 47.78 KiB, allocs estimate: 320.

The counterpart using the new branch julia --project="environments/one_sided_branch" --check-bounds=no

julia> using NESAPOceananigans
julia> set_problem_size!(500, 500, 50)

julia> trial1 = run_model_benchmark!(momentum_kernel_test, GPU();
                                     use_benchmarktools = true
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  16.463 ms …  18.503 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.466 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.878 ms ± 908.449 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  16.5 ms         Histogram: frequency by time         18.5 ms <

 Memory estimate: 250.06 KiB, allocs estimate: 676.

julia> trial1 = run_model_benchmark!(tracer_kernel_test, arch;
                                       use_benchmarktools = true)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  6.695 ms …   7.461 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.789 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.908 ms ± 312.944 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █     ██ █                                                █
  █▁▁▁▁▁██▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  6.69 ms         Histogram: frequency by time        7.46 ms <

 Memory estimate: 46.39 KiB, allocs estimate: 231.

P.S. some vestigial code not used is being removed as part of this PR because not beneficial (in terms of both accuracy and performance) that is

the JS weno formulation (dominated by the Z-weno formulation)
Velocity Upwinding for the vector invariant weno formulation (dominated by the other two formulations)

glwagner · 2024-07-24T00:37:46Z