Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One-sided upwind reconstruction #3658

Merged
merged 70 commits into from
Jul 25, 2024
Merged

One-sided upwind reconstruction #3658

merged 70 commits into from
Jul 25, 2024

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Jul 22, 2024

Yet another optimization for upwind stencil computations.
The pattern in Oceananigans to perform upwind reconstruction is roughly:

R_left    = _left_reconstruction(.....)
R_right = _right_reconstruction(.....)

return ifelse(u > 0, u * R_left, u * R_right)

This means that we are always performing the reconstruction twice. This is not a huge problem for linear reconstruction schemes (UpwindBiased) but leads to register blowup for WENO schemes that are extremely heavy to compute.
This PR aims to push the left - right choice inside the reconstruction function by realizing that the only difference between left and right reconstruction is how the data is organized in the stencil.
In this way, only one reconstruction is required significantly reducing register pressure, and consequently, computation time.
This follows the same pattern found in SpeedyWeather.jl

Some benchmarks are implemented in the NESAPOceananigans.jl repository.

Here are some timing tests on main with a NON-Immersed grid (launching julia with julia --project="environments/main" --check-bounds=no)

julia> using NESAPOceananigans
julia> set_problem_size!(500, 500, 50)

julia> trial1 = run_model_benchmark!(momentum_kernel_test, GPU();
                                      use_benchmarktools = true)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min  max):  21.916 ms   22.784 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     22.036 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.144 ms ± 363.318 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██      ██                                                 █
  ██▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  21.9 ms         Histogram: frequency by time         22.8 ms <

 Memory estimate: 245.86 KiB, allocs estimate: 407.

julia> trial1 = run_model_benchmark!(tracer_kernel_test, GPU();
                                     use_benchmarktools = true)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min  max):  14.189 ms  14.421 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     14.261 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.269 ms ± 93.553 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██                █   █                                   █
  ██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  14.2 ms         Histogram: frequency by time        14.4 ms <

 Memory estimate: 47.78 KiB, allocs estimate: 320.

The counterpart using the new branch julia --project="environments/one_sided_branch" --check-bounds=no

julia> using NESAPOceananigans
julia> set_problem_size!(500, 500, 50)

julia> trial1 = run_model_benchmark!(momentum_kernel_test, GPU();
                                     use_benchmarktools = true
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min  max):  16.463 ms   18.503 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     16.466 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.878 ms ± 908.449 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  16.5 ms         Histogram: frequency by time         18.5 ms <

 Memory estimate: 250.06 KiB, allocs estimate: 676.

julia> trial1 = run_model_benchmark!(tracer_kernel_test, arch;
                                       use_benchmarktools = true)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min  max):  6.695 ms    7.461 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.789 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.908 ms ± 312.944 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █     ██ █                                                █
  █▁▁▁▁▁██▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  6.69 ms         Histogram: frequency by time        7.46 ms <

 Memory estimate: 46.39 KiB, allocs estimate: 231.

P.S. some vestigial code not used is being removed as part of this PR because not beneficial (in terms of both accuracy and performance) that is

  • the JS weno formulation (dominated by the Z-weno formulation)
  • Velocity Upwinding for the vector invariant weno formulation (dominated by the other two formulations)

@inline C★(::WENO{6}, ::Val{2}) = 100/231
@inline C★(::WENO{6}, ::Val{3}) = 25/77
@inline C★(::WENO{6}, ::Val{4}) = 5/77
@inline C★(::WENO{6}, ::Val{5}) = 1/462
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a name change?


for stencil in 0:buffer-1
@eval @inline biased_β(ψ, scheme::WENO{$buffer, FT}, ::Val{$stencil}) where FT =
smoothness_sum(scheme, ψ, FT.($(smoothness_coefficients(Val(buffer), Val(stencil)))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be better to use map vs broadcasting where this is easy, I think this could affect precompilation time

end
end
end
@inline function biased_weno_weights(ijk, scheme::WENO{N, FT}, bias, dir, ::VelocityStencil, u, v) where {N, FT}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here bias is a boolean where true means left biased and false means right biased? is that right? I think it makes sense to document this.

@@ -364,62 +371,51 @@ end
# Calculation of WENO reconstructed value v⋆ = ∑ᵣ(wᵣv̂ᵣ)
for buffer in [2, 3, 4, 5, 6]
@eval begin
@inline stencil_sum(scheme::WENO{$buffer}, ψ, w, func, cT, val, idx, loc) = @inbounds $(metaprogrammed_stencil_sum(buffer))
@inline stencil_sum(scheme::WENO{$buffer}, bias, ψ, w, cT, val, idx, loc) = @inbounds $(metaprogrammed_stencil_sum(buffer))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following the breadcrumbs of this argument bias leads here but I don't understand how this works in the end.

@inline S₂₆(S, left_bias) = @inbounds ifelse(left_bias, (S[4], S[5], S[6], S[7], S[8], S[9]), (S[9], S[8], S[7], S[6], S[5], S[4]))
@inline S₃₆(S, left_bias) = @inbounds ifelse(left_bias, (S[3], S[4], S[5], S[6], S[7], S[8]), (S[10], S[9], S[8], S[7], S[6], S[5]))
@inline S₄₆(S, left_bias) = @inbounds ifelse(left_bias, (S[2], S[3], S[4], S[5], S[6], S[7]), (S[11], S[10], S[9], S[8], S[7], S[6]))
@inline S₅₆(S, left_bias) = @inbounds ifelse(left_bias, (S[1], S[2], S[3], S[4], S[5], S[6]), (S[12], S[11], S[10], S[9], S[8], S[7]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left_bias is a much clearer argument name than bias. So left_bias == true is left biased and false means right-biased?

@glwagner
Copy link
Member

glwagner commented Jul 24, 2024

A few comments:

  • I would like to see some CPU benchmarks just to see where we are and how this changes things. We can merge the PR either way since GPU matters most
  • The documentation is pretty poor for this PR. The crucial argument is called "bias" in most places, but switches to "left_bias" in core functions. It seems to me that "left_bias" makes more sense --- bias==true implies left_bias where bias==false is right biased. I suggest using left_bias everywhere, including in the function that is now called bias. I also think that weno_interpolants.jl needs a lot more explanation by way of comments as well as docstrings for crucial functions. I suspect that there is only one person capable of understanding weno_interpolants.jl which is a dangerous situation.

@simone-silvestri
Copy link
Collaborator Author

  • The crucial argument is called "bias" in most places, but switches to "left_bias" in core functions. It seems to me that "left_bias" makes more sense --- bias==true implies left_bias where bias==false is right biased. I suggest using left_bias everywhere, including in the function that is now called bias.

I have changed the bias argument to be either LeftBias or RightBias this should make everything more readable.

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Jul 24, 2024

Global ocean benchmarks (from prototype_omip_simulation.jl) using

tracer_advection = TracerAdvection(WENO(; order = 7), WENO(; order =7), Centered())
momentum_advection = WENOVectorInvariant()

On Main

Screenshot 2024-07-24 at 10 31 27 AM

On this branch

Screenshot 2024-07-24 at 10 29 15 AM

@inbounds ifelse(left_bias, $(calc_reconstruction_stencil(buffer, :left, :x, true)),
$(calc_reconstruction_stencil(buffer, :right, :x, true)))
@inline inner_biased_interpolate_xᶠᵃᵃ(i, j, k, grid, ::UX{$buffer, FT}, bias, ψ::Function, idx, loc, args...) where FT =
@inbounds ifelse(bias == LeftBias(), $(calc_reconstruction_stencil(buffer, :left, :x, true)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@inbounds ifelse(bias == LeftBias(), $(calc_reconstruction_stencil(buffer, :left, :x, true)),
@inbounds ifelse(bias isa LeftBias, $(calc_reconstruction_stencil(buffer, :left, :x, true)),

not a necessary change, but saves a few characters and might be easier to read

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Jul 24, 2024

Output of the benchmark_advection_schemes.jl script ran on Tartarus with the following specs:

Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 1 default, 0 interactive, 1 GC (on 48 virtual cores)
Environment:
  JULIA_CUDA_MEMORY_POOL = none
  LD_LIBRARY_PATH = /storage4/simone/lib:
  GPU: NVIDIA TITAN V

Note that

  • the Centered scheme is unaffected by this PR
  • UpwindBiased should have a similar execution time between this branch and main given that the operations are the same but they just happen at different levels
  • WENO(; order = 1) defaults to UpwindBiased(; order = 1)

On Main

Screenshot 2024-07-24 at 11 37 44 AM

On this PR

Screenshot 2024-07-24 at 11 02 35 AM

There are a couple of things to note here when it comes to the WENO scheme:

  • on CPU we gain roughly a factor of 1.8 given that computations have been reduced by 2X
  • on the GPU the gain is less, because, in this case, the uncoalesced memory access (which has to happen for both the branches) costs much more than on the CPU
  • (I suspect that) if using WENO only for horizontal reconstructions (like in the global ocean shown above) the penalty of uncoalesced memory access is reduced significantly and the kernel speeds up by a factor of 2 also on the GPU by virtue of the lower (2X) arithmetic intensity of the kernel

@simone-silvestri simone-silvestri merged commit abb66e3 into main Jul 25, 2024
46 checks passed
@simone-silvestri simone-silvestri deleted the ss/one-sided-upwind branch July 25, 2024 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants