Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill velocity halos in a single pass for ConformalCubedSphereGrid #3201

Closed
navidcy opened this issue Jul 28, 2023 · 7 comments
Closed

Fill velocity halos in a single pass for ConformalCubedSphereGrid #3201

navidcy opened this issue Jul 28, 2023 · 7 comments

Comments

@navidcy
Copy link
Collaborator

navidcy commented Jul 28, 2023

At the moment we fill the velocity halos with multiple passes, e.g.,

for _ in 1:2
fill_halo_regions!(u)
fill_halo_regions!(v)
@apply_regionally replace_horizontal_velocity_halos!((; u = u, v = v, w = nothing), grid)
end

We should utilize the grid's connectivity and develop a method to fill the velocity halos that only requires one pass. This is very important for performance and scaling on distributed systems.

@glwagner
Copy link
Member

Is this task required to complete the cubed sphere, or should we regard it as an optimization that's important for performance but not functionality?

@simone-silvestri @navidcy

@navidcy
Copy link
Collaborator Author

navidcy commented Sep 14, 2023

It's a "performance" task really but I have the gut feeling that it might be impeding performance so much that we won't be able to consider the cubed sphere done if we don't deal with this. So probably good idea to leave it in the milestone of global simulation using cubed sphere as is now?

@glwagner
Copy link
Member

"Done" isn't very precise since the cubed sphere will never be "done". But perhaps we can put a number on performance for the first milestone, which will allow us to conclude whether we need this optimization or not.

Can you explain where the gut feeling comes from? Will filling halos be so expensive even on just one GPU, or is this a distributed problem? Currently, 1/4 degree is performant on one GPU.

@navidcy
Copy link
Collaborator Author

navidcy commented Sep 14, 2023

"Done" isn't very precise since the cubed sphere will never be "done". But perhaps we can put a number on performance for the first milestone, which will allow us to conclude whether we need this optimization or not.

True. Ideally we want to be close to the scalings/performance we got with lat-lon grid? That’s perhaps not feasible..? I don’t know how close is good enough tho.

Can you explain where the gut feeling comes from? Will filling halos be so expensive even on just one GPU, or is this a distributed problem? Currently, 1/4 degree is performant on one GPU.

Well at least some gut feeling comes from that am pretty sure that it can be reduced in half by getting done in a single pass. But you are on point, I don’t have a gut feeling regarding how much impact the two passes have on performance.

@glwagner
Copy link
Member

True. Ideally we want to be close to the scalings/performance we got with lat-lon grid? That’s perhaps not feasible..? I don’t know how close is good enough tho.

We expect to be at lower performance. For that reason we have dedicated two independent milestones to the cubed sphere. The first milestone is rather susinct "complete the cubed sphere implementation". The second milestone pertain to performance: "achieve 10 SYPD at 25 km resolution". I think this is nice, because we want to separate tasks into ones that are required for correct functionality, versus tasks that are oriented towards performance rather than correctness.

@glwagner
Copy link
Member

glwagner commented Sep 14, 2023

I think high performance at 25 km resolution will prove difficult also because we are effectively dividing our kernel size by 1/6 (unless we figure out how to coalesce kernels across panels). On a large GPU this will lead to performance degredation at 25 km resolution, because even a single-panel kernel covering the whole globe at 25 km barely saturates one GPU. Recovering that performance for multi-region simulations may be difficult, especially in the face of the added complexity of distribution across multiple GPUs.

@navidcy
Copy link
Collaborator Author

navidcy commented Apr 5, 2024

closing this; closed by #3488

@navidcy navidcy closed this as completed Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants