Add Y-partition and XY-partition tests #3338

simone-silvestri · 2023-10-13T18:16:55Z

The code should allow all types of partitioning but there is a bug preventing the tests from passing for y- and xy-partitions.

This PR aims to fix the bug (and consequently add the excluded y- and xy-partitioning tests)

rework fill_halo_regions! to split out non-communicating and communicating boundary conditions such that the latter are executed alone.

To give an example, for

julia> boundary_conditions = FieldBoundaryConditions(west = NoFluxBoundaryCondition(), east = NoFluxBoundaryCondition(), south = ImpenetrableBoundaryCondition(), north = DistributedCommunicationBoundaryCondition(), bottom = nothing, top = nothing)
Oceananigans.FieldBoundaryConditions, with boundary conditions
├── west: FluxBoundaryCondition: Nothing
├── east: FluxBoundaryCondition: Nothing
├── south: OpenBoundaryCondition: Nothing
├── north: DistributedBoundaryCondition: Nothing
├── bottom: Nothing
├── top: Nothing
└── immersed: DefaultBoundaryCondition (FluxBoundaryCondition: Nothing)

in main:

julia> halo_tuple = permute_boundary_conditions(boundary_conditions);

julia> for i in 1:length(halo_tuple[1])
              @info "operation $(halo_tuple[1][i]) with bcs $((halo_tuple[2][i], halo_tuple[3][i]))"
       end
[ Info: operation fill_bottom_and_top_halo! with bcs (nothing, nothing)
[ Info: operation fill_west_and_east_halo! with bcs (FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing)
[ Info: operation fill_south_and_north_halo! with bcs (OpenBoundaryCondition: Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Nothing})

in this PR

julia> halo_tuple = permute_boundary_conditions(boundary_conditions);

julia> for i in 1:length(halo_tuple[1])
              @info "operation $(halo_tuple[1][i]) with bcs $(halo_tuple[2][i])"
       end
[ Info: operation fill_bottom_and_top_halo! with bcs (nothing, nothing)
[ Info: operation fill_south_halo! with bcs (OpenBoundaryCondition: Nothing,)
[ Info: operation fill_west_and_east_halo! with bcs (FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing)
[ Info: operation fill_north_halo! with bcs (DistributedBoundaryCondition: Nothing,)

glwagner · 2023-10-23T14:52:49Z

src/BoundaryConditions/fill_halo_regions.jl

+
+@inline extract_bc(bc, ::Val{:west_and_east})   = (extract_west_bc(bc), extract_east_bc(bc))
+@inline extract_bc(bc, ::Val{:south_and_north}) = (extract_south_bc(bc), extract_north_bc(bc))
+@inline extract_bc(bc, ::Val{:bottom_and_top})  = (extract_bottom_bc(bc), extract_top_bc(bc))


What's all this for?

in case we have a tuple of FieldBoundaryConditions instead of just one FieldBoundaryConditions where we can just do (bc.west, bc.east)

Oh okay, so this will extract a "tuple of west bcs" and a "tuple of east bcs".

Can you write that in a comment?

src/BoundaryConditions/fill_halo_regions.jl

glwagner · 2023-10-23T14:54:31Z

src/BoundaryConditions/fill_halo_regions.jl


    # Calculate size and offset of the fill_halo kernel
-    size   = fill_halo_size(c, fill_halo!, indices, bc_left, loc, grid)
+    size   = fill_halo_size(c, fill_halo!, indices, bcs[1], loc, grid)


Isn't there an assumption hidden in using bcs[1]? What is that assumption? Document your assumption with a comment.

src/BoundaryConditions/fill_halo_regions.jl

glwagner · 2023-10-23T14:55:58Z

src/BoundaryConditions/fill_halo_regions.jl

+# Distributed halos have to be filled last because of
+# buffered communication. 


This doesn't explain it to me completely. Why would buffered communication require halos to be filled last? What do buffers have to do with it? Is that point that you have to complete other halo filling tasks before communicating, if you want to avoid filling the halos again after communication?

It seems like we could communicate first before filling halos, right?

I will update the comment. If you want asynchronous communication there are two options:

Fill all local halos - fill the buffers and start the communication - perform any required computation - complete the communication.

Fill the buffers and start the communication - perform any required computation - complete the communication - Fill any local halos.

with the first option, nothing changes in the case of non-distributed fields so I went in that direction in the #3125.

By "complete the communication" you mean "wait until communication finishes" right? Eg there is not a separate action associated with "completing" the communication; it's just that it may not be finished at the time that you are done performing the preliminary computations. Or does "completing" mean something else?

completing the communication requires waiting until the communication is finished and then filling the relevant halos from the received buffer. In practice, we can think about it as waiting until the communication is finished (assuming that filling the relevant halos is part of the communication process).

Okay so the first step is to initiate communication, and the second step is to wait for communication to finish and then fill halos from the buffer. The code is misleading, because the initiation of communication is currently called "fill halo regions!". But halo filling is actually occurring in the second step, right?

in asynchronous communication fill_halo_regions!(f; async = true) yes, the communication is completed in the synchronize_communication!(f) otherwise everything is completed in the fill_halo_regions! function

Perhaps finish_fill_halo_regions!(f) or something would be helpful for the second step

And probably neesd a comment in the code to indicate that the halo regions are not actually filled into the second step in the case of a distributed computation

We assume that the second step completes everything: i.e. wait for communication to complete and fill the halos

Oceananigans.jl/src/DistributedComputations/distributed_fields.jl

Lines 56 to 77 in d0b7ec8

synchronize_communication!(field)

complete the halo passing of `field` among processors.

"""

function synchronize_communication!(field)

arch = architecture(field.grid)

# Wait for outstanding requests

if !isempty(arch.mpi_requests)

cooperative_waitall!(arch.mpi_requests)

# Reset MPI tag

arch.mpi_tag[] -= arch.mpi_tag[]

# Reset MPI requests

empty!(arch.mpi_requests)

end

recv_from_buffers!(field.data, field.boundary_buffers, field.grid)

return nothing

end

src/BoundaryConditions/fill_halo_regions_periodic.jl

src/DistributedComputations/halo_communication.jl

glwagner · 2023-10-23T14:57:47Z

src/DistributedComputations/halo_communication.jl

@@ -101,11 +101,11 @@ function fill_halo_regions!(c::OffsetArray, bcs, indices, loc, grid::Distributed
    arch       = architecture(grid)
    halo_tuple = permute_boundary_conditions(bcs)

-    for task = 1:3
+    for task = 1:length(halo_tuple[1])


What is the meaning of "length(halo_tuple[1])"?

changed to number_of_tasks

src/Fields/field_boundary_buffers.jl

src/DistributedComputations/halo_communication_bcs.jl

Co-authored-by: Gregory L. Wagner <[email protected]>

simone-silvestri · 2023-10-24T17:22:18Z

I think I should have addressed all comments

simone-silvestri added 30 commits October 10, 2023 14:57

partitioning

bb10cae

ready for regression

c679d88

mutable

34e07a5

remove topology from arch

fd8e099

fix topology issue

de1177d

no need to regularize connectivity

9a87593

comments

e40dfe6

some bugfixxes

8c5bf7f

more bug fixing

a2165ca

one bug down

9831b7e

fix indent

492d376

bugfix

28b4293

fixed tests

2855375

fixed tests

1cd66f7

bugifx

527a246

bugfix

7ca0e40

bugfix

7b632bc

fixed tests

ce75461

downloading correct data

c88901d

another test

f8edf35

last debuggging to follow

3952535

bounded regression to fix

394d593

test on caltech cluster

1fb7d3f

test correct files

cbc20c2

correct keys

b43f1fc

remove tracer advection test

7f5d8dc

forgot delta_min

029f8f9

fix

8d182cb

using types

10779a6

now it should work

098da36