Splitting NetCDF output with NetCDFOutputWriter #2967

natgeo-wong · 2023-03-10T19:44:16Z

Opening this request as per the Slack conversation. Sometimes file sizes become very big (and thus are not easy to transfer, etc.), and so quite often it would probably be good to split data files according to time intervals.

So something with API like

split_file = Filesize(10Mb)

split_file = TimeInterval(10days)

Is there any specific things that should be noted if we are to extend the start_next_file() functionality to NetCDF as well instead of just JLD2? Like any flags or things that I should note.

The text was updated successfully, but these errors were encountered:

glwagner · 2023-03-10T19:49:11Z

I'd start by implementing the same feature that JLD2 has: a max_filesize keyword argument and property that controls when the files are split.

glwagner · 2023-03-10T19:51:36Z

Check out the test for file splitting with JLD2OutputWriter:

Oceananigans.jl/test/test_jld2_output_writer.jl

Line 42 in 2cd91ab

function test_jld2_file_splitting(arch)

We'll want a practically identical test for a max_filesize feature with NetCDFOutputWriter

josuemtzmo · 2024-03-13T10:42:00Z

Hello, I'm interested in this feature. I've managed to implement the same feature @glwagner suggested of max_filesize within the NetCDFOutputWriter, in addition to its test. I'm currently doing the testing of the changes, and if it works, I will submit a PR. Currently the code is in my fork of Oceananigans (https://github.com/josuemtzmo/Oceananigans.jl/tree/netcdf_split)

josuemtzmo · 2024-03-14T15:31:59Z

Since I want to fully understand the way Julia works, I added also the feature of splitting files based on time.

However, this feature will not work well when computing averages AveragedTimeInterval, since what I coded uses the difference between the times store in the NCDatasets to define if the difference between the first and last time stored are equal or larger to the user defined TimeInterval(), thus requiring to split the file.

https://github.com/josuemtzmo/Oceananigans.jl/blob/99b69720d0743fe7af6118bc3352cc7139d84408/src/OutputWriters/netcdf_output_writer.jl#L489-L505

Do you have any suggestion on how to approach this? or how do you envision splitting files based on time?

I will not merge the changes to the other PR, since I think this time splitting a bit more thought. The code is currently at:
https://github.com/josuemtzmo/Oceananigans.jl/tree/netcdf_t_split

glwagner · 2024-03-14T15:52:13Z

Since I want to fully understand the way Julia works, I added also the feature of splitting files based on time.

However, this feature will not work well when computing averages AveragedTimeInterval, since what I coded uses the difference between the times store in the NCDatasets to define if the difference between the first and last time stored are equal or larger to the user defined TimeInterval(), thus requiring to split the file.

https://github.com/josuemtzmo/Oceananigans.jl/blob/99b69720d0743fe7af6118bc3352cc7139d84408/src/OutputWriters/netcdf_output_writer.jl#L489-L505

Do you have any suggestion on how to approach this? or how do you envision splitting files based on time?

I will not merge the changes to the other PR, since I think this time splitting a bit more thought. The code is currently at: https://github.com/josuemtzmo/Oceananigans.jl/tree/netcdf_t_split

Can you explain what you are trying to do in more detail? What does it mean to split files based on time? You mean that you want to split files on a TimeInterval?

I would use the existing TimeInterval / schedule mechanism. If you re-implement code that is based on assumptions about how TimeInterval works, it will be harder to maintain because it will have to change if TimeInterval changes.

Then I guess if you want to have two independent features with interacting schedules, you will have to enforce that the two schedules are compatible / consistent within the constructor for the output writer.

Now that I think of it, it would probably better for size-based file splitting to also use schedules (eg a new schedule called FileSizeCriterion or something). But that's for another PR.

This isn't really a Julia-specific issue, it's more of a code design issue I think...

josuemtzmo · 2024-03-14T16:03:28Z

Can you explain what you are trying to do in more detail? What does it mean to split files based on time? You mean that you want to split files on a TimeInterval?

I would like output files that consistently have for example 30 days, in other words, it requires to create a new file once the time in the netCDF is equal to the TimeInterval of 30 days. The implementation I did of this, works when the user provided TimeInterval is larger than that of the scheduler. Potentially it could also work if the user provided TimeInterval is larger than the AveragedTimeInterval (I haven't tested it).

I would use the existing TimeInterval / schedule mechanism. If you re-implement code that is based on assumptions about how TimeInterval works, it will be harder to maintain because it will have to change if TimeInterval changes.

So far I'm only using the TimeInterval interval to differentiate the data type between splitting by size and by time. i.e. if the argument passed by the user is a float, then it is assumed to be a file split by size. If the argument passed is a TimeInterval, then the files are split by time.

Then I guess if you want to have two independent features with interacting schedules, you will have to enforce that the two schedules are compatible / consistent within the constructor for the output writer.

Now that I think of it, it would probably better for size-based file splitting to also use schedules (eg a new schedule called FileSizeCriterion or something). But that's for another PR.

Yes, it seems to me that in order to do it properly, it will be required to make it consistent with the schedules, but looking at the code, and I'm not sure where to start...

glwagner · 2024-03-15T16:32:43Z

A "schedule" is a function or callable object with a method

schedule(model)

that returns true or false based on a criterion. true means "do something".

The cleanest way to get this feature is to refactor the output writers to have a more generic interface for splitting. If we have a property called file_splitting that is a schedule, then we can write

NetCDFOutputWriter(model, outputs; file_splitting = TimeInterval(30days), ...)

Then the decision about whether to start a new file will change from

Oceananigans.jl/src/OutputWriters/jld2_output_writer.jl

Line 275 in 9f3f8e6

filesize(path) >= writer.max_filesize && start_next_file(model, writer)

to

writer.file_splitting(model) && start_next_file(model, writer)

Next, you will have to add a new schedule in Utils/schedules.jl that can split based on filesize:

mutable struct FileSizeLargerThan <: AbstractSchedule
    max_filesize :: Float64
    path :: String
end

(fslt::FileSizeLargerThan)(model) = filesize(fslt.path) >= fslt.max_filesize

Finally, you need to add a user interface for initializing and modifying the schedules to smooth out the user experience (for example, we don't want users to have to specify the file path more than once, and the file path that is checked by the schedule has to be updated). This will have to take two parts. In Utils/schedules.jl, a constructor:

FileSizeLargerThan(max_filesize) = FileSizeLargerThan(max_filesize, "")

Then in output writers, an interface to be used in both the model constructor and write_output!:

update_schedule!(schedule, path) = nothing
update_schedule!(schedule::FileSizeLargerThan, path) = schedule.path = path

This function update_schedule!(writer.schedule, path) then replaces this line

Oceananigans.jl/src/OutputWriters/jld2_output_writer.jl

Line 276 in 9f3f8e6

path = writer.filepath # we might have a new path...

and also needs to be added in the output writer constructor so that

Oceananigans.jl/src/OutputWriters/jld2_output_writer.jl

Line 176 in 9f3f8e6

filepath = joinpath(dir, filename)

becomes

filepath = joinpath(dir, filename) 
update_schedule!(schedule, filepath)

Make sense?

glwagner · 2024-03-15T16:33:35Z

This is a lot more sustainable than adding new properties to every output writer every time we want to support splitting a file by a different criterion.

It's a decent change to the user interface. I can help if you like.

josuemtzmo · 2024-03-15T16:37:28Z

Yes, that will be great. I will dive in into the schedule over the next few days!

glwagner · 2024-03-15T17:30:36Z

Ok, I'll open a PR that refactors the interface for file splitting.

glwagner changed the title ~~Splitting NetCDF dataset files by time~~ Splitting NetCDF output with NetCDFOutputWriter Mar 10, 2023

josuemtzmo mentioned this issue Mar 13, 2024

Support file split in NetCDFOutputWriter when max_filesize is exceeded #3506

Closed

navidcy linked a pull request Mar 13, 2024 that will close this issue

Support file split in NetCDFOutputWriter when max_filesize is exceeded #3506

Closed

josuemtzmo mentioned this issue Mar 14, 2024

File splitting in NetCDFOutputWriter when max_filesize is exceeded #3512

Merged

glwagner mentioned this issue Mar 15, 2024

Generalize file splitting for output writers #3515

Merged

navidcy closed this as completed in #3515 Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting NetCDF output with NetCDFOutputWriter #2967

Splitting NetCDF output with NetCDFOutputWriter #2967

natgeo-wong commented Mar 10, 2023

glwagner commented Mar 10, 2023

glwagner commented Mar 10, 2023

josuemtzmo commented Mar 13, 2024

josuemtzmo commented Mar 14, 2024

glwagner commented Mar 14, 2024 •

edited

Loading

josuemtzmo commented Mar 14, 2024 •

edited

Loading

glwagner commented Mar 15, 2024

glwagner commented Mar 15, 2024

josuemtzmo commented Mar 15, 2024

glwagner commented Mar 15, 2024

Splitting NetCDF output with NetCDFOutputWriter #2967

Splitting NetCDF output with NetCDFOutputWriter #2967

Comments

natgeo-wong commented Mar 10, 2023

glwagner commented Mar 10, 2023

glwagner commented Mar 10, 2023

josuemtzmo commented Mar 13, 2024

josuemtzmo commented Mar 14, 2024

glwagner commented Mar 14, 2024 • edited Loading

josuemtzmo commented Mar 14, 2024 • edited Loading

glwagner commented Mar 15, 2024

glwagner commented Mar 15, 2024

josuemtzmo commented Mar 15, 2024

glwagner commented Mar 15, 2024

glwagner commented Mar 14, 2024 •

edited

Loading

josuemtzmo commented Mar 14, 2024 •

edited

Loading