Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting NetCDF output with NetCDFOutputWriter #2967

Closed
natgeo-wong opened this issue Mar 10, 2023 · 10 comments · Fixed by #3515
Closed

Splitting NetCDF output with NetCDFOutputWriter #2967

natgeo-wong opened this issue Mar 10, 2023 · 10 comments · Fixed by #3515

Comments

@natgeo-wong
Copy link

Opening this request as per the Slack conversation. Sometimes file sizes become very big (and thus are not easy to transfer, etc.), and so quite often it would probably be good to split data files according to time intervals.

So something with API like

split_file = Filesize(10Mb)

split_file = TimeInterval(10days)

Is there any specific things that should be noted if we are to extend the start_next_file() functionality to NetCDF as well instead of just JLD2? Like any flags or things that I should note.

@glwagner glwagner changed the title Splitting NetCDF dataset files by time Splitting NetCDF output with NetCDFOutputWriter Mar 10, 2023
@glwagner
Copy link
Member

I'd start by implementing the same feature that JLD2 has: a max_filesize keyword argument and property that controls when the files are split.

@glwagner
Copy link
Member

Check out the test for file splitting with JLD2OutputWriter:

function test_jld2_file_splitting(arch)

We'll want a practically identical test for a max_filesize feature with NetCDFOutputWriter

@josuemtzmo
Copy link
Collaborator

Hello, I'm interested in this feature. I've managed to implement the same feature @glwagner suggested of max_filesize within the NetCDFOutputWriter, in addition to its test. I'm currently doing the testing of the changes, and if it works, I will submit a PR. Currently the code is in my fork of Oceananigans (https://github.com/josuemtzmo/Oceananigans.jl/tree/netcdf_split)

@josuemtzmo
Copy link
Collaborator

Since I want to fully understand the way Julia works, I added also the feature of splitting files based on time.

However, this feature will not work well when computing averages AveragedTimeInterval, since what I coded uses the difference between the times store in the NCDatasets to define if the difference between the first and last time stored are equal or larger to the user defined TimeInterval(), thus requiring to split the file.

https://github.com/josuemtzmo/Oceananigans.jl/blob/99b69720d0743fe7af6118bc3352cc7139d84408/src/OutputWriters/netcdf_output_writer.jl#L489-L505

Do you have any suggestion on how to approach this? or how do you envision splitting files based on time?

I will not merge the changes to the other PR, since I think this time splitting a bit more thought. The code is currently at:
https://github.com/josuemtzmo/Oceananigans.jl/tree/netcdf_t_split

@glwagner
Copy link
Member

glwagner commented Mar 14, 2024

Since I want to fully understand the way Julia works, I added also the feature of splitting files based on time.

However, this feature will not work well when computing averages AveragedTimeInterval, since what I coded uses the difference between the times store in the NCDatasets to define if the difference between the first and last time stored are equal or larger to the user defined TimeInterval(), thus requiring to split the file.

https://github.com/josuemtzmo/Oceananigans.jl/blob/99b69720d0743fe7af6118bc3352cc7139d84408/src/OutputWriters/netcdf_output_writer.jl#L489-L505

Do you have any suggestion on how to approach this? or how do you envision splitting files based on time?

I will not merge the changes to the other PR, since I think this time splitting a bit more thought. The code is currently at: https://github.com/josuemtzmo/Oceananigans.jl/tree/netcdf_t_split

Can you explain what you are trying to do in more detail? What does it mean to split files based on time? You mean that you want to split files on a TimeInterval?

I would use the existing TimeInterval / schedule mechanism. If you re-implement code that is based on assumptions about how TimeInterval works, it will be harder to maintain because it will have to change if TimeInterval changes.

Then I guess if you want to have two independent features with interacting schedules, you will have to enforce that the two schedules are compatible / consistent within the constructor for the output writer.

Now that I think of it, it would probably better for size-based file splitting to also use schedules (eg a new schedule called FileSizeCriterion or something). But that's for another PR.

This isn't really a Julia-specific issue, it's more of a code design issue I think...

@josuemtzmo
Copy link
Collaborator

josuemtzmo commented Mar 14, 2024

Can you explain what you are trying to do in more detail? What does it mean to split files based on time? You mean that you want to split files on a TimeInterval?

I would like output files that consistently have for example 30 days, in other words, it requires to create a new file once the time in the netCDF is equal to the TimeInterval of 30 days. The implementation I did of this, works when the user provided TimeInterval is larger than that of the scheduler. Potentially it could also work if the user provided TimeInterval is larger than the AveragedTimeInterval (I haven't tested it).

I would use the existing TimeInterval / schedule mechanism. If you re-implement code that is based on assumptions about how TimeInterval works, it will be harder to maintain because it will have to change if TimeInterval changes.

So far I'm only using the TimeInterval interval to differentiate the data type between splitting by size and by time. i.e. if the argument passed by the user is a float, then it is assumed to be a file split by size. If the argument passed is a TimeInterval, then the files are split by time.

Then I guess if you want to have two independent features with interacting schedules, you will have to enforce that the two schedules are compatible / consistent within the constructor for the output writer.

Now that I think of it, it would probably better for size-based file splitting to also use schedules (eg a new schedule called FileSizeCriterion or something). But that's for another PR.

Yes, it seems to me that in order to do it properly, it will be required to make it consistent with the schedules, but looking at the code, and I'm not sure where to start...

@glwagner
Copy link
Member

A "schedule" is a function or callable object with a method

schedule(model)

that returns true or false based on a criterion. true means "do something".

The cleanest way to get this feature is to refactor the output writers to have a more generic interface for splitting. If we have a property called file_splitting that is a schedule, then we can write

NetCDFOutputWriter(model, outputs; file_splitting = TimeInterval(30days), ...)

Then the decision about whether to start a new file will change from

filesize(path) >= writer.max_filesize && start_next_file(model, writer)

to

writer.file_splitting(model) && start_next_file(model, writer) 

Next, you will have to add a new schedule in Utils/schedules.jl that can split based on filesize:

mutable struct FileSizeLargerThan <: AbstractSchedule
    max_filesize :: Float64
    path :: String
end

(fslt::FileSizeLargerThan)(model) = filesize(fslt.path) >= fslt.max_filesize

Finally, you need to add a user interface for initializing and modifying the schedules to smooth out the user experience (for example, we don't want users to have to specify the file path more than once, and the file path that is checked by the schedule has to be updated). This will have to take two parts. In Utils/schedules.jl, a constructor:

FileSizeLargerThan(max_filesize) = FileSizeLargerThan(max_filesize, "")

Then in output writers, an interface to be used in both the model constructor and write_output!:

update_schedule!(schedule, path) = nothing
update_schedule!(schedule::FileSizeLargerThan, path) = schedule.path = path

This function update_schedule!(writer.schedule, path) then replaces this line

path = writer.filepath # we might have a new path...

and also needs to be added in the output writer constructor so that

filepath = joinpath(dir, filename)

becomes

filepath = joinpath(dir, filename) 
update_schedule!(schedule, filepath)

Make sense?

@glwagner
Copy link
Member

This is a lot more sustainable than adding new properties to every output writer every time we want to support splitting a file by a different criterion.

It's a decent change to the user interface. I can help if you like.

@josuemtzmo
Copy link
Collaborator

Yes, that will be great. I will dive in into the schedule over the next few days!

@glwagner
Copy link
Member

Ok, I'll open a PR that refactors the interface for file splitting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants