Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointer revival #326

Merged
merged 28 commits into from
Aug 28, 2019
Merged

Checkpointer revival #326

merged 28 commits into from
Aug 28, 2019

Conversation

ali-ramadhan
Copy link
Member

This PR reintroduces checkpointing. This time the checkpointer saves the bare essentials to a JLD2 file.

Checkpointer will not save forcing functions as before (see #141).

Need to test GPU checkpointing, checkpointing with LES closures (I assumed model.diffusivities does not need to be checkpointed), and checkpointing with fancier boundary conditions.

A major issue with this checkpointer is that it can't restore the largest models because it creates a Model and then copies the data from the checkpoint file into the model fields. But if the model is using up all memory, then there's no room for reading data from disk to memory.

It should feed the data directly through the Model constructor, but this would require refactoring some of the model and field code.

This can be addressed in this PR or in a future PR, although it's kind of useless because checkpointing becomes more important the larger the model/simulation...

Resolves #324

@ali-ramadhan ali-ramadhan added the feature 🌟 Something new and shiny label Aug 2, 2019
@ali-ramadhan ali-ramadhan added this to the v1.0 milestone Aug 2, 2019
@ali-ramadhan ali-ramadhan self-assigned this Aug 2, 2019
@ali-ramadhan
Copy link
Member Author

I think output_writers.jl is getting pretty cluttered (almost 400 lines with functionality shared between output writers) so I might try to move them to individual files in a future PR.

@glwagner
Copy link
Member

glwagner commented Aug 2, 2019

I’m happy with output_writers.jl; I think all the code in that file is appropriately related.

The checkpointer could provide its own constructor to avoid excess memory allocation.

For GPU problems I don’t think there is an issue: checkpointed arrays can b loaded into CPU memory rather than GPU memory, and then the data can be copied into the fields allocated by the model constructor. So at first glance the excess memory allocation does not seem like a major issue on modern CPUs.

I am particularly concerned about the maintainability of the checkpointer, since it will need to be updated every time a new feature is added. Let’s make sure the design is easy to maintain before merging.

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some work to do I think but this is a good start.


function savesubfields!(file, model, name, flds=propertynames(getproperty(model, name)))
for f in flds
file["$name/$f"] = Array(getproperty(getproperty(model, name), f).data.parent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put an if statement that does not save a field if its type is Function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good check. Won't be needed as long we maintain checkpointed_fieldsets but good check either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should allow users to supply the modelfields that are to be checkpointed. In that case, such a check will be important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing from my comment below, the if-statement can also emit a warning that field x will not be saved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ali-ramadhan was this resolved?

output_frequency :: Int
end

function Checkpointer(; output_frequency, dir=".", prefix="checkpoint", force=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kwarg 'force' is not used here. It should either be used, or removed from the function signature.

In the JLD2OutputWriter, the kwarg force indicates whether file creation should be 'forced' (it corresponds to the same keyword passed to mkpath.

checkpointed_fieldsets = [:velocities, :tracers, :G, :Gp]

function write_output(model, c::Checkpointer)
@warn "Checkpointer will not save forcing functions, output writers, or diagnostics. They will need to be " *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a warning that is always printed? Perhaps is it simply better to document this aspect of the checkpointer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, might then be good to just print the warning if there is a non-zero forcing function, or an output writer or diagnostic included.

Then yeah warning may be removed if the checkpointer is well-documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what I propose:

  1. Allow users to set the modelfields that are to be checkpoints as an argument to the Checkpointer constructor

  2. If one of those fields that the user has asked to be saved contains a function, emit a warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ali-ramadhan was this resolved?

_arr(::GPU, a) = CuArray(a)

function restore_from_checkpoint(filepath)
@warn "Checkpointer cannot restore forcing functions, output writers, or diagnostics. They will need to be " *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. What we need is documentation about proper checkpointing, which will include information about how to restore a model from a checkpoint that includes functions. In fact, we could even provide features in the checkpointer that streamline this process (by indicating parts of the model structure that are associated with functions, and asking the user to provide those functions during checkpoint restoration).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah warning doesn't have be in restore_from_checkpoint. Passing forcing functions is a good idea.

k1, k2 = round(Int, Nz/4), round(Int, 3Nz/4)
true_model.tracers.T.data[i1:i2, j1:j2, k1:k2] .+= 0.01

checkpointed_model = deepcopy(true_model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to copy the model before checkpointing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's creating a second model that will be checkpointed as opposed to true_model which isn't.

Probably over paranoid but I wanted the two models to be time-stepped separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you worried that checkpointing will modify model?

Since this behavior is undesirable / unexpected, perhaps it should simply be tested for, so that we can assume that the model is not modified.

end

checkpointed_structs = [:arch, :boundary_conditions, :grid, :clock, :eos, :constants, :closure]
checkpointed_fieldsets = [:velocities, :tracers, :G, :Gp]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

G and Gp are not fields of model.

Copy link
Member Author

@ali-ramadhan ali-ramadhan Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. This was before PR #325 was merged and included the tendencies with AdamsBashforthTimestepper. Will update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ali-ramadhan was this resolved?

_arr(::CPU, a) = a
_arr(::GPU, a) = CuArray(a)

function restore_from_checkpoint(filepath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function must have a ; kwargs... argument which is then merged with the retrieved kwargs from the checkpoint file before being passed to the model constructor. This is needed to restore models with forcing functions or non-default boundary conditions.

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second review: add a field (perhaps checkpointed_fields) to the Checkpointer that allows the user to control which subfields of model are checkpointed. Use a kwarg to set the default subfields to current list. The subfields that contain Fields (like velocities) should be included the same list as subfields like constants (for simplicity, one list is best), and a function should be used to properly save a struct depending on whether it contains Fields, Functions, or neither.

# Checkpointing model properties that we can just serialize.
[file["$p"] = getproperty(model, p) for p in checkpointed_structs]

# Checkpointing structs containing fields.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this step can be combined with the first, along with an if statement that inspects the content of the struct to be saved and determines whether it contains Fields or not.

return nothing
end

checkpointed_structs = [:arch, :boundary_conditions, :grid, :clock, :eos, :constants, :closure]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields could be made properties of the Checkpointer. This will allow the user to add/remove fields from checkpointing, which may be generally useful in the future and is currently useful for omitting boundary conditions from checkpointing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ali-ramadhan was this resolved?

@ali-ramadhan
Copy link
Member Author

I'll work on this until we're happy with the interace but will leave restoring massive models (that fill up memory) properly for another PR.

I’m happy with output_writers.jl; I think all the code in that file is appropriately related.

I still find it a little messy and hard to navigate. Maybe a good solution for now would be to reorganize the file into sections, and shared functions can be moved to the top.

I am particularly concerned about the maintainability of the checkpointer, since it will need to be updated every time a new feature is added. Let’s make sure the design is easy to maintain before merging.

Eventually the Model struct will become stable but until then we'll have to modify the checkpointer to account for that. But yeah, having something flexible and easy to modify is important.

add a field (perhaps checkpointed_fields) to the Checkpointer that allows the user to control which subfields of model are checkpointed.

Might make more sense. In general, the checkpointer should be flexible but you'd probably only need to change what's checkpointed for esoteric cases. E.g. if you remove something important, you'll

@glwagner
Copy link
Member

glwagner commented Aug 5, 2019

Eventually the Model struct will become stable but until then we'll have to modify the checkpointer to account for that.

This is not the only issue. Another issue is if new subfields of model are added that contain non-checkpointable elements, like forcing functions. In that sense the Model struct will never be "stable" with respect to the types of data it references. For example, there could someday be a grid that contains forcing functions. Also, we should checkpoint boundary conditions by default. However, because boundary conditions can contain forcing functions, we need to be able to robustly/programmatically save only those boundary conditions that do not contain references to functions.

To address this, we need a generalized checkpointer that works in a wide range of circumstances, is relatively automated, modular, and can be user-modified.

I'm happy to help develop these features. One key is a function that can query a struct recursively to determine if it contains any references to Functions at any level. A second element is control to save an entire object in the case that the object contains no references to functions, versus a different but useful secondary behavior if the object does contain a reference to functions.

@codecov
Copy link

codecov bot commented Aug 22, 2019

Codecov Report

Merging #326 into master will decrease coverage by 13.3%.
The diff coverage is 96.15%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #326       +/-   ##
===========================================
- Coverage   74.74%   61.43%   -13.31%     
===========================================
  Files          22       24        +2     
  Lines        1176     1281      +105     
===========================================
- Hits          879      787       -92     
- Misses        297      494      +197
Impacted Files Coverage Δ
src/Oceananigans.jl 62.5% <ø> (-37.5%) ⬇️
src/output_writers.jl 46.97% <96.15%> (-17.77%) ⬇️
src/poisson_solvers.jl 40.65% <0%> (-56.97%) ⬇️
src/utils.jl 16.21% <0%> (-40.93%) ⬇️
src/turbulence_closures/constant_smagorinsky.jl 51.42% <0%> (-37.15%) ⬇️
src/turbulence_closures/closure_operators.jl 42.62% <0%> (-22.55%) ⬇️
src/fields.jl 42% <0%> (-20.03%) ⬇️
src/boundary_conditions.jl 59.15% <0%> (-10.85%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66670ad...40fd4b0. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 22, 2019

Codecov Report

Merging #326 into master will increase coverage by 1.31%.
The diff coverage is 81.1%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #326      +/-   ##
==========================================
+ Coverage   74.74%   76.06%   +1.31%     
==========================================
  Files          22       22              
  Lines        1176     1224      +48     
==========================================
+ Hits          879      931      +52     
+ Misses        297      293       -4
Impacted Files Coverage Δ
src/Oceananigans.jl 100% <ø> (ø) ⬆️
src/output_writers.jl 75.93% <81.1%> (+11.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66670ad...4e72d35. Read the comment docs.

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Aug 22, 2019

I've changed things a bit in response to your suggestions.

Mainly, the structs and fieldsets to checkpoint can be specified through the Checkpointer constructor. If an incompatible struct is provided, e.g. :forcing, an exception is raised. Validation only occurs in the constructor so no need for if-statements and warnings anywhere else.

@glwagner Lemme know if this looks okay to merge.

mkpath(dir)
return Checkpointer(dir, prefix, output_frequency)
end

function savesubfields!(file, model, name, flds=propertynames(getproperty(model, name)))
for f in flds
file["$name/$f"] = Array(getproperty(getproperty(model, name), f).data.parent)
if name ∉ (:forcing)
Copy link
Member

@glwagner glwagner Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about how this function works:

  • why restrict this functionality to Field (as it looks to me), but provide a hook when name=:forcing?
  • can't we check the type of the object being saved? Other fields (like boundary conditions, or future fields that we add to model) may be functions.

I think it'd be better to use a function that dispatches on the type of the object being saved, eg

savefield(file, location, field) = file[location] = field
savefield(file, location, field::AbstractArray) = file[location] = Array(field)
savefield(file, location, field::Field) = file[location] = Array(field.data.parent)
savefield(file, location, field::Function) = warn("Cannot checkpoint Function object into $location!")

Using such a function may be more robust and general.

@ali-ramadhan
Copy link
Member Author

Good suggestion. Actually, this should help fix one problem I've been having with JLD2 output: Julia ranges like grid.xC are serialized to the JLD2 file, and cannot be read outside of Julia.

You might want to serialize the grid when checkpointing to easily restore from a checkpoint file. But when saving the grid to a JLD2 output file, which may be read by a language other than Julia, the grid properties should be saved in a language-agnostic manner. Same for boundary conditions.

So I changed the way structs are saved to disk for both the JLD2 output writer and the checkpointer. It's all done recursively via multiple dispatch so it should be flexible enough to work for all current Model properties and it should accomodate future changes to Model with minor changes.

When saving stuff to disk like a JLD2 file, saveproperty! is used, which converts Julia objects to language-agnostic objects.

When checkpointing, serializeproperty! is used, which serializes objects, with fields and boundary conditions require special treatment.

We checkpoint structs that are important for timestepping. Diagnostics and output writers are not checkpointed, as they are not essential and can be added in any time after model constructions. But if one or more boundary conditions contain a function, model.boundary_conditions are not serialized and must be manually restored.

There is one mess bits associated with restoring from a checkpoint:

  • Fields cannot be passed to the Model constructor. When restoring fields we want to avoid loading a field from disk and allocating Model memory for it at the same time, as we won't be able to restore models whose memory footprint exceed ~half the CPU/GPU memory. Thus restoring fields is treated as a special case (see restore_fields!). It is done after model creation where fields are read from disk and used to fill up existing model fields. Unfortunately, model.timestepper doesn't fit the pattern and is treated as an extra special case.

@glwagner Let me know what you think. This PR has been open for a while so I'd like to merge it ASAP and work on more pressing issues.

@ali-ramadhan ali-ramadhan merged commit 35a6a05 into master Aug 28, 2019
@ali-ramadhan ali-ramadhan deleted the checkpointer branch August 28, 2019 00:02
arcavaliere pushed a commit to arcavaliere/Oceananigans.jl that referenced this pull request Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature 🌟 Something new and shiny
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Barebones checkpointing
2 participants