-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature idea: keyword "pickup" in run!
#1068
Comments
@christophernhill feedback appreciated! |
Related issue: #779 |
A consideration when picking up from a checkpoint and using Not sure of the best way of making this easy for users without accidentally overwriting their data. I can think of three solutions:
|
Not sure how it would work with JLD2 but |
I like this idea a lot! It would make checkpointing and picking up from a checkpoint much easier, especially for users. How would this work if you have one spin up script and another script for the interesting part of the simulation that picks up from the checkpoint made at the end of spin up? I guess there are two options for If we add a third option: X-Ref: #602 |
This is an interesting and I think common use case that's worth thinking about (cc @sandreza). The current mode, which I think works well, is to build a new model (Thinking about it more, we might almost rename If we deprecate set!(model, checkpoint_filepath::String) which would be implemented as part of the Cons
Pro
I think we can mitigate Con.1 by designing helper functions that restore individual model properties. Something like grid = restore(checkpoint_filepath, :grid)
coriolis = restore(checkpoint_filepath, :coriolis) (This kind of feature is important for post processing with the the other output writers too.) I guess we can't solve Con.1 without diluting Pro.1. |
Currently restoring a model from a checkpoint uses the function
restore_from_checkpoint
. However, this design can be difficult to use because, in general, restoring a model requires passing the forcing functions and boundary condition functions explicitly torestore_from_checkpoint
. This means that part of the setup script needs to be replicated in the script that restores a model from the checkpoint.A different solution is possible that may require less work on the part of users to restore a model from a checkpoint: if we extend
run!
with the keyword argumentpickup
.If
pickup=true
(and a checkpointer has been added tosimulation.output_writers
), then prior to executing therun!
loop,model.velocities
,model.tracers
,model.clock
, and, forQuasiAdamsBashforth2
timesteppers, the tendencymodel.timesteppers.Gⁿ
.For this to work seamlessly when a
JLD2OutputWriter
is involved we might need to modify the constructor forJLD2OutputWriter
. Currently the constructor runs an arbitraryinit(file, model)
function, and saves certain properties identified in the constructor signature:Oceananigans.jl/src/OutputWriters/jld2_output_writer.jl
Lines 157 to 160 in 9cfb568
This will fail if the output file already exists; thus it's not possible to pickup from a checkpoint with the same script used to initialize the model (without destroying prior output). A simple change could just be to allow
init
andsave_properties!
to fail withtry/catch
(or to avoid running those functions if a file already exists). There are also some shenanigans that'd have to be done for output files split into multiplepart
s. (We could, potentially, requirepickup=true
in theJLD2OutputWriter
constructor, rather than changing its default behavior to accommodate auto-checkpoint-pickup).Another caveat is that this method of restoring checkpointed data will not work for large CPU models that consume almost all of the CPU memory (such that a single field cannot be loaded from file after
model
has been instantiated). These cases are relatively rare right now, since such large models would typically run very slowly on a typical single node.The basic idea is:
We could also allow
pickup
to be an iteration number, egIt may also be possible to enable this functionality with an environment variable; eg
Note that this design works even if
model.clock.iteration==0
, since the initial checkpoint can be picked up.The text was updated successfully, but these errors were encountered: