if overwrite_output=False, check if files exist first to prevent excessive runtimes when files already exist #120

kvanwerkhoven · 2024-02-12T22:48:32Z

Based on run times and the code, loading functions appear to be generating data first then checking if the file exists. A loading request may commonly be launched although the majority of files already exist (e.g., adding 1 day to the start or end of the period). One would expect the runtime to be near zero if no new output is generated. As is, a request to load data that mostly already existed took 30 min (similar to runtime when files were originally generated), although no new files were written. Is this the intended behavior?

kvanwerkhoven · 2024-02-12T22:54:08Z

Note - this is not the highest priority issue, but wanted to get it logged

samlamont · 2024-02-13T02:23:46Z

Thanks @kvanwerkhoven , I see your point, I will take a look.

mgdenno · 2024-02-13T04:44:09Z

I am not saying we can't /shouldn't address this, but one consideration - I think we use the same "file writer" for all the functions/methods that write parquet files. In general there is not always a 1:1 relationship between downloaded file and written file. For example, for the USGS data we download the data and have options to write one or more dates/locations per file. Granted for the NWM data we do have a 1:1 relationship between downloaded file and written file, i think, but we have considered adding other ways to "chunk" the data such that that wouldn't be the case.

I think this also relates to this not a realtime system - one of the limitations is that we don't track what data is in the system (or in each file). While just changing a date in an import script where there is a 1:1 relationship between read files and write files and grabbing a new day (or whatever) is not likely to cause any issues, i think it does open the system up a little more to data integrity issues. Or rather make them a little more likely - I think we already potentially have them. I think we should take a pause on changing this and think about if/how we should be tracking what we are downloading...or maybe querying to see what data is missing before downloading...

We will see if this makes any sense in the AM when I read this again... :)

kvanwerkhoven · 2024-02-13T13:16:43Z

Agree - definitely it's a larger question and already an issue with the blanket "overwrite_output" option not yet checking for 1:1 between newly requested data and existing data within the files. For NWM it's also not always 1:1, there can also be discrepancies between which locations were requested the first time versus the new download request. i.e., If the event spatial extents (and thus nwm id list) was different for a given day, there would be a mismatch. I was ignoring that for now while I am the only user, but can see maybe need to address sooner than later. Not sure where it should sit in priorities and which/how many use cases outside of post-event it will affect. Add it to the 'to discuss again' list for when you get back :-)

kvanwerkhoven assigned samlamont and mgdenno Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

if overwrite_output=False, check if files exist first to prevent excessive runtimes when files already exist #120

if overwrite_output=False, check if files exist first to prevent excessive runtimes when files already exist #120

kvanwerkhoven commented Feb 12, 2024

kvanwerkhoven commented Feb 12, 2024

samlamont commented Feb 13, 2024

mgdenno commented Feb 13, 2024 •

edited

Loading

kvanwerkhoven commented Feb 13, 2024

if overwrite_output=False, check if files exist first to prevent excessive runtimes when files already exist #120

if overwrite_output=False, check if files exist first to prevent excessive runtimes when files already exist #120

Comments

kvanwerkhoven commented Feb 12, 2024

kvanwerkhoven commented Feb 12, 2024

samlamont commented Feb 13, 2024

mgdenno commented Feb 13, 2024 • edited Loading

kvanwerkhoven commented Feb 13, 2024

mgdenno commented Feb 13, 2024 •

edited

Loading