Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if overwrite_output=False, check if files exist first to prevent excessive runtimes when files already exist #120

Open
kvanwerkhoven opened this issue Feb 12, 2024 · 4 comments
Assignees

Comments

@kvanwerkhoven
Copy link
Member

Based on run times and the code, loading functions appear to be generating data first then checking if the file exists. A loading request may commonly be launched although the majority of files already exist (e.g., adding 1 day to the start or end of the period). One would expect the runtime to be near zero if no new output is generated. As is, a request to load data that mostly already existed took 30 min (similar to runtime when files were originally generated), although no new files were written. Is this the intended behavior?

@kvanwerkhoven
Copy link
Member Author

Note - this is not the highest priority issue, but wanted to get it logged

@samlamont
Copy link
Collaborator

Thanks @kvanwerkhoven , I see your point, I will take a look.

@mgdenno
Copy link
Contributor

mgdenno commented Feb 13, 2024

I am not saying we can't /shouldn't address this, but one consideration - I think we use the same "file writer" for all the functions/methods that write parquet files. In general there is not always a 1:1 relationship between downloaded file and written file. For example, for the USGS data we download the data and have options to write one or more dates/locations per file. Granted for the NWM data we do have a 1:1 relationship between downloaded file and written file, i think, but we have considered adding other ways to "chunk" the data such that that wouldn't be the case.

I think this also relates to this not a realtime system - one of the limitations is that we don't track what data is in the system (or in each file). While just changing a date in an import script where there is a 1:1 relationship between read files and write files and grabbing a new day (or whatever) is not likely to cause any issues, i think it does open the system up a little more to data integrity issues. Or rather make them a little more likely - I think we already potentially have them. I think we should take a pause on changing this and think about if/how we should be tracking what we are downloading...or maybe querying to see what data is missing before downloading...

We will see if this makes any sense in the AM when I read this again... :)

@kvanwerkhoven
Copy link
Member Author

Agree - definitely it's a larger question and already an issue with the blanket "overwrite_output" option not yet checking for 1:1 between newly requested data and existing data within the files. For NWM it's also not always 1:1, there can also be discrepancies between which locations were requested the first time versus the new download request. i.e., If the event spatial extents (and thus nwm id list) was different for a given day, there would be a mismatch. I was ignoring that for now while I am the only user, but can see maybe need to address sooner than later. Not sure where it should sit in priorities and which/how many use cases outside of post-event it will affect. Add it to the 'to discuss again' list for when you get back :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants