Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data type mismatches in model specification #603

Open
ConnectedSystems opened this issue Nov 24, 2023 · 11 comments
Open

Data type mismatches in model specification #603

ConnectedSystems opened this issue Nov 24, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@ConnectedSystems
Copy link
Collaborator

Model specification as generated by ADRIA.model_spec(dom) and loaded through rs = load_results(path) do not have the same data types for the same columns.

This leads to issues particularly for analyses which may expect Symbols vs Strings and vice versa (see for example: #601 (comment))

@ConnectedSystems ConnectedSystems added the bug Something isn't working label Nov 24, 2023
@ConnectedSystems
Copy link
Collaborator Author

ConnectedSystems commented Nov 24, 2023

Adding excerpts from the two model specifications.

Note the different column data types

julia> ADRIA.model_spec(dom)[:, [:component, :fieldname, :val, :bounds, :default_bounds]]
253×5 DataFrame
 Row │ component           fieldname                          val         bounds                       default_bounds
     │ String              Symbol                             Real        Tuple                       Tuple
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ EnvironmentalLayer  dhw_scenario                        1          (1.0, 51.0)                  (1.0, 51.0)
   2 │ EnvironmentalLayer  wave_scenario                       1          (1.0, 51.0)                  (1.0, 51.0)
   3 │ Intervention        guided                              0          (-1.0, 19.0)                 (-1.0, 19.0)

julia> rs.model_spec[:, [:component, :fieldname, :val, :bounds, :default_bounds]]
253×5 DataFrame
 Row │ component           fieldname                          val         bounds                             default_bounds
     │ String31            String                             Float64     String                             String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ EnvironmentalLayer  dhw_scenario                        1.0        (1.0, 51.0)                        (1.0, 51.0)
   2 │ EnvironmentalLayer  wave_scenario                       1.0        (1.0, 51.0)                        (1.0, 51.0)
   3 │ Intervention        guided                              0.0        (-1.0, 19.0)                       (-1.0, 19.0)

@Zapiano
Copy link
Collaborator

Zapiano commented Nov 27, 2023

At the line

model_spec = CSV.read(joinpath(result_loc, MODEL_SPEC, "model_spec.csv"), DataFrame; comment="#")
, when model_spec.csv is loaded, it is possible to pass an argument types with an array of types for each column. The question is: how to get this list of types matching ADRIA.model_spec(dom) - since dom is not available inside this function?

I don't know what the best approach would be, but what do you think about having one row with types for each col inside model_spec.csv ? Or do you have any other ideas?

@ConnectedSystems
Copy link
Collaborator Author

ConnectedSystems commented Nov 27, 2023

what do you think about having one row with types for each col inside model_spec.csv ? Or do you have any other ideas?

Yes, that's one approach - to list the types in a comment directly above the column headers and make that part of the specification.

Alternatively, switch to Zarr format instead of CSV and put the metadata of column types somewhere in Zarr.

Or specify it in the ResultSet metadata (inside the datapackage.json file?)...

Personally I like the Zarr-based solution, but it would come with additional complications...

@Zapiano
Copy link
Collaborator

Zapiano commented Nov 27, 2023

I also like the Zarr-based solution. I don't know what are the complications (can you elaborate?) but one problem I see is that is it harder to implement than just adding a commented row to the csv.

@ConnectedSystems
Copy link
Collaborator Author

harder to implement than just adding a commented row to the csv

Yes easier, but doing that would open a whole lot of security implications. Because we'd be evaluating string as code to obtain the equivalent data types.

Actually, this is true for all the ideas above (but somewhat less so for Zarr).

Maybe I'm overthinking it. Let's sleep on it.

@Zapiano
Copy link
Collaborator

Zapiano commented Dec 20, 2023

harder to implement than just adding a commented row to the csv

Yes easier, but doing that would open a whole lot of security implications. Because we'd be evaluating string as code to obtain the equivalent data types.

Actually, this is true for all the ideas above (but somewhat less so for Zarr).

Maybe I'm overthinking it. Let's sleep on it.

I was thinking: is it possible to just pass dom or, at least, a vector of types, as an argument to load_results? This function is being called in a lot of places so I'm not completely sure, but I see that two main places are inside combine_results and inside run_scenarios. I didn't find any call to combine_results inside ADRIA, how/where/when is this being used? But about run_scenarios, we have the domain there, we could pass the types as arguments to load_results.

@ConnectedSystems
Copy link
Collaborator Author

ConnectedSystems commented Dec 20, 2023

I was thinking: is it possible to just pass dom or, at least, a vector of types, as an argument to load_results?

Do you remember what the motivation is behind keeping the Domain and ResultSet separate?

@ConnectedSystems
Copy link
Collaborator Author

ConnectedSystems commented Dec 20, 2023

Maybe, if we're willing to drop support for any prior versions, the simplest solution is to take the expected types from within ADRIA itself.

This would be simple to do as we can take the types from the model spec, as generated from the default factor values...

I didn't find any call to combine_results inside ADRIA, how/where/when is this being used?

This was the first thing you worked on. I said not to delete the function because we still want to be able to merge ResultSets. Later we want to be able to add to an existing ResultSet and other things too.

@Zapiano
Copy link
Collaborator

Zapiano commented Dec 21, 2023

I was thinking: is it possible to just pass dom or, at least, a vector of types, as an argument to load_results?

Do you remember what the motivation is behind keeping the Domain and ResultSet separate?

Yes, you're right, sorry. Stupid idea.

@Zapiano
Copy link
Collaborator

Zapiano commented Dec 21, 2023

Maybe, if we're willing to drop support for any prior versions, the simplest solution is to take the expected types from within ADRIA itself.

I think now is a good time to implement changes that are important like this but have the cost of dropping support to prior versions. Soon we will release 1.0.0 (right?) anyway.

That said, what do you mean by "take the types from within ADRIA"? What I like about having the types saved somewhere within result set is that it allows us support other models/domains in the future. So far I'm leaning towards the Zarr-based solution.

I didn't find any call to combine_results inside ADRIA, how/where/when is this being used?

This was the first thing you worked on. I said not to delete the function because we still want to be able to merge ResultSets. Later we want to be able to add to an existing ResultSet and other things too.

Ok, I remember that.

@ConnectedSystems
Copy link
Collaborator Author

ConnectedSystems commented Dec 21, 2023

That said, what do you mean by "take the types from within ADRIA"?

The model specification is defined by the composition of the various components (Intervention, CriteriaWeight, EnvironmentalLayer, Coral).

I think only EnvironmentalLayer requires Domain metadata (e.g., the number of available trajectories), but have to check this.

Given we only want the data types, I guess it should be possible to create a model spec with just the default values and infer the types from that.

The Zarr-based approach has the same issue where we'd have to infer the correct types from a metadata string - running strings as code should be avoided. But again, maybe I'm overthinking it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants