Data type mismatches in model specification #603

ConnectedSystems · 2023-11-24T05:50:20Z

Model specification as generated by ADRIA.model_spec(dom) and loaded through rs = load_results(path) do not have the same data types for the same columns.

This leads to issues particularly for analyses which may expect Symbols vs Strings and vice versa (see for example: #601 (comment))

The text was updated successfully, but these errors were encountered:

ConnectedSystems · 2023-11-24T05:52:26Z

Adding excerpts from the two model specifications.

Note the different column data types

julia> ADRIA.model_spec(dom)[:, [:component, :fieldname, :val, :bounds, :default_bounds]]
253×5 DataFrame
 Row │ component           fieldname                          val         bounds                       default_bounds
     │ String              Symbol                             Real        Tuple…                       Tuple…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ EnvironmentalLayer  dhw_scenario                        1          (1.0, 51.0)                  (1.0, 51.0)
   2 │ EnvironmentalLayer  wave_scenario                       1          (1.0, 51.0)                  (1.0, 51.0)
   3 │ Intervention        guided                              0          (-1.0, 19.0)                 (-1.0, 19.0)

julia> rs.model_spec[:, [:component, :fieldname, :val, :bounds, :default_bounds]]
253×5 DataFrame
 Row │ component           fieldname                          val         bounds                             default_bounds
     │ String31            String                             Float64     String                             String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ EnvironmentalLayer  dhw_scenario                        1.0        (1.0, 51.0)                        (1.0, 51.0)
   2 │ EnvironmentalLayer  wave_scenario                       1.0        (1.0, 51.0)                        (1.0, 51.0)
   3 │ Intervention        guided                              0.0        (-1.0, 19.0)                       (-1.0, 19.0)

Zapiano · 2023-11-27T04:20:19Z

At the line

ADRIA.jl/src/io/result_io.jl

Line 408 in 9422a04

    
           model_spec = CSV.read(joinpath(result_loc, MODEL_SPEC, "model_spec.csv"), DataFrame; comment="#")

, when model_spec.csv is loaded, it is possible to pass an argument types with an array of types for each column. The question is: how to get this list of types matching ADRIA.model_spec(dom) - since dom is not available inside this function?

I don't know what the best approach would be, but what do you think about having one row with types for each col inside model_spec.csv ? Or do you have any other ideas?

ConnectedSystems · 2023-11-27T04:40:25Z

what do you think about having one row with types for each col inside model_spec.csv ? Or do you have any other ideas?

Yes, that's one approach - to list the types in a comment directly above the column headers and make that part of the specification.

Alternatively, switch to Zarr format instead of CSV and put the metadata of column types somewhere in Zarr.

Or specify it in the ResultSet metadata (inside the datapackage.json file?)...

Personally I like the Zarr-based solution, but it would come with additional complications...

Zapiano · 2023-11-27T05:05:45Z

I also like the Zarr-based solution. I don't know what are the complications (can you elaborate?) but one problem I see is that is it harder to implement than just adding a commented row to the csv.

ConnectedSystems · 2023-12-20T07:16:15Z

harder to implement than just adding a commented row to the csv

Yes easier, but doing that would open a whole lot of security implications. Because we'd be evaluating string as code to obtain the equivalent data types.

Actually, this is true for all the ideas above (but somewhat less so for Zarr).

Maybe I'm overthinking it. Let's sleep on it.

Zapiano · 2023-12-20T15:19:00Z

harder to implement than just adding a commented row to the csv

Yes easier, but doing that would open a whole lot of security implications. Because we'd be evaluating string as code to obtain the equivalent data types.

Actually, this is true for all the ideas above (but somewhat less so for Zarr).

Maybe I'm overthinking it. Let's sleep on it.

I was thinking: is it possible to just pass dom or, at least, a vector of types, as an argument to load_results? This function is being called in a lot of places so I'm not completely sure, but I see that two main places are inside combine_results and inside run_scenarios. I didn't find any call to combine_results inside ADRIA, how/where/when is this being used? But about run_scenarios, we have the domain there, we could pass the types as arguments to load_results.

ConnectedSystems · 2023-12-20T15:33:30Z

I was thinking: is it possible to just pass dom or, at least, a vector of types, as an argument to load_results?

Do you remember what the motivation is behind keeping the Domain and ResultSet separate?

ConnectedSystems · 2023-12-20T15:42:34Z

Maybe, if we're willing to drop support for any prior versions, the simplest solution is to take the expected types from within ADRIA itself.

This would be simple to do as we can take the types from the model spec, as generated from the default factor values...

I didn't find any call to combine_results inside ADRIA, how/where/when is this being used?

This was the first thing you worked on. I said not to delete the function because we still want to be able to merge ResultSets. Later we want to be able to add to an existing ResultSet and other things too.

Zapiano · 2023-12-21T14:27:55Z

I was thinking: is it possible to just pass dom or, at least, a vector of types, as an argument to load_results?

Do you remember what the motivation is behind keeping the Domain and ResultSet separate?

Yes, you're right, sorry. Stupid idea.

Zapiano · 2023-12-21T14:37:36Z

Maybe, if we're willing to drop support for any prior versions, the simplest solution is to take the expected types from within ADRIA itself.

I think now is a good time to implement changes that are important like this but have the cost of dropping support to prior versions. Soon we will release 1.0.0 (right?) anyway.

That said, what do you mean by "take the types from within ADRIA"? What I like about having the types saved somewhere within result set is that it allows us support other models/domains in the future. So far I'm leaning towards the Zarr-based solution.

I didn't find any call to combine_results inside ADRIA, how/where/when is this being used?

This was the first thing you worked on. I said not to delete the function because we still want to be able to merge ResultSets. Later we want to be able to add to an existing ResultSet and other things too.

Ok, I remember that.

ConnectedSystems · 2023-12-21T23:39:19Z

That said, what do you mean by "take the types from within ADRIA"?

The model specification is defined by the composition of the various components (Intervention, CriteriaWeight, EnvironmentalLayer, Coral).

I think only EnvironmentalLayer requires Domain metadata (e.g., the number of available trajectories), but have to check this.

Given we only want the data types, I guess it should be possible to create a model spec with just the default values and infer the types from that.

The Zarr-based approach has the same issue where we'd have to infer the correct types from a metadata string - running strings as code should be avoided. But again, maybe I'm overthinking it.

ConnectedSystems added the bug Something isn't working label Nov 24, 2023

ConnectedSystems mentioned this issue Nov 24, 2023

Fix treatment of categorical factors in regional sensitivity analysis #601

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data type mismatches in model specification #603

Data type mismatches in model specification #603

ConnectedSystems commented Nov 24, 2023

ConnectedSystems commented Nov 24, 2023 •

edited

Loading

Zapiano commented Nov 27, 2023 •

edited

Loading

ConnectedSystems commented Nov 27, 2023 •

edited

Loading

Zapiano commented Nov 27, 2023

ConnectedSystems commented Dec 20, 2023

Zapiano commented Dec 20, 2023

ConnectedSystems commented Dec 20, 2023 •

edited

Loading

ConnectedSystems commented Dec 20, 2023 •

edited

Loading

Zapiano commented Dec 21, 2023

Zapiano commented Dec 21, 2023

ConnectedSystems commented Dec 21, 2023 •

edited

Loading

Data type mismatches in model specification #603

Data type mismatches in model specification #603

Comments

ConnectedSystems commented Nov 24, 2023

ConnectedSystems commented Nov 24, 2023 • edited Loading

Zapiano commented Nov 27, 2023 • edited Loading

ConnectedSystems commented Nov 27, 2023 • edited Loading

Zapiano commented Nov 27, 2023

ConnectedSystems commented Dec 20, 2023

Zapiano commented Dec 20, 2023

ConnectedSystems commented Dec 20, 2023 • edited Loading

ConnectedSystems commented Dec 20, 2023 • edited Loading

Zapiano commented Dec 21, 2023

Zapiano commented Dec 21, 2023

ConnectedSystems commented Dec 21, 2023 • edited Loading

ConnectedSystems commented Nov 24, 2023 •

edited

Loading

Zapiano commented Nov 27, 2023 •

edited

Loading

ConnectedSystems commented Nov 27, 2023 •

edited

Loading

ConnectedSystems commented Dec 20, 2023 •

edited

Loading

ConnectedSystems commented Dec 20, 2023 •

edited

Loading

ConnectedSystems commented Dec 21, 2023 •

edited

Loading