Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derived properties vs. primary properties #410

Open
rartino opened this issue May 31, 2022 · 17 comments
Open

Derived properties vs. primary properties #410

rartino opened this issue May 31, 2022 · 17 comments
Labels
topic/property-standardization The specification of the precise data representation of properties and entries type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.

Comments

@rartino
Copy link
Contributor

rartino commented May 31, 2022

@sauliusg made a remark that it would be good to know whether a property comes directly out of the underlying data source, or if it has been processed (probably to fulfill the requirements of the standard, e.g., providing Cartesian coordinates from the reduced ones).

A suggestion is to add this info to the PR for property definitions #376. There is some discussion whether it is sufficient to say this is a derived property = FALSE/TRUE, or if it is strongly desirable for a database to indicate from what primary properties a derived property has been derived from.

@merkys
Copy link
Member

merkys commented Feb 17, 2023

This issue has not been addressed by #376. However, I wonder whether the granularity is right. Suppose a database is serving both fractional and Cartesian coordinates. Some entries have only fractional, and some have only Cartesian coordinates in the underlying data, thus the database has to perform on-the-fly conversions to present both types of coordinates. Thus the database cannot say that all Cartesian/fractional coordinates are primary/derived. However, such distinction would be useful, but maybe in per-entry level?

@rartino
Copy link
Contributor Author

rartino commented Feb 23, 2023

In connection to ranged properties #452, I start to think we have a general issue related to not having any systematic way to provide per entry property metadata. In that PR it is about what the ranges are, etc., here it is metadata about whether the property is derived. We've been touching upon the issue of providing accuracy information, etc.

Shouldn't we just agree on some sensible suffix? - e.g.,
cartesian_coordinates:meta
that is a dictionary with all metadata related to the property?

The suffix could also just be cartesian_coordinates_meta, and we'd not have to change anything on the software side (it'll behave just as any other data property).

We could also go in the opposite direction and place each separate category of metadata as a separate top-level field:
cartesian_coordinates:derived
cartesian_coordinates:ranged
meaning that ":" is the "meta data operator".

@merkys
Copy link
Member

merkys commented Feb 23, 2023

I think : in property names will break the filter syntax, unless they are not supposed to be used in filters, but I guess having them queryable would be a benefit. Suffix _meta looks better to me.

@rartino
Copy link
Contributor Author

rartino commented Feb 23, 2023

Yes, ":" would require us to extend the filter syntax - but it is a quite minor extension. Using "_" means we cannot split metadata up into categories (or, at least not in more than a few categories), as every suffix restricts the namespace of property names for data.

But there seems to be very little preventing us to PR:ing a standard _meta suffix with some globally specified metadata fields that are just specified as MAY be given for any property that is included in the response. It'll only take a paragraph to say, and then, of course, the hard work is standardizing these metadata parameters. But it would also be immediately useful for provider-specific fields.

Are there any trivially agreeable metadata fields we can start with to just get the segment about the suffix merged?

@merkys
Copy link
Member

merkys commented Feb 23, 2023

Yes, ":" would require us to extend the filter syntax - but it is a quite minor extension.

I disagree, : is already used in zip comparisons, thus allowing it in property names may break our grammar.

Using "_" means we cannot split metadata up into categories (or, at least not in more than a few categories), as every suffix restricts the namespace of property names for data.

Right. Maybe we can think of some other means/separator? Can we use .meta suffix? AFAIR, . is forbidden in property names and is used to refer to nested properties instead, but maybe this is some sort of legit un-nesting?

But there seems to be very little preventing us to PR:ing a standard _meta suffix with some globally specified metadata fields that are just specified as MAY be given for any property that is included in the response. It'll only take a paragraph to say, and then, of course, the hard work is standardizing these metadata parameters. But it would also be immediately useful for provider-specific fields.

Are there any trivially agreeable metadata fields we can start with to just get the segment about the suffix merged?

In #426 (comment) @eimrek suggested specifying calculation methods for derived properties, so that could be a candidate as well.

@JPBergsma
Copy link
Contributor

JPBergsma commented Feb 23, 2023

As far as I know, the "." is used to address subfields of dictionaries. At least, I have started to use it like that for the trajectories.
(So if the optimade python tools use the EBNF grammer it is compatible.)
It is also used like that in MongoDB.
So if we make the property cartesian_coordinates_meta a dictionary, we could place all the additional meta fields in there.
We can do queries like filter=cartesian_coordinates_meta.field = "value".
I think it would be even better to also place the data and metadata into a single dictionary, so the entry looks a bit cleaner at the top level. But I can understand that you do not want to do that for backwards compatibility.

Perhaps it would be better to separate this discussion from the original issue about derived vs primary properties.

@merkys
Copy link
Member

merkys commented Feb 24, 2023

As far as I know, the "." is used to address subfields of dictionaries.

This is true. However, since we probably will not place property data and meta in a single JSON:API-sh dictionary (as you say, this would be a major change), I thought we could simulate something similar to that by having:

"cartesian_site_positions": [
    // ...
],
"cartesian_site_positions.meta": {
    // ...
},

This would not need any changes to the grammar and would still support filters like cartesian_site_positions.meta.field = "value". The drawbacks are that we would have to make .meta suffix a special case and "meta" keys would probably be disallowed in dictionary-valued properties. (Or maybe not? We can mandate that they should be used for meta when property is dictionary valued)

@rartino
Copy link
Contributor Author

rartino commented Feb 24, 2023

Good point about the conflict with ":" for zip comparisons, lets stay away from that.

I'm reluctant to overload "." to mean both data dictionary subfield and metadata suffix. I get how one can see ".meta" as a generalized subfield, but especially for a properties that are themselves data dictionaries, won't this still get confusing? And, it means no (not even database-specific) property can ever have a data field named "meta".

On the other hand, @merkys suggestion is the only one I can see that would allow any property to either come with metadata or not, and give the feeling that data and metadata are "bundled in the same dictionary". Otherwise, wouldn't referencing our most straightforward properties, e.g. nelements have to be changed into nelements.count or similar?

Also, in practice, I see barely any difference between ".meta" and "_meta". The response would still look exactly as in @merkys' example, only with . -> _. So, maybe in this case _ is still the better choice. The reason some character like ":" could be more attractive for metadata access is that it shortens metadata identifiers. I.e., instead of cartesian_site_positions.meta.derived it would be cartesian_site_positions:derived. This loses a bit of clarity, however. And, when writing out example responses, I realize that it leads to a lot of duplication of the property name...

Both "_" and "." are attractive in that they require no change to the grammar. On the other hand, a specific character for metadata access has the benefit that the grammar can be changed to specifically separate metadata access on the parser level, which in the long run may be better for implementations.

There are also many other characters we could consider that do not have any special meaning in OPTIMADE yet, e.g., "%", "#", "$", etc. But, maybe the best option is '~', since it is the only URL-safe character we have left, and thus would still work well when experimenting in command line/browser address bar, etc. So, e.g.:

"cartesian_site_positions": [
    // ...
],
"cartesian_site_positions~meta: {
    // ..
}

@merkys
Copy link
Member

merkys commented Feb 24, 2023

I'm reluctant to overload "." to mean both data dictionary subfield and metadata suffix. I get how one can see ".meta" as a generalized subfield, but especially for a properties that are themselves data dictionaries, won't this still get confusing? And, it means no (not even database-specific) property can ever have a data field named "meta".

Indeed, this might be confusing. On the other hand, dictionary-valued properties would be allowed to store their metadata inside dictionaries and have the same filters as non-dictionary-valued properties.

Alternatively we may invert the placement of metadata and designate a special dictionary property meta for every entry type:

{
    "cartesian_site_positions": [
        // ...
    ],
    "nsites": 123,
    "meta": {
        "cartesian_site_positions": {
            "is_derived": true, "derived_from": "fractional_site_positions" // ...
        },
        "nsites": {
            "are_occupancies_accounted_for": false
        }
    },
}

This would not break grammar and would only require to reserve meta property name in every entry type.

@rartino
Copy link
Contributor Author

rartino commented Feb 25, 2023

Alternatively we may invert the placement of metadata and designate a special dictionary property meta for every entry type

Interesting idea. This fits our current design very well - current implementations would essentially not need to change at all to support this. In practice, it will just look and work like any other data member. It also makes it easy to say "include all metadata for this entry" with currently existing features (response_fields=meta,...).

On the other hand, metadata wouldn't be "sorted" alongside the data it belongs to. Also, in the shorter term, there is no facility in place today for clients to selectively chose which properties' metadata to include.

Queries would take the form meta.cartesian_site_positions.is_derived which is OK but IMO slightly less elegant than even cartesian_site_positions_meta.is_derived.

To continue the general discussion; we started to discuss this feature for "very meta" metadata such as "is_derived". However, I have a feeling that if we get this in place, it will pull us towards a generally more object-oriented-like design with a lower barrier to what we consider metadata. For example, the following would be a nice syntax (and, I think, in line with what @JPBergsma wants):

?filter=cartesian_site_positions.length > 42

whereas these syntaxes will just be seen by users as having to add irrelevant cruft:

?filter=cartesian_site_positions_meta.length > 42
?filter=meta.cartesian_site_positions > 42

But, as we've discussed above, . is already our field selector for dictionary data. I don't see a way of doing the above with . and preserve our current syntax for querying subfields in dictionaries. When trying to work something out, I keep coming back to this:

?filter=cartesian_site_positions~length > 42

and a response syntax of:

{
    "cartesian_site_positions": [
        // ...
    ],
    "cartesian_site_positions~": {
        "length": 123,
        "is_derived": true,
        "derived_from": "fractional_site_positions",
        // ...
    },
    "nsites": 123
}

But, I concede that adding "~" to the language in this way is a larger change.

So, after all this discussion, I feel a bit silly coming back to that we could do the _meta suffix, but with just _. This still requires almost no change to the standard, and the syntax will be almost as simple as the ~ operator one. We would have:

?filter=cartesian_site_positions_.length > 42

with a response syntax of:

{
    "cartesian_site_positions": [
        // ...
    ],
    "cartesian_site_positions_": {
        "length": 123,
        "is_derived": true,
        "derived_from": "fractional_site_positions",
        // ...
    },
    "nsites": 123
}

Just like the _meta suffix idea, this requires almost no extra work for implementations. The primary drawback IMO is that users will keep forgetting _ in _. for metadata field access in queries.

@JPBergsma
Copy link
Contributor

JPBergsma commented Mar 2, 2023

To me, a suffix like “_meta” or “_” seems to be the best solution.

?filter=cartesian_site_positions.length > 42 would indeed be nice syntax, but I would not know how to represent this properly in JSON without effectively making two separate properties. And if we are going to do that, we might as well make a real property by using a suffix like “_meta” or “_”

The “_” is a bit of a special character, and we may want to assign a special meaning to a terminal underscore in the future, just as we do now with a leading underscore. There may also be databases that already have properties with an underscore at the end.
So I am a bit hesitant to use just an underscore. I think it would be better to use the _meta postfix. It is unlikely that a database already uses that postfix without it having more or less the same meaning as we would assign to it now.
It is also clearer for a user what the meaning of the field is when we use "_meta".

@JPBergsma
Copy link
Contributor

To get back to the original topic of this discussion: To me, the difference between Cartesian/fractional is more of a unit issue, they both describe the same value but in a different way. And no extra error/uncertainty is introduced when you convert from one to the other. Therefore, I would not consider either one as derived from the other. So I do not think this is a good example. I would rather use that flag for a property like the compressibility if it was derived from the radial distribution function, instead of calculated directly by measuring the change in density upon changing the pressure.

@rartino
Copy link
Contributor Author

rartino commented Mar 2, 2023

@JPBergsma I suppose you have a point in that assigning a special meaning to a _ suffix could interfere with someone's database-specific field, but I really doubt it to be the case and that they'd feel it is an issue to fix if upgrading to the release where we introduce this.

But, if both of you in this thread agree that this kind of mechanism is a good solution to the metadata problem - maybe let us bring the question about the precise details and syntax is to the next web meet?

To me, the difference between Cartesian/fractional is more of a unit issue, they both describe the same value but in a different way.

I think in the discussion that provoked the issue posted here, we were primarily considering derived quantities possible to get at floating point accuracy. However, even in this case the distinction between "original" and "derived" is not meaningless. A database may want to be able to faithfully represent the source data in its tables and distinguish that from quantities that have undergone numerical processing. For example, COD is built from CIF files with, in most cases, experimental data. When I download a COD structure via OPTIMADE, it may for some uses be useful to know what quantities comes directly from that source file.

I would rather use that flag for a property like the compressibility if it was derived from the radial distribution function, instead of calculated directly by measuring the change in density upon changing the pressure.

Here we IMO get into the territory of the "full experimental/computational provenance of the data"-framework mentioned in #455 (in the context of "that would be useful but this is not that"). I do not envision such a framework to fit into even a per-entry metadata property, but would be happy to look at any proposal. (My own old proposal on this goes back to #74; now closed, but probably should be reopened since it also covers this discussion).

@JPBergsma
Copy link
Contributor

But, if both of you in this thread agree that this kind of mechanism is a good solution to the metadata problem - maybe let us bring the question about the precise details and syntax is to the next web meet?

Yes, we can discuss it further on the next web meeting.

@merkys
Copy link
Member

merkys commented Jun 8, 2023

Since metadata placement discussion has branched to #462 and #463, I suggest bringing back the discussion about specifying the origin of a property.

For simplicity I would vote for binary flag (derived vs. primary) for now. In future we may extend the scheme to provide the derivation details.

@merkys merkys added topic/property-standardization The specification of the precise data representation of properties and entries type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. labels Jun 8, 2023
@rartino
Copy link
Contributor Author

rartino commented Jun 8, 2023

@merkys But, there is a PR for this, #455 ?

@merkys
Copy link
Member

merkys commented Jun 12, 2023

@merkys But, there is a PR for this, #455 ?

PR #455 is about marking structure origin, i.e., experimental vs. DFT-derived structures. The current issue is about marking properties (or their values) as primary/derived. For example, if DFT code outputs orthogonal coordinates, they would be considered primary. If someone would convert them to fractional, such property would be marked as derived.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/property-standardization The specification of the precise data representation of properties and entries type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Projects
None yet
Development

No branches or pull requests

3 participants