Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. #1304

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

JPBergsma
Copy link
Contributor

@JPBergsma JPBergsma commented Aug 17, 2022

I have put some of the changes I made in Barcelona and some more things in this PR.
I got a bit carried away and in hindsight I should probably have split it up in two or more PRs as it has become quite large and involves changes that could have been separated.

I am also still a bit worried that I may have changed too much and that there could be problems with code depending on the optimade python tools.

It now allows mapping to arbitrary nesting levels as long as the fields are separated by ".", which is similar to how they are handled in MongoDB.
Example: "aliases": { "structures": {"OPTIMADE_field": "field.nested_field"} }
It also allows adding database specific fields to top level optimade dictionaries like: "species"
So if you add "species.oxidation_state" to the provider_fields, it will be presented as "species._exmpl_oxidation_state"

The mapping is now done in two steps. One step for removing/adding the prefixes and one step for mapping the optimade field to the backend specific field.

The all_aliases method therefore now only returns the mapping between the optimade fields and the backend fields.
The all_prefixed_fields now contains the pairs of prefixed and unprefixed optimade fields.

Another change that is in this PR is that only the requested fields are now retrieved from the Mongo database. This caused some issues with validation, because fields that are required for a normal response are now no longer present when the response_fields parameter is present. I there for had to make the validation less strict in a few places.

@codecov
Copy link

codecov bot commented Aug 17, 2022

Codecov Report

Merging #1304 (f174850) into master (105b501) will increase coverage by 0.15%.
The diff coverage is 99.24%.

@@            Coverage Diff             @@
##           master    #1304      +/-   ##
==========================================
+ Coverage   91.35%   91.51%   +0.15%     
==========================================
  Files          74       74              
  Lines        4374     4454      +80     
==========================================
+ Hits         3996     4076      +80     
  Misses        378      378              
Flag Coverage Δ
project 91.51% <99.24%> (+0.15%) ⬆️
validator 91.57% <100.00%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
optimade/utils.py 71.56% <97.22%> (+13.35%) ⬆️
optimade/models/jsonapi.py 93.93% <100.00%> (+0.44%) ⬆️
optimade/models/links.py 100.00% <100.00%> (ø)
optimade/models/references.py 100.00% <100.00%> (+2.27%) ⬆️
optimade/server/entry_collections/elasticsearch.py 97.53% <100.00%> (ø)
...made/server/entry_collections/entry_collections.py 97.98% <100.00%> (+0.19%) ⬆️
optimade/server/mappers/entries.py 99.29% <100.00%> (+0.19%) ⬆️
optimade/server/routers/utils.py 96.39% <100.00%> (-0.10%) ⬇️
optimade/validator/validator.py 83.48% <100.00%> (+0.02%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@JPBergsma JPBergsma marked this pull request as ready for review August 18, 2022 12:16
@JPBergsma JPBergsma added the on-hold For PRs/issues that are on-hold for an unspecified time label Sep 14, 2022
@JPBergsma
Copy link
Contributor Author

I still made some edits in this area during my stay in Barcelona, so I would like to add those changes to this PR as well.

@JPBergsma JPBergsma changed the title Alow aliases to be nested fields. Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. Nov 16, 2022
@JPBergsma JPBergsma added enhancement New feature or request config For issues/PRs related to the server config. and removed on-hold For PRs/issues that are on-hold for an unspecified time labels Nov 16, 2022
… and removed fallback value for get_non_optional_fields.
@ml-evs
Copy link
Member

ml-evs commented Nov 16, 2022

Is this ready for review now then @JPBergsma ?

@JPBergsma
Copy link
Contributor Author

Yes, you can review it now. I have also updated the text at the start of this PR, so you may want to reread it.

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard effort on this @JPBergsma! I've focused this first review on the response fields aspect without looking too deeply at the aliases. Whilst the nested aliases would be nice, we should make sure it doesn't effect performance for server's that don't use them (perhaps just hiding all of that functionality behind a switch based on whether there are dotted aliases in the server config).

I agree that this is definitely two separate PRs, but understand if it now too hard to disentangle them.

I think selecting which fields to grab from the database could be useful going forward (though I guess in the trajectories case they might be so big as to live in a different collection/table anyway?), however I think the current implementation falls a bit short in terms of performance and changes to the existing Python API that others have built their servers on. Whilst we shouldn't feel completely married to the existing way we do things (i.e., we can introduce breaking changes where necessary before v1), we should bear in mind that this will probably be a prohibitive blocker for APIs using this package (Materials Project for example is still using v0.13.3 from 18 months ago.

optimade/models/links.py Show resolved Hide resolved
docs/getting_started/setting_up_an_api.md Outdated Show resolved Hide resolved
Comment on lines +174 to +176
for result in results:
for field in include_fields:
set_field_to_none_if_missing_in_dict(result["attributes"], field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to incur a significant performance overhead, but I guess you want to do it so that you don't have to pull e.g., entire trajectories from the database each time, yet you still want to deserialize the JSON into your classes? I think I would suggest we instead have a per-collection deserialization flag, as presumably you only want to deserialize trajectories once (on database insertion) anyway. Does that make sense?

If you want to retain this approach, it might be cleaner to do it at the pydantic level, e.g., a root_validator that explicitly sets all missing fields to None (see also https://pydantic-docs.helpmanual.io/usage/models/#creating-models-without-validation as an option).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is particularly heavy compared to all the other things we do in the code.
It is what we previously did in handle_response_fields. I only moved it here, so we can do it before the deserialization and added code to handle nested fields.

For biomolecular data, a structure can easily have 10,000 atoms, so retrieving them from the database and putting them in the model would take some time. This way we can avoid that if the species_at_sites and cartesian_site_positions are not in the response_fields. (I also made a patch in the code for Barcelona that allowed them to specify the default response fields, so they can choose to not have these fields in the response by default.)

I did not want to make the change even bigger by bypassing the rest of the validator (as in your second suggestion).
But from a performance viewpoint, bypassing the validation would be good.
Do you want me to add this to this PR or to a future PR ?

I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, fair enough, just looks a bit scarier as a double for loop alongside the recursive descent into dictionaries to get the nested aliases. It's quite hard to reason about this, so I might set up a separate repo for measuring performance in the extreme limits (1 structure of 10000 atoms vs 10000 structures of a few atoms -- i.e., what we have now, ignoring pagination of course).

I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.

Did you use @root_validator(pre=True) to make sure it gets run before anything else? Perhaps bypassing the validation altogether can wait for another PR, as you suggest, but I'd like to make the performance tests first at least (doesn't necessarily hold up this PR but it might hold up the next release).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed I made a mistake in my test script.
So it may be possible to do this with a root_validator after all.

Copy link
Contributor Author

@JPBergsma JPBergsma Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a root_validator to the attributes class that, if a flag is set, checks whether all required fields are present and if not adds them and sets them to 0.

I'll try to put the handling of the other include fields back to the place where it happened originally so the code changes less.

optimade/server/entry_collections/entry_collections.py Outdated Show resolved Hide resolved
Comment on lines -109 to +107
+ cls.ALIASES
]
+ list(cls.PROVIDER_FIELDS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is getting very confused between these similar functions, could you add the new stuff below instead all_aliases instead?

Comment on lines -364 to +408
def deserialize(
cls, results: Union[dict, Iterable[dict]]
) -> Union[List[EntryResource], EntryResource]:
if isinstance(results, dict):
return cls.ENTRY_RESOURCE_CLASS(**cls.map_back(results))
def add_alias_and_prefix(cls, doc: dict) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here with the diff, it is hard to see the deserialize has also been changed as well as moved. Could you put your new stuff underneath it?

Comment on lines +444 to +445
def deserialize(cls, results: Iterable[dict]) -> List[EntryResource]:
return [cls.ENTRY_RESOURCE_CLASS(**result) for result in results]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just generally I don't like that we are changing the functionality of existing methods (no longer doing the mapping) like deserialize as mappers are used elsewhere beyond just the server (and potentially by other users).

def handle_response_fields(
def remove_exclude_fields(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the removal of handle_response_fields really necessary? Again, my server code uses this independently of this package. We need to be careful not to force people into a rewrite of their existing stuff for the sake of new features (that they may or may not use)

@@ -105,6 +105,7 @@
"pydantic~=1.10,>=1.10.2",
"email_validator~=1.2",
"requests~=2.28",
"pyyaml>=5.4, <7", # Keep at pyyaml 5.4 for aiida-core support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you pin down which part of the non-server stuff needs pyyaml? I assume it was just for reading the config files in yaml format previously

@JPBergsma JPBergsma force-pushed the JPBergsma/allow_nested_aliases branch from 925a5b7 to 1e49fb8 Compare November 30, 2022 13:37
@JPBergsma JPBergsma force-pushed the JPBergsma/allow_nested_aliases branch from 0a551aa to 596bb73 Compare November 30, 2022 14:02
Comment on lines +301 to +313
@root_validator(pre=True)
def set_missing_to_none(cls, values):
if "set_missing_to_none" in values and values.pop("set_missing_to_none"):
for field in cls.schema()["required"]:
if field not in values:
if (
field == "structure_features"
): # It would be nice if there would be a more universal way to handle special cases like this.
values[field] = []
else:
values[field] = None
return values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks on the right track! I've just played around with something too after spotting something in the pydantic docs about default factories. Would the snippet also solve this issue?

>>> from pydantic import BaseModel, Field
>>> from typing import Optional
>>> class Model(BaseModel):
...     test: Optional[str] = Field(default_factory=lambda: None)
...
>>> Model()
Model(test=None)
>>> Model(test=None)
Model(test=None)
>>> Model(test="value")
Model(test='value')
>>> Model.schema()
{'title': 'Model', 'type': 'object', 'properties': {'test': {'title': 'Test', 'type': 'string'}}}

We can then patch the underlying OptimadeField and StrictField wrappers to default to having a default_factory that returns null in cases where there is no default value for the field to fall back on, and we can do this without modifying the schema or the models.

The only concern is that this functionality might get removed from pydantic:

The default_factory argument is in beta, it has been added to pydantic in v1.5 on a provisional basis. It may change significantly in future releases and its signature or behaviour will not be concrete until v2.

Though it has already lasted a few versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config For issues/PRs related to the server config. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants