Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling NaNs from ElementProperty #898

Open
gbrunin opened this issue Apr 13, 2023 · 3 comments
Open

Handling NaNs from ElementProperty #898

gbrunin opened this issue Apr 13, 2023 · 3 comments

Comments

@gbrunin
Copy link
Contributor

gbrunin commented Apr 13, 2023

When using an ElementProperty featurizer, some elemental data may not be present, e.g., the bulk modulus of Ga is not in the Ga Element of pymatgen. In such a case, the featurizer will return a NaN.
There are different ways to handle such a case, and for now it is left to the user to handle it in whatever way they prefer. Basically, the possible approaches there would be:

  • ignore such a feature entirely, which can be a pity if only a small fraction of the dataset presents a NaN for this feature
  • replace the feature by a constant value, either one that is completely outside the range of possibilities, or the mean of the feature over the dataset. The former has the advantage that it is simple to set in place, while for the latter the mean of the feature should be stored somewhere to be re-used in the case the feature is NaN for a new prediction (that may not have access to the original dataset).
    I believe that these two possibilities could be implemented in matminer as some kind of post-processing step, that could be used by the user or not. This is arguable because it could be left to the user to handle these.

I see another possibility that could be implemented in matminer and that the user has no easy way to do. The ElementProperty could, when a value for an element is not found in the data, replace it by the mean of the values for all other elements. This is different from using the mean of the feature over the total dataset, in the sense that it is not biased by the dataset (the user could want one or the other), and that nothing is to be done for new predictions since this treatment is internal to the featurizer. The ElementProperty featurizer would not return NaNs for missing-data reasons. This could be triggered with an optional argument, e.g., ElementProperty(missing_is_mean=True).

If you think this is a good addition to matminer, I would be happy to submit a PR with whichever solution you think is best. I would actually be in favor of implementing all of them to leave the choice to the user, but make the users life easier.

@ardunn
Copy link
Contributor

ardunn commented Jun 8, 2023

Hey @gbrunin let me address some of these. For your first bullet point, I think allowing the users to filter features as they see fit using pandas operations is probably best. For the second bullet, this strikes me as potentially nefarious because it can result in very strange values and the user would have to dig deep to figure out where it came from. The postprocessing might just be more easily handled by the users as they see fit.

I do like your third idea:

I see another possibility that could be implemented in matminer and that the user has no easy way to do. The ElementProperty could, when a value for an element is not found in the data, replace it by the mean of the values for all other elements. This is different from using the mean of the feature over the total dataset, in the sense that it is not biased by the dataset (the user could want one or the other), and that nothing is to be done for new predictions since this treatment is internal to the featurizer. The ElementProperty featurizer would not return NaNs for missing-data reasons. This could be triggered with an optional argument, e.g., ElementProperty(missing_is_mean=True).

I believe the technical term for this is "imputation". Note you can already impute features with pandas when doing ML experiments with matminer. That being said, your idea looks like a better version of imputation because we would be retaining as much information as possible about elementproperty attributes while avoiding nans.

Why this seems useful

So, normally, lets say you were using elementproperty to get mean <feature1> of a composition, and is not defined for Ga (I'm using made up numbers here). You'd get:

composition ElementProperty mean feature 1
NbCoSn 12
Ga2O3 nan
SiO2 16

Although <feature1> was defined for oxygen, the information about oxygen was lost for Ga2O3. Then if you were to do "normal" imputation (from the defined samples, let's say by mean):

composition ElementProperty mean feature 1
NbCoSn 12
Ga2O3 14 (does not use <feature1> for O)
SiO2 16

I understand you want to impute the elemental attribute before statistics are computed over the individual composition. This would, for example, use <feature1> for oxygen and would use the mean of <feature1> as <feature1> for Ga. Then, instead, you might get something more accurate like:

composition ElementProperty mean feature 1
NbCoSn 12
Ga2O3 19 (does use <feature1> for O)
SiO2 16

Closing thoughts

it might be nice to see some examples of this third working in action... i.e., to see if this kind of imputation would actually be more useful than just pandas.impute....

To save you some time, if you are still interested in making a PR, I'd focus on your third idea

@gbrunin
Copy link
Contributor Author

gbrunin commented Jun 8, 2023

Thanks for your comments @ardunn. I agree that the third option is best, I just wanted to lay out the possible ways to handle these NaNs that I could see. I'll start working on a PR for this imputation, and we'll see what it leads to.

Note that in any case, if we have let's say Ga2O3 and Ga3O5, those would have the same value for the feature if the average over the column is used, while they would have different feature values if Ga is first sorted out. So it should make more sense.

@gbrunin
Copy link
Contributor Author

gbrunin commented May 16, 2024

Hi,

I have (finally) made some tests with the 3rd option that has been discussed here. Tagging @ml-evs since he was also part of the process and developments. Here are the MAE on three of the Matbench datasets, using MODNet models:

impute_nan=False impute_nan=True
Steels 92.59 93.69 (+1.2%)
Expt Eg 0.3399 0.3238 (-4.7%)
Dielectric 0.2793 0.2662 (-4.2%)

For the steels, I think most compounds have pretty basic elements for which most features are known, so imputing does not change the featurized dataset much. I have not checked this though. I believe that imputing in this way overall does lead to better results with features that make more sense.

The implementation is already present, it has been merged in #892. However in this PR we decided to allow for the possibility to impute but to keep the default behavior (impute_nan=False + warning message) because we had not tested it yet. Now, I would be in favor to change the default to True, though that will change the value of multiple features for some compounds.

Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants