Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple alternative to Earth Mover's Distance (1D distributions) #16

Closed
sgbaird opened this issue Nov 5, 2021 · 4 comments
Closed

Simple alternative to Earth Mover's Distance (1D distributions) #16

sgbaird opened this issue Nov 5, 2021 · 4 comments

Comments

@sgbaird
Copy link
Contributor

sgbaird commented Nov 5, 2021

Hey Cameron, it was great to meet with you earlier. I've been thinking more about the equation that you mentioned, and was wondering how this could be the correct Earth Mover's distance if it doesn't consider the mod_petti features (i.e. the equation doesn't seem to incorporate weights)?

import numpy as np
from ElMD import ElMD, elmd

comp1 = "NaCl"
comp2 = "SiO2"
e1 = ElMD(comp1, metric="mod_petti")
e2 = ElMD(comp2, metric="mod_petti")

# Carturi Eq 2.37: DOI: arXiv:1803.00567
cdf1, cdf2 = np.cumsum(e1.ratio_vector), np.cumsum(e2.ratio_vector)
out = np.sum(np.abs(cdf1 - cdf2))

check = e1.elmd(comp2)
check2 = elmd(comp1, comp2)

print(out, check, check2)
41.0 2.0 2.0
@SurgeArrester
Copy link
Collaborator

So that's a slightly incorrect implementation but mostly right, the function should be

np.sum(np.abs(np.cumsum(comp1.ratio_vector - comp2.ratio_vector)))

The justification is given on page 31 of Computational Optimal Transport, the other tests have been really useful checking the last bug so would be interesting to see how this compares in accuracy and speed

@SurgeArrester
Copy link
Collaborator

Added this to ElMD==0.5.3 as per this reference https://arxiv.org/pdf/1804.01947.pdf

@sgbaird
Copy link
Contributor Author

sgbaird commented Aug 25, 2022

Do you know if there's a relation to Cramer's approximation used in scipy.stats.wasserstein_distance (e.g. if they're the same)?

@SurgeArrester
Copy link
Collaborator

SurgeArrester commented Aug 25, 2022

I can't answer with full confidence as I'm not hugely familiar with the Cramer approximation, but it looks like the definition given by scipy.stats.energy_distance is basically the same as the implementation given by scipy.stats.wasserstein_distance, but with a different p-norm.

We can verify this in their code:

def wasserstein_distance(u_values, v_values, u_weights=None, v_weights=None):
    return _cdf_distance(1, u_values, v_values, u_weights, v_weights)

def energy_distance(u_values, v_values, u_weights=None, v_weights=None):
    return np.sqrt(2) * _cdf_distance(2, u_values, v_values,
                                      u_weights, v_weights)

The _cdf_distance() function is more complex than the implementation in ElMD==0.5.4, as it sorts and then searches for each of the joint values in the initial distributions u and v . I think this approach is a lot more generalisable as you can have real valued non-monotonic scales, but it seems a lot more involved and hasn't been fast when I've tried it in the past

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants