Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCA Contours With Tp Data #249

Open
Graham-EGI opened this issue Jul 7, 2023 · 11 comments
Open

PCA Contours With Tp Data #249

Graham-EGI opened this issue Jul 7, 2023 · 11 comments

Comments

@Graham-EGI
Copy link
Contributor

Describe the bug:

PCA environmental contours returning odd results when Tp is used as the period value (rather than Te)

To Reproduce:

This is just a slight modification to the code in the example notebook in examples.environmental_contours_example.ipynb to calculate Tp from the spectral data returned from NDBC swden parameter.

Not all buoys have this same behavior. Example is of one that creates it.

from mhkit.wave import resource, contours, graphics
import matplotlib.pyplot as plt
from mhkit.wave.io import ndbc
import pandas as pd

parameter = 'swden'
buoy_number = '46054' 
ndbc_available_data= ndbc.available_data(parameter, buoy_number)

years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022]

filenames= years_of_interest['filename']
ndbc_requested_data = ndbc.request_data(parameter, filenames)

ndbc_data={}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
    year_data = ndbc_requested_data[year]
    ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)

Hm0_list=[]
Te_list=[]
#! ADDITION
Tp_list=[]

# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
    year_data = ndbc_data[year]
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    Te_list.append(resource.energy_period(year_data.T))
    Tp_list.append(resource.peak_period(year_data.T))

# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list ,axis=0)
Hm0 = pd.concat(Hm0_list ,axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0_Te = pd.concat([Hm0,Te,Tp],axis=1)

# Drop any NaNs created from the calculation of Hm0 or Te
Hm0_Te.dropna(inplace=True)
# Sort the DateTime index
Hm0_Te.sort_index(inplace=True)


# Return period (years) of interest
period = 50  

# Remove TP Outliers 
# 30 removes the Hm0 outliers as well
Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30]


# Get only the values from the DataFrame
Hm0 = Hm0_Te_clean.Hm0.values  
Te  = Hm0_Te_clean.Te.values 
Tp = Hm0_Te_clean.Tp.values
# Delta time of sea-states 
dt = (Hm0_Te_clean.index[2]-Hm0_Te_clean.index[1]).seconds  

# Get the contour values

#! TP IS THE ONE WITH ODD BEHAVIOR
copula_tp = contours.environmental_contours(Hm0, Tp, dt, period, 'PCA', return_PCA=True)
Hm0_contour_tp=copula_tp['PCA_x1']
Te_contour_tp=copula_tp['PCA_x2']

#! TE Looks good
copula_te = contours.environmental_contours(Hm0, Te, dt, period, 'PCA', return_PCA=True)
Hm0_contour_te=copula_te['PCA_x1']
Te_contour_te=copula_te['PCA_x2']


fig,ax=plt.subplots(figsize=(8,6))

ax=graphics.plot_environmental_contour(Tp, Hm0, 
                                      Te_contour_tp, Hm0_contour_tp, 
                                      data_label='NDBC 46054', 
                                      contour_label='50 Year Contour',
                                      x_label = 'Spectral Peak Period, $Tp$ [s]',
                                      y_label = 'Sig. wave height, $Hm0$ [m]', 
                                      ax=ax)
plt.show()


fig,ax=plt.subplots(figsize=(8,6))

ax=graphics.plot_environmental_contour(Te, Hm0, 
                                      Te_contour_te, Hm0_contour_te, 
                                      data_label='NDBC 46054', 
                                      contour_label='50 Year Contour',
                                      x_label = 'Energy Period, $Te$ [s]',
                                      y_label = 'Sig. wave height, $Hm0$ [m]', 
                                      ax=ax)

plt.show()

Expected behavior:

PCA contours having similar shapes between Tp and Te as the period value.

Screenshots:

Bad contour with Tp as period value:

image

Good contour with Te as period value:

image

Desktop (please complete the following information):

  • OS: iOS & Ubuntu - example on Mac, found in system deployed on Ubuntu.
  • MHKiT Version: 0.6.0

Additional context:

There are quite a few of these NDBC swden data sets that produce this contour shape when using Tp calculated from MHKiT. Appears to be related to the banding, and the density of the data:

image

Created with (where Tp and Hm0 variables are from the example above) :

from scipy.stats import gaussian_kde
import numpy as np

xy = np.vstack([Tp,Hm0])
z = gaussian_kde(xy)(xy)
idx = z.argsort()
x, y, z = Tp[idx], Hm0[idx], z[idx]

fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50)
plt.show()
@ssolson
Copy link
Contributor

ssolson commented Jul 7, 2023

Thanks, @Graham-EGI for bringing this up. On my first look at the issue, my guess is similar to yours in that the binning of the Te is causing the behavior you are seeing.

@ryancoe can you comment on if this is a known behavior for the modified PCA approach we take here?

@ryancoe
Copy link
Contributor

ryancoe commented Jul 12, 2023

Referring back to Aubrey's paper on the PCA contours, I suspect it may have to do with the binning that's done to fit in second principal component/dimension (see Section 3.2), which looks like it is set by an optional argument. Try playing around with this.

PCA_bin_size = kwargs.get("PCA_bin_size", 250)

Eckert-Gallup, Aubrey C., et al. "Application of principal component analysis (PCA) and improved joint probability distributions to the inverse first-order reliability method (I-FORM) for predicting extreme sea states." Ocean Engineering 112 (2016): 307-319.

@Graham-EGI
Copy link
Contributor Author

Wanted to take a quick look if this was something that could likely be tuned on my end, when calling contours.environmental_contours(), here are the results:

image

Code
from mhkit.wave import resource, contours
import matplotlib.pyplot as plt
from mhkit.wave.io import ndbc
import pandas as pd
import numpy as np


parameter = "swden"
buoy_number = "46054"
ndbc_available_data = ndbc.available_data(parameter, buoy_number)

years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022]

filenames = years_of_interest["filename"]
ndbc_requested_data = ndbc.request_data(parameter, filenames)

ndbc_data = {}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
    year_data = ndbc_requested_data[year]
    ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)

Hm0_list = []
Te_list = []
#! ADDITION
Tp_list = []

# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
    year_data = ndbc_data[year]
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    Te_list.append(resource.energy_period(year_data.T))
    Tp_list.append(resource.peak_period(year_data.T))

# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list, axis=0)
Hm0 = pd.concat(Hm0_list, axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0_Te = pd.concat([Hm0, Te, Tp], axis=1)

# Drop any NaNs created from the calculation of Hm0 or Te
Hm0_Te.dropna(inplace=True)
# Sort the DateTime index
Hm0_Te.sort_index(inplace=True)


# Return period (years) of interest
period = 50

# Remove TP Outliers
# 30 removes the Hm0 outliers as well
Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30]


# Get only the values from the DataFrame
Hm0 = Hm0_Te_clean.Hm0.values
Te = Hm0_Te_clean.Te.values
Tp = Hm0_Te_clean.Tp.values
# Delta time of sea-states
dt = (Hm0_Te_clean.index[2] - Hm0_Te_clean.index[1]).seconds

# Get the contour values

# ! TP IS THE ONE WITH ODD BEHAVIOR

bin_sizes = {i: None for i in range(0, 1100, 100)}
failed = []


for k in bin_sizes.keys():
    try:
        copula_tp = contours.environmental_contours(
            Hm0, Tp, dt, period, "PCA", return_PCA=True, PCA_bin_size=k
        )
    except:
        failed.append(k)
        continue

    Hm0_contour_tp = copula_tp["PCA_x1"]
    Te_contour_tp = copula_tp["PCA_x2"]
    bin_sizes[k] = (Te_contour_tp, Hm0_contour_tp)

print(failed)


fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(18, 6))


ax[0].scatter(Tp, Hm0, color="blue", alpha=0.5)
for k, v in bin_sizes.items():
    if v is not None:
        ax[0].plot(v[0], v[1], label=f"Bin Size = {k}")

ax[0].legend()
ax[0].grid(True)
ax[0].set_xlabel("Tp (s)")
ax[0].set_ylabel("Hm0 (m)")

h = ax[1].hist2d(Tp, Hm0, (100, 100), cmap=plt.cm.jet, cmin=1)
ax[1].set_xlabel("Tp (s)")
ax[1].set_ylabel("Hm0 (m)")
plt.colorbar(h[3])

plt.show()

@ssolson
Copy link
Contributor

ssolson commented Jul 13, 2023

Nice Graham. The y-shape is improved for larger bin sizes. My guess would be that the next issue may be that the PCA assumes equal bin size. Perhaps you could try setting independent bin sizes here:
https://github.com/MHKiT-Software/MHKiT-Python/blob/e5916004ff0413f32f9559790582c95193acdd24/mhkit/wave/contours.py#L461C5-L461C27

Well, looking at it more we didn't see any changes in the x-axis for varying bin sizes from your analysis. But potentially below the linked line by adjusting the fit you could get better results. Very interested in any findings you have.

@ryancoe
Copy link
Contributor

ryancoe commented Jul 13, 2023

Yeah... I'm not honestly sure if the bin size is really the fundamental problem here... I think we need to look a bit more closely before go down a rabbit hole on this. I know they're quite busy with other things these days, but perhaps @aubreyeckert or even @nevmart have some idea as to what's going on here?

@Graham-EGI
Copy link
Contributor Author

Got it, thanks guys!

@mattEhall
Copy link

Hey @ryancoe and all. Just chiming in that we're hitting this bug as well. In the last week we've been looking into using this tool in one of our projects. It would be awesome to get to the bottom of it.

@ssolson
Copy link
Contributor

ssolson commented Aug 16, 2023

@jtgrasb in #261 detailed the same issue.

In my opinion this seems to me be more of a short coming of the provided methods than a bug but I do not claim to know the details of how these methods should work.

Is this a bug or a short coming of the provided methods?

@ssolson
Copy link
Contributor

ssolson commented Nov 8, 2023

@mattEhall @jtgrasb @mbruggs (in place of Graham) @ryancoe

I have been poking at this issue today and yesterday. I have not identified exactly where the issue is occurring in the method but wanted to check the water on the level of interest for this issue.

One of the first things I checked was the histogram of the data. As shown in the first 2 plots below it is clear that Hm0 and Te are not perfectly Gaussian. Each show skewness (they do fail the shapiro_wilk_test included in the code below). However the contour method works with these datasets.

Looking at the the third figure for Tp Q-Q plot is does not seem particularly worse and shows less skewness but the histogram has far more than a single peak. As this is the data causing the issue my current thought is that the multiple peaks in this non gaussian distribution lead to the poor contours observed in the original issue.

The reason this is important is because the PCA method assumes that the data, when transformed into the PCA space, can be modeled with a Gaussian distribution. This is evident in the use of stats.invgauss.ppf for the inverse cumulative distribution function (CDF) to calculate the component 1 values. If Tp does not follow a Gaussian distribution, the results will not reflect the actual distribution of the data. There is then a linear relationship assumed between the two components that converts the second component into the PCA space. If the relationship between Tp and the first component is not linear, this step will not capture the true dependency, leading to incorrect contouring.

component_1 = stats.invgauss.ppf(x_quantile,

As stated above I have started following the changes through the method when using Te vs Tp but I have not identified exactly where things break down yet.

I am happy to keep poking at this if there is still interest. Let me know if you guys have thoughts on this initial look at the data or just that you think it is a good expenditure of time to identify exactly where this binned Tp data causes issues vs simple using Te.

image

Code
from mhkit.wave import resource, contours, graphics
import matplotlib.pyplot as plt
from mhkit.wave.io import ndbc
import pandas as pd
import numpy as np
import numpy as np
import scipy.stats as stats
import seaborn as sns

parameter = 'swden'
buoy_number = '46054'
ndbc_available_data = ndbc.available_data(parameter, buoy_number)

years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022]

filenames = years_of_interest['filename']
ndbc_requested_data = ndbc.request_data(parameter, filenames)

ndbc_data = {}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
    year_data = ndbc_requested_data[year]
    ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)

Hm0_list = []
Te_list = []
#! ADDITION
Tp_list = []

# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
    year_data = ndbc_data[year]
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    Te_list.append(resource.energy_period(year_data.T))
    Tp_list.append(resource.peak_period(year_data.T))

# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list, axis=0)
Hm0 = pd.concat(Hm0_list, axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0_Te = pd.concat([Hm0, Te, Tp], axis=1)

# Drop any NaNs created from the calculation of Hm0 or Te
Hm0_Te.dropna(inplace=True)
# Sort the DateTime index
Hm0_Te.sort_index(inplace=True)


# Return period (years) of interest
period = 50

# Remove TP Outliers
# 30 removes the Hm0 outliers as well
Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30]


# Get only the values from the DataFrame
Hm0 = Hm0_Te_clean.Hm0.values
Te = Hm0_Te_clean.Te.values
Tp = Hm0_Te_clean.Tp.values
# Delta time of sea-states
dt = (Hm0_Te_clean.index[2]-Hm0_Te_clean.index[1]).seconds


# Define function to create histogram and Q-Q plot for given data

def plot_histogram_qq(data, title):
    sns.set(style="whitegrid")
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

    # Histogram with kernel density estimate
    sns.histplot(data, kde=True, ax=ax1)
    ax1.set_title(f'Histogram of {title}')

    # Q-Q plot
    stats.probplot(data, dist="norm", plot=ax2)
    ax2.set_title(f'Q-Q Plot of {title}')

    plt.tight_layout()


# Plot for Hm0 (simulated as normally distributed)
plot_histogram_qq(Hm0, 'Hm0')

# Plot for Te (simulated as log-normally distributed)
plot_histogram_qq(Te, 'Te')

# Plot for Tp (simulated as log-normally distributed)
plot_histogram_qq(Tp, 'Tp')


# Perform Shapiro-Wilk test on the simulated data to check for normality
def perform_shapiro_wilk_test(data, name):
    stat, p = stats.shapiro(data)
    alpha = 0.05
    if p > alpha:
        return f"{name} looks Gaussian (fail to reject H0) with p-value = {p}"
    else:
        return f"{name} does not look Gaussian (reject H0) with p-value = {p}"


# Perform the test on the simulated data
sw_Hm0 = perform_shapiro_wilk_test(Hm0, 'Hm0')
sw_Te = perform_shapiro_wilk_test(Te, 'Te')
sw_Tp = perform_shapiro_wilk_test(Tp, 'Tp')


print(sw_Hm0)
print(sw_Te)
print(sw_Tp)
plt.show()

@mattEhall
Copy link

Hi @ssolson, thanks for your message and the informative observations. This isn't a topic I know a lot about or have the chance to get into in more detail, so I don't follow everything (but I see what you mean about the skewness and that it might not be fitting the assumptions of the method well). For what it's worth, I am still very interested in using this functionality. We haven't found another way to create these contours.

@jtgrasb
Copy link

jtgrasb commented Nov 20, 2023

Hi @ssolson, thanks for the update. I am also still interested in this functionality. It's not a pressing issue for the project I was using it for as we were able to use the KDE method with a different dataset and get okay results, but it would be nice to have a workflow for such data in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants