PCA Contours With Tp Data #249

Graham-EGI · 2023-07-07T18:53:07Z

Describe the bug:

PCA environmental contours returning odd results when Tp is used as the period value (rather than Te)

To Reproduce:

This is just a slight modification to the code in the example notebook in examples.environmental_contours_example.ipynb to calculate Tp from the spectral data returned from NDBC swden parameter.

Not all buoys have this same behavior. Example is of one that creates it.

from mhkit.wave import resource, contours, graphics
import matplotlib.pyplot as plt
from mhkit.wave.io import ndbc
import pandas as pd

parameter = 'swden'
buoy_number = '46054' 
ndbc_available_data= ndbc.available_data(parameter, buoy_number)

years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022]

filenames= years_of_interest['filename']
ndbc_requested_data = ndbc.request_data(parameter, filenames)

ndbc_data={}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
    year_data = ndbc_requested_data[year]
    ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)

Hm0_list=[]
Te_list=[]
#! ADDITION
Tp_list=[]

# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
    year_data = ndbc_data[year]
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    Te_list.append(resource.energy_period(year_data.T))
    Tp_list.append(resource.peak_period(year_data.T))

# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list ,axis=0)
Hm0 = pd.concat(Hm0_list ,axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0_Te = pd.concat([Hm0,Te,Tp],axis=1)

# Drop any NaNs created from the calculation of Hm0 or Te
Hm0_Te.dropna(inplace=True)
# Sort the DateTime index
Hm0_Te.sort_index(inplace=True)


# Return period (years) of interest
period = 50  

# Remove TP Outliers 
# 30 removes the Hm0 outliers as well
Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30]


# Get only the values from the DataFrame
Hm0 = Hm0_Te_clean.Hm0.values  
Te  = Hm0_Te_clean.Te.values 
Tp = Hm0_Te_clean.Tp.values
# Delta time of sea-states 
dt = (Hm0_Te_clean.index[2]-Hm0_Te_clean.index[1]).seconds  

# Get the contour values

#! TP IS THE ONE WITH ODD BEHAVIOR
copula_tp = contours.environmental_contours(Hm0, Tp, dt, period, 'PCA', return_PCA=True)
Hm0_contour_tp=copula_tp['PCA_x1']
Te_contour_tp=copula_tp['PCA_x2']

#! TE Looks good
copula_te = contours.environmental_contours(Hm0, Te, dt, period, 'PCA', return_PCA=True)
Hm0_contour_te=copula_te['PCA_x1']
Te_contour_te=copula_te['PCA_x2']


fig,ax=plt.subplots(figsize=(8,6))

ax=graphics.plot_environmental_contour(Tp, Hm0, 
                                      Te_contour_tp, Hm0_contour_tp, 
                                      data_label='NDBC 46054', 
                                      contour_label='50 Year Contour',
                                      x_label = 'Spectral Peak Period, $Tp$ [s]',
                                      y_label = 'Sig. wave height, $Hm0$ [m]', 
                                      ax=ax)
plt.show()


fig,ax=plt.subplots(figsize=(8,6))

ax=graphics.plot_environmental_contour(Te, Hm0, 
                                      Te_contour_te, Hm0_contour_te, 
                                      data_label='NDBC 46054', 
                                      contour_label='50 Year Contour',
                                      x_label = 'Energy Period, $Te$ [s]',
                                      y_label = 'Sig. wave height, $Hm0$ [m]', 
                                      ax=ax)

plt.show()

Expected behavior:

PCA contours having similar shapes between Tp and Te as the period value.

Screenshots:

Bad contour with Tp as period value:

Good contour with Te as period value:

Desktop (please complete the following information):

OS: iOS & Ubuntu - example on Mac, found in system deployed on Ubuntu.
MHKiT Version: 0.6.0

Additional context:

There are quite a few of these NDBC swden data sets that produce this contour shape when using Tp calculated from MHKiT. Appears to be related to the banding, and the density of the data:

Created with (where Tp and Hm0 variables are from the example above) :

from scipy.stats import gaussian_kde
import numpy as np

xy = np.vstack([Tp,Hm0])
z = gaussian_kde(xy)(xy)
idx = z.argsort()
x, y, z = Tp[idx], Hm0[idx], z[idx]

fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50)
plt.show()

The text was updated successfully, but these errors were encountered:

ssolson · 2023-07-07T19:20:52Z

Thanks, @Graham-EGI for bringing this up. On my first look at the issue, my guess is similar to yours in that the binning of the Te is causing the behavior you are seeing.

@ryancoe can you comment on if this is a known behavior for the modified PCA approach we take here?

ryancoe · 2023-07-12T20:46:39Z

Referring back to Aubrey's paper on the PCA contours, I suspect it may have to do with the binning that's done to fit in second principal component/dimension (see Section 3.2), which looks like it is set by an optional argument. Try playing around with this.

MHKiT-Python/mhkit/wave/contours.py

Line 100 in e591600

PCA_bin_size = kwargs.get("PCA_bin_size", 250)

Eckert-Gallup, Aubrey C., et al. "Application of principal component analysis (PCA) and improved joint probability distributions to the inverse first-order reliability method (I-FORM) for predicting extreme sea states." Ocean Engineering 112 (2016): 307-319.

Graham-EGI · 2023-07-13T18:36:56Z

Wanted to take a quick look if this was something that could likely be tuned on my end, when calling contours.environmental_contours(), here are the results:

Code

from mhkit.wave import resource, contours
import matplotlib.pyplot as plt
from mhkit.wave.io import ndbc
import pandas as pd
import numpy as np


parameter = "swden"
buoy_number = "46054"
ndbc_available_data = ndbc.available_data(parameter, buoy_number)

years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022]

filenames = years_of_interest["filename"]
ndbc_requested_data = ndbc.request_data(parameter, filenames)

ndbc_data = {}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
    year_data = ndbc_requested_data[year]
    ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)

Hm0_list = []
Te_list = []
#! ADDITION
Tp_list = []

# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
    year_data = ndbc_data[year]
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    Te_list.append(resource.energy_period(year_data.T))
    Tp_list.append(resource.peak_period(year_data.T))

# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list, axis=0)
Hm0 = pd.concat(Hm0_list, axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0_Te = pd.concat([Hm0, Te, Tp], axis=1)

# Drop any NaNs created from the calculation of Hm0 or Te
Hm0_Te.dropna(inplace=True)
# Sort the DateTime index
Hm0_Te.sort_index(inplace=True)


# Return period (years) of interest
period = 50

# Remove TP Outliers
# 30 removes the Hm0 outliers as well
Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30]


# Get only the values from the DataFrame
Hm0 = Hm0_Te_clean.Hm0.values
Te = Hm0_Te_clean.Te.values
Tp = Hm0_Te_clean.Tp.values
# Delta time of sea-states
dt = (Hm0_Te_clean.index[2] - Hm0_Te_clean.index[1]).seconds

# Get the contour values

# ! TP IS THE ONE WITH ODD BEHAVIOR

bin_sizes = {i: None for i in range(0, 1100, 100)}
failed = []


for k in bin_sizes.keys():
    try:
        copula_tp = contours.environmental_contours(
            Hm0, Tp, dt, period, "PCA", return_PCA=True, PCA_bin_size=k
        )
    except:
        failed.append(k)
        continue

    Hm0_contour_tp = copula_tp["PCA_x1"]
    Te_contour_tp = copula_tp["PCA_x2"]
    bin_sizes[k] = (Te_contour_tp, Hm0_contour_tp)

print(failed)


fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(18, 6))


ax[0].scatter(Tp, Hm0, color="blue", alpha=0.5)
for k, v in bin_sizes.items():
    if v is not None:
        ax[0].plot(v[0], v[1], label=f"Bin Size = {k}")

ax[0].legend()
ax[0].grid(True)
ax[0].set_xlabel("Tp (s)")
ax[0].set_ylabel("Hm0 (m)")

h = ax[1].hist2d(Tp, Hm0, (100, 100), cmap=plt.cm.jet, cmin=1)
ax[1].set_xlabel("Tp (s)")
ax[1].set_ylabel("Hm0 (m)")
plt.colorbar(h[3])

plt.show()

ssolson · 2023-07-13T18:47:34Z

Nice Graham. The y-shape is improved for larger bin sizes. My guess would be that the next issue may be that the PCA assumes equal bin size. Perhaps you could try setting independent bin sizes here:
https://github.com/MHKiT-Software/MHKiT-Python/blob/e5916004ff0413f32f9559790582c95193acdd24/mhkit/wave/contours.py#L461C5-L461C27

Well, looking at it more we didn't see any changes in the x-axis for varying bin sizes from your analysis. But potentially below the linked line by adjusting the fit you could get better results. Very interested in any findings you have.

ryancoe · 2023-07-13T18:59:26Z

Yeah... I'm not honestly sure if the bin size is really the fundamental problem here... I think we need to look a bit more closely before go down a rabbit hole on this. I know they're quite busy with other things these days, but perhaps @aubreyeckert or even @nevmart have some idea as to what's going on here?

Graham-EGI · 2023-07-13T19:24:06Z

Got it, thanks guys!

mattEhall · 2023-07-17T15:20:33Z

Hey @ryancoe and all. Just chiming in that we're hitting this bug as well. In the last week we've been looking into using this tool in one of our projects. It would be awesome to get to the bottom of it.

ssolson · 2023-08-16T14:06:05Z

@jtgrasb in #261 detailed the same issue.

In my opinion this seems to me be more of a short coming of the provided methods than a bug but I do not claim to know the details of how these methods should work.

Is this a bug or a short coming of the provided methods?

ssolson · 2023-11-08T18:32:05Z

@mattEhall @jtgrasb @mbruggs (in place of Graham) @ryancoe

I have been poking at this issue today and yesterday. I have not identified exactly where the issue is occurring in the method but wanted to check the water on the level of interest for this issue.

One of the first things I checked was the histogram of the data. As shown in the first 2 plots below it is clear that Hm0 and Te are not perfectly Gaussian. Each show skewness (they do fail the shapiro_wilk_test included in the code below). However the contour method works with these datasets.

Looking at the the third figure for Tp Q-Q plot is does not seem particularly worse and shows less skewness but the histogram has far more than a single peak. As this is the data causing the issue my current thought is that the multiple peaks in this non gaussian distribution lead to the poor contours observed in the original issue.

The reason this is important is because the PCA method assumes that the data, when transformed into the PCA space, can be modeled with a Gaussian distribution. This is evident in the use of stats.invgauss.ppf for the inverse cumulative distribution function (CDF) to calculate the component 1 values. If Tp does not follow a Gaussian distribution, the results will not reflect the actual distribution of the data. There is then a linear relationship assumed between the two components that converts the second component into the PCA space. If the relationship between Tp and the first component is not linear, this step will not capture the true dependency, leading to incorrect contouring.

MHKiT-Python/mhkit/wave/contours.py

Line 330 in 2ccc286

component_1 = stats.invgauss.ppf(x_quantile,

As stated above I have started following the changes through the method when using Te vs Tp but I have not identified exactly where things break down yet.

I am happy to keep poking at this if there is still interest. Let me know if you guys have thoughts on this initial look at the data or just that you think it is a good expenditure of time to identify exactly where this binned Tp data causes issues vs simple using Te.

Code

from mhkit.wave import resource, contours, graphics
import matplotlib.pyplot as plt
from mhkit.wave.io import ndbc
import pandas as pd
import numpy as np
import numpy as np
import scipy.stats as stats
import seaborn as sns

parameter = 'swden'
buoy_number = '46054'
ndbc_available_data = ndbc.available_data(parameter, buoy_number)

years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022]

filenames = years_of_interest['filename']
ndbc_requested_data = ndbc.request_data(parameter, filenames)

ndbc_data = {}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
    year_data = ndbc_requested_data[year]
    ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)

Hm0_list = []
Te_list = []
#! ADDITION
Tp_list = []

# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
    year_data = ndbc_data[year]
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    Te_list.append(resource.energy_period(year_data.T))
    Tp_list.append(resource.peak_period(year_data.T))

# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list, axis=0)
Hm0 = pd.concat(Hm0_list, axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0_Te = pd.concat([Hm0, Te, Tp], axis=1)

# Drop any NaNs created from the calculation of Hm0 or Te
Hm0_Te.dropna(inplace=True)
# Sort the DateTime index
Hm0_Te.sort_index(inplace=True)


# Return period (years) of interest
period = 50

# Remove TP Outliers
# 30 removes the Hm0 outliers as well
Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30]


# Get only the values from the DataFrame
Hm0 = Hm0_Te_clean.Hm0.values
Te = Hm0_Te_clean.Te.values
Tp = Hm0_Te_clean.Tp.values
# Delta time of sea-states
dt = (Hm0_Te_clean.index[2]-Hm0_Te_clean.index[1]).seconds


# Define function to create histogram and Q-Q plot for given data

def plot_histogram_qq(data, title):
    sns.set(style="whitegrid")
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

    # Histogram with kernel density estimate
    sns.histplot(data, kde=True, ax=ax1)
    ax1.set_title(f'Histogram of {title}')

    # Q-Q plot
    stats.probplot(data, dist="norm", plot=ax2)
    ax2.set_title(f'Q-Q Plot of {title}')

    plt.tight_layout()


# Plot for Hm0 (simulated as normally distributed)
plot_histogram_qq(Hm0, 'Hm0')

# Plot for Te (simulated as log-normally distributed)
plot_histogram_qq(Te, 'Te')

# Plot for Tp (simulated as log-normally distributed)
plot_histogram_qq(Tp, 'Tp')


# Perform Shapiro-Wilk test on the simulated data to check for normality
def perform_shapiro_wilk_test(data, name):
    stat, p = stats.shapiro(data)
    alpha = 0.05
    if p > alpha:
        return f"{name} looks Gaussian (fail to reject H0) with p-value = {p}"
    else:
        return f"{name} does not look Gaussian (reject H0) with p-value = {p}"


# Perform the test on the simulated data
sw_Hm0 = perform_shapiro_wilk_test(Hm0, 'Hm0')
sw_Te = perform_shapiro_wilk_test(Te, 'Te')
sw_Tp = perform_shapiro_wilk_test(Tp, 'Tp')


print(sw_Hm0)
print(sw_Te)
print(sw_Tp)
plt.show()

mattEhall · 2023-11-18T03:59:52Z

Hi @ssolson, thanks for your message and the informative observations. This isn't a topic I know a lot about or have the chance to get into in more detail, so I don't follow everything (but I see what you mean about the skewness and that it might not be fitting the assumptions of the method well). For what it's worth, I am still very interested in using this functionality. We haven't found another way to create these contours.

jtgrasb · 2023-11-20T13:11:30Z

Hi @ssolson, thanks for the update. I am also still interested in this functionality. It's not a pressing issue for the project I was using it for as we were able to use the KDE method with a different dataset and get okay results, but it would be nice to have a workflow for such data in the future.

ssolson mentioned this issue Aug 16, 2023

Getting weird results with different environmental contour plot options #261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCA Contours With Tp Data #249

PCA Contours With Tp Data #249

Graham-EGI commented Jul 7, 2023

ssolson commented Jul 7, 2023

ryancoe commented Jul 12, 2023

Graham-EGI commented Jul 13, 2023

ssolson commented Jul 13, 2023 •

edited

Loading

ryancoe commented Jul 13, 2023

Graham-EGI commented Jul 13, 2023

mattEhall commented Jul 17, 2023

ssolson commented Aug 16, 2023

ssolson commented Nov 8, 2023 •

edited

Loading

mattEhall commented Nov 18, 2023

jtgrasb commented Nov 20, 2023

PCA Contours With Tp Data #249

PCA Contours With Tp Data #249

Comments

Graham-EGI commented Jul 7, 2023

Describe the bug:

To Reproduce:

Expected behavior:

Screenshots:

Desktop (please complete the following information):

Additional context:

ssolson commented Jul 7, 2023

ryancoe commented Jul 12, 2023

Graham-EGI commented Jul 13, 2023

ssolson commented Jul 13, 2023 • edited Loading

ryancoe commented Jul 13, 2023

Graham-EGI commented Jul 13, 2023

mattEhall commented Jul 17, 2023

ssolson commented Aug 16, 2023

ssolson commented Nov 8, 2023 • edited Loading

mattEhall commented Nov 18, 2023

jtgrasb commented Nov 20, 2023

ssolson commented Jul 13, 2023 •

edited

Loading

ssolson commented Nov 8, 2023 •

edited

Loading