Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic threshold for peaks-over-threshold #268

Merged
merged 16 commits into from
Oct 30, 2023

Conversation

cmichelenstrofer
Copy link
Contributor

replaces #214.

MHKiT already has a peaks-over-threshold method, which takes in threshold as one of the inputs. This method here is to automatically calculate the "best" threshold. It is from a paper by Vince and Seongho:

Neary, V. S., S. Ahn, B. E. Seng, M. N. Allahdadi, T. Wang, Z. Yang and R. He (2020). "Characterization of Extreme Wave Conditions for Wave Energy Converter Design and Project Risk Assessment.” J. Mar. Sci. Eng. 2020, 8(4), 289; https://doi.org/10.3390/jmse8040289.

They had it in Matlab and I converted it to Python here.

@cmichelenstrofer cmichelenstrofer self-assigned this Oct 3, 2023
@cmichelenstrofer
Copy link
Contributor Author

@ssolson I cleaned this up quite a bit and made a Matlab minimal example form the code I got from Seongho. He unfortunately left which makes figuring some of this out more difficult. The problem is that the solution I get with my Python implementation is slightly different from what I get with the original Matlab code. I think the issue might be that I am not accounting for all the metrics they used for "independent storms". I would appreciate some help if anyone else can take a look at this.

The failing errors don't seem to be related to this PR.

@ssolson ssolson changed the base branch from master to develop October 4, 2023 16:28
@ssolson
Copy link
Contributor

ssolson commented Oct 4, 2023

@ssolson I cleaned this up quite a bit and made a Matlab minimal example form the code I got from Seongho. He unfortunately left which makes figuring some of this out more difficult. The problem is that the solution I get with my Python implementation is slightly different from what I get with the original Matlab code. I think the issue might be that I am not accounting for all the metrics they used for "independent storms". I would appreciate some help if anyone else can take a look at this.

The failing errors don't seem to be related to this PR.

Carlos thank you.

The errors are unrelated to this PR and are fixed in #241 which should be reviewed and approved soon.

I took a bit of a look at his today. Running the tests I see that I get the same answer as you have specified but I was wondering how big is the difference from the Matlab version?

I started looking into the generalized pareto distribution in scipy vs matlab via the docs but I was ultimately hindered by not being able to run the Matlab version without getting Matlab installed on my machine. It seems like the idea would be to step-by-step compare the python vs Matlab output to see where the difference arises.

@akeeste are you setup to run Matlab and do you have time to take a look into this further?

@akeeste
Copy link
Contributor

akeeste commented Oct 4, 2023

@ssolson I'm set-up for this. I will review this and debug next week.

@akeeste akeeste self-requested a review October 4, 2023 17:35
@cmichelenstrofer
Copy link
Contributor Author

@ssolson @akeeste Thanks! The difference I'm getting is this:

Matlab: 99.53%, 1.041 m
Python: 98.94%, 1.028 m

@akeeste LMK if you want to go over it together.

@akeeste
Copy link
Contributor

akeeste commented Oct 12, 2023

@cmichelenstrofer I made some progress today and noting where I'm at for reference. I noticed a couple things, but I'm still trying to figure out how much they affect the end result, given the input Hs used in your working example. I'll refer to Seongho's original code as the 'matlab' version and your refactored code as the 'python version'

  • xarray's stats.scoreatpercentile gives slightly different values than MATLAB's prctile function. The differences are very small but may become more significant as the threshold range is refined. Also, they recommend moving to numpy.percentile
    • Percentile, xarray threshold, matlab threshold
    • 0.99, 1.0294, 1.0295
    • 0.991, 1.0311, 1.0313
    • 0.992, 1.0334, 1.0334
    • 0.995, 1.0406, 1.0506
  • the refinement logic is working a little differently in the python version. the condition rate_per_year < 2 is checked for every threshold and refinement iteration in Python, but only checked every refinement iteration in MATLAB. So all 6 initial thresholds are used in MATLAB. But in the python version only the first 3 thresholds pass the check for 2+ peaks/year.
  • In python for the first refinement iteration, the thresholds 0.99, 0.991, 0.992 all give an identical correlation. This seems suspicious and is not the behavior seen in the MATLAB code. The _pearks_over_threshold method gives different output for thresholds of e.g. 0.99 and 0.991, but the distribution fit returns identical distribution_parameters.

I'll continue debugging next week

@akeeste
Copy link
Contributor

akeeste commented Oct 18, 2023

I'm still digging into the settings of various functions, but I can confirm that the functions for percentile, Generalized Pareto distribution fitting, and probability plotting functions all give slightly different results between their respective MATLAB and SciPy versions. I'll see if there are optional settings we can use to converge the results

Updates:

Percentile

MATLAB's prctile function (both 'exact/linear' and 'approximate/T-digest' methods) gives identical results to NumPy's np.percentile with the Hazen method

The Hazen method requires NumPy>1.22.0. For older versions (1.22.0 > Numpy > 1.9), the 'higher' linear method is very close, but not identical to MATLAB's percentile.

Generalized Pareto distribution fitting

MATLAB and SciPy are giving different distribution parameters when the excess is fit to a Generalized Pareto Distribution.

Both MATLAB's GeneralizedParetoDistribution (description) and SciPy's stats.genpareto have an identical definition. MATLAB's fit and SciPy's fit both estimate the distribution parameters using Maximum Likelihood Estimation and assume that the location/threshold is zero. A zero threshold is correct given that both versions of the code already use the excess (peaks-threshold) to estimate the tail distribution.

MATLAB fitting does not take any other options. SciPy takes initial guesses for shape, scale, etc but that outcome seems unpredictable. I'm not sure what else can be done to make the distribution fitting converge. There are not many points in the tail. In this example, even the 90th percentile only has 9 points. This will only decrease as the percentile is iteratively increased.

Example with excess defined using a 90th percentile threshold:

   excess = [0.016299, 0.025899, 0.006499, 0.024699, 0.005099, 0.017399, 0.004299, 0.013899, 0.000199]
   shape, scale, threshold:
   MATLAB: -1.21434, 0.0314501, 0.0
   SciPy: -1.62408, 0.0420621, 0.0
   SciPy (using MATLAB results as initial guess): -1.26645, 0.032800, 0.0

Probability plotting

My stats background begins to fail here. I'm not confident in how the correlations are each defined, but when given the same distribution parameters the results are nearly identical.

  • The MATLAB code's result is defined as the correlation between the quantile of the input (excess) and the quantile of the fitted distribution (qqplot)
  • The Python code's result is defined as the r value of the least-squares fit to the excess vs the quantile of the fitted distribution (probplot)

Example, given the resultant MATLAB distribution above:

MATLAB correlation: 0.978299509175096
SciPy correlation: 0.97872960601213 (0.044% error)

mhkit/loads/extreme.py Outdated Show resolved Hide resolved
Copy link
Contributor

@akeeste akeeste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmichelenstrofer Suggestions to improve a couple minor issues with the storm peak calculation. Otherwise, the storm peak calculation seems to pull all of the same independent storm peaks that the original matlab code does.

The comparison is closer. I think the remaining discrepancy is from the distribution fitting described above. Comparison after these changes:
Matlab: 99.53%, 1.041 m
Python: 99.13%, 1.032 m

mhkit/loads/extreme.py Outdated Show resolved Hide resolved
mhkit/loads/extreme.py Outdated Show resolved Hide resolved
mhkit/loads/extreme.py Outdated Show resolved Hide resolved
@cmichelenstrofer
Copy link
Contributor Author

@akeeste this is great! Thanks for looking into this in so much detail! I think the results are basically identical. The very tiny difference is from the techniques to fit the distribution. Do you think it is ready merge?

@akeeste
Copy link
Contributor

akeeste commented Oct 19, 2023

@cmichelenstrofer Sounds good. Can you first remove the MATLAB working example and updated the test result to 99.13% and 1.032m. After that we can merge

@akeeste
Copy link
Contributor

akeeste commented Oct 19, 2023

Seems like I could commit the suggestion directly to the branch. I mistyped last time, the percentile still returns as a fraction. I updated the test to compare to 0.9913. The test should pass now

Update threshold test value one more time
mhkit/loads/extreme.py Outdated Show resolved Hide resolved
@ssolson
Copy link
Contributor

ssolson commented Oct 20, 2023

Overall, I'm happy with this PR as is. I have a few suggestions, neither required but I think would round out the PR nicely.

  1. Consider consolidating all the scipy imports onto a single line.
  2. I recommend adding a docstring at the beginning of the file. While we haven't formally adopted this in MHKiT's style guidelines, it's something I've been introducing in files I've worked on.

Lastly, I noticed the use of a nested function. Historically, we've kept these functions separate. While nesting does highlight that the inner function is only needed by the parent function, it can also make the code appear more cluttered. I'm not opposed to the current structure, but I wanted to initiate a brief discussion on this topic. What are your thoughts?

@akeeste
Copy link
Contributor

akeeste commented Oct 30, 2023

To wrap up this PR I

  • broke out the nested function
  • consolidated the scipy imports

I leave the submodule-level docstring and any other optional developments for future PRs. All tests have passed. Merging this PR

@akeeste akeeste merged commit abc4395 into MHKiT-Software:develop Oct 30, 2023
16 checks passed
@ssolson ssolson mentioned this pull request Apr 24, 2024
@ssolson ssolson mentioned this pull request May 6, 2024
ssolson added a commit that referenced this pull request May 8, 2024
# MHKiT v0.8.0
We're excited to announce the release of MHKiT v0.8.0, which brings a host of new features, enhancements, and bug fixes across various modules, ensuring compatibility with Python 3.10 and 3.11, and introducing full xarray support for more flexible data handling. Significant updates in the Wave and DOLfYN modules improve functionality and extend capabilities.

## Python 3.10 & 3.11 Support
MHKiT now supports python 3.10 and 3.11. Support for 3.12 will follow in the next minor update.
- #240


## Wave Module
### Enhancements:
**Automatic Threshold Calculation for Peaks-Over-Threshold**: We've introduced a new feature that automatically calculates the "best" threshold for identifying significant wave events. This method, originally developed by Neary, V. S., et al. in their 2020 study, has now been translated from Matlab to Python, enhancing our existing peaks-over-threshold functionality.

**Wave Heights Analysis**: A new function, `wave_heights`, has been added to extract the heights of individual waves from a time series. This function uses zero up-crossing analysis to accurately measure wave heights, improving upon our previous methods which only provided the maximum value between up-crossings.

**Enhanced Zero Crossing Analysis**: Building on the above, the zero crossing code previously embedded in `global_peaks` has been isolated into a helper function. This modular approach not only refines the codebase but also supports new functionalities such as calculating wave heights, zero crossing periods, and identifying crests.

### Bug Fixes:
**Contour Sampling Error in Wave Contours**: A bug identified in `mhkit.wave.contours.samples_contour` has been resolved. The issue occurred when period samples defined using the maximum period resulted in values outside the interpolation range of the contour data. This was corrected by ensuring that all sampling points are within the interpolation range and adjusting the contour data selection process accordingly.

- #268 
- #252 
- #278


## Xarray Support
MHKiT functions now fully support the use of xarray for passing and returning data.

- #279 
- #282
- #285
- #302
- #310


## DOLfYN

Thanks to the many user contributions and users of MHKiT the DOLFYN module include a significant number of enhancements and bug fixes. 

### Enhancements:
**Altimeter Support**: Enhanced the Nortek Signature Reader to add capability for reading ADCP dual profile configurations.

**Data Handling Improvements**: Introduced logic to skip messy header data that can accumulate during measurements collected via Nortek software on PCs and Macs.

**Instrument Noise Subtraction**: Added a function to subtract instrument noise from turbulence intensity estimation using RMS calculations, providing results that differ by approximately 1% from the existing standard deviation-based "I" property.

**Improved File Handling**: Updates for RDI files to handle changing "number of cells" and variable "cell sizes," which are now bin-averaged into the largest cell size.

### Bug Fixes:
**Power Spectra Calculation**: Fixed a bug where a given noise value was not being subtracted from the power spectra, and noise was inadvertently added as an input to dissipation rate calculations.

**Improved Header Handling**: Allowed RDI reader to skip junk headers effectively.

**Nortek Reader C Types Update**: Adjusted C types in the Nortek reader to handle below-zero water temperatures and to allow pitch and roll values to go negative.


- #280 
- #289
- #290
- #292
- #293
- #294
- #299


## River & Tidal: D3D
Added limits to `variable_interpolation` and added 3 array input capability to `create_points`
- #271

## Developer Experience
### Black formatting
Black formatting is now enforced on all MHKiT files. This ensures consistent formatting across the MHKiT package.
- #281

### Linting & Type Hints
MHKiT is in the process of enforcing pylint and adding type hints to all functions. Currently this has been achieved and is enforced in the Loads and Power modules.
- #288 
- #296 

### CI/CD
This release introduces significant reduction in testing time for development. This is achieved by reducing the number of tests for pulls against the develop branch and only running hindcast test when changes are made to it. A bug in the hindcast CI was fixed which only ran on changes to the hindcast tests instead of the hindcast module. Additionally the wave and wind hindcast needed to be separated in 2 jobs due to the excessive time taken to run a wind cache. This created a number of follow on PRs around solidifying the logic of these job. A special case for Python 3.8, pip, and Mac OS was added to use homebrew to install NetCDF and HDF5 to get tests to pass.
- #241
- #270
- #306
- #311
- #317
- #318
- #319
- #320
- #324

### Clean Up
MHKiT fixed an implementation error where functions used assert instead of built in errors for type and value checking. Additionally these PRs removed unused files, fixed typos, and created an argument which allows users to run CDIP API calls silently.
- #276
- #272
- #273
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants