Use standard deviation from global explanation graph to calculate standard deviation of prediction? #235

bverhoeff · 2021-05-18T20:03:16Z

(moved this from other issue)
First of all many thanks for creating and maintaining this project.

I have the following related questions about quantifying uncertainty of predictions:

Is a standard deviation calculated by bagging data 8 times and training 8 models and averaging them (as you did) representative for the distribution of uncertainty of the feature scores of a certain feature at certain feature values? (e.g. so that I can draw samples from that distribution?).
If we draw samples from each feature graphs distribution for a certain values for a case and use them to calculate a prediction; and if we would repeat this a number of times, could we determine the average+-SD of the prediction for that case as well? And would that be a correct way to determine uncertainty of the predictions?
Currently there are no upper and lower bounds for interaction terms in the internal object. Would it be difficult to add these? Otherwise abovementioned method would only be possible without interaction terms
Most importantly, would abovementioned method be mathematically correct? I can imagine that this is the wrong place for that question. I could ask this at Stats Exchange / Cross Validated?

In medicine we are screaming for quantification of uncertainty of predictions, so I hope somebody could help me with this.

Code example to illustrate line of thought:

def feature_mean_sd(ebm_global, feature: int, value: float):
    name = ebm_global.data()["names"][feature]
    if " x " in name:
        return 0, 0, name  # return zeros when interaction, not supported

    vals = ebm_global.data(feature)["names"]
    means = ebm_global.data(feature)["scores"]
    upps = ebm_global.data(feature)["upper_bounds"]

    i = np.searchsorted(vals, value)
    if i >= len(vals) - 1:  # in case new value above range X_train
        i -= len(vals) - 2
    sd = upps[i] - means[i]

    return means[i], sd, name


def draw_feature_values(ebm_global, feature, value, size=1):
    mean, sd, name = feature_mean_sd(ebm_global, feature, value)

    return np.random.normal(mean, sd, size), name


def pred_with_sd(ebm_global, intercept, feature_values):
    size = 16

    arr = np.full((size), intercept)
    for i, f in enumerate(feature_values):
        arr += draw_feature_values(ebm_global, i, f, size=size)[0]

    return arr.mean(), arr.std()


ebm_global = ebm.explain_global(name='EBM')
intercept = ebm.intercept_[0]

samples_range, samples_sd = [], []
for i in range(0, 100):
    feature_values = X_test.iloc[i, :].values
    samples = pred_with_sd(ebm_global, intercept, feature_values)
    print(f"{i}: samples: mean = {samples[0]:.3f}, sd = {samples[1]:.3f}, y_true = {y_test[i]}")
    samples_range.append(samples[0])
    samples_sd.append(samples[1])

print(f"Lowest mean: {min(samples_range):.3f}, highest mean: {max(samples_range):.3f}")

import matplotlib.pyplot as plt
plt.scatter(samples_range, samples_sd, c=y_test[:100] + 1, alpha=0.5)
plt.show()

Results with KDD2015 dropout data (64 columns selected using lightgbm's feature importance):

...
samples: mean = 3.099, sd = 0.051, y_true = True
samples: mean = 3.038, sd = 0.100, y_true = True
samples: mean = 2.603, sd = 0.079, y_true = True
samples: mean = 3.169, sd = 0.056, y_true = True
samples: mean = 0.413, sd = 0.623, y_true = False
samples: mean = -1.751, sd = 0.220, y_true = False
samples: mean = 1.611, sd = 0.165, y_true = True
samples: mean = 2.808, sd = 0.101, y_true = True
samples: mean = 3.055, sd = 0.091, y_true = True
samples: mean = 2.153, sd = 0.157, y_true = True
Lowest mean: -3.615, highest mean: 3.289

mean vs SD, lower prediction score means higher uncertainty. Could be right given the lower number of cases for lower predictions:

Last question: where is the function for the final transformation stored?

interpret-ml · 2021-05-19T16:19:55Z

Hi @bverhoeff,

This is a great topic!

Is a standard deviation calculated by bagging data 8 times and training 8 models and averaging them (as you did) representative for the distribution of uncertainty of the feature scores of a certain feature at certain feature values? (e.g. so that I can draw samples from that distribution?).

Unfortunately, uncertainty estimates derived by bagging are not perfect confidence intervals. Bagging can underestimate the true variance in uncertainty of feature scores, especially in regions with relatively small sample size. However, we believe it can give reasonable hints about how the model might behave under shifts in the training data distribution. I'd highly recommend bumping up the number of bags (controlled by the outer_bags parameter) to something like 30 or 50 if you can afford the extra computational time, to get better estimates of this bagged uncertainty. We set it to 8 by default as a reasonable tradeoff between compute speed, accuracy, and intelligibility, but for more serious applications increasing this number is recommended.

Currently there are no upper and lower bounds for interaction terms in the internal object. Would it be difficult to add these?

Good point! We left these off the internal object because it isn't clear how to visualize uncertainty estimates of interaction terms. However, it's easy for us to calculate them and include them in the data structures, even if they don't get visualized today. We'll be sure to include this in a future release.

If we draw samples from each feature graphs distribution for a certain values for a case and use them to calculate a prediction; and if we would repeat this a number of times, could we determine the average+-SD of the prediction for that case as well? And would that be a correct way to determine uncertainty of the predictions? ...would abovementioned method be mathematically correct?

This is tricky question. I believe this requires making fairly strong assumptions about the uncertainty distribution on each learned value. It's probably reasonable to assume it is Gaussian with the mean and standard deviations stored by the model, but I'm not completely sure about this. The other (possibly bigger) concern is that this method implicitly assumes independence of error distributions across features by drawing samples from each feature independently, which might lead to miscalibrated estimates.

One other option we're thinking about is allowing users to keep each of the bagged models as a property of the EBMs. That way you can empirically derive prediction intervals by using each model to make a prediction and calculating a range. This is very close to the approach you were suggesting, but doesn't make the independence or distribution assumptions -- in this case, we'd have access to each of the base models to generate samples directly. You could then do something like this:

def pred_from_base_models(ebm, instances):
    binned_inst = ebm.preprocessor_.transform(instances)
    
    preds_per_bag = np.zeros(shape=(instances.shape[0], ebm.outer_bags))
    for bag_index, core_ebm in enumerate(ebm.bagged_models_):
        output = np.zeros(shape=(instances.shape[0]))
        for i in range(instances.shape[0]):
            term_contribs = [None] * instances.shape[1]
            for feature, bin_index in enumerate(binned_inst[i]):
                term_contribs[feature] = core_ebm.model_[feature][bin_index]

            output[i] = np.sum(term_contribs)
        preds_per_bag[:, bag_index] = output
        
    # Calc mean and stdev
    return np.c_[np.mean(preds_per_bag, axis=1), np.std(preds_per_bag, axis=1)]

The downsides are that this would make the models much larger (we throw away each mini bag currently), and generating prediction intervals would be significantly slower than just making predictions. However, it might be reasonable to include as an option for users who are willing to pay that cost.

I ran some experiments comparing prediction intervals derived through both methods on the Adult income dataset. Here's the distribution of prediction intervals for both methods, where the blue trace ("uncertainties") is the approach you mentioned, and the red trace ("base preds") is where the intervals are calculated with each of the base models.

In this case, they're fairly close! However, when testing on another medical dataset, the differences become more apparent:

It's not clear if either approach is perfect, but we believe option 2 tends to give better estimates (due to the fewer assumptions being made). Would this be useful for you? We're still actively researching this, and would love to keep the discussion going!

-InterpretML Team

bverhoeff · 2021-05-19T18:34:24Z

While looking at the forest I didn't see the tree! Obviously your solution looks way more promising both visually and conceptually. I wonder how we could validate that. Would you know about a correct bayesian method to estimate standard deviations of the predictions so they can be compared with the bagging results? Would be a nice paper!

By the way: standard errors of interactions would still be interesting, especially for local explanations. E.g. the horizontal bar chart local explanation might be more informative with error bars for each feature.

Regarding computation and storage cost: the higher the stakes, the less of an argument. One EBM as a pkl ~ 0.5 - 1 MB. A pkl of 50 MB shouldn't be a problem.

onacrame · 2021-05-29T12:26:12Z

A probabilistic estimation would be of immense value. I'm in financial services, where people in the business are always wary of point estimates and ask the question, "how confident are we of this prediction?". We've experimented with quantile regression within the usual GBDT packages (highly inefficient as you can only model two quantiles at a time) and also looked at NGBoost, which is not particularly fast.

paulbkoch · 2023-01-30T00:30:31Z

@bverhoeff & @onacrame -- As of release v0.3.0, we now store the individual bagged models inside the EBM under the bagged_scores_ attribute, so it should now be possible to implement the pred_from_base_models function above. I will add this to our backlog since exposing a built-in version would be useful to a wider audience.

joshdunnlime · 2023-12-01T11:22:15Z

A probabilistic estimation would be of immense value. I'm in financial services, where people in the business are always wary of point estimates and ask the question, "how confident are we of this prediction?". We've experimented with quantile regression within the usual GBDT packages (highly inefficient as you can only model two quantiles at a time) and also looked at NGBoost, which is not particularly fast.

Here are some examples of other packages doing distribution learning. These can be used to get any quantile.

XGBoost-Distribution: Like NGBoost - but faster: https://github.com/CDonnerer/xgboost-distribution
LightGBMLSS - https://github.com/StatMixedML/LightGBMLSS
Also an XGBoostLLS - https://github.com/StatMixedML/XGBoostLSS

Additionally, XGBoost and CatBoost now support multi quantile predictions.

EBM would really benefit from custom loss, or even better, out-of-the-box multi pinball loss 😍

bverhoeff mentioned this issue May 18, 2021

How can I retrieve a confidence score from the global explanation graph of each feature for each feature value? #234

Closed

interpretml deleted a comment from rodrigovssp May 31, 2021

paulbkoch added the enhancement New feature or request label Jan 30, 2023

paulbkoch mentioned this issue Jan 30, 2023

Backlog #400

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use standard deviation from global explanation graph to calculate standard deviation of prediction? #235

Use standard deviation from global explanation graph to calculate standard deviation of prediction? #235

bverhoeff commented May 18, 2021 •

edited

Loading

interpret-ml commented May 19, 2021 •

edited

Loading

bverhoeff commented May 19, 2021 •

edited

Loading

onacrame commented May 29, 2021

paulbkoch commented Jan 30, 2023

joshdunnlime commented Dec 1, 2023

Use standard deviation from global explanation graph to calculate standard deviation of prediction? #235

Use standard deviation from global explanation graph to calculate standard deviation of prediction? #235

Comments

bverhoeff commented May 18, 2021 • edited Loading

interpret-ml commented May 19, 2021 • edited Loading

bverhoeff commented May 19, 2021 • edited Loading

onacrame commented May 29, 2021

paulbkoch commented Jan 30, 2023

joshdunnlime commented Dec 1, 2023

bverhoeff commented May 18, 2021 •

edited

Loading

interpret-ml commented May 19, 2021 •

edited

Loading

bverhoeff commented May 19, 2021 •

edited

Loading