vignette.qmd

---
title: "Vignette: Penalized Regression Techniques"
author: Aleksander Cichosz, Anni Li, Brian Che, Justin Vo, Noa Rapoport
format: html
editor: visual
execute:
  message: false
  warning: false
  echo: true
  cache: true
---

## Introduction

Penalized regression is an alternative to the standard linear model that allows us to create a regression model that is penalized by adding a constraint in the equation. This allows us to reduce the coefficent values towards zero, allowing the less contributive variables to have a coefficient close or equal to zero.

**Objectives**

In this vignette you'll learn how to implement:

-   Ridge Regression
-   LASSO Regression
-   Elastic Net Regression

**Setup**\
First, we need to load all the required packages and import the data

```{r, echo = TRUE}
# load packages
library(tidyverse)
library(tidymodels)
library(modelr)
library(rsample)
library(yardstick)
library(DataExplorer)
library(Matrix)
library(glmnet)
library(rsample)
library(dplyr)
library(caret)

# read data
url  <- 'https://raw.githubusercontent.com/pstat197/vignette-regression/main/data/life_clean.csv'

life_clean <- read.csv(url)
```

**About Our Dataset: Life Expectancy (WHO)**

## Exploratory Data Analysis

For the project, we wanted to find a data set where analyzing the regression between explanatory and response variables would have an interesting and valuable result. For this reason we decided to analyze the WHO Life Expectancy data-set and look at the relationship between factors that affect one's health such as country, vaccinations given in that said country, infant mortality rate in the country, etc., and see how these factors might explain the life expectancy. In this data set, there are 2938 observations and 22 measurements, one being life expectancy. The main purpose of this data set is to further analyze how the developments in the health sector have contributed to the huge improvement in human mortality rates.

## Data Partitioning

```{r}
# for consistency across linear, ridge-regression, and LASSO
set.seed(13)

# partitioning data into training and test set
partitions <- life_clean %>%
  initial_split(prop = 0.8)

# creating training set
train <- training(partitions)
train %>% head(4)

# creating test set
test <- testing(partitions)
test %>% head(4)
```

## Simple Linear Regression

We will use a simple linear regression model as a basis for comparing all other models. This base model has the life expectancy as a linear function of all the other variables in the data set. Linear regression works on the premise of minimizing the squared residuals of your predictions.

$$argmin\sum_{i=1}^{n}{(y_i - \sum_{j}{x_{ij}\beta_j})^2}$$

```{r}
# Model fitting on training set
lin_mod <- lm(Life.expectancy ~ ., data = train)

# Predictions for test set based on fitted model
pred_lin <- predict(lin_mod, newdata = test)
```

To evaluate the model performance, we will look at the $R^2$ and the Test Mean Squared Error (MSE).

```{r}
# R Squared
rss <- sum((pred_lin - test$Life.expectancy)^ 2)
tss <- sum((test$Life.expectancy - mean(test$Life.expectancy)) ^ 2)
rsq <- 1 - rss/tss
rsq

# Mean Squared Error
test_mse <- mean((pred_lin - test$Life.expectancy)^2)
test_mse
```

#### Results

Here, we see that our base linear regression model has a $R^2$ value of 0.829 and a test MSE of 12.907. We will use these as a basis for comparing the performance of the penalized regression models.

## Ridge Regression

#### Background

Ridge regression is a version of regression where we add an additional penalty term. The most popular form of ridge regression that we will be referring to is L2 regularization based on Tikhonov regularization. L2 regularization is best used when you have many many parameters and highly correlated parameters. This issue of multicollinearity in linear regression can return biased results. Multicollinearity will usually inflate the standard error of terms, resulting in inaccurate P values. Estimates will be biased and cannot be trusted. Ridge regression adds a penalization factor to linear regression. This penalization factor is the ridge parameter and is the tuning for ridge regression. Overall, ridge regression replace our residual estimation with residual plus a penalization term. So when we are minimizing our residuals the penalization term comes into effect. This penalization term comes from our betas, and a self selecting lambda . This lamda will be a larger penalty on larger $\beta_i$ . This in turns means we prefer smaller $\beta_i$ , in turn reducing variance. This helps with multicollinearity estimates tend to be unbiased but have large estimates with large variances. The general downside is that as we reduce variance we are introducing bias in our model.

$$
argmin\sum_{i=1}^{n}{(y_i - \sum_{j}{x_{ij}\beta_j})^2} + \lambda \sum_{j=1}^{p}\beta^2_j
$$

The package we will be working with for ridge regression is glmnet. There are some requirements for data formatting for this package. This means that our response variable will have to be in a vector. Our explanatory variables will have to be in a matrix. We will reformat the training and testing set objects we created earlier to achieve this.

```{r}
# splitting predictors and response variable in training set
train_x <- train %>%
  select(-Life.expectancy) %>%
  as.matrix()
train_y <- train %>%
  pull(Life.expectancy)

# splitting predictors and response variable in test set
test_x <- test %>%
  select(-Life.expectancy) %>%
  as.matrix()
test_y <- test %>%
  pull(Life.expectancy)
```

Unlike linear regression, ridge regression is a regularized regression model that uses lambda as our tuning parameter, meaning we are going to try to find an optimal lambda as usually done by brute force.

The *glmnet()* function trains the model multiple times for different values of lambda, passing them through its argument as a sequence of vector as mentioned before. Here, we use cross-validation to get the best model, which we do by using the *cv.glmnet()* function to automate the process of identify the optimal value of lambda that will result in a minimum error.

```{r}
# Setting the range of lambda values
lambda_seq <- 10^seq(2, -3, by = -.1)

# Using glmnet function to build the ridge regression in r
fit <- glmnet(train_x, train_y, alpha = 0, lambda  = lambda_seq)

# Checking the model
summary(fit)

# Using cross validation glmnet
ridge_cv <- cv.glmnet(train_x, train_y, alpha = 0, lambda = lambda_seq)

# Acquire best lambda value
best_lambda <- ridge_cv$lambda.min
best_lambda
```

The optimal lambda value, which produces the lowest test mean squared error (MSE), comes out to be 0.01 and will be used to build the ridge regression model for which we pass it through the lambda value.

```{r}
# extract the best model using K-cross validation
best_fit <- ridge_cv$glmnet.fit

# Rebuilding the model with optimal lambda value 0.01
best_ridge <- glmnet(train_x, train_y, alpha = 0, lambda = 0.01)

# Checking the coefficients
coef(best_ridge)
```

We then compute the $$R^2$$ and the Test Mean Squared Error (MSE) from the True and Predicted values.

```{r}
# apply prediction model to test_x
pred <- predict(best_ridge, s = best_lambda, newx = test_x)

# use prediction function and R squared formula to compute R^2 value
actual <- test_y
rss <- sum((pred - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
rsq # 0.8278766

# Mean Squared Error
test_mse <- mean((pred - actual)^2)
test_mse
```

#### Results

The R-squared value for the ridge regression model on the testing data outputs to be **82.79%**, which is a **slight decrease in performance** compared with the linear regression model at a R-squared value of **82.90%.** The MSE for our test partition set is **12.995**, which is slightly higher than the the test MSE of the linear regression model.

## LASSO Regression

#### Background

LASSO (Least Absolute Shrinkage and Selection Operator) regression is a type of regression that shrinks the coefficients of the base linear regression model. It has the effect of forcing some of the coefficient estimates to be exactly zero and performs variable selection. Similar to Ridge, LASSO is well suited for datasets where multicollinearity may be present.

LASSO adds a penalty by performing L1 regularization, which penalizes the sum of absolute values of the coefficients. This can be seen in the function below, which we want to minimize:

$argmin\sum_{i=1}^{n}{(y_i - \sum_{j}{x_{ij}\beta_j})^2} + \lambda \sum_{j=1}^{p}|\beta_j|$

Lambda is the tuning parameter for LASSO regression. We will find the optimal lambda value through cross validation. We will use the glmnet package for LASSO jsut like ridge. We will also use the same training and test set that we created in the ridge regression section.

```{r}
# lambda values we will choose from
lambdas <- 10^seq(2, -3, by = -.1)

# Perform cross validation to find best value of lambda
lasso_reg <- cv.glmnet(train_x, train_y, alpha = 1, lambda = lambdas)

# Best lambda value
best_lambda <- lasso_reg$lambda.min

best_lambda
```

We now fit the model by specifying the optimal lambda, and alpha = 1 which corresponds to LASSO regression.

```{r}
# Model fit on training set and optimal lambda value
lasso_mod <- glmnet(train_x, train_y, alpha = 1, lambda = best_lambda, standardize = TRUE)

# Checking the coefficients
coef(lasso_mod)
```

We will now apply the model to the test set to come up with a set of test predictions.

```{r}
# Apply fitted model on test set
pred_lasso <- predict(lasso_mod, s = best_lambda, newx = test_x)
```

Using these predicted test values and actual test values, we will compute the test MSE to see how our model performed.

```{r}
# R Squared
actual <- test_y
rss <- sum((pred_lasso - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
rsq 

# Test Mean Squared Error
mse <- mean((pred_lasso-actual)^2)
mse
```

#### Results

The R-squared value for the LASSO regression model on the testing data outputs to be **82.88%**, which is a **slight decrease in performance** compared with the linear regression model at a R-squared value of **82.90%.** The MSE for our test partition set is **12.918**, which is slightly higher than the the test MSE of 12.907 of the linear regression model.

## Elastic Net Regression

#### **Background**

Elastic Net Regression is a type of regression that linearly combines the L1 and L2 penalties of lasso and ridge regression to optimize the model. By using both the L1 and L2 penalties we change the loss of the model to account for large coefficients, while also penalizing the model based on the sum of the absolute coefficient values. The formula for elastic net penalty is: $elasticnetpenalty = (\alpha * l_1penalty) + ((1 – alpha) * l_2penalty)$. We choose alpha based on how much contribution of each penalty we want the loss function to have. For example, if we fully want just L1 penalty, we would say alpha = 1, to get rid of the L2 penalty.

The loss function is $elasticnetloss = loss + (\lambda * elasticnetpenalty)$. Lambda is another hyperparameter that is mentioned previously and that controls the weighting of the sum of both penalties to the loss function.

Overall, the elastic net model solves the problem:

$min_{\beta_0,\beta}(\frac{1}{2N}\sum_{i=1}^{N}(y_i−\beta_0−x_i^T\beta)^2+\lambda P_\alpha(\beta))$ where $P_\alpha(\beta) = \sum_{j=1}^{p}\frac{1-\alpha}{2}\beta_j^2+\alpha|\beta_j|$

```{r, results='hide'}
# remove some categorical variables
df <- life_clean
x <- df %>% select(-c(Life.expectancy))

# Model Building : Elastic Net Regression
control <- trainControl(method = "repeatedcv",
                        number = 5,
                        repeats = 5,
                        search = "random",
                        verboseIter = TRUE)

# Training ELastic Net Regression model
# Train model to find optimum lambda and alpha values
elastic_model <- train(Life.expectancy ~ .,
                       data = df,
                       method = "glmnet",
                       preProcess = c("center", "scale"),
                       tuneLength = 25,
                       trControl = control)
```

Based on the model training, we choose an alpha value of .694, and a lambda value of .00251 to train our model meaning we are giving a larger weight to the l1 penalty rather than the l2.

```{r}
# Model Prediction
x_hat_pre <- predict(elastic_model, x)

# Multiple R-squared
rsq <- cor(df$Life.expectancy, x_hat_pre)^2
rsq

# Mean Squared Error
mse <- mean((x_hat_pre - df$Life.expectancy)^2)
mse
```

#### Results

The R-squared value for the elastic net regression model on the testing data outputs to be **83.51%**, which is a **increase decrease in performance** compared with the linear regression model at a R-squared value of **82.90%.** The MSE for our test partition set is **12.75**, which is slightly lower than the the test MSE of the linear regression model. This shows that the elastic net model slightly improves both metrics compared to the rest of our regression models.

## Conclusion

After analyzing each model, it can be concluded that the elastic net regression model is the most effective when it comes to this data set, and analyzing the regression between the explanatory variables and the life expectancy. This is most likely due to the fact that many of the variables in our data are highly correlated which is a factor that pushes for the use of elastic net. Overall, it is not too surprising that lasso had a smaller R\^2, as it is possible that by attempting to produce a more simple and interpretative model, it sometimes misses important variables when selecting which variables to choose, especially when pairwise correlations among some variables are pretty high. The performance of the models is summarized below:

1.  Linear Regression Model: Test set RMSE of 12.907 and R-square of 82.9%.

2.  Ridge Regression Model: Test set RMSE of 12.995 and R-square of 82.79%.

3.  Lasso Regression Model: Test set RMSE of 12.918 and R-square of 82.88%.

4.  ElasticNet Regression Model: Test set RMSE of 12.75 and R-square of 83.51%.