-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHAPTER 4_REGRESSION IN A NUTSHELL.Rmd
72 lines (43 loc) · 4.7 KB
/
CHAPTER 4_REGRESSION IN A NUTSHELL.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
title: "CHAPTER 4_REGRESSION IN A NUTSHELL"
author: "Silvia Antón"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
The main motivation behind regression is to build an equation by which we can learn more about our data. Choosing between a logistic regression, linear regression, or multivariate regression model depends on the problem and the data that you have. you might recall in your high school mathematics education about having a couple points of data and fitting a line through it. This fit to data is the easiest form of machine learning and is used often without realizing it is a type of machine learning.
#LINEAR REGRESSION
In chapter 1, we briefly encountered linear regression with an example of the mtcars dataset. In that example, we determined a linear relationship of fuel efficiency as a function of vehicle weight and saw the trend go downward. We extracted coefficients for a linear mathematical equation onto a bunch of data and calling it a day. Let's revisit our mtcars example:
```{r}
model<-lm(mtcars$mpg~mtcars$disp)
plot(y=mtcars$mpg,x=mtcars$disp,xlab="Engine Size(cubic inches)", ylab="Fuel Efficiency (Miles per Gallon)", main="Fuel Efficiency From the 'mtcars dataset")
abline(a=coef(model[1]), b=coef(model)[2],lty=2)
```
Let's revisit our mtcars example, where we model the fuel efficiency(mpg) as a function of engine size(disp) and then look at the outputs of the model with the summary function:
```{r}
summary(model)
```
There is a wealth of information dumped out from the summary() function call on this linear model object. Generally, the one number people will typically look at to get a baseline accuracy assessment is the multiple R-squared value. The closer that value is to 1, the more accurate the linear regression model is. There are a lot of other terms in this output, though, so let's walk through each element to gain a solid understanding:
Call: This displays the formulaic function call we used. In this case, we used one response variable, mpg, as a function of one dependent variable, disp, both of which were being called from the mtcars data frame.
Residuals: Residuals are a measure of vertical distance from each data point to the fitted line in our model. In this case, we have summary statistics for all of the vertical distances for all of our points relative to the fitted line. The smaller this value is, the better the fit is.
Coefficients: These are the estimates for the coefficients of our linear equation. Our equation in this case would be Y=0.04x+29.59.
Std. Error: Those coefficients come error estimates as given b the Std. Error part of the coefficients table. In reality, our equation would be something like Y=(-0.04+-0.005)x+(29.59+-1.23).
t-value:This is the measurement of the difference relative to the variation of our data. This vale is linked with p-values, but p-values are used far more frequently.
p-value: p-values are statistical assessments of significance. The working of p-values are more complicated than that, but for our purposes a p-value being of value less than 0.05 means that we can take the number as statistically significant. If the number in question has p-value greater than 0.05, we should err on the side of not being statistically significant. The ratings next to them are explained by the significance codes that follow.
Residual standard error:
This error estimate pertains to the standard deviation of our data.
Multiple R-squared:
This is the R-squaed value for when we have multiple predictors. This is totally relevant for our linear exampl, but when we add more predictions to the model, invariably our multiple R-squared will go up. This is becasus some features we add to the model will explain some part of the variance, whether its true or not.
Adjusted R-squared:
To counteract the biases introduced from having a constantly incresing squared value with more predictors, the adjusted R-squared tends to be a better representation of a model's accuracty when there's multiple features.
F-statistic:
Finally, the F-statistic is the ratio of the variance explained by parameters in the model and the unexplained variance.
The simple linear example has some decent explanatory power.