-
Notifications
You must be signed in to change notification settings - Fork 0
/
[ENG]classification_report.Rmd
2172 lines (1525 loc) · 65 KB
/
[ENG]classification_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Classification"
author: "Fatih Emre Ozturk"
date: "2023-02-06"
output: html_document
---
```{r message=FALSE, warning=FALSE, include=FALSE}
library(caret)
library(tidyverse)
library(magrittr)
library(olsrr)
library(car)
library(corrplot)
library(ISLR)
library(Hmisc)
library(caret)
library(dplyr)
library(ModelMetrics)
library(lmtest)
library(moments)
library(bestNormalize) # normalization
library(MASS)
library(psych)
library(mvnTest) # perform multivariate normality test
library(tree) # perform regression and decision tree
library(randomForest) # perform random forest
library(rpart) # performing regression trees
library(rpart.plot) # plotting regression trees
library(ipred) # bagging
library(kmed)
library(klaR)
library(e1071)
library(gridExtra)
library(ggalt)
library(ROCR)
library(MVN)
library(tinytex)
```
```{r include=FALSE}
df <- get(data("heart", package = "kmed"))
# Dependent variable is decreased to two levels: 0 for healthy, 1 for heart disease
df %<>% mutate(class = ifelse(df$class == 0, 0,1))
df2 <- df
# required transformations
str(df)
df$sex <- as.numeric(df$sex)
df$sex <- as.factor(df$sex)
df$fbs <- as.numeric(df$fbs)
df$fbs <- as.factor(df$fbs)
df$exang <- as.numeric(df$exang)
df$exang <- as.factor(df$exang)
df$ca <- as.factor(df$ca)
df$class <- as.factor(df$class)
# after transformation
str(df)
sum(is.na(df))
# there is no na in the dataset
```
## Tanımlayıcı İstatistikler
```{r echo=FALSE}
summary(df)
```
When we examine the descriptive statistics of the numerical values in the data set:
- The mean of the age variable was found to be lower than the median. This shows that the variable is skewed to the left. Considering the difference between the first quartile and the minimum value, it is thought that there may be extreme values.
- The mean of the Trestbps variable was found to be slightly larger than the median. This shows that the variable is skewed to the right. When the quartiles and min max variables are examined, it is thought that there may be outlier observations.
- The mean of the chol variable was found to be larger than the median. This shows that the variable is skewed to the right. When the min-max values are examined with the charts, it is thought that there may be outlier observations.
- The median of the thalach variable was found to be larger than the mean. This shows that the variable is skewed to the left.
- When the difference between the quartiles and min-max values is examined, it is thought that there may be outlier observations.
- Boxplots will be used for outlier observations and histogram graphs will be used to have a general information about the distributions.
When the descriptive statistics of the categorical values in the data set are examined:
- When the sex variable is analyzed, it is determined that the majority of the observations in the data are male.
- When the cp variable was analyzed, it was determined that the majority had asymptomatic chest pain.
- When the fbs variable was analyzed, it was determined that the majority of the observations had blood glucose less than 120mg/dl.
- When the restecg variable was analyzed, it was found that the observations had normal to possible electrocardiographic results and very few had abnormal results.
- When the exang variable was analyzed, it was found that the majority of observations did not have angina.
- When the slope variable was analyzed, it was found that the slope of the exercise ST segment was flat in the majority of the observations.
- When the ca variable was examined, it was found that the majority of the observations took the value 0.
- When the thal variable was analyzed, it was observed that the majority of the observations received normal and reversable defect levels.
- When the dependent variable class was analyzed, it was found that 160 people had heart disease and 137 people did not have heart disease.
### Data Visualization
```{r echo=FALSE}
par(mfrow = c(1,5), bty = "n")
boxplot(df$age, col = "goldenrod1", main = "Age", border = "firebrick3")
boxplot(df$trestbps, col = "goldenrod1" ,main = "Trestbps", border = "firebrick3")
boxplot(df$chol, col = "goldenrod1", main = "Chol", border = "firebrick3")
boxplot(df$thalach, col = "goldenrod1", main = "Thalach", border = "firebrick3")
boxplot(df$oldpeak, col = "goldenrod1", main = "Oldpeak", border = "firebrick3")
```
When the box plots of numerical variables are analyzed:
- There is no outlier observation in the age variable. Left skewness is again noteworthy. Its range is quite high.
- When the Trestbps variable is analyzed, it is found that there are many outlier observations.
- When the Chol variable is analyzed, 5 outlier observations are detected.
- When the Thalach variable was analyzed, 1 outlier observation was detected.
- 4 outlier observations were observed in the Oldpeak variable.
```{r echo=FALSE}
indexes = sapply(df, is.numeric)
indexes["class"] = TRUE
df[,indexes]%>%
gather(-class, key = "var", value = "value") %>%
ggplot(aes(x = value, y = class, color = class)) +
geom_boxplot() +
facet_wrap(~ var, scales = "free")+
theme(axis.text.x = element_text(angle = 30, hjust = 0.85),legend.position="none",
panel.background = element_rect(fill = "white"))+
theme(strip.background =element_rect(fill="goldenrod1"))+
theme(strip.text = element_text(colour = "firebrick3"))
```
When the box plots of the numerical variables according to the levels of the dependent variable are analyzed:
- Observations who did not have a heart attack were found to be in a wider range.
- It was found that the average age of observations who had a heart attack was higher than those who did not.
- Interestingly, there is no noticeable change for the variable containing cholesterol information according to the levels of the class variable.
- It is also interesting to note that the individual with maximum cholesterol did not have a heart attack.
- When the oldpeak variable, which contains information on ST depression caused by exercise compared to rest, was examined, it was found that individuals who had a heart attack had higher values.
- When the thalach variable, which includes the maximum heart rate reached, was examined, it was found that individuals who did not have a heart attack reached a higher heart rate. While it was found that the observations who had a heart attack had a wider range, it was also found that they had lower values.
- When the variable trestbps, which includes resting blood pressure information, is analyzed, there is no difference between the averages of those who had a heart attack and those who did not. However, it can be said that those who had a heart attack had slightly higher values.
### Train - Test Separation
```{r}
set.seed(2021900444)
train_indices <- sample(2, size=nrow(df), replace = TRUE, prob=c(0.7,0.3))
train <- df[train_indices==1, ]
test <- df[train_indices==2, ]
```
## Classification Tree
### Tree Packege
```{r echo=FALSE}
treeclass <- tree(class~. , train )
summary(treeclass )
```
Examining the output of the first classification tree model:
- The tree was created using a total of 10 variables.
- The tree was created with a total of 18 terminal nodes.
- Residual mean deviance was 0.448.
- The error rate was 0.1005, which can be considered high.
```{r echo=FALSE, fig.width=10}
plot(treeclass )
text(treeclass ,pretty =0)
```
When the Classification Tree is analyzed:
- The root node was identified as having cp of 1, 2 and 3.
- Thalach being less than 133.5 was identified as one of the terminal nodes. However, it is noteworthy that there is no class change for both nodes. The situation is similar for the other terminal nodes. It is clear that the tree needs to be pruned.
- ca taking the value 0 was identified as one of the internal nodes.
- Most of the terminal nodes have the same values. This emphasizes the need to prune.
#### Cross Validation
```{r echo=FALSE}
set.seed(2021900444)
cv.treeclass <- cv.tree(treeclass ,FUN=prune.misclass )
plot(cv.treeclass$size ,cv.treeclass$dev ,type="o", col = "firebrick3", bty = "l", ylab = "Deviance", xlab = "Size")
```
When the graph showing the relationship between the number of terminal nodes and Residual Mean Deviance is examined, it is noticed that the minimum residual mean deviance value is realized for the number of 10 terminal nodes.
For this reason, a pruning will be done for 10 terminal nodes.
```{r}
prune.treeclass1 <- prune.misclass (treeclass,best=10)
summary(prune.treeclass1)
```
When the output of the classification tree model after pruning is examined:
- The tree was created using a total of 7 variables.
- The tree was created with a total of 10 terminal nodes.
- Residual mean deviance is 0.6314, which is higher than before pruning.
- Error rate is 0.1053, which is slightly higher than before pruning.
```{r echo=FALSE}
plot(prune.treeclass1)
text(prune.treeclass1 ,pretty =0)
```
When the classification tree is examined after pruning:
- The root node was determined as cp being 1, 2 and 3.
- Thal being 3 was identified as one of the terminal nodes.
- ca having a value of 0 was again determined as one of the internal nodes.
- Before pruning, the same values were observed in most of the terminal nodes.
It is noticed that this problem disappeared after pruning.
#### Prediction of Trees Created with Tree Package
##### Metrics of the First Tree with Train Data
```{r echo=FALSE}
classtree.pred <- predict(treeclass ,train ,type="class")
a <- caret::confusionMatrix(classtree.pred, train$class)
a
```
Accuracy Rate : 0.8995
Sensitivity : 0.9459
Specificity : 0.8469
```{r message=FALSE, warning=FALSE, include=FALSE}
ctpredictions <- data.frame()
ctpredictions[1,1] <- "Before Pruning CT"
ctpredictions[1,2] <- "Train"
ctpredictions[1,3] <- a$overall[1]
ctpredictions[1,4] <- a$byClass[1]
ctpredictions[1,5] <- a$byClass[2]
```
##### Test
```{r echo=FALSE}
classtree.predtest <- predict(treeclass, test, type = "class")
a <- caret::confusionMatrix(classtree.predtest, test$class)
a
```
Accuracy Rate : 0.75
Sensitivity : 0.8776
Specificity : 0.5897
```{r include=FALSE}
ctpredictions[2,1] <- "Before Pruning CT"
ctpredictions[2,2] <- "Test"
ctpredictions[2,3] <- a$overall[1]
ctpredictions[2,4] <- a$byClass[1]
ctpredictions[2,5] <- a$byClass[2]
```
##### First Pruned Tree Predictions
```{r echo=FALSE}
prunedtree.pred1 <- predict(prune.treeclass1 ,train ,type="class")
a <- caret::confusionMatrix(prunedtree.pred1, train$class)
a
```
Accuracy Rate : 0.8947
Sensitivity : 0.9550
Specificity : 0.8265
```{r include=FALSE}
ctpredictions[3,1] <- "First Prune CT"
ctpredictions[3,2] <- "Train"
ctpredictions[3,3] <- a$overall[1]
ctpredictions[3,4] <- a$byClass[1]
ctpredictions[3,5] <- a$byClass[2]
```
##### Test
```{r echo=FALSE}
prunedtree.predtest1 <- predict(prune.treeclass1, test, type = "class")
a <- caret::confusionMatrix(prunedtree.predtest1, test$class)
```
Accuracy Rate : 0.7614
Sensitivity : 0.8980
Specificity : 0.5897
```{r include=FALSE}
ctpredictions[4,1] <- "First Prune CT"
ctpredictions[4,2] <- "Test"
ctpredictions[4,3] <- a$overall[1]
ctpredictions[4,4] <- a$byClass[1]
ctpredictions[4,5] <- a$byClass[2]
```
### Rpart Package
This function also performs cross validation and automatically generates the pruned tree with the fewest errors.
```{r}
treeclass2 <- rpart(class~., data = train, method = "class")
treeclass2$variable.importance
```
When ranking the importance of variables, the most important variable is cp followed by thalach, thal, exang variables.
```{r}
treeclass2$numresp
```
The tree was constructed using four independent variables.
```{r}
rpart.plot(treeclass2)
```
When we examine the tree, it is noteworthy that cp is 1,2,3 again as the root node.
Internal nodes draw attention as cases where ca is equal to zero, slope is 1 or 3.
There are 6 terminal nodes in total.
Each assignment is shown in different colors.
The shades of the colors indicates the amount of observations it contains.
##### Train
```{r echo=FALSE}
prunedtree.pred3 <- predict(treeclass2 ,train ,type="class")
a <- caret::confusionMatrix(prunedtree.pred3, train$class)
a
```
Accuracy Rate : 0.866
Sensitivity : 0.9099
Specificity : 0.8163
##### Test
```{r echo=FALSE}
prunedtree.predtest3 <- predict(treeclass2, test, type = "class")
b <- caret::confusionMatrix(prunedtree.predtest3, test$class)
b
```
Accuracy Rate : 0.8295
Sensitivity : 0.9184
Specificity : 0.7179
```{r include=FALSE}
ctpredictions[5,1] <- "rpart CT"
ctpredictions[5,2] <- "Train"
ctpredictions[5,3] <- a$overall[1]
ctpredictions[5,4] <- a$byClass[1]
ctpredictions[5,5] <- a$byClass[2]
ctpredictions[6,1] <- "rpart CT"
ctpredictions[6,2] <- "Test"
ctpredictions[6,3] <- b$overall[1]
ctpredictions[6,4] <- b$byClass[1]
ctpredictions[6,5] <- b$byClass[2]
names(ctpredictions) <- c("Algorithm", "TT", "Accuracy_Rate", "Sensivity", "Specificity" )
```
### the Best Classification Tree
#### Accuracy Rate Comparison
```{r}
ctpredictions %>%
ggplot(aes(x= Accuracy_Rate, y= reorder(Algorithm, -Accuracy_Rate))) +
geom_line(stat="identity") +
geom_point(aes(color=TT), size=3) +
theme(legend.position="top") +
theme(panel.background = element_rect(fill="white"))+
xlab("Accuracy Rate") +
ylab("Algorithm")
```
When the accuracy ratios of the first model created with the Tree package and the pruned tree models for train and test data were examined, it was noticed that there were large differences between them.
This indicates that there may be an overfit problem.
Although the model with the lowest accuracy rate in the train dataset, the rpart package with the highest accuracy rate in the test dataset draws attention.
#### Sensitivity Comparison
```{r}
ctpredictions %>%
ggplot(aes(x= Sensivity, y= reorder(Algorithm, -Sensivity))) +
geom_line(stat="identity") +
geom_point(aes(color=TT), size=3) +
theme(legend.position="top") +
theme(panel.background = element_rect(fill="white"))+
xlab("Sensitivity") +
ylab("Algorithm")
```
When the sensitivities of the first model created with the Tree package and the pruned tree models for train and test data were examined, it was noticed that there were large differences between them.
This indicates that there may be an overfit problem.
Although the rpart package has the lowest sensitivity in the train dataset, it has the highest sensitivity in the test dataset.
In addition, the sensitivity of the test dataset was higher than the train dataset.
This is exactly what I want.
#### Specificity Comparison
```{r}
ctpredictions %>%
ggplot(aes(x= Specificity, y= reorder(Algorithm, -Specificity))) +
geom_line(stat="identity") +
geom_point(aes(color=TT), size=3) +
theme(legend.position="top") +
theme(panel.background = element_rect(fill="white"))+
xlab("Specificity") +
ylab("Algorithm")
```
When the accuracy ratios of the first model created with the Tree package and the pruned tree models for train and test data were examined, it was noticed that there were large differences between them.
This indicates that there may be an overfit problem.
Although the model with the lowest accuracy rate in the train dataset, the rpart package with the highest accuracy rate in the test dataset draws attention.
For this reason, the model built with the **rpart package was selected as the best model** among the Classification Tree models.
This model will be used when comparing with other models.
## Bagging
### Random Forest Package
```{r}
set.seed(2021900444)
bag <- randomForest(class~. , data=train, mtry=13,importance=TRUE)
bag
```
- A total of 500 trees were used in the model.
- 13 variables were used in each split.
- The OOB error rate was found to be 18.18%.
- While the error of the zero class was 0.11, the error rate in the first class increased to 0.25
```{r}
varImpPlot(bag)
```
When the graph indicating the importance of the variables is analyzed, the important variables according to the meandecreaseaccuracy value of the proline variable are cp, ca, oldpeak, thal.
According to the gini value indicating node purity, the significant variables are cp, ca, oldpeak, age.
### Model Building with ipred Package
When building the model with the bagging function included in the ipred package:
- nbagg is used to control how many iterations to include in the model.
- coob = TRUE indicates to use the OOB error rate.
- 10-fold cross validation is applied inside the function with the tr control argument.
```{r}
bag2 <- bagging(
formula = class ~ .,
data = train,
nbagg = 500,
coob = TRUE,
method = "treebag",
trControl = trainControl(method = "cv", number = 10))
bag2$err
```
OOB Missclassification error rate is 0.177.
It coincided with the same result as the model created with the randomForest package.
```{r}
VI <- data.frame(var=names(train[,-14]), imp=varImp(bag2))
VI_plot <- VI[order(VI$Overall, decreasing=F),]
barplot(VI_plot$Overall,
names.arg=rownames(VI_plot),
horiz=T,
col="goldenrod1",
xlab="Variable Importance",
las = 2)
```
When we examine the graph expressing the importance of the variables, we see a different graph than the previous package.
While ca and cp variables appeared to be the most important variables in the other package, it was noticed that the oldpeak variable was more important this time.
It can be said that oldpeak is followed by cp, ca, thal and age variables.
### Predictions of the Models
#### Train
```{r echo=FALSE}
baggintrain <- predict(bag ,train ,type="class")
a <- caret::confusionMatrix(baggintrain, train$class)
a
```
Accuracy Rate : 1
Sensitivity : 1
Specificity : 1
#### Test
```{r echo=FALSE}
baggintest <- predict(bag, test, type = "class")
b <- caret::confusionMatrix(baggintest, test$class)
b
```
Accuracy Rate : 0.7727
Sensitivity : 0.8571
Specificity : 0.6667
```{r include=FALSE}
bagpred <- data.frame()
bagpred[1,1] <- "bagmodel1"
bagpred[1,2] <- "train"
bagpred[2,2] <- "test"
bagpred[2,1] <- "bagmodel1"
bagpred[1,3] <- a$overall[1]
bagpred[2,3] <- b$overall[1]
bagpred[1,4] <- a$byClass[1]
bagpred[2,4] <- b$byClass[1]
bagpred[1,5] <- a$byClass[2]
bagpred[2,5] <- b$byClass[2]
```
#### ipred Package Preds
```{r echo=FALSE}
baggintrain1 <- predict(bag2 ,train ,type="class")
a <- caret::confusionMatrix(baggintrain1, train$class)
a
```
Accuracy Rate : 1
Sensitivity : 1
Specificity : 1
```{r echo=FALSE}
baggintest1 <- predict(bag2, test, type = "class")
b <- caret::confusionMatrix(baggintest1, test$class)
b
```
Accuracy Rate : 0.8068
Sensitivity : 0.8776
Specificity : 0.7179
```{r include=FALSE}
bagpred[3,1] <- "ipredmodel"
bagpred[3,2] <- "train"
bagpred[4,2] <- "test"
bagpred[4,1] <- "ipredmodel"
bagpred[3,3] <- a$overall[1]
bagpred[4,3] <- b$overall[1]
bagpred[3,4] <- a$byClass[1]
bagpred[4,4] <- b$byClass[1]
bagpred[3,5] <- a$byClass[2]
bagpred[4,5] <- b$byClass[2]
names(bagpred) <- c("Algorithm", "TT", "Accuracy_Rate", "Sensivity", "Specificity" )
```
### Choosing the Best Bagging Model
#### Accuracy Rate Comparison
```{r}
bagpred %>%
ggplot(aes(x= Accuracy_Rate, y= reorder(Algorithm, -Accuracy_Rate))) +
geom_line(stat="identity") +
geom_point(aes(color=TT), size=3) +
theme(legend.position="top") +
theme(panel.background = element_rect(fill="white"))+
xlab("Accuracy Rate") +
ylab("Algorithm")
```
For both models, the difference between the accuracy rates of the train data and the test data was quite high.
It can be easily said that there is an overfit problem.
It is seen that the model built with the ipred package gives a slightly better result.
#### Sensitivity Comparison
```{r}
bagpred %>%
ggplot(aes(x= Sensivity, y= reorder(Algorithm, -Sensivity))) +
geom_line(stat="identity") +
geom_point(aes(color=TT), size=3) +
theme(legend.position="top") +
theme(panel.background = element_rect(fill="white"))+
xlab("Sensitivity") +
ylab("Algorithm")
```
For both models, the difference between the sensitivities of the train data and the test data is quite high.
It can be easily said that there is an overfit problem.
It is seen that the model built with the ipred package gives a slightly better result.
#### Specificity Comparison
```{r}
bagpred %>%
ggplot(aes(x= Specificity, y= reorder(Algorithm, -Specificity))) +
geom_line(stat="identity") +
geom_point(aes(color=TT), size=3) +
theme(legend.position="top") +
theme(panel.background = element_rect(fill="white"))+
xlab("Specificity") +
ylab("Algorithm")
```
For both models, the difference between the specificities of the train data and the test data was quite high.
It can be easily said that there is an overfit problem.
The model built with the ipred package seems to give a slightly better result.
Similar results were encountered for all metrics.
Since it gives slightly better results in the comparison of the algorithms and the results of the test dataset are higher, **I will continue with the bagging model built with the ipred package**.
## Random Forest
```{r}
rf <- randomForest(class~. ,data=train, mtry=4,importance=TRUE)
rf
```
Analyzing the output of the model:
- 4 variables were tested in each decomposition.
- A total of 500 trees were established.
- The OOB error rate was 0.16
- The error was 0.12 for class zero and 0.21 for class one.
- A total of 36 observations were misclassified.
```{r}
varImpPlot(rf)
```
Considering the importance of variables:
When mean decrease accuracy is analyzed, the order of ca, followed by cp, oldpeakthalach,thal, stands out the most.
When we look at the gini values expressing node purity, the order of cp, ca, oldpeak stands out.
#### Grid Search
In Grid search, a graph is plotted to decide the range of the number of trees.
```{r}
plot(rf)
```
```{r}
hyper_grid <- expand.grid(
mtry = c(3, 4, 5, 6), # sqrt(p)
nodesize = c(1, 3, 5, 10),
numtrees = c(250,300,330,370, 400),
oob = NA
)
for (i in 1:nrow(hyper_grid)) {
fit <- randomForest(class~. ,
data=train,
mtry=hyper_grid$mtry[i],
nodesize = hyper_grid$nodesize[i],
ntree = hyper_grid$numtrees[i],
importance=TRUE)
hyper_grid$oob[i] <- mean(fit$err.rate[,1])
}
hyper_grid %>%
arrange(oob) %>%
head(10)
```
Thus, the model with the best parameters should be as follows.
```{r}
rf2 <- randomForest(class~. ,data=train, mtry=5,importance=TRUE, nodesize = 1, ntree= 250)
rf2
```
- 5 variables were tested in each split.
- A total of 250 trees were constructed.
We had already entered these two parameters.
- The OOB estimate error rate was 0.14
This grid search can be said to be a better result than the previous model.
- The error was 0.10 for class zero and 0.19 for class one.
- A total of 31 observations were misclassified.
- It can be said that Grid Search gives better results than before.
```{r}
varImpPlot(rf2)
```
Considering the importance of the variables;
When the mean decrease accuracy is analyzed, cp is the most important variable, followed by ca, oldpeak and thal.
According to the first random forest model, the order of the most periodic variable has changed.
When Gini values are analyzed, the order of cp, ca, thalach, stands out.
### Predictions of the Models
#### Metrics of the First Random Forest Model with Train Data
```{r echo=FALSE}
ranfortrain <- predict(rf ,train ,type="class")
a <- caret::confusionMatrix(ranfortrain, train$class)
a
```
Accuracy Rate : 1
Sensitivity : 1
Specificity : 1
#### Test
```{r echo=FALSE}
ranfortest <- predict(rf, test, type = "class")
b <- caret::confusionMatrix(ranfortest, test$class)
b
```
Accuracy Rate : 0.8182
Sensitivity : 0.8776
Specificity : 0.7436
```{r include=FALSE}
rfpred <- data.frame()
rfpred[1,1] <- "rfmodel1"
rfpred[2,1] <- "rfmodel1"
rfpred[1,2] <- "train"
rfpred[2,2] <- "test"
rfpred[1,3] <- a$overall[1]
rfpred[2,3] <- b$overall[1]
rfpred[1,4] <- a$byClass[1]
rfpred[2,4] <- b$byClass[1]
rfpred[1,5] <- a$byClass[2]
rfpred[2,5] <- b$byClass[2]
```
#### After Grid Search
```{r echo=FALSE}
ranfortrain1 <- predict(rf2 ,train ,type="class")
a <- caret::confusionMatrix(ranfortrain1, train$class)
a
```
Accuracy Rate : 1
Sensitivity : 1
Specificity : 1
#### Test
```{r echo=FALSE}
ranfortest1 <- predict(rf2, test, type = "class")
b <- caret::confusionMatrix(ranfortest1, test$class)
b
```
Accuracy Rate : 0.8068
Sensitivity : 0.8776
Specificity : 0.7179
```{r include=FALSE}
rfpred[3,1] <- "rfmodel2"
rfpred[4,1] <- "rfmodel2"
rfpred[3,2] <- "train"
rfpred[4,2] <- "test"
rfpred[3,3] <- a$overall[1]
rfpred[4,3] <- b$overall[1]
rfpred[3,4] <- a$byClass[1]
rfpred[4,4] <- b$byClass[1]
rfpred[3,5] <- a$byClass[2]
rfpred[4,5] <- b$byClass[2]
names(rfpred) <- c("Algorithm", "TT", "Accuracy_Rate", "Sensivity", "Specificity" )
```
### Choosing the Best Random Forest Model
#### Accuracy Rate Comparison
```{r}
rfpred %>%
ggplot(aes(x= Accuracy_Rate, y= reorder(Algorithm, -Accuracy_Rate))) +
geom_line(stat="identity") +
geom_point(aes(color=TT), size=3) +
theme(legend.position="top") +
theme(panel.background = element_rect(fill="white"))+
xlab("Accuracy Rate") +
ylab("Algorithm")
```
For both models, the difference between the accuracy rates of the train data and the test data was quite high.
It can be easily said that there is an overfit problem.
It is seen that the model created with Grid Search gives a slightly better result.
#### Sensitivity Comparison
```{r}
rfpred %>%
ggplot(aes(x= Sensivity, y= reorder(Algorithm, -Sensivity))) +
geom_line(stat="identity") +
geom_point(aes(color=TT), size=3) +
theme(legend.position="top") +
theme(panel.background = element_rect(fill="white"))+
xlab("Sensitivity") +
ylab("Algorithm")
```
For both models, the difference between the sensitivities of the train data and the test data is quite high.
It can be easily said that there is an overfit problem.
It is seen that the model created with Grid Search gives a slightly better result.
#### Specificity Comparison
```{r}
rfpred %>%
ggplot(aes(x= Specificity, y= reorder(Algorithm, -Specificity))) +
geom_line(stat="identity") +
geom_point(aes(color=TT), size=3) +
theme(legend.position="top") +
theme(panel.background = element_rect(fill="white"))+
xlab("Specificity") +
ylab("Algorithm")
```
For both models, the difference between the specificities of the train data and the test data was quite high.
It can be easily said that there is an overfit problem.
It is seen that the model built with Grid Search gives a slightly better result.
Since it gives slightly better results in the comparison of algorithms and the results of the test data set are higher, **I will continue with the random forest model created after grid search**.
## Logistic Regression
```{r}
logmodel1 <- glm(class ~ age + sex + cp + trestbps + chol +
fbs + restecg + thalach + exang + oldpeak + slope + ca + thal, data = train, family = binomial)
summary(logmodel1)
```
### Model's statistical significance
- $H_{0}$ : $\beta_{1}$ = $\beta_{2}$ = ⋯ = $\beta_{k}$ = 0
- $H_{a}$ : At least one $\beta_{j}$ $\ne$ 0
```{r}
# G= Null deviance-Residual Deviance
1-pchisq(288.93 - 112.58,208-188)
```
Since this p-value is less than .05, we can reject the null hypothesis. In other words, we have sufficient statistical evidence to say that the independent variables are effective in explaining the dependent variable.
### Coefficients
Change in prediction value when one increases the value of the independent variable by one unit to determine the log(odds), the exp function is first applied to both sides of the log(odds) formula.
Coefficient interpretation of significant variables:
```{r}
exp(1.689872)
exp(2.824694)
exp(3.703384)
exp(1.331317)
exp(0.988486)
exp(1.434285)
exp(2.770626)
exp( 2.553011)
exp(1.265260)
```
- One-unit increase in the Sex1 variable changes the odds ratio by 5.418787 times
- One-unit increase in the Cp2 variable changes the odds ratio 16.85579 times
- 1-unit increase in the Cp4 variable changes the odds ratio 40.58441 times.
- 1 unit increase in Restecg2 variable changes odds ratio 3.786026 times.
- 1-unit increase in the Oldpeak variable changes the odds ratio by 2.687163 times.
- 1-unit increase in the Slope2 variable changes the odds ratio by 4.196643 times.
- 1-unit increase in the Ca1 variable changes the odds ratio by 15.96863 times.
- One unit increase in Ca2 variable changes odds ratio 12.84572 times.
- 1-unit increase in the Thal7 variable changes the odds ratio by 3.544014 times.
### Confidence Interval for Coefficients
- $H_{0}$ : $\beta_{i}$ = 0
- $H_{a}$ : $\beta_{i}$ $\ne$ 0
```{r}
confint.default(logmodel1)
```
Since the confidence interval for the $\beta$ coefficient does not include the zero value, the null hypothesis $H_o$ is rejected and the following coefficients are statistically significant: sex1, cp4, thalach, slope2, ca
Since the confidence interval for the $\beta$ coefficient contains zero value, the null hypothesis $H_o$ cannot be rejected and the following coefficients are not statistically significant: age, cp2, cp3, trestbps, chol, fbs1, restecg, exang, oldpeak, slope3, thal
### Confidence Interval for Odds
$H_{0}$ : exp($\beta_{i}$) = 1