Diabetes-Prediction

A multi-layer perceptron which predicts whether an individual is susceptible to diabetes. The model has been trained on the Pima Indians Diabetes Database, provided by the National Institute of Diabetes and Digestive and Kidney Diseases.

Libraries Used

matplotlib pandas Keras NumPy seaborn scikit-learn

Data Analysis

Histograms

Note: 'outcome' refers to whether an individual does, or does not, have diabetes

Insights

Variables are on different scales, and therefore must be standardized
The majority of data has been collected from individuals between 20 and 30 years of age
BMI, Blood Pressure, and Glucose are normally distributed
- This is to be expected when such statistics are collected from a population
It is impossible for for BMI, Blood Pressure, and Glucose to have a value of zero
- Missing or incomplete data?
Certain individuals have had up to 15 pregnancies
- While not implausible, this information should still be considered
This data-set suggests that 35% of the population has diabetes (65% do not)
- The World Health Organisation estimates that only 8.5% of the global population suffers from diabetes
- ...this data-set is therefore not representative of the global population, which is to be expected due to its nature

Density Plots

Insights

Glucose, BMI, and Age appear to be the strongest predicting values for those with diabetes
Blood Pressure and Skin Thickness do not appear to have a significant correlation with the distribution of diabetic and non-diabetic individuals

Data Pre-Processing

Missing or Incomplete Values

Statistical Summary

There are a total of 768 entries
Pregnancies, Glucose Concentration, Blood Pressure, Skin Thickness, Insulin, and BMI appear to have a minimum value of zero. This indicates missing values as such values are impossible

Number of Missing Values

There is a significant number of missing values. Most notably, a large number of entries for Insulin and Skin Thickness are missing
Due to the fact that missing values have been determined by searching for entries with a value of zero, Pregnancies can be ignored as an individual with zero pregnancies is perfectly valid
Missing values have been replaced with the mean of non-missing values

Data Standardization

Statistical Summary of Standardized Data

The values for Outcome have been copied from the original dataset as they do not require standardization

Data Splits

The dataset has been split into training (80%) and testing (20%) splits. The training set has then been further divided into training (80%) and validation (20%) splits.

Results

Once trained, the model was able to achieve 96.74% accuracy on the training set and 70.13% accuracy on the testing set.

Confusion Matrix

In the case of diabetes prediction, false-negatives are the least desirable outcome as it would result in patients being informed that they will not develop diabetes when in fact they may.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
README.md		README.md
data-analysis.py		data-analysis.py
diabetes.csv		diabetes.csv
main.py		main.py
preprocessing.py		preprocessing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diabetes-Prediction

Libraries Used

Data Analysis

Histograms

Insights

Density Plots

Insights

Data Pre-Processing

Missing or Incomplete Values

Statistical Summary

Number of Missing Values

Data Standardization

Statistical Summary of Standardized Data

Data Splits

Results

Confusion Matrix

ROC Curve

About

Languages

Mauzey/Diabetes-Prediction

Folders and files

Latest commit

History

Repository files navigation

Diabetes-Prediction

Libraries Used

Data Analysis

Histograms

Insights

Density Plots

Insights

Data Pre-Processing

Missing or Incomplete Values

Statistical Summary

Number of Missing Values

Data Standardization

Statistical Summary of Standardized Data

Data Splits

Results

Confusion Matrix

ROC Curve

About

Topics

Resources

Stars

Watchers

Forks

Languages