By Han Simeng from NTU Open Source Society
Workshop Details | |
---|---|
When | Friday, 9 Sep 2018. 6:30 PM - 8:30 PM |
Where | LT1, NTU North Spine Plaza |
Who | NTU Open Source Society |
Questions | We will be hosting a Pigeon Hole Live for collecting questions regarding the workshop |
For errors, typos or suggestions, please do not hesitate to post an issue! Pull requests are very welcome, thank you!
Disclaimer: This workshop is for educational purposes only. No prototype or outcome of any type is intended for commercial use.
Machine Learning is an interdisciplinary subject where computer science and statistics intersect.
In the workshop today, we will focus on the practical aspect of machine learning, i.e.,coding.
In most cases, we give our algorithm an input and it gives us an output.
However, for a machine learning algorithm, we first feed a lot of data to the algorithm to let the algorithm determine itself how it should react to the data. This is the process of determining the parameters of the machine learning model.
In supervised machine learning, we feed the input and label, into the model and it will learn how to predict the output when we feed new inputs. Think about supervised learning as learning with a teacher who tells you the right answers.
In unsupervised machine learning, we only feed the input and the model will learn to predict the output solely based on the input. Think about unsupervised learning as learning without a teacher. Not all real-world data have a label, thus the necessity of unsupervised learning.
The second workshop will introduce two machine learning algorithms in order to demonstrate how the field can be used in real-world scenarios.
This includes logistic regression, a supervised method to solve classification problems, as well as k-means clustering, an unsupervised method to group together clusters of data by certain criteria.
We will use scikit-learn, a python package built for implement machine learning algorithms.
Logistic Regression with scikit-learn
K-Means with scikit-learn
See NTUOSS-PandasBasics for a comprehensive introduction on how to use Google Colabtory for data science projects and let's walk through it.
Copy this notebook to your own drive
Go to this link to download the data to be used in this workshop and upload it to Google Colabtory.
- Supervised Odyssey:
Supervised Classification
- Unsupervised Odyssey:
Unsupervised Classification
- End of journey
Import the module for linear regression algorithm from sklearn and plotting packages
Logistic Regression is used when the dependent variable(target) is categorical, i.e., we want to find class which each of the variables belongs to. For example, to classify spam emails, we find whether an email belongs to the spam class or the normal class.
Algorithm Intuition (online demo)
Sigmoid function adds non-linearity into the model
z is the input to the sigmoid function, which is the dot product of input X and the weight w
Logistic regression predictive function
To conduct logistic regression with scikit-learn, we first create a LogisticRegression object
Then we fit the model to the data
The intercept and coef are the model parameters(weights)
After obtaining the parameters, lets visualize the result by plotting the decision boundary.
Students whose score points are above the decision boundary will be admitted while the students below the decision boundary will be rejected
Now let's use our trained logistic regression model to predict if a student will be accepted or rejected.
Import the image reading module from matplotlib and the K-Means module from sklearn
Read the image
A 2D image is comprised of two dimensional RGB values.
700 is the row number.
1000 is the column number.
3 is the R, G, B value respectively.
Algorithm Intuition (Online Demo)
K-means is one of the most popular unsupervised clustering algorithms.
"K" in K-means refers to k number of clusters.
"Means" refers to finding the means, or centroids of the clusters.
Reshape the image to be 2-dimension
To run the KMeans algorithm, we first create a scikit-learn KMeans object with the number of clusters assigned to 20, which is the number of colors we want for the compressed image. Fit the model to the data, then use the centroids to compress the image
Reshape X_recovered to have the same dimension as the original image
Now we can plot the original and the compressed image side by side.
Congratualations on completing the Machine Learing Odyssey!
In this workshop we have learned how to use machine learning algorithms to solve some simple real-world problems.
In the next, which is also the last workshop of the NTUOSS Data Science workshop series, we will teach you deep learning, which is a subfield of machine learning and is even more interesting!
An approchable book if you want to learn more A Course in Machine Learning