Skip to content

Py DS_Engineer Lab Report #06

Amy Lin edited this page Jul 20, 2017 · 21 revisions

Python Programming for Data Scientists & Engineers Lab #06

Lab #06-1 Linear Regression in Tensorflow + Training & Testing Cost


I split the datasets into 85% training and 15% testing and ran the linear regression, optimization using tensorflow in python.

There is still a lot of noise in the data so the Absolute Mean Square Difference between training and testing is still high. Some ways to improve is to retrain the data and adjust based on the testing result so the accuracy can be improved. From training and testing plot, we can see that the data is improving (Error is less than the raw data) but still not good enough. There is definitely more training to be done!


* Training Cost




* Training Result




* Testing Result




* TensorFlow Final Result




Lab #06-1-2 K-Mean Clustering

DATASET : tshirt-H.csv



A small dataset ( 23 people ) with their names, heights and weights is used in this case. For siplicity on clustering a fairly small dataset, one iteration of K-mean Clustering was simutated throughout the process into 4 Clusters. The labeling will be assigned back to the data so each person will know what size of the T-shirt they're having! And for the company, they'll be able to determine the quantity and size range based on customers' weights and heights.


Lab #06-2 Spectral + Hierarchical Clustering

Spectral Clustering a.k.a. Graphic Clustering

source: http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

For spatial data, a graph formed by distances of points will be induced.The Spectral Clustering will then look at eigenvectors of the Laplacian of the graph to attempt to find a good (low dimensional) embedding of the graph into Euclidean space.

This technique is to find a transfornation of the graph to present manifold thathe the data is assumed to land on.

* Weaknesses : Partitioning is still polluting data with noise.

* Intuitive Parameters : Clustering number must be specifyour or hopefully find a 'suitabele' one through a range of parameters.

* Stability : A little bit more stable than K-mean due to the transformation but still suffer from some issues.

* Performance : A slower algorithm since spatial data don't have a sparse grpah ( unless we prep it by ourselves).



Hierarchical Clustering a.k.a. HCA


source: https://www.centerspace.net/clustering-analysis-part-iii-hierarchical-cluster-analysis

Build nested clusters by splitting or merging them successively. This algorithm is usually represented as a tree or dendrogram.

* Two types :

--- 1. Agglomerative - A "bottom up" approach, each observation starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy.

--- 2. Divisive - A "top down" approach, all observations start in one cluster and splits are performed recursively as one moves down the hierarchy.

* Performance : Too slow for large datasets.


Clone this wiki locally