Skip to content

5.Clustering

Ahmed Shahriar Sakib edited this page Dec 30, 2021 · 1 revision

Preprocess

Filter Data

  • "timestamp_time" was replaced with "time_of_the_day" feature
  • "date_of_incident" was replaced with "week_day", "day_name" and "month_name"
  • "business" and "address_2" has lots of null values, hence those features were removed
  • "duration" was converted to numerical value and replaced with "duration_in_seconds"

Scaling

def scaling_df(df):
  X_cluster = df.copy()
  object_cols = df.columns[df.dtypes == object].to_list()
  label_enc=LabelEncoder()
  for i in object_cols:
      X_cluster[i]=X_cluster[[i]].apply(label_enc.fit_transform)
  
  scaler = MinMaxScaler()
  scaler.fit(X_cluster)
  X_cluster_scaled = pd.DataFrame(scaler.transform(X_cluster),columns= X_cluster.columns)
  return X_cluster_scaled

PCA

PCA

Script

def pulse_point_pca(X_data, n_components):
  pca = PCA(n_components=n_components)

  fit_pca = pca.fit(X_data)
 
  print("Variance Explained with {0} components ".format(n_components),
        round(sum(fit_pca.explained_variance_ratio_),2))

  return fit_pca, fit_pca.transform(X_data)

pca_full, pulsepoint_data_full = pulse_point_pca(X_cluster_scaled, X_cluster_scaled.shape[1])

# plot PCA
plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
plt.title("Proportion of PCA variance\nexplained by number of components")
plt.xlabel("Number of components")
plt.ylabel("Proportion of variance explained");

Clustering

Agency Engagement Vs Incident Duration by City

Agency Engagement Vs Incident Duration by City

The agency_count (number of agency engagement) and duration_hr (duration in hour) has a positive linear relationship. Higher Duration of Incidents indicates more agency engagement in a city

Clustering States By Duration

X = pulse_point_state_duration_df[['total_agency_engagement', 'total_duration_hr']].values

Elbow Method

elbow

k-means

cluster_k_means

The k-mean clustering algorithm clusters the cities based on duration of incidents and number of agencies into three groups. Small duration indicates having less number of agency engagement and vice-versa.

Group 1 : Cities with very low number of incidents duraion and agency engagements

Group 2 : Cities with comparatively higher number of incidents duraion and agency engagements

Group 2 : Cities with highest number of incidents duraion and agency engagements

Agglomerative Clustering

kelbow

Ward Linkage

ward

Complete Linkage

complete

Results

From the above clustering techniques, it is clear that “complete” linkage is not suitable for Agglomerative clustering (cluster parameter was given 4 but it mostly formed 2 clusters). On the other hand, k-means and “Ward” Agglomerative provided a better clustering result. But the density of the cities is high when the value of number of agency engagement and total incident duration is low.

K-means++ focused on clustering lower dense cities with unequal parameter distribution –

  • Cluster 0: Total incident duration = ~400, number of engagements = ~500
  • Cluster 1: Total incident duration = 400 to ~1200, number of engagements = 500 to ~1500
  • Cluster 2: Total incident duration = 1200 to max, number of engagements = 1500 to max

The range of cluster 1 is bigger than cluster 0 in k-means whereas Ward Agglomerative did almost an equally distributed clustering for cluster 0 and 1. If the range of the parameters (engagements of duration) is important based on other factor, for example – budget allocation with respect to engagements or business decision/future planning based on duration of emergencies, then depending on the priority both clusters would be acceptable.