Skip to content

PlaypowerLabs/seqClustR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

seqClustR

seqClustR is a sequence clustering package: it provides access to different clustering algorithms to perform sequence clustering on one common data format.

Related paper: seqClustR: An R Package for Sequence Clustering

Overview

Sequence clustering is a data mining technique that groups similar sequences into clusters based on their similarities. Sequence clustering is useful when there are unknown number of similar sequences that need to be identified to gain valuable insights.

seqClustR package provides the means to perform different clustering algorithms on sequence data by reducing the complexity to prepare data for each algorithm in a different way, by just converting the sequence data into event logs you can run multiple clustering algorithms and compare them.

Installation

Install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("PlaypowerLabs/seqClustR")

Usage

The package uses event log as an input for the data. Event log are very commonly used to store the user behavior data. They indicate the sequence of actions a user takes over time, along with added metadata of the event.

Once the data is prepared in the event log format, we can run one of the following sequence clustering algorithms:

  1. Edit Distance Clustering - seq_edit_distance_clustering
  2. Markov Model Based Clustering - seq_markov_clustering
  3. Dynamic Time Warping - seq_dtw_clustering
  4. K-Means Clustering - seq_kmeans_clustering

The output of the function would be a list containing the fitted model and a data frame having the case to cluster assigned mapping. To do further analysis on individual clusters, we need event logs for each cluster for which we have written a function split_event_log which takes event logs and the clusters assigned data frame as inputs, and returns a list of event log by cluster.

library(tidyverse)
library(bupaR)
library(seqClustR)

event_log <- sequence_data %>% 
 arrange(EventTime) %>% 
 mutate(lifecycle_id = 'complete',
        resource = NA,
        row_num = 1:nrow(.)) %>% 
 eventlog(case_id = "learnerID", 
          activity_id = "Observable", 
          activity_instance_id = "row_num",
          lifecycle_id = "lifecycle_id",
          timestamp = "EventTime", 
          resource_id = "resource")

cluster <- seq_edit_distance_clustering(
           event_log)

# Get event log by cluster as a list.

event_log_2 <- split_event_log(eventlog, 
                               cluster$cluster_assignment)

You can visualize the clusters using fuzzymineR package.

library(fuzzymineR)

# Process Model for Cluster 1

metrics <- mine_fuzzy_model(event_log_2[["1"]])

viz_fuzzy_model(metrics = metrics,
                node_sig_threshold = 0.1,
                edge_sig_threshold = 0.3,
                edge_sig_to_corr_ratio = 1)

About

Generate Clusters from Sequence Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages