Steamgestion - Data Ingestion Pipeline for Large Datasets

Gaming industry is currently one of the most prominent industries in the market. The development and popularity of games has been increasing rapidly in the past decade. Of all the factors that determine the popularity of a game, reviews are paramount importance. As the online gaming community expands with the passing day, there is more data to be collected from the user’s database. This project aims to analyse Steam reviews dataset and build a data ingestion pipeline using Kubernetes and Docker. The data is ingested from the Steam Reviews dataset which is then cached in Redis and then stored in Elasticsearch. The data is then queried from Elasticsearch and then rendered on a Flask application.

Directory Structure

── async_backend/              <- Asynchronous backend service for the data ingestion pipeline.
│   ├── Dockerfile              <- Dockerfile for building the backend service container.
│   ├── __init__.py             <- Initialization script for the backend service module.
│   ├── database.py             <- Script for database operations.
│   ├── deployment_config/      <- Configuration files for deployment.
│   │   ├── deployment.yaml     <- Kubernetes deployment configuration.
│   │   └── service.yaml        <- Kubernetes service configuration.
│   ├── main.py                 <- Main script for the backend service.
│   ├── requirements.txt        <- File listing dependencies for the backend service.
│   ├── settings.py             <- Configuration settings for the backend service.
│   ├── static/                 <- Static files for the backend service.
│   │   ├── main.css            <- CSS file for styling.
│   │   ├── main.js             <- JavaScript file for frontend interaction.
│   │   └── reviews.js          <- JavaScript file for handling reviews.
│   ├── tasks.py                <- Script defining background tasks.
│   └── templates/              <- HTML templates for the backend service.
│       ├── index.html          <- Template for the main page.
│       └── show.html           <- Template for showing data.
├── celery_app/                 <- Celery application for the data ingestion pipeline.
│   ├── Dockerfile              <- Dockerfile for building the Celery application container.
│   ├── celery-hpa.yaml         <- Kubernetes Horizontal Pod Autoscaler configuration for Celery.
│   ├── deployment.yaml         <- Kubernetes deployment configuration for Celery.
│   ├── main.py                 <- Main script for the Celery application.
│   └── requirements.txt        <- File listing dependencies for the Celery application.
└── ingestion_engine/           <- Engine for data ingestion.
    ├── ingestion_engine.py     <- Script for the data ingestion engine.
    └── requirements.txt        <- File listing dependencies for the ingestion engine.

Architecture

The system uses an asynchronous Flask backend which has been deployed as a service interacting using ScyllaDB/sQLite for storing processes followed by an event driven message queue controlled by RabbitMQ. This message queue is also deployed as a service with an interconnection with celery workers capable of horizontal pod scaling continuously integrating with Elasticsearch and Redis for data ingestion and data caching. The architecture of the project is shown in the figure below which is a representation of the data pipeline.

Installation

Whenever dealing with Kubernetes, one can use micr8s or minikube for kubernetes installation on the base system. MicroK8s is the easiest and fastest way to get Kubernetes up and running. High availability in a Kubernetes cluster, is one of the premiere qualities one should look for when dealing with clusters so as to withstand a failure on any component and continue serving workloads without interruption, therefore the following three factors are necessary for a Highly Available Kubernetes cluster and micro8s serves them all. Simply following the canonical documentation, one can get started with kubernetes installation by using these commands:

- sudo snap install microk8s –classic
- microk8s status –wait-ready
- microk8s enable dashboard dns registry istio

There are a few python based dependies which can be installed using: pip3 install -r requirements.txt under the async_backend, celery_app and ingestion_engine directories. These directories have their respective requirements.txt files,

so one needs to install all of them by their respective environment manager, either virtualenv or conda.

Aliasing of microk8s kubectl to kcdev is also done for using kubernetes as a short form for convenience which can be done by adding an alias in the bashrc/zshrc profile which can be done by:

- echo "alias kcdev='microk8s kubectl'" >> ~/.zshrc

or one can also use vim or nano to edit the ~/.zshrc file as shown below:

- vim ~/.zshrc

And then add the following line to the file:

- alias kcdev='microk8s kubectl'

Next, we will use a set of commands to deploy and serve the backend, elasticsearch, rabbitmq and redis environments using yaml configuration files to start the kubernetes environment. The commands are as follows:

- elasticsearch-setup-passwords auto
- kcdev apply -f async_backend/deployment_config/service.yaml
- kcdev apply -f async_backend/deployment_config/deployment.yaml
- kcdev apply -f debug_pod/debug_pod.yaml
- kcdev apply -f elasticsearch_config/elasticsearch_deployment.yaml
- kcdev apply -f elasticsearch_config/elasticsearch_pv.yaml
- kcdev apply -f elasticsearch_config/elasticsearch_service.yaml
- kcdev apply -f message_queue/rabbitmq_deployment.yaml
- kcdev apply -f message_queue/rabbitmq_service.yaml
- kcdev apply -f redis/redis_pv.yaml
- kcdev apply -f redis/redis_deployment.yaml
- kcdev apply -f redis/redis_service.yaml

Kubernetes Pods

Using Kubernetes, we have setup 6 microservice pods which can be seen in the firgure below. These microservices run in an even driven architecture which are given persistent volumes for claiming the resources.

In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment), with the aim of automatically scaling the workload to match demand. Horizontal scaling means that, the response to increased load is to deploy more pods. If the load decreases, and the number of Pods is above the configured minimum, the HorizontalPodAutoscaler instructs the workload resource to scale back down. The visual representation of the pods that are already running for the workload are shown in the figure below.

Celery Workers

Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages and tasks, while providing operations with the tools required to maintain such a system. There’s a huge variety between those tasks: some of them can run for seconds while others can take hours, depending on the data (size) being processed and the operation type (read/write). When a task is published, Celery adds a message to the RabbitMQ queue. In our case, each worker consumes from a dedicated queue for simplicity and ease to auto-scale.

Persistent Volumes

A PersistentVolume (PV) as shown in the figure below is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource.

A PersistentVolumeClaim (PVC) as shown in the figure below is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory).

Results

Using port forwarding to render the results on a Flask application, one can show the processes and ids of celery workers running concurrently and scaling at the same time. In the figure below, the process ids are seen on a scaffoled flask template.

For quering the Steam Reviews for exploration, relevant columns are extracted such as language of review, the review itself, game name and whether it is recommended or not. In the figure below, the process ids are seen on a scaffoled flask template.

Acknowledgements & Feedback

Shlok Walia

I would love to receive any feedback or suggestions for this repo. It has been written as per my understanding and the learnings I kindled during my journey in this project. I would like to express my gratitude to Shlok Walia in helping me out and being a constant guide throughtout this project. I hope you find it useful and easy to understand.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
async_backend		async_backend
celery_app		celery_app
db_clients		db_clients
debug_pod		debug_pod
elasticsearch_config		elasticsearch_config
figures		figures
ingestion_engine		ingestion_engine
message_queue		message_queue
redis		redis
worker_tasks		worker_tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Steamgestion - Data Ingestion Pipeline for Large Datasets

Directory Structure

Architecture

Installation

Kubernetes Pods

Celery Workers

Persistent Volumes

Results

Acknowledgements & Feedback

About

Releases

Packages

Languages

License

coderjolly/data-ingestion-pipeline

Folders and files

Latest commit

History

Repository files navigation

Steamgestion - Data Ingestion Pipeline for Large Datasets

Directory Structure

Architecture

Installation

Kubernetes Pods

Celery Workers

Persistent Volumes

Results

Acknowledgements & Feedback

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages