ETL pipeline for Speech-to-text-data-collection

This project aims to develop an open-source speech data collection tool that can be used by any language. We have created a new ETL pipeline a s3 bucket, Kafka, Spark and Airflow, which can automatically handle schema changes We have used apache Kafka as a streaming platform that allows for the creation of real-time data processing pipelines and streaming applications. Kafka, allows you to sequentially log streaming data into topic-specific feeds. Apache Airflow allows us to create, orchestrate and monitor data workflows. As a part of distributed data processing, we have used spark.

Arcitecture

We have implemented this arcitecture

Steps to Build ETL Pipline

Build a simple web app that will help us to collect speech for specific text sent by Kafka producer.
Setup delta lake on s3 bucket
Setup Kafka producer and consumer using Kafka-python
Setup Airflow for scheduling
Setup spark

How to use and contribute

To use repository

Assuming that you are working in Project directory

cd ~/Project git clone https://github.com/mahlettaye/Speech_data_collection_tool.git git checkout main

To contribute to the the repo

Instead, you will create dev_yourname on your machine that exist for the purpose of solving singular issues.

git checkout main git checkout -b dev_yourname

-Make changes.

git add . git commit -m "Updated Kafaka

This adds any new files to be tracked and makes a commit. Now let's add them to your branch and pull request to merge.

git checkout main git push origin dev_yourname

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
static		static
templates		templates
Architecture.png		Architecture.png
README.md		README.md
app.py		app.py
consumer.py		consumer.py
data_processor.py		data_processor.py
from_s3_to_kafka.py		from_s3_to_kafka.py
insert_into_s3.py		insert_into_s3.py
producer.py		producer.py
spark_app.py		spark_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL pipeline for Speech-to-text-data-collection

Arcitecture

Steps to Build ETL Pipline

How to use and contribute

To use repository

To contribute to the the repo

About

Releases

Packages

Languages

mahlettaye/Speech_data_collection_tool

Folders and files

Latest commit

History

Repository files navigation

ETL pipeline for Speech-to-text-data-collection

Arcitecture

Steps to Build ETL Pipline

How to use and contribute

To use repository

To contribute to the the repo

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages