ETL pipeline for Speech-to-text-data-collection

This project aims to develop an open-source speech data collection tool that can be used by any language. We have created a new ETL pipeline a s3 bucket, Kafka, Spark and Airflow, which can automatically handle schema changes We have used apache Kafka as a streaming platform that allows for the creation of real-time data processing pipelines and streaming applications. Kafka, allows you to sequentially log streaming data into topic-specific feeds. Apache Airflow allows us to create, orchestrate and monitor data workflows. As a part of distributed data processing, we have used spark.

Arcitecture

We have implemented this arcitecture

Steps to Build ETL Pipline

Build a simple web app that will help us to collect speech for specific text sent by Kafka producer.
Setup delta lake on s3 bucket
Setup Kafka producer and consumer using Kafka-python
Setup Airflow for scheduling
Setup spark

How to use and contribute

To use repository

Assuming that you are working in Project directory

cd ~/Project git clone https://github.com/mahlettaye/Speech_data_collection_tool.git git checkout main

To contribute to the the repo

Instead, you will create dev_yourname on your machine that exist for the purpose of solving singular issues.

git checkout main git checkout -b dev_yourname

-Make changes.

git add . git commit -m "Updated Kafaka

This adds any new files to be tracked and makes a commit. Now let's add them to your branch and pull request to merge.

git checkout main git push origin dev_yourname

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ETL pipeline for Speech-to-text-data-collection

Arcitecture

Steps to Build ETL Pipline

How to use and contribute

To use repository

To contribute to the the repo

Files

README.md

Latest commit

History

README.md

File metadata and controls

ETL pipeline for Speech-to-text-data-collection

Arcitecture

Steps to Build ETL Pipline

How to use and contribute

To use repository

To contribute to the the repo