GitHub - mustafahakkoz/Preprocessing_w_SPARK: Preprocessing + Feature Extraction pipeline by SPARK and Neo4J

Preprocessing Pipeline with MongoDB - Spark - Neo4j

Data manipulation and feature extraction with SPARK, network calcuations with Neo4j.

Main.py

Setting MongoDB connection and paths.

RunCalculators.py

Setting Spark and Neo4j connections, importing and exporting data, handling preprocessing and normalization.

UserFeatures.py

Calculating twitter-user features by manipulating and aggregating data by Spark methods.

CentralityMeasures.py

Running ETL methods for Neo4J and calculating graph centrality measures by Cypher queries.

PREREQUISITES FOR SPARK

install pyspark --> pip install pyspark
install mongo-spark-connector --> pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1
or download with dependencies https://jar-download.com/artifacts/org.mongodb.spark/mongo-spark-connector_2.11/2.4.1/source-code

PREREQUISITES FOR NEO4J

install neo4j-desktop --> https://neo4j.com/download/
create an empty neo4j graph and install the plug-ins: "graph algorithms" and "APOC". set auth key as "12345678"
pip install neo4j-driver

STRUCTURES OF DATABASES

tweets database should be labeled by "category" column.
tweets database should also contain "id", "user.id", "user.screen_name", "created_at" columns.
user database should contain "id" and "screen_name" columns.
edges file is a json file with columns "Source" and "Target", which defines a relationship between them.

STRUCTURE OF OUTPUT JSON

root  
 |-- id: string (nullable = true)  
 |-- user_features: struct (nullable = true)  
 |    |-- dict_activeness_1: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- dict_activeness_2: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- dict_activeness_3: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- dict_days_posted_by_topic: struct (nullable = true)  
 |    |    |-- category1: long (nullable = true)  
 |    |    |-- category2: long (nullable = true)  
 |    |-- dict_focus_rate: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- dict_tweet_by_topic: struct (nullable = true)  
 |    |    |-- category1: long (nullable = true)  
 |    |    |-- category2: long (nullable = true)  
 |    |-- tweets_total: long (nullable = true)  
 |-- centralities: struct (nullable = true)  
 |    |-- betweennessCentrality: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- closenessCentrality: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- degreeCentrality: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- pageRank: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
img		img
output		output
CentralityMeasures.py		CentralityMeasures.py
Main.py		Main.py
README.md		README.md
RunCalculators.py		RunCalculators.py
UserFeatures.py		UserFeatures.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preprocessing Pipeline with MongoDB - Spark - Neo4j

Main.py

RunCalculators.py

UserFeatures.py

CentralityMeasures.py

PREREQUISITES FOR SPARK

PREREQUISITES FOR NEO4J

STRUCTURES OF DATABASES

STRUCTURE OF OUTPUT JSON

About

Releases

Packages

Languages

mustafahakkoz/Preprocessing_w_SPARK

Folders and files

Latest commit

History

Repository files navigation

Preprocessing Pipeline with MongoDB - Spark - Neo4j

Main.py

RunCalculators.py

UserFeatures.py

CentralityMeasures.py

PREREQUISITES FOR SPARK

PREREQUISITES FOR NEO4J

STRUCTURES OF DATABASES

STRUCTURE OF OUTPUT JSON

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages