Skip to content

Collection of practical codes for Savitribai Phule Pune University's Data Science and Big Data Analytics Laboratory (310256).

License

Notifications You must be signed in to change notification settings

kunalPisolkar24/DSBDA_Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 Data Science and Big Data Analytics Laboratory - Savitribai Phule Pune University 📊

GitHub license GitHub last commit GitHub code size in bytes

Welcome to the repository for the Data Science and Big Data Analytics Laboratory (310256) course, part of the Third Year Computer Engineering curriculum (2019 Course) at Savitribai Phule Pune University. This repository provides practical implementations and resources to help you gain hands-on experience with essential data science concepts, techniques, and tools.

🏛️ Course Information:

  • University: Savitribai Phule Pune University
  • Course Name: Data Science and Big Data Analytics Laboratory (310256)
  • Companion Course: Data Science and Big Data Analytics (310251)
  • Credit: 02
  • Examination Scheme:
    • Practical: 04 Hours/Week
    • Term Work: 50 Marks
    • Practical Exam: 25 Marks

🎯 Learning Objectives:

  • Understand the fundamental principles of data science and apply them to real-world problem-solving.
  • Gain in-depth knowledge and practical implementation skills in key data science and big data analytics technologies.
  • Master statistical data analysis techniques for informed decision-making.
  • Acquire practical experience with essential programming languages and tools used in data science and big data analysis.

💡 Course Outcomes:

Upon successful completion of this laboratory course, students will be able to:

  • CO1: Apply data science principles to analyze and address real-world problems.
  • CO2: Implement various data representation techniques using statistical methods.
  • CO3: Implement and evaluate common data analytics algorithms.
  • CO4: Perform effective text preprocessing for natural language processing tasks.
  • CO5: Utilize data visualization techniques to gain insights from data.
  • CO6: Employ cutting-edge tools and technologies to analyze big data.

📂 Practical Implementations:

Practical No. Description
1 Data Wrangling I:
1. Import necessary Python libraries.
2. Identify and describe an open-source dataset from a source like Kaggle.
3. Load the dataset into a pandas DataFrame.
4. Perform data preprocessing: check for missing values, provide initial statistics, variable descriptions, and check data frame dimensions.
5. Format and normalize data: analyze data types, apply type conversions if needed, and convert categorical variables to quantitative representations.
6. Provide clear explanations of all operations and the data import/reading process.
2 Data Wrangling II:
1. Create an “Academic performance” dataset for students.
2. Handle missing values and inconsistencies using appropriate techniques.
3. Identify and handle outliers in numeric variables.
4. Apply data transformations to at least one variable, justifying your chosen approach (e.g., scaling, linearization, or reducing skewness).
3 Descriptive Statistics:
1. Calculate and present summary statistics (mean, median, min, max, standard deviation) for a dataset, grouped by a categorical variable (e.g., income grouped by age groups).
2. Write a Python program to display statistical details (percentile, mean, standard deviation) for different species in the iris.csv dataset.
4 Data Analytics I: Build a linear regression model using Python/R to predict home prices using the Boston Housing Dataset from Kaggle.
5 Data Analytics II:
1. Implement logistic regression using Python/R to classify data from the Social_Network_Ads.csv dataset.
2. Calculate and analyze the confusion matrix, accuracy, error rate, precision, and recall.
6 Data Analytics III:
1. Implement the Naïve Bayes classification algorithm using Python/R on the iris.csv dataset.
2. Evaluate the model using the confusion matrix, accuracy, error rate, precision, and recall.
7 Text Analytics:
1. Extract a sample document and apply preprocessing techniques: tokenization, POS tagging, stop word removal, stemming, and lemmatization.
2. Create a document representation using Term Frequency-Inverse Document Frequency (TF-IDF).
8 Data Visualization I:
1. Explore patterns in the 'titanic' dataset using Seaborn.
2. Visualize ticket price distribution using a histogram.
9 Data Visualization II:
1. Create a box plot showing age distribution by gender and survival status from the 'titanic' dataset.
2. Draw inferences from the visualized statistics.
10 Data Visualization III:
1. Download a dataset (e.g., Iris) and identify feature types (numeric, nominal).
2. Create histograms and box plots for each feature to illustrate distributions.
3. Compare distributions and identify potential outliers.
11 Hadoop Word Count: Develop a simple Word Count application in Java using the Hadoop MapReduce framework on a local standalone setup.
12 Weather Data Analysis: Process weather data from a text file (sample_weather.txt) using Hadoop to calculate the average temperature, dew point, and wind speed.
13 Apache Spark Basics: Write a simple program using Scala and the Apache Spark framework.

🚀 Getting Started:

Navigate to the specific practical implementation directory for instructions, code examples, and datasets.

🙌 Contributions:

Contributions, improvements, and feedback from the data science community are highly appreciated! If you have any enhancements, bug fixes, or additional practical examples to share, please open a pull request. Refer to the CONTRIBUTING.md file for guidelines.

📄 License:

This repository is distributed under the MIT License. You are free to use, modify, and distribute the code for educational and personal projects.

Let's explore the world of data science and big data together!

Releases

No releases published

Packages

No packages published

Languages