Skip to content

This package includes functions helping with common tasks during EDA stage of a data science project

License

Notifications You must be signed in to change notification settings

lephanthuymai/datascience_eda

 
 

Repository files navigation

datascience_eda

build codecov Deploy Documentation Status

This package includes functions assisting data scientists with various common tasks during the exploratory data analysis stage of a data science project. Its functions will help the data scientist to do preliminary analysis on common column types like numeric columns, categorical columns and text columns; it will also conduct several experimental clusterings on the dataset.

Our functions are tailored based on our own experience, there are also similar packages published on PyPi, a few good ones worth mentioning:

Installation

There are several dependencies not available on test.pypi, please use the exact command below to install our package.

$ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple datascience-eda

Main Functions

  • explore_numeric_columns: conducts common exploratory analysis on columns with numeric type: it generates a heatmap showing correlation coefficients (using pearson, kendall or spearman correlation on choice), histograms and SPLOM plots for all numeric columns or a list of columns specified by the user. This returns a list of plot objects so that the user can save and use them later on.

  • explore_categorical_columns: performs exploratory analysis on categorical features. It returns a dataframe containing column names, corresponding unique categories, counts of null values, percentages of null values and most frequent categories. It also generates and visualize countplots of a list of categorical columns of choice.

  • explore_text_columns: performs exploratory data analysis of text features. It prints the summary statistics of character length and word count. It also plots the word cloud, distributions of character lengths, word count and polarity and subjectivity scores. Bar charts of top n stopwords and top n words other than stopwords, top n bigrams, sentiments, name entities and part of speech tags will be visualized as well. This returns a list of plot objects.

  • explore_clustering: fits K-Means and DBSCAN clustering algorithms on the dataset and visualizes Elbow, Silhouette Score and PCA plots. It returns a dictionary with each key being name of the clustering algorithm and the value being a list of plots generated by the models.

  • explore_KMeans_clustering: fits K-Means clustering algorithms on the dataset and visualizes Elbow, Silhouette Score and PCA plots. It returns a dictionary with each key being name of the plot type and the value being a list of plots generated for each type.

  • explore_DBSCAN_clustering: fits K-DBSCAN clustering algorithms on the dataset and visualizes Silhouette Score and PCA plots. It returns a tuple containing a list of n_clusters returned by DBSCAN models and a dictionary with each key being name of the plot type and the value being a list of plots generated for each type.

Dependencies

List of depencies can be found at: https://github.com/UBC-MDS/datascience_eda/blob/main/pyproject.toml

Usage

import pandas as pd
import datascience_eda as eda

original_df = pd.read_csv("/data/menu.csv")
numeric_features = eda.get_numeric_columns(original_df)
numeric_transformer = make_pipeline(SimpleImputer(), StandardScaler())
preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features)
)
df = pd.DataFrame(
    data=preprocessor.fit_transform(original_df), columns=numeric_features
)

eda.explore_numeric_columns(df)
eda.explore_categorical_columns(df, ["categorical_column1", "categorical_column2"])
eda.explore_text_columns(df)
eda.explore_clustering(df)

Documentation

The official documentation is hosted on Read the Docs: https://datascience_eda.readthedocs.io/en/latest/

Contributors

We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab. Please check out our CONDUCTING.rst if you are interested in contributing to this project.

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.

About

This package includes functions helping with common tasks during EDA stage of a data science project

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 94.7%
  • Python 5.3%