datascience_eda

This package includes functions assisting data scientists with various common tasks during the exploratory data analysis stage of a data science project. Its functions will help the data scientist to do preliminary analysis on common column types like numeric columns, categorical columns and text columns; it will also conduct several experimental clusterings on the dataset.

Our functions are tailored based on our own experience, there are also similar packages published on PyPi, a few good ones worth mentioning:

Installation

There are several dependencies not available on test.pypi, please use the exact command below to install our package.

$ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple datascience-eda

Main Functions

explore_numeric_columns: conducts common exploratory analysis on columns with numeric type: it generates a heatmap showing correlation coefficients (using pearson, kendall or spearman correlation on choice), histograms and SPLOM plots for all numeric columns or a list of columns specified by the user. This returns a list of plot objects so that the user can save and use them later on.
explore_categorical_columns: performs exploratory analysis on categorical features. It returns a dataframe containing column names, corresponding unique categories, counts of null values, percentages of null values and most frequent categories. It also generates and visualize countplots of a list of categorical columns of choice.
explore_text_columns: performs exploratory data analysis of text features. It prints the summary statistics of character length and word count. It also plots the word cloud, distributions of character lengths, word count and polarity and subjectivity scores. Bar charts of top n stopwords and top n words other than stopwords, top n bigrams, sentiments, name entities and part of speech tags will be visualized as well. This returns a list of plot objects.
explore_clustering: fits K-Means and DBSCAN clustering algorithms on the dataset and visualizes Elbow, Silhouette Score and PCA plots. It returns a dictionary with each key being name of the clustering algorithm and the value being a list of plots generated by the models.
explore_KMeans_clustering: fits K-Means clustering algorithms on the dataset and visualizes Elbow, Silhouette Score and PCA plots. It returns a dictionary with each key being name of the plot type and the value being a list of plots generated for each type.
explore_DBSCAN_clustering: fits K-DBSCAN clustering algorithms on the dataset and visualizes Silhouette Score and PCA plots. It returns a tuple containing a list of n_clusters returned by DBSCAN models and a dictionary with each key being name of the plot type and the value being a list of plots generated for each type.

Dependencies

List of depencies can be found at: https://github.com/UBC-MDS/datascience_eda/blob/main/pyproject.toml

Usage

import pandas as pd
import datascience_eda as eda

original_df = pd.read_csv("/data/menu.csv")
numeric_features = eda.get_numeric_columns(original_df)
numeric_transformer = make_pipeline(SimpleImputer(), StandardScaler())
preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features)
)
df = pd.DataFrame(
    data=preprocessor.fit_transform(original_df), columns=numeric_features
)

eda.explore_numeric_columns(df)
eda.explore_categorical_columns(df, ["categorical_column1", "categorical_column2"])
eda.explore_text_columns(df)
eda.explore_clustering(df)

Documentation

The official documentation is hosted on Read the Docs: https://datascience_eda.readthedocs.io/en/latest/

Contributors

We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab. Please check out our CONDUCTING.rst if you are interested in contributing to this project.

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.github/workflows		.github/workflows
datascience_eda		datascience_eda
docs		docs
tests		tests
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
CONDUCT.rst		CONDUCT.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datascience_eda

Installation

Main Functions

Dependencies

Usage

Documentation

Contributors

Credits

About

Releases

Packages

Languages

License

lephanthuymai/datascience_eda

Folders and files

Latest commit

History

Repository files navigation

datascience_eda

Installation

Main Functions

Dependencies

Usage

Documentation

Contributors

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages