Skip to content

SingularityUrBrain/stackoverflow-ml-search

Repository files navigation

StackOverflow ML search

I deal with unstructured stackoverflow issues data (~60 000 collected ml related questions). I process it using NLP techniques and do a short data visualization. Then write a model based on Word2Vec's Skip-Gram model to find k the most similar to main query questions and estimate these models on a small test dataset with HitsCount and nDCG scores.

Notes

There are several interactive plots made with plotly in the notebook and they don't show on GitHub, but you can use nbviewer or run it locally in trusted mode to see them all.

Requirements

To create virtual environment with all dependecies needed for notebook:

Conda

conda env create -n ENV_NAME --file environment.yml

Pip

Create virtual environment using python module venv, pipenv or virtualenv and install packages with the following command:

pip install -r requirements.txt

Results

For more details about metrics see in the notebook.

Hits scores

hit_score

nDCG scores

dcg_score