Introduction:

Goal of project is to match people up with jobs
Deploy project using a live interactive webapp for prospective job seekers to use
Jobs were scraped from Linkedin using Selenium and Beautiful Soup

The Data:

Using selenium and beautiful soup, two popular python packages for web scraping and web automation, a datasets was scraped from Linkedin.com (n = 3150). The search looked at jobs that resulted from looking up 'Data Science'. This search term yielded the largest number of jobs from the search terms tested.

Features From The Dataset:

Linkedin job features
Company name
Job title
Location
Number of applicants
Full job description

Six states were selected as states that are hubs for data science jobs.

States
Colorado
California
Florida
New York
Utah
Washington

Example of Linkedin data:

	Job_Title	Company	Location	Number_of_Applicants	Description
0	Data Acquisition Developer	ABB	Broomfield, CO, US	Be among the first 25 applicants	Join ABB and work in a team that is dedicated ...
1	DATA SCIENTIST MANAGER	Pilgrim's	Greeley, CO, US	Be among the first 25 applicants	Pilgrim’s is the second largest chicken produc...
2	Data Scientist – Logistics Optimization Algori...	ShaleApps	Denver, CO, US	Be among the first 25 applicants	Location: Denver, CO – Relocation Offered. Rem...

Linkedin web scraping process:

To scrape Linkedin, selenium needed to be used. Beautiful soup is not enough for linkedin because one needs to click on the jobs in order to get the job information. Beautiful soup could have been used to scrape basic information such as job title and company name but because the site doesn't redirect users to the company's website, the full job description was available. In order to get this more detailed information, one needs to be able to interact with the site like a human user. Selenium's strong point is just this; that it allows one to write code that will interact with the website like a human user. Two main features of selenium that allow this are the ability to click on buttons and the ability to type in text to text fields.

In order to scrape Linkedin, the script first would navigate to the search result page and clicked on the 'see more jobs' button a given number of times allowing one to have access to more than just one page. The script then clicked on each job and passed the HTML to a beautiful soup object that then was used to parse for the information of interest.

Example of Linkedin job:

Example of html:

EDA and Clustering

Companies with the largest number of postings for data science jobs on Linkedin.com

Using a TFIDF matrix, job descriptions were vectorized and fed into a K-Means algorithm to attempt to group jobs together based on descriptions alone
Wanted to group jobs together and recommend using clusters, but selecting the K value using the Silhouette Score elbow plot method was difficult
- For any interpretable number of K values, the SS score was extremely low

Using three clusters, we can see a very diverse group of jobs
- Although the each observation - and cluster center for that matter - is located within 10,000 dimensional space, the clusters were visualized by transforming the data into two dimensional space using PCA

Once again, all of these jobs were found by searching for the keyword 'Data Science'
Interesting that mobile developer jobs were included
In two dimensional euclidean space, we can see that the descriptions of the mobile dev jobs were quite different than the data science and big data job descriptions

Vectorizing jobs

Jobs were assessed on several skills

python, r, spark, spss, sql, pandas, numpy, cloud, docker, statistic, java, scala, marketing, sas, stata, excel, tableau

Vector for each job was created to represent whether or not the job included each of the skills
- Ex: [1, 1, 1, 1, 0, 1, 1, 0 ,0, 0, 0, 0, 0, 0, 0, 0, 1]
User vectors were created by asking users which of the skills they had experience with

User vector was compared to all job vectors using Jaccard Similarity
- Most similar job was returned
- If multiple jobs were equally similar, the first job in the list was returned

Example recommendations:

Docker, Cloud skills only --> Senior Manager of Technology
Python, R only --> Director - Data Science, Modeling
Python, Marketing only --> Marketing Data Scientist

Streamlit WebApp

Recommender was deployed on a streamlit app with the help of Docker
Hosted on Amazon AWS EC2 instance

Here is the homepage of the app:

Users would then select the Data Science skills they have like in this example

Once the user completed their selections, they would click the recommendation button to get a job

Future Directions

Include jobs outside of data science
Explore other text embedding methods
- Retain semantic representations of descriptions (e.g. word-to-vec or BERT)
Allow user to request location from recommender
Develop strategy to isolate 'important' parts of descriptions
- Most descriptions include fluffy details about company
Develop more exhaustive list of skills

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Datasets		Datasets
imgs		imgs
notebooks		notebooks
old		old
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction:

The Data:

Features From The Dataset:

Example of Linkedin data:

Linkedin web scraping process:

Example of Linkedin job:

Example of html:

EDA and Clustering

Vectorizing jobs

Example recommendations:

Streamlit WebApp

Future Directions

About

Releases

Packages

Languages

Joshmantova/Which-Job-Is-Right-For-You

Folders and files

Latest commit

History

Repository files navigation

Introduction:

The Data:

Features From The Dataset:

Example of Linkedin data:

Linkedin web scraping process:

Example of Linkedin job:

Example of html:

EDA and Clustering

Vectorizing jobs

Example recommendations:

Streamlit WebApp

Future Directions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages