Skip to content

A machine learning model is built using PySpark's MLlib library to automatically flag suspicious job postings on Indeed.com. The dataset includes 18,000 job descriptions, out of which about 800 are fake.

License

Notifications You must be signed in to change notification settings

aehabV/Indeed-fake-job-posting-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NLP in PySpark's MLlib - Fake Job Posting Predictions

Language Badge PySpark Badge Library Badge Library Badge Library Badge License Badge

Indeed.com has hired us to create a system that automatically flags suspicious job postings on its website. Due to the high volume of job postings, their employees do not have the capacity to check every posting, so they would like to prioritize which postings to review before deleting them. Our task is to use the attached dataset with NLP to create an algorithm that automatically flags suspicious posts for review.

Dataset

This dataset contains 18K job descriptions out of which about 800 are fake. The data consist of both textual information and meta-information about the jobs.

Data Source: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction

The dataset has the following columns with their data types:

Column Name Description
job_id Unique identifier for each job posting
title Job title
location Location of the job
department Department of the company
salary_range Salary range of the job
company_profile Description of the company
description Description of the job
requirements Requirements for the job
benefits Benefits offered by the company
telecommuting Whether the job allows telecommuting or not
has_company_logo Whether the company has a logo or not
has_questions Whether the job has questions for applicants or not
employment_type Type of employment (full-time, part-time, etc.)
required_experience Required experience for the job
required_education Required education for the job
industry Industry of the company
function Function of the job
fraudulent Whether the job posting is fraudulent or not

Prerequisites

Before running the code, you will need to have the following installed:

  • PySpark: the Python API for Apache Spark
  • Jupyter Notebook: an interactive development environment for Python

Usage

To run the code, open the Fake_Job_Posting_Predictions.ipynb file in Jupyter Notebook and execute the cells in order. The notebook contains detailed explanations of each step in the code and the results obtained.

About

A machine learning model is built using PySpark's MLlib library to automatically flag suspicious job postings on Indeed.com. The dataset includes 18,000 job descriptions, out of which about 800 are fake.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published