💼 Jobs.cz Scraper

A simple data scraper of Jobs.cz written in multiple JS/TS libraries.

A programming exercise and an experiment to determine which JavaScript / TypeScript library is the best option for web scraping.

The libraries used are Puppeteer ¹, Playwright and Selenium.

(The instructions below have been made to work on Linux operating systems, specifically on Ubuntu (20.04 and 22.04) along with the prerequisite of having Git and npm installed on your system.)

Installation:

git clone https://github.com/zahradnik-ondrej/jobscz-scraper.git

cd jobscz-scraper

cd puppeteer or cd playwright or cd selenium

./run.sh

Go to http://localhost:3000/ to access the input form for the Puppeteer ¹ script.

Output:

You will find the scraped job postings in the job-posts.json file in the current project's directory or in the subdirectory named scraper in the case of the Puppeteer script. ¹

Observations:

Puppeteer and Selenium are equally fast in this specific case.
Puppeteer and Selenium are ~3.6944.. times faster than Playwright in this specific case.

Playwright offers the most intuitive built-in functions for interacting with the web browser making it most suitable for beginners.
Selenium also offers many built-in functions but they are not as intuitive.
Puppeteer offers very little in this case and it's best to write your own wrapper functions which suit your specific needs but it offers the most modularity making this process easier compared to the others. ²

Both Playwright and Selenium offer a support for multiple browsers aside from Chrome (unlike Puppeteer which has only experimental support for Edge via puppeteer-core and Firefox via puppeteer-firefox).

Note that the Puppeteer script also provides a graphical web interface through http://localhost:3000/ with the option to specify parameters of which job listings to scrape because it's the library that I chose to go with in my project. ↩ ↩² ↩³
You can check out my 🧰 puppethelper - A Puppeteer helper package for automated QA web testing which has many useful functions for interacting with the web browser out-of-the-box plus a little extra. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
playwright		playwright
puppeteer		puppeteer
selenium		selenium
.gitignore		.gitignore
README.md		README.md
jobscz-scraper.iml		jobscz-scraper.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💼 Jobs.cz Scraper

A simple data scraper of Jobs.cz written in multiple JS/TS libraries.

Installation:

Output:

Observations:

About

Languages

zahradnik-ondrej/jobscz-scraper

Folders and files

Latest commit

History

Repository files navigation

💼 Jobs.cz Scraper

A simple data scraper of Jobs.cz written in multiple JS/TS libraries.

Installation:

Output:

Observations:

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Languages