Neo: supersonic scraping ⚡

Neo is a super-fast, super-lightweight scraper capable of handling around 300 requests at once. It's managed using the Celery distributed task queue framework, and configured to run on Windows OS (unsupported by Celery).

Neo scales using green threads by using Eventlet as Celery's execution pool, which also happens to be the only way for Celery to work on Windows. Unlike with the default pre-fork execution pool that spawns child processes, Eventlet threads are managed within a single worker process, so manual queue routing can be used to split the task size between a few workers to achieve multiprocessing.

Built for an input size of 57k+ as a commercial data project, Neo combines with a regex parser called Trinity to find compatibility of site pages with CRM products and produces a ranking score for the first 16 pages of input websites (with 5s cooldown between requests). Neo's observed latency speedup is 315x during production on a quad core machine, reducing average latency from 29s to 9.2ms. For context, this meant a reduction in production runtime from 472h to 1.5h.

Code references The Matrix for bonus points.

Use 🏃

First set up the message broker and databse URI in your .env file. The production.py file is ready to use (edit the input file as need be). More documentation can be found in script.

To spin up worker processes for the production app:

$ celery -A production worker -P eventlet -c 10000 -n worker1 -Q worker1
$ celery -A production worker -P eventlet -c 10000 -n worker2 -Q worker2
$ celery -A production worker -P eventlet -c 10000 -n worker3 -Q worker3
$ celery -A production worker -P eventlet -c 10000 -n worker4 -Q worker4

Note that concurrency set using flag -c is more of a limit than a guideline. Runtime concurrency depends on OS and other processes. Observed concurrency during development range from 100 to 2,500.

Run using:

$ python run_production.py

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
databases		databases
deliverables		deliverables
regex		regex
regex_tests		regex_tests
scraper		scraper
sites_data		sites_data
sites_list		sites_list
speed_tests		speed_tests
.gitignore		.gitignore
README.md		README.md
data_generator.py		data_generator.py
pipeline_test.py		pipeline_test.py
production.py		production.py
requirements.txt		requirements.txt
run_pipeline_test.py		run_pipeline_test.py
run_production.py		run_production.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neo: supersonic scraping ⚡

Use 🏃

About

Releases

Packages

Languages

t0nychn/neo

Folders and files

Latest commit

History

Repository files navigation

Neo: supersonic scraping ⚡

Use 🏃

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages