Question: Optimal configuration for a python worker (crawler + save db) on celery

Discussion on: https://stackoverflow.com/questions/55165370/optimal-setup-for-a-python-worker-crawler-save-db-on-celery

I have multiple python workers that crawl certain websites, parse the data and store them on a Postgres database.

It's unclear for me how to architect the code to optimize the server resources (deployed on microservices multiple pods on kubernetes). Let's assume that there's no rate limit for the request.

For demo purpose, I created a sample code that gets the top 10k websites, store them on the db - and then crawl search results from Bing (and store them as well). This can be extended to 1M websites.

Celery is using the pool gevent since the worker have many network I/O. I added psycogreen to patch postgres as well to avoid bottlenecks. To avoid reaching Postgres max connections, I added pgbouncer as database proxy.

The entry point is ./app/entrypoint.sh and main code logic in ./app/worker.py

There are 3 sub-questions regarding this implementation:

How to size/tweak the variables?

Worker concurrency
SQLAlchemy pool_size (normally pgbouncer would take over)
worker_prefetch_multiplier
broker_pool_limit
Number of replica of the python worker (how it will affect the database load)

How to optimize the code?

It seems that there's room to improve the code, how can I trace bottlenecks (I suspect it's the db or beautifulsoup, it seems like a dark mystery while using gevent) - and how to optimize it?

The database closes unexpectedly sometimes, why?

When I run the code and trigger with +10K pulls. The worker hangs after few pulls and occasionally throws: (psycopg2.OperationalError) server closed the connection unexpectedly Any recommendations on how to size resources of the db to support such tasks?

Development enviroment

Import database schema: docker-compose exec -T postgres psql -U postgres celery-worker-demo < ./celery-worker-demo.sql Start the worker locally: docker-compose up worker Scale the worker to 10: docker-compose scale worker=10 Trigger pulling: docker-compose run --rm worker python -c "from worker import pull_domains; pull_domains.s().apply_async();"

For linting, use virtualenv -p python3 env; source env/bin/activate; pip install -r requirements.txt

Production enviroment

worker deployment specification on kubernetes is available on ./worker-k8s.yml

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
app		app
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
celery-worker-demo.sql		celery-worker-demo.sql
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
worker-k8s.yml		worker-k8s.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Question: Optimal configuration for a python worker (crawler + save db) on celery

There are 3 sub-questions regarding this implementation:

How to size/tweak the variables?

How to optimize the code?

The database closes unexpectedly sometimes, why?

Development enviroment

Production enviroment

About

Releases

Packages

Languages

melalj/celery-worker-demo

Folders and files

Latest commit

History

Repository files navigation

Question: Optimal configuration for a python worker (crawler + save db) on celery

There are 3 sub-questions regarding this implementation:

How to size/tweak the variables?

How to optimize the code?

The database closes unexpectedly sometimes, why?

Development enviroment

Production enviroment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages