GitHub - AlexWorldD/TripY: Magnificent crawler for TripAdvisor: powerful, scalable and purely sexy.

TripY 🌍 : Crawler for TripAdvisor

Magnificent crawler: powerful, scalable and purely sexy. The main goal is collecting data from well-known site for travelers – TripAdvisor.

Authors:

1. REQUIMENTS

This crawler is based on the following techniques:

MongoDB
RabbitMQ
Python 3.6 with next packages: a. Celery – Distributed Task Queue b. PyMongo – driver for MongoDB c. Requests – HTTP library d. lxml – paring HTML actually e. fake_useragent – for overcoming the 301 response from server

2. INSTALLATION

[Master] MongoDB

First of all, we need a Database for storing our crawled data. Taking into account the hierarchical and quite complex structure of data, we have selected MondoDB as a database for our purpose. *If you already have a configured MongoDB on your server, feel free to skip this part of guide.

Install MongoDB on your server;
Create a user with readWrite role and access to the TripY db;

db.createUser({user: 'exam', pwd: 'A', roles: [{role: 'readWrite', db: 'TripY'}]})

Enable auth and remote access to MongoDB a) bindIp: 0.0.0.0 <- change this line b) security: authorization: 'enabled'
Open 27017 port for remote connection to your db: sudo ufw allow 27017
Restart MongoDB: sudo service mongod restart

We have finished with database installation now.

[Master] RabbitMQ

For making easier next step, we’ve prepared some bash scripts, which you can find in project repo. However, if you prefer manual installation – it’s your choice! *If you already have a configured RabbitMQ on your server, feel free to skip this part of guide.

Install git to your server;
Clone project repo:

git clone https://github.com/AlexWorldD/TripY/

Make RabbitMQ installation script executable: chmod +x MakeRabbit.sh
Run: ./MakeRabbit.sh

After all, you should see the final message like:

[OK]
RabbitMQ web management console
URL: 159.65.17.172:

We’ve finished with required stuff on the [Master] side, so now we can prepare our workers.

[Worker] Docker and Docker-compose

Have a look on this tutorial.

[Client]

Install all packages from _ requiments.txt_

3. CONFIGURATION

Cloning repo from git

First of all, we should clone project repo to our [Worker] machines.

git clone https://github.com/AlexWorldD/TripY/

Configure MongoDB and RabbitMQ addresses Change IPs in config file (/ TripY/cluster_managment/default_config.py)

CELERY_BROKER_URL = 'amqp://<user>:<pwd>@<IP>:<Port>’
MONGO = ‘mongodb://<user>:<pwd>@<IP>/<DataBase>’

For instance, our configuration was:

CELERY_BROKER_URL = 'amqp://radmin:[email protected]:5672'
MONGO = 'mongodb://exam:[email protected]/TripY'

Additionally, in this file you can select the option for crawling reviews: True or False.

However, you should take into account the fact that common entity has roughly 500 reviews, what could make your crawling process critical slow.

Run Docker containers

docker-compose up --force-recreate --build --scale worker= 4

The last parameter sets the number of workers, we strongly recommended set it according to 1 GB per worker , otherwise some problems could be happened.

4. USAGE

Our crawler just works, and it’s amazing!

Run main.py and type the city name (or first letters of city name), if it’s real city which is already in TripAdvisor database.
Wait until the crawl will finished the parsing links to entities and broadcast them to workers.
After that, you can off your client. And connect via your preferable client to MongoDB with given address and User/Pwd pair.

5. CRAWLED DATA

Hotel/Attraction/Restaurant
- Title
- Address
- Link
- Rating
- Prices
- Contacts: official site and phone number – optionally
- Specific features
- Reviews – optionally
Review
- Entity ID
- UserID
- User nickname
- Date
- Title
- Full text
- Ratings

All Data in database has crossing field (such as GEO_ID, ID or User_ID), what gives the opportunity easily getting specific data after crawling process.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
InProgress		InProgress
TripY		TripY
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TripY_Doc.pdf		TripY_Doc.pdf
WTF.txt		WTF.txt
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TripY 🌍 : Crawler for TripAdvisor

Authors:

1. REQUIMENTS

This crawler is based on the following techniques:

2. INSTALLATION

3. CONFIGURATION

4. USAGE

5. CRAWLED DATA

About

Releases

Packages

Contributors 2

Languages

License

AlexWorldD/TripY

Folders and files

Latest commit

History

Repository files navigation

TripY 🌍 : Crawler for TripAdvisor

Authors:

1. REQUIMENTS

This crawler is based on the following techniques:

2. INSTALLATION

3. CONFIGURATION

4. USAGE

5. CRAWLED DATA

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages