WebCrawler

This is a Python script that crawls a website and saves the text content of each page in a text file. It also extracts all the hyperlinks from each page and follows the links that are within the same domain to continue the crawling process.

Requirements

Python 3.x
Works on Linux, Windows, macOS, BSD

Install

Install dependencies:

pip install -r requirements.txt

Usage

To use this script, replace the domain and full_url variables with the domain and full URL of the website you want to crawl. Then, simply run the script in your Python environment.

The script will create a text directory in the same directory as the script, which will contain a directory for the domain being crawled and text files for each page crawled.

Note: It is recommended to use this script with permission from the authors of the websites.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WebCrawler

Requirements

Install

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

WebCrawler

Requirements

Install

Usage