Skip to content

jinglu-jlu/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Crawler

Overview

Crawler is a simple multi-threaded web crawler which fetches URLs uisng BFS and outputs crawl results to console and log as the crawl proceeds. It starts with a given set of URLs, and keeps crawling until user enters CTRL-C, or the number of crawled pages reaches the specified count. This implementation only takes URLs from <a href> tags and only processes absolute links.

Usage

Crawler accepts one positional argument, and two optional arguments. Run 'python crawler.py -h' for details.

image

Examples

  • Run crawler with a single seed url and 20 (default) worker threads. Stop after crawling 100 (default) urls.

python crawler.py https://source.android.com/setup/start/build-numbers

  • Run crawler with a single seed url and 10 worker threads. Stop after crawling 150 urls.

python crawler.py https://source.android.com/setup/start/build-numbers -c 150 -w 10

  • Run crawler with two seed urls and 30 work threads. Stop after crawling 300 urls.

python crawler.py https://source.android.com/setup/start/build-numbers https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors -c 300 -w 30

To stop crawler before it completes

Press Ctrl+C

Output

Crawler outputs to both stdout and log. The output is formated as below

image

Example

image

Install and Build

For Windows

Requirements: Python 3.6 or above (and pip) is installed and its path is in system path

  1. Download crawler.py and buildcrawler.bat
  2. Run buildcrawler.bat to setup virtualenv to run crawler: buildcrawler.bat

For Other Platforms

TBD

Run

For Windows

  1. Run activatecrawler.bat to activate virtualenv (if not activated yet) - virtualenv is automatically activated when running buildcrawler.bat
  2. Run crawler - see examples in Usage section

For Other Platforms

TBD