Skip to content

rapidclock/web-octopus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-octopus

GoDoc Build Status
A concurent web crawler written in Go.

Install

go get github.com/rapidclock/web-octopus/octopus
go get github.com/rapidclock/web-octopus/adapter

Current Features:

  • Depth Limited Crawling
  • User specified valid protocols
  • User buildable adapters that the crawler feeds output to.
  • Filter Duplicates. (Default, Non-Customizable)
  • Filter URLs that fail a HEAD request. (Default, Non-Customizable)
  • User specifiable max timeout between two successive url requests.
  • Max Number of Links to be crawled.

Sample Implementation Snippet

package main

import (
	"github.com/rapidclock/web-octopus/adapter"
	"github.com/rapidclock/web-octopus/octopus"
)

func main() {
	opAdapter := &adapter.StdOpAdapter{}
	
	options := octopus.GetDefaultCrawlOptions()
	options.MaxCrawlDepth = 3
	options.TimeToQuit = 10
	options.CrawlRatePerSec = 5
	options.CrawlBurstLimitPerSec = 8
	options.OpAdapter = opAdapter
	
	crawler := octopus.New(options)
	crawler.SetupSystem()
	crawler.BeginCrawling("https://www.example.com")
}

List of customizations

Customizations can be made by supplying the crawler an instance of CrawlOptions. The basic structure is shown below, with a brief explanation for each option.

type CrawlOptions struct {
	MaxCrawlDepth         int64 // Max Depth of Crawl, 0 is the initial link.
	MaxCrawledUrls        int64 // Max number of links to be crawled in total.
	StayWithinBaseHost    bool // [Not-Implemented-Yet]
	CrawlRatePerSec       int64 // Max Rate at which requests can be made (req/sec).
	CrawlBurstLimitPerSec int64 // Max Burst Capacity (should be atleast the crawl rate).
	RespectRobots         bool // [Not-Implemented-Yet]
	IncludeBody           bool // Include the Request Body (Contents of the web page) in the result of the crawl.
	OpAdapter             OutputAdapter // A user defined crawl output handler (See next section for info).
	ValidProtocols        []string // Valid protocols to crawl (http, https, ftp, etc.)
	TimeToQuit            int64 // Timeout (seconds) between two attempts or requests, before the crawler quits.
}

A default instance of the CrawlOptions can be obtained by calling octopus.GetDefaultCrawlOptions(). This can be further customized by overriding individual properties.

NOTE: If rate-limiting is not required, then just ignore(don't set value) both CrawlRatePerSec and CrawlBurstLimitPerSec in the CrawlOptions.

Output Adapters

An Output Adapter is the final destination of a crawler processed request. The output of the crawler is fed here, according to the customizations made before starting the crawler through the CrawlOptions attached to the crawler.

The OutputAdapter is a Go Interface, that has to be implemented by your(user-defined) processor.

type OutputAdapter interface {
	Consume() *NodeChSet
}

The user has to implement the Consume() method that returns a pointer to a NodeChSet. The NodeChSet is described below. The crawler uses the returned channel to send the crawl output. The user can start listening for output from the crawler.

Note : If the user chooses to implement their custom OutputAdapter REMEMBER to listen for the output on another go-routine. Otherwise you might block the crawler from running. Atleast begin the crawling on another go-routine before you begin processing output.

The structure of the NodeChSet is given below.

type NodeChSet struct {
	NodeCh chan<- *Node
	*StdChannels
}

type StdChannels struct {
	QuitCh chan<- int
}

type Node struct {
	*NodeInfo
	Body io.ReadCloser
}

type NodeInfo struct {
	ParentUrlString string
	UrlString       string
	Depth           int64
}

You can use the utility function MakeDefaultNodeChSet() to get a NodeChSet built for you. This also returns the Node and quit channels. Example given below:

var opNodeChSet *NodeChSet
var nodeCh chan *Node
var quitCh chan int
// above to demo the types. One can easily use go lang type erasure.
opNodeChSet, nodeCh, quitCh = MakeDefaultNodeChSet()

The user should supply the custom OutputAdapter as an argument to the CrawlOptions.

Default Output Adapters:

We supply two default Adapters for you to try out. They are not meant to be feature rich, but you can still use them. Their primary purpose is meant to be a demonstration of how to build and use a OutputAdapter.

  1. adapter.StdOpAdapter : Writes the crawled output (only links, not body) to the standard output.
  2. adapter.FileWriterAdapter : Writes the crawled output (only links, not body) to a supplied file.

Implementation of the adapter.StdOpAdapter:

We have supplied the implementation of adapter.StdOpAdapter below to get a rough idea of what goes into building your own adapter.

// StdOpAdapter is an output adapter that just prints the output onto the
// screen.
//
// Sample Output Format is:
// 	LinkNum - Depth - Url
type StdOpAdapter struct{}

func (s *StdOpAdapter) Consume() *oct.NodeChSet {
	listenCh := make(chan *oct.Node)
	quitCh := make(chan int, 1)
	listenChSet := &oct.NodeChSet{
		NodeCh: listenCh,
		StdChannels: &oct.StdChannels{
			QuitCh: quitCh,
		},
	}
	go func() {
		i := 1
		for {
			select {
			case output := <-listenCh:
				fmt.Printf("%d - %d - %s\n", i, output.Depth, output.UrlString)
				i++
			case <-quitCh:
				return
			}
		}
	}()
	return listenChSet
}