Description

Web crawler that generates a site map

Quick start

Build

make build

Print usage

./crawler -help

Generate the site map of a site (change the url and output file accordingly)

./crawler -url http://site.com -out sitemap.txt

Run tests

make test

About

This is a simple web crawler that generates a site map. It will fetch the provided URL and parse its links saving those links in an in-memory map, for each new link it will do the same mentioned steps (i.e.: fetch it and parse its links), each fetch and parse will be run in a concurrent Go routine.

This is a diagram of the program where the Rxs are routines:

R1 - url: "/", parsed links: ["/a", "/b", "/c"] -+
                                                 |
R2 - url: "/b", parsed links: ["/c", "/d"] ------| ----> R0 - save in in-memory map 
                                                 |            & for each new path
R3 - url: "/a", parsed links: ["/", "/d"] -------+            fetch its links
                                                                  |
R4 - url: "/d" fetching links...                                  |
 ^                                                                |
 |                                                                |
 +---------------------------- start routine ---------------------+

R0 is a routine that listens to the routines that are fetching and parsing the URLs, R0 will save in-memory the new links and it will launch a new routine for each unseen link in order fetch it and parse its links.

When we don't see any new links and all the go routines have finished parsing their pages we print the results to the output file.

The file format is similar to this:

/
 /a
 /b
  /d
 /c

The root is at the top with no spaces and then its children in new lines with one space, we do not print a path more than once so it might be that "/a" has a link to "/" but "/" has already been printed in an upper level

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
internal		internal
.gitignore		.gitignore
README.md		README.md
config.go		config.go
go.mod		go.mod
go.sum		go.sum
main.go		main.go
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Quick start

About

About

Releases

Packages

Languages

plopezlpz/sitemap

Folders and files

Latest commit

History

Repository files navigation

Description

Quick start

About

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages