Skip to content

teohrt/abrasion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Abrasion: The CLI Web Scraper

Abrasion is used to crawl the web. It will concurrently scrape all websites found, starting with a seed website. While scraping, Abrasion can also search and output matches to a regex template that defaults to match on email addresses, or can just simply print the structure of web.

Installation Instructions

git clone https://github.com/teohrt/abrasion.git
cd abrasion/
make build

Usage

# Crawl the web starting from google.com, the default seed URL, and output emails to a text file.
./abrasion -getEmails

# Scrape 100 URLs starting with given seed URL. Output all related URLs and debug logs to their respective files, as well as the console. 
./abrasion -url=https://reddit.com -scrapeLimit=100 -verbose -debug

Flags

  • url - The URL with which Abrasion begins scraping the web. Defaults to "https://www.google.com".
  • verbose - Sets verbose logging in console. Defaults to false.
  • debug - When set, a debug log file is written. Defaults to false.
  • getEmail - Aggregate email addresses. Defaults to false.
  • scrapeLimit - Sets the number of URLs to scrape. Defaults to MAXINT.

Logs

Abrasion has 3 potential output streams. By default it outputs errors and results to their own respective text files and will also print to console if the -verbose flag is set.

Releases

No releases published

Packages

No packages published