Spider Solution (Crawler) in NestJS

This is a spider solution (crawler) built using NestJS, which receives crawling requests and extracts links from web pages based on a specified starting URL. The solution implements a Breadth-First Search (BFS) crawling strategy with support for scalability, persistence, and handling a high number of requests and heavy crawling.

Features

Crawls web pages using the BFS algorithm.
Receives crawling requests with a starting URL, maximum number of links to return, and crawl depth.
Supports parallel processing to handle a high number of requests efficiently.
Persists crawling requests and their results.
Optimized for heavy requests with pages containing many links and deep crawling.
Scalable architecture for distributed deployment.
RESTful API endpoints for submitting crawling requests and retrieving crawled data.

Requirements

Node.js
npm
MongoDB (optional, if using persistence)

Installation

Clone the repository:

git clone https://github.com/g4lb/spider-solution.git

Navigate to the project directory:
```
cd spider-solution
```
Install the dependencies:
```
npm install
```
Set up the configuration:
- Copy the .env.example file and rename it to .env.
- Modify the configuration parameters in the .env file as per your environment.
Start the application:
```
npm run start
```
The application will start running on http://localhost:3001.

Usage

Send a POST request to http://localhost:3001/crawler with the following JSON payload:
```
{
  "startUrl": "https://example.com",
  "maxLinks": 10,
  "crawlDepth": 3
}
```
- startUrl: The starting URL for crawling.
- maxLinks: The maximum number of links to return.
- crawlDepth: The depth to which the crawler should enter (nested links).
The application will initiate the crawling process and return the crawled links as a JSON response.

Persistence (Optional)

If you want to enable persistence for the crawling requests and their results, follow these additional steps:

Install MongoDB and ensure it is running on your system.
Update the .env file with your MongoDB connection details:
```
MONGODB_URI=mongodb://localhost:27017/spider_solution
```
Modify the MONGODB_URI value to match your MongoDB connection string.
Start the application:
```
npm run start
```
The application will now persist the crawling requests and their results in the configured MongoDB database.

Design Concept

Architecture Overview:
- The spider solution follows a microservices architecture, separating concerns into modules to promote modularity and maintainability.
- The main components include:
  - Crawler Module: Responsible for handling crawling requests, executing the crawling algorithm, and persisting the crawled data.
  - Database Module: Handles database interactions for storing and retrieving crawled data.
  - API Module: Provides the RESTful API endpoints for receiving crawling requests and returning crawled data.
  - Queue Module: Manages a message queue system to handle asynchronous processing and distribute crawling tasks across multiple instances.
Crawler Module:
- Contains the core logic for crawling web pages using the BFS algorithm.
- Includes the CrawlerService that encapsulates the crawling functionality, such as extracting links, managing visited URLs, and storing crawled data.
- Utilizes libraries like Axios for making HTTP requests and Cheerio for parsing HTML content and extracting links.
- Implements parallel processing techniques to handle a high number of requests concurrently, improving performance and scalability.
- Integrates with the Database Module to persist crawled data.
Database Module:
- Manages the interaction with the database for storing and retrieving crawled data.
- Utilizes an ORM (Object-Relational Mapping) library like Mongoose or TypeORM to interact with the database.
- Defines a schema and model for the CrawledData entity, specifying the structure of the crawled data to be stored.
- Handles database operations, such as saving crawled data, querying for specific data, and managing database connections and transactions.
API Module:
- Exposes the RESTful API endpoints to receive crawling requests and return crawled data.
- Utilizes the NestJS framework's decorators and controllers to define the API routes.
- Validates and sanitizes the incoming requests, ensuring the required parameters are provided and within acceptable limits.
- Uses the Crawler Service to initiate the crawling process and return the crawled data as a response.
Queue Module (optional):
- Implements a message queue system (e.g., RabbitMQ, Kafka) to handle asynchronous processing and distribute crawling tasks across multiple instances.
- Decouples the crawling requests from the processing logic, improving scalability and fault-tolerance.
- Queues incoming crawling requests, and worker instances pick up tasks from the queue and process them asynchronously.
- Allows for easy scaling by adding more worker instances as the workload increases.
Error Handling and Logging:
- Implements appropriate error handling mechanisms throughout the application, handling exceptions, and returning meaningful error responses to the client.
- Utilizes logging libraries like Winston or Bunyan to log important events and errors, facilitating troubleshooting and monitoring of the system.
Scalability and Performance:
- Uses techniques like parallel processing, distributed architecture, and caching to support high request volumes and heavy requests with deep crawling.
- Implements caching mechanisms to reduce redundant crawling of the same URLs.
- Distributes the spider solution across multiple instances or servers, utilizing load balancing and container orchestration technologies to handle scalability requirements.
Monitoring and Metrics:
- Incorporates monitoring tools like Prometheus or Datadog to collect and visualize performance metrics, such as request latency, error rates, and resource utilization.
- Implements health checks and monitors the system's vital components to ensure availability and performance.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
test		test
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
nest-cli.json		nest-cli.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spider Solution (Crawler) in NestJS

Features

Requirements

Installation

Usage

Persistence (Optional)

Design Concept

About

Releases

Packages

Languages

g4lb/spider

Folders and files

Latest commit

History

Repository files navigation

Spider Solution (Crawler) in NestJS

Features

Requirements

Installation

Usage

Persistence (Optional)

Design Concept

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages