GitHub - jiup/gospy: 🕷 a flexible web crawler framework

Gospy is a flexible web crawler framework that allows you to develop complete web crawlers in few minutes.

It's minimalist and unified API can greatly reduce the learning costs of new users. With it, you can better focus on the data itself, rather than implement a complicated web crawler from the beginning. If you are familiar with java and hoping to grab some interesting data, just hold on, you will soon carry out your first crawler in few minutes. Ok, let's start!

Features

Portable, Flexible and Modular (you can only use one of the modules, or add your own development module into your Gospy-based crawler)
Can operate in stand-alone mode (with multi-thread) or distribute mode (RabbitMQ or Hprose) or even the both
Built in PhantomJs and Selenium, you can directly call the WebDriver to build a browser-kernel based web crawler
Element extraction based on RegEx, XPath and Jsoup, respectively, apply from simple to complex tasks.
Support object-oriented processing with annotations
Practical structural abstraction, from task scheduling to data persistence
Provide robots.txt interpreter (easy to use if you need)

Install

Download jar:

Release Version	JDK Version compatibility	Release Date	Links
0.2.1-beta	1.8+	07.04.2017	download
0.2.2-beta	1.8+	21.05.2017	download

To add a dependency using Maven, use the following:

<dependency>
    <groupId>cc.gospy</groupId>
    <artifactId>gospy-core</artifactId>
    <version>0.2.2</version>
</dependency>

To add a dependency using Gradle:

compile 'cc.gospy.gospy-core:0.2.2'

Learn about Gospy

Module division:

Run in cluster by Hprose:

Run in cluster under RabbitMQ-Server runtime:

Quick start

Visit and print the webpage:

Gospy.custom()
        .setScheduler(Schedulers.VerifiableScheduler.getDefault())
        .addFetcher(Fetchers.HttpFetcher.getDefault())
        .addProcessor(Processors.UniversalProcessor.getDefault())
        .addPipeline(Pipelines.ConsolePipeline.custom().bytesToString().build())
        .build().addTask("https://github.com/zhangjiupeng/gospy").start();

Custom Fetcher, and set multiple pipelines:

String dir = "D:/"; // you need to specify a valid dir on you os
Gospy.custom()
        .setScheduler(Schedulers.VerifiableScheduler.custom()
                .setTaskQueue(new PriorityTaskQueue()) // specify a priority queue
                .build())
        .addFetcher(Fetchers.HttpFetcher.custom()
                .setAutoKeepAlive(false)
                .before(request -> { // custom request
                    request.setHeader("Accept", "text/html,image/webp,*/*;q=0.8");
                    request.setHeader("Accept-Encoding", "gzip, deflate, sdch");
                    request.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
                    request.setHeader("Cache-Control", "max-age=0");
                    request.setHeader("Connection", "keep-alive");
                    request.setHeader("DNT", "1");
                    request.setHeader("Host", request.getURI().getHost());
                    request.setHeader("User-Agent", UserAgent.DEFAULT);
                })
                .build())
        .addProcessor(Processors.UniversalProcessor.getDefault())
        .addPipeline(Pipelines.ConsolePipeline.getDefault()) // add multiple pipelines
        .addPipeline(Pipelines.SimpleFilePipeline.custom().setDir(dir).build())
        .build()
        .addTask("https://zhangjiupeng.com/logo.png")
        .addTask("https://www.baidu.com/img/bd_logo1.png")
        .addTasks(UrlBundle.parse("https://www.baidu.com/s?wd=gospy&pn={0~90~10}"))
        .start();

Save page screenshot by PhantomJS:

String phantomJsPath = "/path/to/phantomjs.exe";
String savePath = "D:/capture.png";
Gospy.custom()
        .setScheduler(Schedulers.VerifiableScheduler.custom()
                .setPendingTimeInSeconds(60)
                .build())
        .addFetcher(Fetchers.TransparentFetcher.getDefault())
        .addProcessor(Processors.PhantomJSProcessor.custom()
                .setPhantomJsBinaryPath(phantomJsPath)
                .setWebDriverExecutor((page, webDriver) -> {
                    TakesScreenshot screenshot = (TakesScreenshot) webDriver;
                    File src = screenshot.getScreenshotAs(OutputType.FILE);
                    FileUtils.copyFile(src, new File(savePath));
                    return new Result<>();
                })
                .build())
        .build()
        .addTask("phantomjs://https://www.taobao.com")
        .start();

Crawl by annotated class:

@UrlPattern("http://www.baidu.com/.*\\.php") // task matches this regex will be processed
public static class BaiduHomepageProcessor extends PageProcessor {
    @ExtractBy.XPath("/html/head/title/text()")
    public String title;

    @ExtractBy.XPath("//*[@id='u1']/a/@href") // fill element data by xpath
    @ExtractBy.XPath("//*[@id='head']/div/div[4]/div/div[2]/div[1]/div/a/@href")
    public Set<String> topBarLinks;

    @ExtractBy.Regex(value = "id=\"su\" value=\"(.*?)\"", group = 1) // fill by regex
    public String searchBtnValue;

    @ExtractBy.XPath
    public String[] allLinks;

    @Override
    public void process() { 
        // process after data filling
        System.out.println("Task url      :" + task.getUrl());
        System.out.println("Title         :" + title);
        System.out.println("Search slogan :" + searchBtnValue);
        System.out.println("Top bar links :");
        topBarLinks.forEach(System.out::println);
    }

    @Override
    public Collection<Task> getNewTasks() {
        return Arrays.asList(new Task("https://www.baidu.com/img/bd_logo1.png"));
    }

    @Override
    @Experimental
    public Object[] getResultData() {
        return Arrays.asList(allLinks).stream()
                .filter(s -> s.matches("^https?://((?!javascript:|mailto:| ).)*")).toArray();
    }
}

Gospy.custom()
        .setScheduler(Schedulers.VerifiableScheduler.getDefault())
        .addFetcher(Fetchers.HttpFetcher.getDefault())
        .addPageProcessor(BaiduHomepageProcessor.class)
        .addProcessor(Processors.UniversalProcessor.getDefault())
        .addPipeline(Pipelines.ConsolePipeline.getDefault())
        .build().addTask("http://www.baidu.com/index.php").start();

more examples

Troubleshoot

Common questions will be collected and listed here.

Cooperate & Contact

Welcome to contribute codes to this project, anyone who had significant contributions will be listed here.

If you are interested in this project, please given stars. If you have any possible questions, you can contact us through the following ways:

create an issue | chat on gitter | send an email

Thanks

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
src/main		src/main
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Install

Learn about Gospy

Quick start

Troubleshoot

Cooperate & Contact

Thanks

License

About

Releases 2

Packages

Languages

License

jiup/gospy

Folders and files

Latest commit

History

Repository files navigation

Features

Install

Learn about Gospy

Quick start

Troubleshoot

Cooperate & Contact

Thanks

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages