Skip to content

Syhen/baidu-index

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

baidu_index

crawl baidu index without selenium&phantomjs

requirements

  • flask
  • pillow
  • numpy
  • requests
  • lxml
  • docker

install

  1. 启动 docker
sudo docker pull scrapinghub/splash
sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
  1. 拷贝项目
git clone https://github.com/Syhen/baidu-index.git
  1. 设置 baidu-index 环境变量

  2. 启动flask微服务

cd baidu-index/baidu_index/backend
python index.py
  1. 配置nginx 配置微服务的nginx,因为splash不能解析localhost

然后将 baidu_index.core.index.get_res2 中的域名调整为配置好的域名

  1. demo
from __future__ import unicode_literals

from requests.cookies import RequestsCookieJar

from baidu_index.core.index import BaiduIndexCrawler

cookies = RequestsCookieJar()
# update cookies with login
baidu_index_crawler = BaiduIndexCrawler('机器学习', cookies, start_date="2017-01-01", end_date="2017-01-31")
baidu_index_crawler.next()
# 936

warning!!

禁止商用!

About

crawl baidu index without selenium&phantomjs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published