This crawler inside /browser_simulator
will crawl all pages including redirect pages. For example, r1.html
redirects to r2.html
then to r3.html
. The crawler will crawl r1.html
, r2.html
, and r3.html
with [HTTP:200]
and ignores [HTTP:300]
status code. Instead, it will retrieve redirect links in the HTML pages or JS files to maunally redirect.
- This is a test version.
- HTML redirect is based on meta element
meta[@http-equiv="refresh" and @content]/@content
. - Need to make it robust.
- JS has too many possible ways to do redirects. It can be scripts embeded in HTML page, or in a single JS files.
- Currently, the crawler will try to look for
window.location.href
first. If not existed, then all JS files will be parsed and if//script[@type="text/javascript" and @src]/@src
exist, redirects will be marked as true. - Need to make it suitable for as many as redirect ways possible.
- For some pages, if directly crawl all HTML elements, then some parts may be ingored since JS will render those part.
- Instead of using default
Selector
in theScrapy
,selenium
andPhantomJS
is used insideScrapy
to crawl pages rendered by JS. - May use
CasperJS
in the future. But currently,selenium
only supportsPhantomJS
.
- There are still some problems in it. Currently, the JS redirect cannot be simulated by Phantom JS. To be specific, I cannot stop redirecting in Phantom.
- For some pages, the CSV output is really wried. This will be solved in the future.
- NOTE: the page sources crawled is all rendered by PhantomJS. Scrapy Spider is only used for finding the redirect link in JS files.
environment_readme.txt
is provided thatPython
,Scrapy
in Python,NodeJS
, andPhantomJS
in NodeJS is needed for running the program.
- Tp start a spider, in the
/browser_simulator
directory, typingscrapy crawl spider-name
to start. scrapy-name
can beall_in_one_spider
,headless_spider
,html_redirect_spider
, andjs_redirect_spider
.
- Fix PhantomJS failed to stop redirect function.
- Wirte a new middleware of saving data. Currently, CSV for big webpages will be shown in messy.
- Robust improvement.
The crawler inside /browser_simulator
will try to cralw download links for good software (only exe
) for CNET, and Firehourse. Only the most popular ones will be crawled.
All links will be located at http://download.cnet.com/s/software/windows-free/?sort=most-popular
section. The start page and end page can be customized.
All links will be located at http://www.filehorse.com/popular/
section. The start page and end page can be customized.
A normal crawler, PhantomJS with Selenium crawler are provided based on the preferences.
This section is almost the same as Headless Browser Simulate / Redirect Browser
. Please refer to the above section for details.
Data are all csv
files or xslx
files. But to be noted that only csv
files can be used in ML.
Python
2.7.10numpy
module inPython
is needed for running.
from RandomForest import MachineLearning
ml = MachineLearning() # will start to training the ree
ml.voting(data) # predicting a tuple of data, the data can be gathered from user
- line #
27
, control depth of the tree. - line #
77
, using binary split. - line #
441
, control percentage of good data in each training set.