Skip to content

Fetch news articles within a certain time range from news websites.

License

Notifications You must be signed in to change notification settings

wsdookadr/sitemap-range-fetch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

This module provides the SitemapRange class and a tool to allow command-line usage sitemap_fetch.py.

The class SitemapRange is meant primarily as a generic building block for creating news aggregating applications where the datasources are spec-compliant news websites.

There are some fault-tolerance features included to deal with some inconsistencies in sitemaps.

Install

To install from pypi:

pip3 install --user sitemap-range-fetch

Usage

Fetching all news articles on cnn.com in the past 6 days, and format the result as JSON:

sitemap_fetch.py --site "https://cnn.com" --format json --daysago 6

Here is an example of using the SitemapRange class in your code:

from sitemap_range.sitemap_range import SitemapRange
from datetime import datetime, timedelta
sr = SitemapRange("https://cnn.com")
in_range = sr.get_articles_in_range(start=datetime.now()-timedelta(days=3), end=datetime.now(), opts={})
print(in_range)

The get_articles_in_range method returns a list of dictionaries, where each dictionary has two keys: "url" and "dt" which is an ISO 8601 formatted datetime string (as returned by the isoformat method).

More details about the CLI switches:

    usage: sitemap_fetch.py [-h] --site SITE [--format FORMAT] [--daysago DAYSAGO]
                            [--notz] [--advanced] [--tlimit TRANSFER_LIMIT]

    Tool for extracting articles from news websites

    optional arguments:
      -h, --help            show this help message and exit
      --site SITE           the url for the website
      --format FORMAT       output format (the default is json, also supports xml)
      --daysago DAYSAGO     defines the oldest date of an article that will be
                            selected (default: 2 days ago)
      --notz                strip the timezone from the dates before selection
                            (processing is more fault-tolerant)
      --advanced            use a more fault-tolerant parser
      --tlimit TRANSFER_LIMIT
                            total transfer limit in MB

Commercial Support

For commercial support or customizations, please send an e-mail at [email protected]

About

Fetch news articles within a certain time range from news websites.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published