Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New crawler disa pubs #226

Merged
merged 10 commits into from
Apr 30, 2024
Merged

New crawler disa pubs #226

merged 10 commits into from
Apr 30, 2024

Conversation

matthew-kersting
Copy link
Collaborator

@matthew-kersting matthew-kersting commented Apr 26, 2024

Description

Crawling DISA publications for Instructions at https://disa.mil/About/DISA-Issuances/Instructions and Circulars at https://disa.mil/About/DISA-Issuances/Circulars.

At the time of this PR there are 42 Instructions available and 6 Circulars, all of which are picked up by this crawler.

I also wanted to introduce some refactoring starting in this PR. My goal is to move the logic that is consistent between crawlers into a separate class called DocItemFields. This can hold the data that is used in a consistent way across crawlers. This will allow new crawlers to contain only the aspects that make them unique (crawler name, site scraping logic, urls, etc.).

Also included in this PR is a new PR template and an update to run the github workflows when merging to main as that is the default branch.

Result of Crawler Run on Dev

 DISA_pubs
        Required CAC: 0
        In Previous Hashes: 0
        Item Scraped Count: 48
        Elapsed Time (sec): 6.68043
        Close Reason: finished

Example Metadata

{
    "doc_name": "DISAC 270-A85-1",
    "doc_title": "System Equipment Reporting System (SERS)",
    "doc_num": "270-A85-1",
    "doc_type": "Circular",
    "display_doc_type": "Circular",
    "publication_date": "2017-03-17T00:00:00",
    "cac_login_required": false,
    "crawler_used": "DISA_pubs",
    "downloadable_items": [
        {
            "doc_type": "pdf",
            "download_url": "https://disa.mil/-/media/Files/DISA/About/Publication/Circular/270-A85-1-System-Equipment-Reporting-System.pdf",
            "compression_type": null
        }
    ],
    "source_page_url": "https://disa.mil/About/DISA-Issuances/Circulars",
    "source_fqdn": "disa.mil",
    "download_url": "https://disa.mil/-/media/Files/DISA/About/Publication/Circular/270-A85-1-System-Equipment-Reporting-System.pdf",
    "version_hash_raw_data": {
        "doc_name": "DISAC 270-A85-1",
        "doc_num": "270-A85-1",
        "publication_date": "2017-03-17T00:00:00",
        "download_url": "https://disa.mil/-/media/Files/DISA/About/Publication/Circular/270-A85-1-System-Equipment-Reporting-System.pdf",
        "display_title": "System Equipment Reporting System (SERS)"
    },
    "version_hash": "7eea6bdfb53ca2a9244a1039097fcf826a8082a1a8fa66cdbf75f1452a3e6a57",
    "display_org": "Defense Information Systems Agency",
    "data_source": "Defense Information Systems Agency",
    "source_title": "DISA Policy/Issuances",
    "display_source": "Defense Information Systems Agency - DISA Policy/Issuances",
    "display_title": "DISAC 270-A85-1: System Equipment Reporting System (SERS)",
    "file_ext": "pdf",
    "is_revoked": false,
    "access_timestamp": "2024-04-29T14:05:05"
}

Copy link
Contributor

@Antsega Antsega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice DocItemField class!
Collects all required pdfs, easy to follow code and display fields populate on app nicely

@Antsega Antsega merged commit 5a43724 into main Apr 30, 2024
1 check passed
@matthew-kersting matthew-kersting deleted the new-crawler-disa-pubs branch May 1, 2024 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants