Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch tradoc and national_guard Spiders #237

Merged
merged 3 commits into from
Jul 16, 2024
Merged

Patch tradoc and national_guard Spiders #237

merged 3 commits into from
Jul 16, 2024

Conversation

matthew-kersting
Copy link
Collaborator

Description

We are getting notice of a few stale crawlers. A couple of these were quick fixes and I pulled them into the same PR. Also formatted the files.

The only logic changes are in line 125 of tradoc_spider.py where I add a conditional to skip items without downloadable elements and line 24 of chief_national_guard_bureau_spider.py where the start_url needed updated

Result of Crawler Run on Dev

[STATS] Crawlers run:
 National_Guard
        Required CAC: 2
        In Previous Hashes: 98
        Item Scraped Count: 103
        Elapsed Time (sec): 1.161506
        Close Reason: finished
 tradoc
        Required CAC: 0
        In Previous Hashes: 128
        Item Scraped Count: 141
        Elapsed Time (sec): 2.718082
        Close Reason: finished

Example Metadata

{
    "doc_name": "TRADOC TF5",
    "doc_title": "Transmittal, Action and Control",
    "doc_num": "TF5",
    "doc_type": "TRADOC Forms (TFs)",
    "display_doc_type": "Document",
    "publication_date": "2022-04-01",
    "cac_login_required": false,
    "crawler_used": "tradoc",
    "downloadable_items": [
        {
            "doc_type": "pdf",
            "download_url": "https://adminpubs.tradoc.army.mil/forms/TF5.pdf",
            "compression_type": null
        }
    ],
    "source_page_url": "https://adminpubs.tradoc.army.mil/forms.html",
    "source_fqdn": "adminpubs.tradoc.army.mil",
    "download_url": "https://adminpubs.tradoc.army.mil/forms/TF5.pdf",
    "version_hash_raw_data": {
        "download_url": "https://adminpubs.tradoc.army.mil/forms/TF5.pdf",
        "doc_name": "TRADOC TF5",
        "doc_num": "TF5",
        "publication_date": "2022-04-01",
        "display_title": "TRADOC Forms (TFs) TF5: Transmittal, Action and Control"
    },
    "version_hash": "f5d1b8ac8ebd7d924b8cdd9c0e38d1ec4b77194f40977d076022a735d6fb1d30",
    "display_org": "United States Army Training and Doctrine Command",
    "data_source": "TRADOC",
    "source_title": "TRADOC Administrative Publications",
    "display_source": "TRADOC - TRADOC Administrative Publications",
    "display_title": "TRADOC Forms (TFs) TF5: Transmittal, Action and Control",
    "file_ext": "pdf",
    "is_revoked": false,
    "office_primary_resp": "Training and Doctrine Command",
    "access_timestamp": "2024-07-15T21:00:52"
}
{
    "doc_name": "CNGBI 1400.25A",
    "doc_title": "National Guard Technician and Civilian Personnel",
    "doc_num": "1400.25A",
    "doc_type": "CNGBI",
    "display_doc_type": "CNGBI",
    "publication_date": "11 May 2020",
    "cac_login_required": false,
    "crawler_used": "National_Guard",
    "downloadable_items": [
        {
            "doc_type": "pdf",
            "download_url": "https://www.ngbpmc.ng.mil/Portals/27/Publications/cngbi/CNGBI%201400_25A_20200511.pdf?ver=2020-05-15-072059-130",
            "compression_type": null
        }
    ],
    "source_page_url": "https://www.ngbpmc.ng.mil/Publications/CNGB-Instructions/",
    "source_fqdn": "www.ngbpmc.ng.mil",
    "download_url": "https://www.ngbpmc.ng.mil/Portals/27/Publications/cngbi/CNGBI%201400_25A_20200511.pdf?ver=2020-05-15-072059-130",
    "version_hash_raw_data": {
        "item_currency": "/Portals/27/Publications/cngbi/CNGBI%201400_25A_20200511.pdf?ver=2020-05-15-072059-130",
        "document_title": "National Guard Technician and Civilian Personnel",
        "document_number": "1400.25A"
    },
    "version_hash": "b0e0df12f4aadf52728fc2d14f6ce29099b3078d3d4b937e34aca0a2352d4080",
    "display_org": "National Guard",
    "data_source": "National Guard Bureau Publications & Forms Library",
    "source_title": "Unlisted Source",
    "display_source": "National Guard Bureau Publications & Forms Library - Unlisted Source",
    "display_title": "CNGBI 1400.25A: National Guard Technician and Civilian Personnel",
    "file_ext": "pdf",
    "is_revoked": false,
    "access_timestamp": "2024-07-15T19:55:03"
}

@emmarez emmarez merged commit aa7da0c into main Jul 16, 2024
1 check passed
@matthew-kersting matthew-kersting deleted the patch-stale-spiders branch July 19, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants