Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch army g1 spider #228

Merged
merged 2 commits into from
May 8, 2024
Merged

Patch army g1 spider #228

merged 2 commits into from
May 8, 2024

Conversation

matthew-kersting
Copy link
Collaborator

@matthew-kersting matthew-kersting commented May 3, 2024

Description

The army_g1_pubs spider was getting marked as stale by our crawler monitor. It's last successful run was on March 28th 2024. When running this crawler locally I received the following error:

  File "/gc/gamechanger-crawlers/dataPipelines/gc_scrapy/gc_scrapy/spiders/army_g1_spider.py", line 84, in parse
    label_text = item.css('label[for]::text').get().strip()
AttributeError: 'NoneType' object has no attribute 'strip'

This indicated that the web site had changed so I updated the crawler to scrape the new website as well as start using the DocItemFields class while maintaining most of the logic from the previous iteration. Also confirmed that the hashes match and only new documents are getting pulled by seeing the "In Previous Hashes" value on dev below.

Result of Crawler Run on Dev

        Required CAC: 0
        In Previous Hashes: 58
        Item Scraped Count: 122
        Elapsed Time (sec): 457.617432
        Close Reason: finished

Example Metadata

{
    "doc_name": "fa-49-orsa-da-pam-600-3-as-of-20210601",
    "doc_title": "OPERATIONS RESEARCH/SYSTEMS ANALYSIS FA",
    "doc_num": "600-3",
    "doc_type": "DA PAM",
    "display_doc_type": "DA PAM",
    "publication_date": "2022-08-03T00:00:00",
    "cac_login_required": false,
    "crawler_used": "army_g1_pubs",
    "downloadable_items": [
        {
            "doc_type": "pdf",
            "download_url": "https://api.army.mil/e2/c/downloads/2022/08/03/c783e672/fa-49-orsa-da-pam-600-3-as-of-20210601.pdf",
            "compression_type": null
        }
    ],
    "source_page_url": "https://www.army.mil/g-1",
    "source_fqdn": "www.army.mil",
    "download_url": "https://api.army.mil/e2/c/downloads/2022/08/03/c783e672/fa-49-orsa-da-pam-600-3-as-of-20210601.pdf",
    "version_hash_raw_data": {
        "doc_name": "fa-49-orsa-da-pam-600-3-as-of-20210601",
        "doc_num": "600-3",
        "publication_date": "2022-08-03T00:00:00",
        "download_url": "https://api.army.mil/e2/c/downloads/2022/08/03/c783e672/fa-49-orsa-da-pam-600-3-as-of-20210601.pdf",
        "display_title": "OPERATIONS RESEARCH/SYSTEMS ANALYSIS FA"
    },
    "version_hash": "28ae36f1df3041e1c359a635168d260eab1cc092b425e4af2ffa5a768a870cf8",
    "display_org": "Dept. of the Army",
    "data_source": "Army Publishing Directorate",
    "source_title": "G-1 Publications",
    "display_source": "Army Publishing Directorate - G-1 Publications",
    "display_title": "DA PAM 600-3: OPERATIONS RESEARCH/SYSTEMS ANALYSIS FA",
    "file_ext": "pdf",
    "is_revoked": false,
    "access_timestamp": "2024-05-03T15:05:24"
}

Copy link
Contributor

@Antsega Antsega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested and approved

@Antsega Antsega merged commit ff85a6a into main May 8, 2024
1 check passed
@matthew-kersting matthew-kersting deleted the patch-army-g1-spider branch May 17, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants