Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch HASC Crawler #235

Merged
merged 5 commits into from
Jun 26, 2024
Merged

Patch HASC Crawler #235

merged 5 commits into from
Jun 26, 2024

Conversation

matthew-kersting
Copy link
Collaborator

@matthew-kersting matthew-kersting commented Jun 25, 2024

Description

The HASC Crawler has been reported as stale by the crawler status tracker

![WARNING] - MONITORING: CRAWLERS OVERDUE
test_crawler was last run Apr 20 2022
HASC was last run Jun 12 2024

After some investigation I found that the site had been updated. So some changes were needed to traverse pages correctly and CSS selectors needed to be updated.

There was also a minor issue in the AM/PM not getting picked up in the publication date so I addressed that.

I also refactored the code to match the other newer crawlers.

Here are two examples of what the metadata looks like now as compared to previously (New on the left)
Screenshot 2024-06-25 at 3 28 39 PM
Screenshot 2024-06-25 at 3 30 41 PM

Result of Crawler Run on Dev

[STATS] Crawlers run:
 HASC
        Required CAC: 0
        In Previous Hashes: 0
        Item Scraped Count: 179
        Elapsed Time (sec): 1239.844171
        Close Reason: finished

Example Metadata

{
    "doc_name": "Seth_20Jones_20Written_20Testimony_20Role_20of_20Special_20Operations_20in_20GPC_206_20February_202023",
    "doc_title": "ISO Hearing: The Role of Special Operations Forces in Great Power Competition",
    "doc_num": " ",
    "doc_type": "Witness Statement",
    "display_doc_type": "Witness Statement",
    "publication_date": "2023-02-08T15:00:00",
    "cac_login_required": false,
    "crawler_used": "HASC",
    "downloadable_items": [
        {
            "doc_type": "pdf",
            "download_url": "http://armedservices.house.gov/sites/evo-subsites/republicans-armedservices.house.gov/files/Seth%20Jones%20Written%20Testimony%20Role%20of%20Special%20Operations%20in%20GPC%206%20February%202023.pdf",
            "compression_type": null
        }
    ],
    "source_page_url": "https://armedservices.house.gov/committee-activity/hearings/iso-hearing-role-special-operations-forces-great-power-competition",
    "source_fqdn": "armedservices.house.gov",
    "download_url": "http://armedservices.house.gov/sites/evo-subsites/republicans-armedservices.house.gov/files/Seth%20Jones%20Written%20Testimony%20Role%20of%20Special%20Operations%20in%20GPC%206%20February%202023.pdf",
    "version_hash_raw_data": {
        "doc_name": "Seth_20Jones_20Written_20Testimony_20Role_20of_20Special_20Operations_20in_20GPC_206_20February_202023",
        "publication_date": "2023-02-08T15:00:00",
        "download_url": "http://armedservices.house.gov/sites/evo-subsites/republicans-armedservices.house.gov/files/Seth%20Jones%20Written%20Testimony%20Role%20of%20Special%20Operations%20in%20GPC%206%20February%202023.pdf",
        "display_title": "HASC ISO Hearing: The Role of Special Operations Forces in Great Power Competition - Seth Jones",
        "doc_title": "ISO Hearing: The Role of Special Operations Forces in Great Power Competition"
    },
    "version_hash": "75f660b82c046314752dd1853761f6a9ff1a6ddb4fbbf8ff0a79da30df07b9b0",
    "display_org": "Congress",
    "data_source": "House Armed Services Committee Publications",
    "source_title": "House Armed Services Committee",
    "display_source": "House Armed Services Committee Publications - House Armed Services Committee",
    "display_title": "HASC ISO Hearing: The Role of Special Operations Forces in Great Power Competition - Seth Jones",
    "file_ext": "pdf",
    "is_revoked": false,
    "access_timestamp": "2024-06-25T17:35:40"
}

@matthew-kersting matthew-kersting marked this pull request as ready for review June 26, 2024 12:42
@emmarez emmarez merged commit 71d7878 into main Jun 26, 2024
1 check passed
@matthew-kersting matthew-kersting deleted the patch-hasc-crawler branch June 26, 2024 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants