Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch navy med spider #232

Merged
merged 5 commits into from
Jun 12, 2024
Merged

Patch navy med spider #232

merged 5 commits into from
Jun 12, 2024

Conversation

matthew-kersting
Copy link
Collaborator

@matthew-kersting matthew-kersting commented May 23, 2024

Description

The Navy Med Pubs spider has been reported as stale by the crawler status monitor job. It was last successfully run on May 8th, 2024.

![WARNING] - MONITORING: CRAWLERS OVERDUE
test_crawler was last run Apr 20 2022
navy_med_pubs was last run May 08 2024

To patch this spider I updated some selectors and pulled in beautiful soup for parsing. To confirm my updates I compared the output results from the patched crawler to those metadata files in S3.

Screenshot 2024-06-11 at 2 05 42 PM
Screenshot 2024-06-11 at 2 08 05 PM

From the diffs shown it looks like all of the updated fields are either changes in the download url or a welcome enrichment because the old files do not have all the fields that we would like to support. Since there do appear to be changes to the hashes we will have to run deletions on dev and prod before deploying this crawler update.

Note: This will require running a delete on the navy_med_pubs crawler docs currently in prod.

Result of Crawler Run on Dev

 navy_med_pubs
        Required CAC: 1
        In Previous Hashes: 0
        Item Scraped Count: 403
        Elapsed Time (sec): 1359.413248
        Close Reason: finished

Example Metadata

{
    "doc_name": "BUMEDNOTE 6000",
    "doc_title": "HIGH RELIABILITY ORGANIZATION IMPLEMENTATION",
    "doc_num": "6000",
    "doc_type": "BUMEDNOTE",
    "display_doc_type": "Document",
    "publication_date": "2023-01-13T00:00:00",
    "cac_login_required": false,
    "crawler_used": "navy_med_pubs",
    "downloadable_items": [
        {
            "doc_type": "pdf",
            "download_url": "https://www.med.navy.mil/LinkClick.aspx?fileticket=Ux6PcveH878%3d&tabid=13484&portalid=62&mid=46796",
            "compression_type": null
        }
    ],
    "source_page_url": "https://www.med.navy.mil/Directives/",
    "source_fqdn": "www.med.navy.mil",
    "download_url": "https://www.med.navy.mil/LinkClick.aspx?fileticket=Ux6PcveH878%3d&tabid=13484&portalid=62&mid=46796",
    "version_hash_raw_data": {
        "doc_name": "BUMEDNOTE 6000",
        "doc_num": "6000",
        "publication_date": "2023-01-13T00:00:00",
        "download_url": "https://www.med.navy.mil/LinkClick.aspx?fileticket=Ux6PcveH878%3d&tabid=13484&portalid=62&mid=46796",
        "display_title": "BUMEDNOTE 6000: HIGH RELIABILITY ORGANIZATION IMPLEMENTATION"
    },
    "version_hash": "38eebb856c0fe462a77fee05395f3356bab286a18d4949e3ad9073448e3150e4",
    "display_org": "US Navy Medicine",
    "data_source": "Navy Medicine",
    "source_title": "Unlisted Source",
    "display_source": "Navy Medicine - Unlisted Source",
    "display_title": "BUMEDNOTE 6000: HIGH RELIABILITY ORGANIZATION IMPLEMENTATION",
    "file_ext": "pdf",
    "is_revoked": false,
    "access_timestamp": "2024-06-11T17:45:56"
},
{
    "doc_name": "BUMEDINST 6470.10C",
    "doc_title": "MANAGEMENT OF IRRADIATED OR RADIOACTIVELY CONTAMINATED PERSONNEL",
    "doc_num": "6470.10C",
    "doc_type": "BUMEDINST",
    "display_doc_type": "Document",
    "publication_date": "2021-02-23T00:00:00",
    "cac_login_required": false,
    "crawler_used": "navy_med_pubs",
    "downloadable_items": [
        {
            "doc_type": "pdf",
            "download_url": "https://www.med.navy.mil/LinkClick.aspx?fileticket=5eDzwbXitFc%3d&tabid=13484&portalid=62&mid=46760",
            "compression_type": null
        }
    ],
    "source_page_url": "https://www.med.navy.mil/Directives/",
    "source_fqdn": "www.med.navy.mil",
    "download_url": "https://www.med.navy.mil/LinkClick.aspx?fileticket=5eDzwbXitFc%3d&tabid=13484&portalid=62&mid=46760",
    "version_hash_raw_data": {
        "doc_name": "BUMEDINST 6470.10C",
        "doc_num": "6470.10C",
        "publication_date": "2021-02-23T00:00:00",
        "download_url": "https://www.med.navy.mil/LinkClick.aspx?fileticket=5eDzwbXitFc%3d&tabid=13484&portalid=62&mid=46760",
        "display_title": "BUMEDINST 6470.10C: MANAGEMENT OF IRRADIATED OR RADIOACTIVELY CONTAMINATED PERSONNEL"
    },
    "version_hash": "6f80815c6f3fc1ea22768c92e30bb5419ec408008426849bdddaf912ca2e0423",
    "display_org": "US Navy Medicine",
    "data_source": "Navy Medicine",
    "source_title": "Unlisted Source",
    "display_source": "Navy Medicine - Unlisted Source",
    "display_title": "BUMEDINST 6470.10C: MANAGEMENT OF IRRADIATED OR RADIOACTIVELY CONTAMINATED PERSONNEL",
    "file_ext": "pdf",
    "is_revoked": false,
    "access_timestamp": "2024-06-11T17:40:50"
},
{
    "doc_name": "NAVMED P-5010-1",
    "doc_title": "NAVMED P-5010, CHAPTER 1, TRI-SERVICE FOOD CODE (TB MED 530/NAVMED P-5010-1/AFMAN 48-147_IP)",
    "doc_num": "P-5010-1",
    "doc_type": "NAVMED",
    "display_doc_type": "Document",
    "publication_date": "2019-03-01T00:00:00",
    "cac_login_required": false,
    "crawler_used": "navy_med_pubs",
    "downloadable_items": [
        {
            "doc_type": "pdf",
            "download_url": "https://www.med.navy.mil/LinkClick.aspx?fileticket=5iwhQW3ku5Y%3d&tabid=13484&portalid=62&mid=51330",
            "compression_type": null
        }
    ],
    "source_page_url": "https://www.med.navy.mil/Directives/",
    "source_fqdn": "www.med.navy.mil",
    "download_url": "https://www.med.navy.mil/LinkClick.aspx?fileticket=5iwhQW3ku5Y%3d&tabid=13484&portalid=62&mid=51330",
    "version_hash_raw_data": {
        "doc_name": "NAVMED P-5010-1",
        "doc_num": "P-5010-1",
        "publication_date": "2019-03-01T00:00:00",
        "download_url": "https://www.med.navy.mil/LinkClick.aspx?fileticket=5iwhQW3ku5Y%3d&tabid=13484&portalid=62&mid=51330",
        "display_title": "NAVMED P-5010-1: NAVMED P-5010, CHAPTER 1, TRI-SERVICE FOOD CODE (TB MED 530/NAVMED P-5010-1/AFMAN 48-147_IP)"
    },
    "version_hash": "27a26a85145c25cc6f6e2ebd2e7af1846352923d97e6e8e940129e3f89f3742d",
    "display_org": "US Navy Medicine",
    "data_source": "Navy Medicine",
    "source_title": "Unlisted Source",
    "display_source": "Navy Medicine - Unlisted Source",
    "display_title": "NAVMED P-5010-1: NAVMED P-5010, CHAPTER 1, TRI-SERVICE FOOD CODE (TB MED 530/NAVMED P-5010-1/AFMAN 48-147_IP)",
    "file_ext": "pdf",
    "is_revoked": false,
    "access_timestamp": "2024-06-11T17:47:03"
}

@matthew-kersting matthew-kersting marked this pull request as ready for review June 11, 2024 18:41
@emmarez emmarez merged commit df5e863 into main Jun 12, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants