Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch IC Spider #230

Merged
merged 5 commits into from
May 16, 2024
Merged

Patch IC Spider #230

merged 5 commits into from
May 16, 2024

Conversation

matthew-kersting
Copy link
Collaborator

Description

The IC policies spider has been stale since April 6th 2024. When testing this locally I found that the crawler was receiving 403 responses when accessing the pages with all the links. After some experimentation I found that visiting these sites directly resolved the issue.

I also took this chance to refactor this spider to use the new DocItemFields class and also to gather a new document on the website title "IC Legal Reference Book 2020".

Result of Crawler Run on Dev

[STATS] Crawlers run:
 ic_policies
        Required CAC: 0
        In Previous Hashes: 101
        Item Scraped Count: 105
        Elapsed Time (sec): 43.793115
        Close Reason: finished

Example Metadata

{
    "doc_name": "ICD 107",
    "doc_title": "Civil Liberties, Privacy, and Transparency",
    "doc_num": "107",
    "doc_type": "ICD",
    "display_doc_type": "Directive",
    "publication_date": null,
    "cac_login_required": false,
    "crawler_used": "ic_policies",
    "downloadable_items": [
        {
            "doc_type": "pdf",
            "download_url": "https://www.dni.gov/files/documents/ICD/ICD-107.pdf",
            "compression_type": null
        }
    ],
    "source_page_url": "https://www.dni.gov/index.php/what-we-do/ic-related-menus/ic-related-links/intelligence-community-directives",
    "source_fqdn": "www.dni.gov",
    "download_url": "https://www.dni.gov/files/documents/ICD/ICD-107.pdf",
    "version_hash_raw_data": {
        "doc_name": "ICD 107",
        "doc_num": "107",
        "publication_date": null,
        "download_url": "https://www.dni.gov/files/documents/ICD/ICD-107.pdf",
        "display_title": "ICD 107: Civil Liberties, Privacy, and Transparency"
    },
    "version_hash": "0ee9043e47d457973fb1989f03e8a628697beeea92e17b775e94b93782bdd574",
    "display_org": "Intelligence Community",
    "data_source": "Office of Director of National Intelligence",
    "source_title": "Unlisted Source",
    "display_source": "Office of Director of National Intelligence - Unlisted Source",
    "display_title": "ICD 107: Civil Liberties, Privacy, and Transparency",
    "file_ext": "pdf",
    "is_revoked": false,
    "access_timestamp": "2024-05-15T18:41:36"
}

@matthew-kersting matthew-kersting marked this pull request as draft May 15, 2024 20:33
@matthew-kersting matthew-kersting marked this pull request as ready for review May 16, 2024 12:23
Copy link
Contributor

@Antsega Antsega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice docker and crawler fix!

@Antsega Antsega merged commit f44c46f into main May 16, 2024
1 check passed
@matthew-kersting matthew-kersting deleted the patch-ic-spider branch May 17, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants