Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New UFC Spider #234

Merged
merged 10 commits into from
Jun 7, 2024
Merged

New UFC Spider #234

merged 10 commits into from
Jun 7, 2024

Conversation

matthew-kersting
Copy link
Collaborator

@matthew-kersting matthew-kersting commented Jun 3, 2024

Description

Adding a new spider to crawl the documents from https://www.wbdg.org/ffc/dod/unified-facilities-criteria-ufc & https://www.wbdg.org/ffc/dod/unified-facilities-guide-specifications-ufgs. For both start urls the part of the site that we want to scrape is a table. It was originally thought that we would have to use the Selenium spider to interact with the table navigation, but I found that we can move through the table pages by manipulating the URLs. For each table entry we visit its linked page to get the metadata and the URL to the pdf.

All of the documents have been assigned doc_type = "Document".

There are two possible dates for each document "Date" and "Revision Date", since "Date" is present on every row I decided to use this as the publication_date.

Two pages required specialized scraping functions

Result of Crawler Run on Dev

 UFC
        Required CAC: 0
        In Previous Hashes: 0
        Item Scraped Count: 948
        Elapsed Time (sec): 418.37702
        Close Reason: finished

Example Metadata

{
      "doc_name": "UFGS 06 20 00 Finish Carpentry",
      "doc_title": "Finish Carpentry",
      "doc_num": "06 20 00",
      "doc_type": "Document",
      "display_doc_type": "Document",
      "publication_date": "2016-08-01T00:00:00",
      "cac_login_required": false,
      "crawler_used": "UFC",
      "downloadable_items": [
          {
              "doc_type": "pdf",
              "download_url": "https://wbdg.org/FFC/DOD/UFGS/UFGS%2006%2020%2000.pdf",
              "compression_type": null
          }
      ],
      "source_page_url": "https://wbdg.org/ffc/dod/unified-facilities-guide-specifications-ufgs/ufgs-06-20-00",
      "source_fqdn": "wbdg.org",
      "download_url": "https://wbdg.org/FFC/DOD/UFGS/UFGS%2006%2020%2000.pdf",
      "version_hash_raw_data": {
          "doc_name": "UFGS 06 20 00 Finish Carpentry",
          "doc_num": "06 20 00",
          "publication_date": "2016-08-01T00:00:00",
          "download_url": "https://wbdg.org/FFC/DOD/UFGS/UFGS%2006%2020%2000.pdf",
          "display_title": "UFGS 06 20 00 Finish Carpentry"
      },
      "version_hash": "5ac18243bd338129f21607e2aa7b1b6ef092fb8b5b8353a3e686505fe03200f0",
      "display_org": "Department of Defense",
      "data_source": "Whole Building Design Guide",
      "source_title": "Unified Facilities Criteria",
      "display_source": "Whole Building Design Guide - Unified Facilities Criteria",
      "display_title": "UFGS 06 20 00 Finish Carpentry",
      "file_ext": "pdf",
      "is_revoked": false,
      "access_timestamp": "2024-06-06T17:59:19"
  }

@matthew-kersting matthew-kersting marked this pull request as ready for review June 4, 2024 19:11
@emmarez emmarez merged commit 77ab81e into main Jun 7, 2024
1 check passed
@matthew-kersting matthew-kersting deleted the new-spider-ufc branch June 7, 2024 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants