Watergate

What is this?

This repository contains Python scripts for text processing related to publicly-released Watergate transcripts.

Nixon Library documents: http://nixon.archives.gov/forresearchers/find/tapes/index.php

Watergate House Judiciary Committee documents: http://watergate.info/impeachment/judiciary-committee-hearings

Ebook of Transcripts of Eight Presidential Tapes from Nixon to House Judiciary Committee: http://www.ebooksread.com/authors-eng/richard-m-richard-milhous-nixon/transcripts-of-eight-recorded-presidential-conversations-hearings-before-the-co-oxi.shtml

National Archives Watergate Files: http://www.archives.gov/research/investigations/watergate/

Nixon Grand Jury Testimony: http://www.archives.gov/research/investigations/watergate/nixon-grand-jury/

Watergate Tape Transcripts Background

The Nixon Library continues to release audio and text transcripts of audio tapes from the Nixon White House taping system. This is an attempt to overcome some of the awkward formatting of the transcripts provided by the Nixon Library.

This repository provides scripts for scraping transcripts, putting them through a PDF-to-text library, and munging the resulting text so that it is in serializable JSON format.

How does it work?

The repository contains Python scripts that will scrape a website for PDF files, download those PDF files, turn them into text, process the text, and output a JSON data structure.

As a proof-of-concept this JSON structure will be hooked up to Olipy to create a twitterbot agent of chaos. But the scripts will presumably have many more sophisticated uses.

Dependencies

Beautiful Soup

This is used to process HTML tags. Link: http://www.crummy.com/software/BeautifulSoup/

pip install beautifulsoup4

Urllib

This is used to download the PDF files.

pip install urllib

Step 1: Finding Primary Text Sources

I'm using the following websites for PDF files of transcripts:

http://nixon.archives.gov/forresearchers/find/tapes/watergate/trial/transcripts.php

http://nixon.archives.gov/forresearchers/find/tapes/watergate/wspf/transcripts.php

Step 2: Scraping PDFs

Scrape HTML, look for links to PDFs, create list, download each item in list

BeautifulSoup to parse HTML for links, Urllib to download PDF links.

Step 3: Extract PDF Text to JSON

This uses a Python PDF utility to turn the PDF into text, which works fairly well.

This text is further processed into JSON pieces.

Step 4: Olipy the JSON

This comes at the end, but it should be really easy. Proof: http://charlesmartinreid.com/wiki/Apollo11Junk

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
bot		bot
pelican		pelican
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Watergate

What is this?

Watergate Tape Transcripts Background

How does it work?

Dependencies

Beautiful Soup

Urllib

Step 1: Finding Primary Text Sources

Step 2: Scraping PDFs

Step 3: Extract PDF Text to JSON

Step 4: Olipy the JSON

About

Releases

Packages

Languages

charlesreid1/watergate

Folders and files

Latest commit

History

Repository files navigation

Watergate

What is this?

Watergate Tape Transcripts Background

How does it work?

Dependencies

Beautiful Soup

Urllib

Step 1: Finding Primary Text Sources

Step 2: Scraping PDFs

Step 3: Extract PDF Text to JSON

Step 4: Olipy the JSON

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages