S&P 500 Data Collection and Storage Docker Container

This Docker container contains a deployment flow written in Python and built with Prefect.io, a workflow automation and orchestration platform. The deployment flow retrieves data about the companies listed in the S&P 500 index from a Wikipedia page, scrapes financial data from an API using the data from Wikipedia, and organizes the data into CSV files that are loaded into an AWS S3 Bucket that can be easily accessed by analytical teams.

Requirements

Docker
Prefect Cloud API key and workspace
AWS access key, secret key, and region name
Alphvantage API key

Usage

To use this Docker container, follow these steps:

Clone this repository to your local machine.
Navigate to the cloned repository in your terminal.
Build the Docker container using the following command:

docker build -t sp500-data-container .

Run the container using the following command:

docker run -d sp500-data-container

The run.sh script, which is run as part of the CMD command in the Dockerfile, logs into the Prefect Cloud, builds a deployment for the main flow, applies the deployment, and starts an agent pool. Once the flow is running, it performs the following steps:

Makes a GET request to a Wikipedia site that contains data about the companies listed in the S&P 500.
Writes the response HTML file to an AWS S3 bucket using the boto3 library.
Starts a local file server that serves the HTML file.
Uses a Scrapy crawler and BeautifulSoup to scrape the desired table from the Wikipedia page.
Cleans and organizes the scraped data into a Python list containing each company's stock ticker and other info.
Passes the data into a function that retrieves the stock ticker for each company and makes a GET request to an API that returns quarterly and annual balance sheet data for each company in JSON format.
Writes the returned JSON files to an S3 bucket.
Reads the JSON files from the S3 bucket and organizes them into two CSV files, one for annual reports and another for quarterly reports for all companies.
Writes the master annual reports and master quarterly reports CSV files to an S3 bucket so that analytical teams can retrieve them.

Contributing

If you find any issues with this Docker container, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
src		src
tests		tests
.gitignore		.gitignore
.prefectignore		.prefectignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
main_flow-deployment.yaml		main_flow-deployment.yaml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S&P 500 Data Collection and Storage Docker Container

Requirements

Usage

Contributing

License

About

Releases

Packages

Languages

miketeee/sandp500

Folders and files

Latest commit

History

Repository files navigation

S&P 500 Data Collection and Storage Docker Container

Requirements

Usage

Contributing

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages