Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illinois Teacher's Salary from 1999--2012 #3

Open
soodoku opened this issue Sep 22, 2017 · 15 comments
Open

Illinois Teacher's Salary from 1999--2012 #3

soodoku opened this issue Sep 22, 2017 · 15 comments
Assignees

Comments

@soodoku
Copy link
Member

soodoku commented Sep 22, 2017

Source =

http://www.familytaxpayers.org/ftf/ftf_salaries.php

For each year, list all districts. Each school in the district brings you to a clickable list of teacher and salary. Each teacher's name is clickable and gets meta data on the teacher.

Useful to produce year by year lists for now. We can merge later.

@ChrisMuir
Copy link
Contributor

@soodoku
I got started on this, I've written a script that will get all of the district links for each year, and then for each set of district links it will go through each one and get all of the teacher links. I've gotten all of the district links, but have only gotten the teacher links for the first year's worth of districts (2012), because the sheer number of requests being made to the website is starting to grow exponentially. The number of years is 14, number of district links is 14816, number of teacher links for only 2012 is 162471. Pulling the teacher links for 2012 (14816 total requests) takes about a half hour.....grabbing all of the data from each teacher link is going to take 70 - 80 hours of constantly pinging the website... ((162K * 14) / 15K) * 0.5h = 75h.

I've written code to scrape all the teacher links, iterate over them and grab the meta data for each link, but I don't know if I feel right about blasting their server to the level that it would take to get all that data.

For now, I'll push the script file to the repo.

@soodoku
Copy link
Member Author

soodoku commented Sep 26, 2017

I see. Worry about putting load on their server is reasonable. Two aspects to that:

  1. The webpages are simple enough that I don't see a huge impact on bandwidth per request. Plus, unlimited bandwidth plans with hosting companies are common.
  2. The big concern is too many requests at the same time. We aren't doing that.

But still makes sense to a) go year by year, and b) do Sys.sleep(1) between requests.

We can also email them to ask for the data. I am not v. optimistic that we will get something.

What do you think?

@soodoku
Copy link
Member Author

soodoku commented Sep 26, 2017

  • Nice work man!

@ChrisMuir
Copy link
Contributor

ChrisMuir commented Sep 26, 2017

Yeah I have calls to Sys.sleep between each request. The robots.txt file doesn't mention /ftf/....I guess it's fine, as long as we go easy on them. It'll take some time to complete, I can probably set up a single year to run over night, and just do that until we have each of the years done.

I'm going out of town Wednesday morning thru Monday, so I probably won't be able to start that process until next week.

If you or anyone else on the team starts it and have questions about the code, feel free to let me know.

@ChrisMuir
Copy link
Contributor

So I worked on this some today, quick update....I let the script run overnight to scrape all of the teacher links from each of the district links. There's a total of 2,226,915 unique teacher links (keep in mind, one teacher can have multiple links, as the data is split up by year). Assuming 0.5 seconds of computation/rvest time per request, and including a Sys.sleep of 2 seconds per request, we're looking at 1546.46 hours, or 64.4 days, of non-stop scrape time required to complete the task.

Should we consider narrowing our focus on this? Maybe limit the years to the five most recent years (2008 - 2012), or filter the 2mil+ links to only include unique teachers (keeping only the most recent instance of each teacher)?

Let me know what you think. I made a few general refactor edits to the scrape code today, I'll push that to the repo now.

@soodoku
Copy link
Member Author

soodoku commented Nov 21, 2017

Awesome @ChrisMuir!

2.2M teacher-years is a lot! I agree that we should start out small. Probably do 2012 first and then go back in time slowly. One year at one time makes sense to me. And we can do it over next many ways.

p.s. There are some odd things in the data including $0 salaries.

@ChrisMuir
Copy link
Contributor

Cool, yeah I'm letting it run on the 2012 teacher links for now. Once that's done, I'll write those results to csv and upload to the repo. We can take a look at that data and decide what to do from there. Thanks!

@ChrisMuir
Copy link
Contributor

Just pushed the 2012 IL teacher salaries to the repo. The data came out very clean from the website, there were over 162K records scraped and every single one returned as a neat 10 variable data frame, all with the same col headers. It made binding them all up into a single data set headache-free.

I'll start the script on 2011 tonight.

@soodoku
Copy link
Member Author

soodoku commented Dec 23, 2017

I think this is done also, right? Should we close this issue @ChrisMuir?

@ChrisMuir
Copy link
Contributor

No unfortunately this isn't done, I'm slowly working through each year. The number of records per year is around 160k, and each record requires a single request to the website, via xml2, so with a small amount of Sys.sleep between each request, the scraping is very slow. Each year is taking about a week to complete, I've gotten 2010 - 2012 done, 2009 is scraping right now. I have records for every year 1999 - 2012.

If you don't think we need to go all the way back to 1999, that's no problem, just let me know.

@soodoku
Copy link
Member Author

soodoku commented Dec 23, 2017

Righto! Thanks, man!

I vote for getting all the data. Longitudinal data is great for econometrics. Paired w/ some outcome data (from health to economic outcomes), it can probably lead to imp. insights.

@soodoku
Copy link
Member Author

soodoku commented Dec 23, 2017

Even descriptively, it would be great to know how teacher salaries have fared under Republicans, Dems., how close elections affect salaries, and also just how they compare over time to median wage in the respective areas.

@ChrisMuir
Copy link
Contributor

Cool, got it. I'll keep adding data to the repo as each year finishes.

@ChrisMuir
Copy link
Contributor

Quick update, the site has been completely down for the last ~48 hours. No "site maintenance" screen or anything, just a blank white page. I'll keep checking it.

@soodoku
Copy link
Member Author

soodoku commented Jan 5, 2018

sigh. just checked. still down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants