Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TX salaries #4

Open
soodoku opened this issue Sep 24, 2017 · 17 comments
Open

TX salaries #4

soodoku opened this issue Sep 24, 2017 · 17 comments
Assignees

Comments

@soodoku
Copy link
Member

soodoku commented Sep 24, 2017

Texas salary data is provided here:

https://salaries.texastribune.org/agencies/

You can download data for each link. For instance, clicking on Austin brings you to:
https://salaries.texastribune.org/austin/ and you have a link to 'download this data.'

The state data is there. We need to download city, school, and university data. Combine each into city, k12, and university files. And 7zip it and upload it to relevant year folder.

Save script under scripts/tx/tx_tribune.R

@soodoku soodoku assigned suriyan and ChrisMuir and unassigned suriyan Sep 24, 2017
@soodoku soodoku assigned tristanjkaiser and unassigned ChrisMuir Nov 8, 2017
@tristanjkaiser
Copy link
Contributor

Added the city, k12, and university files.

@soodoku
Copy link
Member Author

soodoku commented Nov 17, 2017

is anything remaining or are we done? if done, let's close the issue! yey!

@soodoku
Copy link
Member Author

soodoku commented Nov 17, 2017

if you used any scripts, push them to scripts/tx/tx_tribune.R

@tristanjkaiser
Copy link
Contributor

Yup that's everything. I ended up doing it manually, which was actually quicker than writing a script would have been. Complete!

@soodoku
Copy link
Member Author

soodoku commented Nov 18, 2017

just so we are on the same page, you did a rbind on the cities etc. to create a city-year and school-year level file? thx.

@tristanjkaiser
Copy link
Contributor

tristanjkaiser commented Nov 18, 2017

Ah no - the districts/unis/cities are still separated by each school/city etc. I will reopen and bind them.

@tristanjkaiser
Copy link
Contributor

The data is really dirty (every file has different column names for the same thing, ie. "Annual Salary", "Annual Rt", "Rate of Pay"). What are the primary variables we are concerned with? Name, Title, and Salary?

@soodoku
Copy link
Member Author

soodoku commented Nov 18, 2017

Interesting.

  1. Normalize the col. names. and 2. preserve all the information.
    The rows would just have NAs where that information is missing.

@tristanjkaiser
Copy link
Contributor

Ok I pushed the data. I normalized the column names that were particularly important (First Name,
Last Name, Full Name, Annual Salary), there are a lot of unique columns remaining. Given how much time it was taking and how many files there were I wanted to get this out to you guys to see how much more time it is worth to normalize other columns like ethnicity. There is still lots of cleaning to do re: column formats, merging columns, and ensuring that 'annual salary' really is an annual salary (ie. some require multiplying by a corresponding total hours per year when 'annual salary' is actually a rate, say below $1000 for example).

@vinay-pimple
Copy link

vinay-pimple commented Nov 28, 2017 via email

@soodoku
Copy link
Member Author

soodoku commented Nov 28, 2017 via email

@soodoku
Copy link
Member Author

soodoku commented Nov 28, 2017

@tristanjkaiser: can you post the scripts you have been working on so that I can see what is going on. Thanks.

@ChrisMuir
Copy link
Contributor

With regard to how to handle the col headers of many data sets, this may be overkill but I'm dealing with a very similar problem at work and wanted to share how how we approached it.

I'm on a long term research project concerned with food safety in China, one of the key pieces of data the team wants to analyze is government food inspection records. So I'm constantly writing scrapers for different government websites, most of the sites serve up data in the form of xlsx files. Each site will have dozens or hundreds of xlsx files, each file has multiple sheets, some sheets contain multiple data sets, and there is no consistency across files in format/structure. There's also very little consistency in column headers (e.g., we currently have 24 different strings that all mean "manufacturer").

To solve this, I created a column header data dictionary, it's a Python dict in which each key is a column that I know I want in the final DB, and each value is a list containing known string representations of that specific column header. I also have a "skips" dict, to house column headers that I know can be skipped. From there, after scraping a new website, I have a script that will read in each of the new xlsx files, iterate over the column headers, and if a header appears in either of the data dicts it knows whether to skip it or how to bin/categorize it. When it sees a header that doesn't exist in either dict, it will throw an error and print to console the xlsx file name, the sheet name, the col header string, and the first five observations from that col (also translates some of this info, because I don't speak Mandarin lol), and I have to manually add the new header string to one of the data dicts.

It's a rather brute force approach, not very elegant, but it's worked pretty well for us so far. And again, it may be overkill for this project, but I figured I'd share. If anyone has questions about this just let me know.

@soodoku
Copy link
Member Author

soodoku commented Nov 28, 2017

Thanks @ChrisMuir! What you suggest seems like a perfectly fine way to go.

Also @tristanjkaiser, I see the script and will get back, once I have taken a look. This tx data is a bit unusual.

@vinay-pimple
Copy link

vinay-pimple commented Nov 28, 2017 via email

@soodoku
Copy link
Member Author

soodoku commented Nov 28, 2017

@vinay-pimple what you suggest makes sense to me. cobb's third normal for the win.

To bring everyone on the same page, one of the ideas is to also provide a web interface to the data. And @vinay-pimple is talking a bit about the backend of that. But the insights can be used for packaging data as well.

TX was unusually complicated. For most datasets, I don't see this as a serious challenge. Time-consuming but doable.

We would also need to get all the salaries in 2017 dollars or something. We probably also need to get PPP for each area so that we can pro-rate wages etc. But all that is on the menu and not relevant for this particular issue. I will look into hiving this discussion off into a new issue.

@soodoku soodoku assigned soodoku and unassigned tristanjkaiser Jan 2, 2018
@soodoku
Copy link
Member Author

soodoku commented Jan 2, 2018

@ChrisMuir: would you like to pick this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants