Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idaho 2013 pdf to CSV #23

Open
soodoku opened this issue Jan 5, 2018 · 9 comments
Open

Idaho 2013 pdf to CSV #23

soodoku opened this issue Jan 5, 2018 · 9 comments
Assignees

Comments

@soodoku
Copy link
Member

soodoku commented Jan 5, 2018

https://github.com/public-salaries/public_salaries/tree/master/id/2013

@ChrisMuir
Copy link
Contributor

Have a little free time, am working on this now....what's the url source of this PDF? I don't see it listed in the ID README, and just want to include it in the comments at the top of the script file.

@soodoku
Copy link
Member Author

soodoku commented Jan 10, 2018

2013 is from: https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf

But as you see on the title page, data are from transparent idaho. will get a link from that site. there are more pdfs like this on the transparent idaho website including for 2018:
https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf

@soodoku
Copy link
Member Author

soodoku commented Jan 10, 2018

@ChrisMuir
Copy link
Contributor

Cooool, thanks!

@ChrisMuir
Copy link
Contributor

Just finished extracting data from the 2013, 2014, and 2018 PDF's, and pushed the 7z files and script files to the repo.

This ended up being a huge pain, for some reason pdftools was working just fine for the 2013 PDF but then just stopped working about a week ago, and from that point on it wouldn't work for any of the ID pdf files. By wouldn't work, I mean pdf_text would read the correct number of pages in the doc, but would return an empty string for each page. I ended up writing a custom function which mimics pdftools::pdf_text that calls

system2("pdftotext", args = c("-table", path_to_pdf_file))

which is pretty hacky. I'm working on a PC, I'm not sure if that will work on any other OS.

Also, as of now the three ID script files for each individual PDF are effectively identical, at some point I will replace them with a single script that reads and writes to/from each individual yearly folder.

@soodoku
Copy link
Member Author

soodoku commented Jan 18, 2018

oy! sorry to hear.

pdftools:

  1. dk on the situation with pdftools but post windows update, some stuff may need admin privs. correctly as the function may be calling something else in the backend. always worth a try to run as admin.

  2. i did notice that my miktext conked out a week ago also. so i had reinstall that and setup path etc. again.

  3. the other alternative to pdftools = abbyyfine reader. they aren't free but they have an API and there is a R wrapper. abbyy is generally considered best in class for commercial OCR.

no worries on the 3 scripts. and congrats on getting across the line on this one! seems v. painful and that is where some new software is born! :-)

@ChrisMuir
Copy link
Contributor

Yeah, it's all good. What's weirdest is that I was initially working with the 2013 doc on a Mac, then the issue started happening about a week ago, tested it on my work PC and it was doing the same thing (and is persisting for all of the Idaho pdf docs).....so the pdftools issue is cutting across Mac and PC for me.

Actually, do you mind trying it yourself? Try running:

url <- "https://pibuzz.com/wpcontent/uploads/post%20documents/Idaho%202013.pdf"
txt <- pdftools::pdf_text(url)

and let me know if it works for you. For reference, it reads a single empty string for each page for me....so this resolves to TRUE for me:

identical(
  pdftools::pdf_text("https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf"), 
  rep("", 1012)
)
#> TRUE

Just let me know what results you get if you don't mind.

@soodoku
Copy link
Member Author

soodoku commented Jan 20, 2018

dear @ChrisMuir,

reason for delay = URL is now dead.
tried on both linux and windows --- same result --- bunch of empty strings.

@ChrisMuir
Copy link
Contributor

No worries on delay, thanks for trying and for the heads up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants