Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating final de-identified datasets #11

Open
soodoku opened this issue Dec 23, 2017 · 4 comments
Open

Creating final de-identified datasets #11

soodoku opened this issue Dec 23, 2017 · 4 comments
Assignees

Comments

@soodoku
Copy link
Member

soodoku commented Dec 23, 2017

Time has arrived for building the first draft of the final data.frames + dictionary that we will include in the R data package. And it makes sense to pick low hanging fruits first. Let's start with California. It has the twin virtues of being relatively clean and big.

For CA, write a script that:
a. Replaces name with a random 10 character string
b. Does data integrity checks and flags or fixes issues as needed
c. unzips and rbinds years and tiers of government and adds useful information such as what level of government or what year the data are from if such information is missing.
d. final outcome = tidy data

After that, write a Rmd that presents some basic summaries of the data and presents a dictionary.

Note: if you think you can improve the description of the issue, please do. And don't let the description keep you from doing sensible things.

@ChrisMuir
Copy link
Contributor

One question: Is it important that the deidentified string be identical for identical raw string values? Like if "cats" appears 12 times, should the strings post-anonymization be identical? i.e. all 12 instances would be "sdlfijosd98fs"?

If so, we should consider getting hash values, maybe via the digest package.

@soodoku
Copy link
Member Author

soodoku commented Dec 23, 2017

Thanks, @ChrisMuir!

Was thinking about this particular point and hashing. Ya, we do want to map each specific string to a particular value. Basically, it allows us to achieve our purpose---not make it too easy for people to look up specific people---without losing much info.

On to the point about losing info.: Names are pretty useful for imputing gender and ethnicity. And we probably want to enrich the data a bit---impute race and gender using lincoln mullen's gender + my ethnicolr package---in lieu of losing this info.

do you think that's a reasonable way to go? i worry just a bit about having names of people but perhaps we should just go w/ it. what are your thoughts?

@ChrisMuir
Copy link
Contributor

I feel like removing proper names is a good idea. Even though all of this data is public, it feels weird to leave in people's names and make it all available in one central source.

If we want to hash the names prior to release, I assume we shouldn't impute gender and race prior to hashing them, and add them as two new variables? As in, that's not a good option, probably because it would be seen as not very transparent....is this correct?

Yeah the more I think about it, the more I'm leaning towards hashing the names. I'm not an expert in data science ethics though.

@soodoku
Copy link
Member Author

soodoku commented Jan 3, 2018

We are on the same page. In the final 'clean' data we package in R, we won't have actual names. As is the norm, we will have two packages: one data package (downloadable from GH) and one that provides the API.

Proposed order for starting on our effort:

  1. Merge and 'clean' data for a single state (cal.)
  2. Augment data with ethnicity and gender
  3. Build a data dictionary
  4. Get Ready to export
  5. Find a way to 'hash' names to random strings: the problem with hashing = there is a chance of a 'collision'. We want to guarantee uniqueness. I will think more about this. We also want to make it so that reverse searches are not trivial. Once we have figured this, we export put a rd file in the respective R data package folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants