Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct error(s) in data #14

Open
emilycantrell opened this issue Apr 29, 2024 · 5 comments
Open

Correct error(s) in data #14

emilycantrell opened this issue Apr 29, 2024 · 5 comments
Assignees

Comments

@emilycantrell
Copy link
Collaborator

In a very important feature, cf20m130 (Within how many years do you hope to have your [first/next] child?) there is a response that is almost certainly an error: someone said "2025". I take this to mean that they expect to have a child by 2025, i.e., their entry should be "5." I debated whether "2025" suggested they expect a child before 1/1/2025, or whether it means they expect a child by 12/31/2025, as this would change whether we should make the response 4 or 5. My instinct is to assume they mean by 12/31/2025, so let's make it "5." In 2019, the same person answered "3" to this question, which is extra confirmation that "5" is a reasonable range; it just seems that their timeline was pushed back a couple of years, perhaps because of the pandemic or perhaps because that's just how life goes.

I think we should manually recode this value as "5" by just putting a line for it into the code. @HanzhangRen is submission.R the most appropriate file in which to make this edit? I can add the code for this edit there or wherever you recommend.

Note: if we ever do automated feature selection, this change should happen BEFORE automated feature selection, as the "2025" value makes the linear correlation between the outcome and feature very weak.

If we end up having a lot of edits to the data like this, then we can consider setting up a more systematic way to make the edits, like we did for Million Monkeys.

Given the quick turnaround on this project, I don't think we should spend much time actively looking for errors like this in the data. However, I do want to check on other variables from the set of questions about expectations for having children, to make sure everything in those key features is in a reasonable range.

After I check the other related variables to make sure they are in range, I'll also email Lisa and Gert about this to see if they want to fix "2025" for all participants, or if that's something they want to treat as part of the challenge.

@emilycantrell emilycantrell self-assigned this Apr 29, 2024
@HanzhangRen
Copy link
Collaborator

@emilycantrell I noticed this error too, and there is already a line in the code that fixes the problem.

@emilycantrell
Copy link
Collaborator Author

Amazing, thank you!!

@emilycantrell
Copy link
Collaborator Author

I looked at the values of the following three questions for all years, and didn't see anything else that is concerning:

  • Do you think you will have children in the future? -- AND -- Do you think you will have [more] children in the future? cf08a128; cf09b128; cf10c128; cf11d128; cf12e128; cf13f128; cf14g128; cf15h128; cf16i128; cf17j128; cf18k128; cf19l128; cf20m128
  • How many children do you think you will have in the future?" -- AND -- How many children do you think you will have in the future? -- AND -- How many [more] children do you think you will have in the future? cf08a129; cf09b129; cf10c129; cf11d129; cf12e129; cf13f129; cf14g129; cf15h129; cf16i129; cf17j129; cf18k129; cf19l129; cf20m129
  • Within how many years do you hope to have your (first-next) child? -- AND -- Within how many years do you hope to have your [first/next] child? cf08a130; cf09b130; cf10c130; cf11d130; cf12e130; cf13f130; cf14g130; cf15h130; cf16i130; cf17j130; cf18k130; cf19l130; cf20m130

check_feature_ranges.pdf

@emilycantrell
Copy link
Collaborator Author

I realized there might be errors in the test set that we can't check. So maybe rather than specifically recoding the value 2025, we should recode any value greater than, say, 2000, to auto-adjust. I'm not sure how likely it is that this same error occurs in the test set, but if time permits, I'll edit the code to a more generalizable correction of this specific type of error.

@emilycantrell
Copy link
Collaborator Author

emilycantrell commented May 17, 2024

I'll edit the code to a more generalizable correction of this specific type of error.

I did this in a special branch that I'm working on locally, within submission.R. Later I'll merge it into whatever branch we plan to submit.

  # Fix entries where people said calendar year instead of number of years 
  # Note: this never happens in our time-shifted data, so we don't need this code chunk to work for time-shifted data
  cf20m130 = case_when(cf20m130 > 2000 ~ cf20m130 - 2020,
            TRUE ~ cf20m130),  
  cf19l129 = case_when(cf19l129 > 2000 ~ cf19l129 - 2019,
                       TRUE ~ cf19l129),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants