Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle features about expectations for having kids? #17

Open
emilycantrell opened this issue May 9, 2024 · 5 comments
Open

How to handle features about expectations for having kids? #17

emilycantrell opened this issue May 9, 2024 · 5 comments

Comments

@emilycantrell
Copy link
Collaborator

emilycantrell commented May 9, 2024

Question

@HanzhangRen How did you choose the number 31 in "If no expected kids, then a lower-bound estimate for the number of years within which to have kids is 31?"

Proposed future steps (for June 3 submission)

I'd like to make use of the version of cf20m128, cf20m129, and cf20m130 from earlier waves, since many people have missing data in 2020 but might have answered these questions earlier. Here are some possible things to do. I'll explore these options before the June 3 submission deadline.

Do you think you will have (more) kids in the future?

  • My initial thought was: If cf20m128 is non-missing, use that value. If cf20m128 is missing, use the response from the most recent year in which the answer to this question was non-missing.
  • However, on further thought, going back too many years might give us data that isn't very useful, for two reasons: (1) if we use their answer from several years ago, they may have had the additional child they expected in that time, and therefore no longer be expecting more children. (2) Life changes a lot over the course of, say, 10 years, so who is to say that their answer from 10 years ago is still relevant? Maybe just using mean imputation is better than their answer from 10 (or however many) years ago. I'm not sure.

How many (more) children do you think you will have in the future?

  • My thoughts are similar to what I posted above on cf***128.
  • One possibility is that we can interact the expected number of children with how many children they have actually had since giving that answer. E.g., if they said in 2018 that they expect to have 2 more children, and then they have a child in 2020, their expected number of children is 1. If they said in 2018 that they expect to have 2 more children, and then they don't have a child in 2019 or 2020, their expected number of children is still 2. (But it is hard to know how many children they have actually had due to missingness in various waves, as evidenced by the very long code for calculating the presence of a "new" child for our outcome.)

Within how many years do you expect to have your first/next child?

  • We could do something like:
    • If cf20m130 is non-missing, use that value.
    • If cf20m130 is missing, use the value from cf19l130 minus 1. (i.e., if in 2019 they expect to have a child within 5 years, then in 2020 they theoretically might expect to have a child within 4 years).
    • If cf20m130 and cf19l130 are missing, use the value from cf18k130 minus 2.
    • etc.
  • One challenge is similar to above: their own prediction for having children probably becomes less accurate the farther back in time we go. So, maybe we only use data for the past couple of years, rather than all the way back to 2008. Or maybe we create a secondary variable that is an indicator of what survey wave their answer came from. Or maybe in addition to the composite variable, we also include the features from the individual waves.
  • Of course, there's also another challenge similar to above: If they said in 2018 that they expect to have a child in the next 5 years, and then they have a child in 2020, should we still consider them to be expecting to have a new child within the next 3 years?

I was initially thought we should definitely combine answer to these questions across waves. But after thinking through all the problems mentioned above, I'm less certain.

Even if we don't combine data across waves, I do still think it's worth including earlier versions of these features (going at least a couple of years back, or maybe all the way back to 2008) in addition to the 2020 version of these features. That will hopefully help for people who had missing values in 2020.

One other thought: In 2020, people's plans for having children (or when to have children) might have changed due to the pandemic. I think the survey was in Sept/Oct of 2020. So that is an extra reason to think that answers from prior years might not be translatable to 2020 answers.

@HanzhangRen
Copy link
Collaborator

@HanzhangRen How did you choose the number 31 in "If no expected kids, then a lower-bound estimate for the number of years within which to have kids is 31?"

30 is the maximum response that is non-missing, so I went one year beyond that and picked 31. I didn't want to mean impute the number of years within which to have kids for those people who do not plan to have kids, as that would make them appear much more eager to have kids than they really are.

@emilycantrell
Copy link
Collaborator Author

Combine the features above to calculate the difference between expected number of kids and actual number of kids?

Before working on this, examine how consistent answers are from 2019 to 2020 (both within people, and in overall rates)

@emilycantrell
Copy link
Collaborator Author

I got started on code that combines 2020 answers with 2019 answers for all three fertility intentions questions. It's not perfect, but it's at least a good starting place. I'll post details tomorrow, and results for #21

@emilycantrell
Copy link
Collaborator Author

The feature engineering that I did made almost no difference, so I don't think we should do any additional feature engineering on fertility intention features. I recorded the results here.

@emilycantrell
Copy link
Collaborator Author

I previously said the feature engineering wasn't worth pursuing because it made "almost no difference." However, after seeing that other changes which made "almost no difference" individually seemed to add up to put us in first place, I now think the feature engineering is worth testing a bit more, so I'm reopening the issue.

@emilycantrell emilycantrell reopened this Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants