Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting the right features #10

Open
HanzhangRen opened this issue Apr 22, 2024 · 4 comments
Open

Selecting the right features #10

HanzhangRen opened this issue Apr 22, 2024 · 4 comments

Comments

@HanzhangRen
Copy link
Collaborator

HanzhangRen commented Apr 22, 2024

I included 61 variables in the algorithm I submitted to the first leaderboard, before cleaning the variables to form 36 predictors. I chose them in a way that is mostly based on intuition.

PreFer_codebook.xlsx

The excel file contains 31667 variables. I scrolled through all of them and selected 351 variables, marked mostly in yellow, that 1) feel possibly relevant to our outcome AND 2) represent the latest version of the relevant concept.

Then, out of the 351 variables, I further picked 61 variables that I feel would be a pity not to include in the algorithm. These variables are marked in red.

There are 4 variables marked in green that I realized after submitting the code that probably should have ended up with the 61.

There are 2 variables marked in grey that I thought would be helpful but unfortunately had zero variance in the training set.

@emilycantrell, do you think it would be a useful exercise for you to download a fresh version of the codebook, do a similar run-through of all the variables, and then compare our results?
In addition to the two of us, we could also try to get ideas about what variables to select from other sources

  1. Let's run a feature selection algorithm on all features to see which ones stand out.
  2. We can ask some professors in our departments about what they would look for if they were to do the prediction challenge.
  3. Ask ChatGPT :)

Another thing is that there are probably better ways to preprocess these variables. I did the following for preprocessing

  1. some basic logical imputation (e.g. one cannot be married with partner if they do not have a partner)
  2. dummy encoded categorical variables
  3. did some scale calculations for variables that are grouped together on the LISS website (see an example of a group of variables here.
  4. I mean imputed all the remaining missingness

If we have the time time, we might want to explore the following:

  1. Can we reduce missingness by combining past and present versions of the same construct?
  2. Combine categories for categorical variables with very small categories.
  3. Check Cronbach's alpha to see if the groups of variables that I current treat as scales do indeed make sense as scales
  4. Do something other than mean-imputation
  5. Decide on how much missingness is too much for us to include a variable as predictor
@emilycantrell
Copy link
Collaborator

@HanzhangRen Excellent, thank you!!!

@emilycantrell, do you think it would be a useful exercise for you to download a fresh version of the codebook, do a similar run-through of all the variables, and then compare our results?

Good idea. I briefly scrolled through your spreadsheet just to make sure I understood the gist of what you did, but I did not look at the details of what you chose, so that when I review the variables, it will be an "independent" review.

I think it will be interesting to compare the automated feature selection to the manual feature selection to see how much our choices overlap.

When we time-shift earlier data forward, my ideal would be to time-shift ALL features. However, I learned last year that some features change in the structure of their name, which makes the time-shifting more difficult (since it's harder to identify what the corresponding earlier feature was). I'm going to spend some time to look into the ability to time-shift all features forward, but if it turns out to be too difficult, an alternative is that we can time-shift only the features that we actually plan to feed into the model, after selecting them through either manual or automated feature selection.

did some scale calculations for variables that are grouped together on the LISS website (see an example of a group of variables here.

I love that you made scales!!! And the other data pre-processing also looks great. I'll think about what other data prep we might want to do.

If we have the time time, we might want to explore the following

I agree with everything you listed here.

This week, I am finally back to a normal schedule, so I'm going to start working on creating time-shifted outcome data, and then time-shifted feature data.

@emilycantrell
Copy link
Collaborator

The following features that we are currently using don't exist in the time-shifted data. For now I'm creating columns that are all NA for the time-shifted data and then imputing it; I'll return to handle this properly later.

Income questions that were only asked in even-numbered years

We can adjust to code to use corresponding questions from other years for the time-shifted data.
ca20g012
ca20g013
ca20g078

Religiosity question

This was only asked in 2019-2020. We can still use this, it will just have to be imputed for time-shifted cases.
cr20m162

Traditional fertility variables, which were only asked in 2008, 2009, and 2010.

We can't create variables like this for the time-shifted data. We can still use these, they will just have to be imputed for time-shifted cases.
cv10c135
cv10c136
cv10c137
cv10c138

@HanzhangRen
Copy link
Collaborator Author

One benefit of the time shift is that now that we have more data, some previously tiny categories in categorical variables have become larger, and I have added the dummy versions of these categories because I think it's potentially helpful. I am using 50 as a threshold for the minimum category size for me to include the dummy.

Previously, I was only able to include employees, freelancers, students, and homemakers as occupation status categories. Now I can include categories for those people who lost their jobs but are currently seeking, as well as people with work disabilities.
Previously, I was only able to distinguish between people with Western and non-Western backgrounds. Now, among people with non-Western backgrounds, I can distinguish between 1st and 2nd generation immigrants. In addition, I noticed that there was a bug that prevented this immigration background variable from being used at all in our last submission, and I fixed that bug.
Previously, I had to merge the lowest 4 categories of education into 1. Now I just need to merge them into 3.

@emilycantrell
Copy link
Collaborator

emilycantrell commented Jun 3, 2024

I added some other features that I thought might be useful:
# Gender of first, second, and third child
"cf20m068", "cf20m069", "cf20m070",
# Type of parent to first, second, and third child (bio, step, adoptive, foster)
"cf20m098", "cf20m099", "cf20m100",
# Current partner is biological parent of first, second, third child
"cf20m113", "cf20m114", "cf20m115",
# Satisfaction with relationship
"cf19l180", "cf20m180",
# Satisfaction with family life
"cf19l181", "cf20m181"

Some of these features have values that only appear a few dozen times (or less) in the original data, but we get a larger sample size in the time-shifted data, which might make them worthwhile.

I added these features from the partner survey: (see #22)
# Partner survey: fertility expectations in 2020
"cf20m128_PartnerSurvey", "cf20m129_PartnerSurvey", "cf20m130_PartnerSurvey",
# Partner survey: fertility expectations in 2019
"cf19l128_PartnerSurvey", "cf19l129_PartnerSurvey", "cf19l130_PartnerSurvey",
# Partner survey: whether ever had kids
"cf19l454_PartnerSurvey", "cf20m454_PartnerSurvey",
# Partner survey: Number of kids reported in 2019 and 2020
"cf19l455_PartnerSurvey", "cf20m455_PartnerSurvey"

I also included an indicator of whether the partner survey data is non-missing, in case that helps the model differentiate between imputed and non-imputed values. However, I didn't include an indicator for imputation for other variables. We can consider that as something to add in the future, but I suspect it won't make a difference.

emilycantrell added a commit that referenced this issue Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants