Add some additional features #10

citp · Jun 3, 2024 · 66c76c6 · 66c76c6
1 parent 807890e
commit 66c76c6
Show file tree

Hide file tree

Showing 4 changed files with 23 additions and 5 deletions.
diff --git a/description.md b/description.md
@@ -1,9 +1,16 @@
 # Description of submission
 
-## The Model
+## Summary 
 
-We fit an xgboost model with 66 hand-picked variables, which are converted to 43 predictors. An additional predictor is whether the observation is time-shifted (see the section below).
+XGBoost with the following strategies: (1) Expanded sample size with "time-shifted" data, (2) Merged in data from the partner's survey for households where both partners participated, (3) Combined data from related features into "scales". 
 
-## The Data
+## Details 
+
+(1) We roughly tripled the amount of training data using a "time-shift" strategy. By adapting the outcome calculation code which was generously provided by the PreFer organizing team, we calculated whether suitably aged people in the training and supplementary data had children between 2018 and 2020, thus creating additional outcome data. For these rows of additional outcome data, we recoded features from year t-minus-1, year t-minus-2, etc., to have the same name as the equivalent features at year t-minus-1, year t-minus-2, etc. in the original data. For example, in a time-shifted row, cf17j128 is renamed as cf20m128 in order to correspond with data 3 years later. To help account for temporal distribution shift, we include a feature that is an indicator of whether the row comes from the time-shifted data or original data.
+
+(2) For households where both partners participated in the survey, we merge in the partner's fertility intentions from 2019 and 2020 (cf19l128 to cf19l130 and cf20m128 to cf20m130), plus the partner's answers to questions about how many kids they have. 
+
+(3) We generate "scales" in the feature data by averaging related features together. Our scales are: Feelings toward current child, gendered religiosity, attitudes about traditional fertility, attitudes about traditional motherhood, attitudes about traditional fatherhood, attitudes about traditional marriage, attitudes toward working mothers, and sexism.
+
+We choose the hyperparameters for our XGBoost model via grid-search hyperparameter tuning with 5-fold cross-validation. 
 
-We roughly tripled the amount of training data using a time-shift strategy. By adapting outcome_time_shift.Rmd generously provided by the PreFer organizing team, we calculated whether suitably aged people in the training and supplementary data had children between 2018 and 2020. We then found earlier versions of our predictors and surmised that these earlier predictors predict childbirths between 2018 and 2020 in much the same way our predictors predict childbirths between 2021 and 2023. We then time-shifted those earlier measures to create additional rows in our training data.
diff --git a/model.rds b/model.rds
diff --git a/submission.R b/submission.R
@@ -233,6 +233,16 @@ clean_df <- function(df, background_df) {
     "sted_2020",
     # Dwelling type
     "woning_2020",
+    # Gender of first, second, and third child
+    "cf20m068", "cf20m069", "cf20m070", 
+    # Type of parent to first, second, and third child (bio, step, adoptive, foster)
+    "cf20m098", "cf20m099", "cf20m100", 
+    # Current partner is biological parent of first, second, third child
+    "cf20m113", "cf20m114", "cf20m115",
+    # Satisfaction with relationship
+    "cf19l180", "cf20m180",
+    # Satisfaction with family life
+    "cf19l181", "cf20m181",
     # Partner survey: fertility expectations in 2020
     "cf20m128_PartnerSurvey", "cf20m129_PartnerSurvey", "cf20m130_PartnerSurvey",
     # Partner survey: fertility expectations in 2019
@@ -404,7 +414,6 @@ clean_df <- function(df, background_df) {
       woning_2020 = case_when(woning_2020 == 1 ~ 1, woning_2020 %in% 2:4 ~ 0)
     ) %>%
     select(-outcome_available,
-      -cf20m026, -cf19l026, -cf18k026, -cf17j026, -cf16i026, -cf15h026, -cf14g026, -cf13f026, -cf12e026, -cf11d026, -cf10c026, -cf09b026, -cf08a026,
       -cf20m028, -cf19l028, -cf18k028, -cf17j028, -cf16i028, -cf15h028, -cf14g028, -cf13f028, -cf12e028, -cf11d028, -cf10c028, -cf09b028, -cf08a028,
       -ca20g078, -ca20g013,
       -cf20m513,
@@ -433,6 +442,7 @@ clean_df <- function(df, background_df) {
     mutate(
       across(everything(), as.numeric),
       across(c(belbezig_2020, migration_background_bg, oplmet_2020,
+               cf20m098, cf20m099, cf20m100,
                cf08a128, cf09b128, cf10c128, cf11d128, cf12e128,  
                cf13f128, cf14g128, cf15h128, cf16i128, cf17j128,
                cf18k128, cf19l128, cf20m128, 

diff --git a/training.R b/training.R
@@ -38,6 +38,7 @@ train_save_model <- function(cleaned_train_2021to2023, outcome_2021to2023,
   recipe <- recipe(new_child ~ ., original_plus_timeshifted_model_df) %>%
     step_rm(nomem_encr, nohouse_encr) %>%
     step_dummy(c(belbezig_2020, migration_background_bg, oplmet_2020,
+                 cf20m098, cf20m099, cf20m100,
                  cf08a128, cf09b128, cf10c128, cf11d128, cf12e128,  
                  cf13f128, cf14g128, cf15h128, cf16i128, cf17j128,
                  cf18k128, cf19l128, cf20m128,