How to utilize the time-shifted data in our workflow #16

emilycantrell · 2024-05-06T20:51:06Z

@HanzhangRen I'm thinking about the logistics of how to insert the time-shifted data into our workflow. I think the following will work:

Create a csv file with the time-shifted data using our own code file (NOT within submission.R or training.R). We will want to post this code to github eventually for the sake of documentation, but I don't think it needs to be on github in order for our submission to work. We should check with Gert and Lisa before posting this code, since it is similar to the code they used for the real outcome, which they asked us not to share with other teams.
in training.R, read in the CSV file that has the time shifted data, and use it alongside the regular training data to train our model (details of how to put it into cross validation are discussed in Setting up cross-validation #4). If I understand correctly, training.R is only run by us, so it's okay if it relies on a CSV file that only we have access to. @HanzhangRen does that sound right to you?

The following don't require any special changes:

submission.R is unaffected by the inclusion of time-shifted data (since I'll be renaming all time-shifted variables to have the same names as the "real" variables)
model.rds works the same as always (the parameters of the model will probably be different since we are training with different data, but we don't need to do anything special to get model.rds to work when we make the submission)

Does all of this align with your understanding?

HanzhangRen · 2024-05-06T23:51:23Z

@emilycantrell All this aligns with my understanding!

emilycantrell · 2024-05-08T03:08:20Z

@HanzhangRen A first draft of the time-shifted data is almost ready! Tomorrow I'll set up the cross-validation code to work with it. I'm unclear on where to read in the data files. It looks like I should indicate the path to my data files here, but I'm confused about where to read in PreFer_train_outcome.csv.

Is DATA_FILE supposed to be PreFer_train_outcome.csv, and is BACKGROUND_DATA supposed to be train_data.csv? In which case, PreFer_train_background_data.csv isn't used? Or is there a different spot to read in the outcome data?

(P.S. I know to just do this locally of course, since we can't post the data files to github. I'll share the time-shifted data csv files I create with you so that you can also do it locally.)

HanzhangRen · 2024-05-08T04:11:00Z

@HanzhangRen A first draft of the time-shifted data is almost ready! Tomorrow I'll set up the cross-validation code to work with it. I'm unclear on where to read in the data files. It looks like I should indicate the path to my data files here, but I'm confused about where to read in PreFer_train_outcome.csv.

Is DATA_FILE supposed to be PreFer_train_outcome.csv, and is BACKGROUND_DATA supposed to be train_data.csv? In which case, PreFer_train_background_data.csv isn't used? Or is there a different spot to read in the outcome data?

(P.S. I know to just do this locally of course, since we can't post the data files to github. I'll share the time-shifted data csv files I create with you so that you can also do it locally.)

data_file is supposed to be prefer_train_data.csv/PreFer_fake_data.csv, and background_data is supposed to be PreFer_train_background_data.csv/PreFer_fake_background_data.csv.

Background data refers to an extended dataset that we did not really use in our code. See these two sections in the dataset guide:
https://stulp.gmw.rug.nl/prefer/posts/posts/2024-03-20-prefer-datasets.html#background
https://stulp.gmw.rug.nl/prefer/posts/posts/2024-03-20-prefer-datasets.html#prefer_train_background_data.csv

My understanding is that "Rscript run.R PreFer_fake_data.csv PreFer_fake_background_data.csv" does not really use the outcome data and only applies model.rds to the fake data to produce predictions. In other words, running this code does not train the model. The trained model rds object is something we need to create by ourselves using training.py

Some other code would then compare these predictions to actual outcomes.

emilycantrell · 2024-05-08T11:49:06Z

Got it. Thanks!

emilycantrell · 2024-05-09T05:59:17Z

submission.R is unaffected by the inclusion of time-shifted data (since I'll be renaming all time-shifted variables to have the same names as the "real" variables)

Update: submission.R actually required a small change. This commit creates an indicator for whether the data was time-shifted. In our training data, the "original" data has a value of 0 for time shift, and the data associated with the 2018-2020 outcome period has a value of 1 for time shift. In the holdout data, everyone should have a value of 0 for time shift. So, the code in the commit checks whether there is already a time_shift column, and if there is no time_shift column, it generates a time_shift column with a value of 0 for everybody, which is what will need to happen on the holdout data.

emilycantrell · 2024-05-17T22:10:24Z

We've successfully integrated the time-shifted data into the code. This issue is ready to close!

emilycantrell added a commit that referenced this issue May 9, 2024

Add time-shifted data and use household IDs in cross-validation #4 #16

f20de01

emilycantrell closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to utilize the time-shifted data in our workflow #16

How to utilize the time-shifted data in our workflow #16

emilycantrell commented May 6, 2024

HanzhangRen commented May 6, 2024

emilycantrell commented May 8, 2024

HanzhangRen commented May 8, 2024 •

edited

Loading

emilycantrell commented May 8, 2024

emilycantrell commented May 9, 2024

emilycantrell commented May 17, 2024

How to utilize the time-shifted data in our workflow #16

How to utilize the time-shifted data in our workflow #16

Comments

emilycantrell commented May 6, 2024

HanzhangRen commented May 6, 2024

emilycantrell commented May 8, 2024

HanzhangRen commented May 8, 2024 • edited Loading

emilycantrell commented May 8, 2024

emilycantrell commented May 9, 2024

emilycantrell commented May 17, 2024

HanzhangRen commented May 8, 2024 •

edited

Loading