Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to utilize the time-shifted data in our workflow #16

Closed
emilycantrell opened this issue May 6, 2024 · 6 comments
Closed

How to utilize the time-shifted data in our workflow #16

emilycantrell opened this issue May 6, 2024 · 6 comments

Comments

@emilycantrell
Copy link
Collaborator

@HanzhangRen I'm thinking about the logistics of how to insert the time-shifted data into our workflow. I think the following will work:

  • Create a csv file with the time-shifted data using our own code file (NOT within submission.R or training.R). We will want to post this code to github eventually for the sake of documentation, but I don't think it needs to be on github in order for our submission to work. We should check with Gert and Lisa before posting this code, since it is similar to the code they used for the real outcome, which they asked us not to share with other teams.
  • in training.R, read in the CSV file that has the time shifted data, and use it alongside the regular training data to train our model (details of how to put it into cross validation are discussed in Setting up cross-validation #4). If I understand correctly, training.R is only run by us, so it's okay if it relies on a CSV file that only we have access to. @HanzhangRen does that sound right to you?

The following don't require any special changes:

  • submission.R is unaffected by the inclusion of time-shifted data (since I'll be renaming all time-shifted variables to have the same names as the "real" variables)
  • model.rds works the same as always (the parameters of the model will probably be different since we are training with different data, but we don't need to do anything special to get model.rds to work when we make the submission)

Does all of this align with your understanding?

@HanzhangRen
Copy link
Collaborator

@emilycantrell All this aligns with my understanding!

@emilycantrell
Copy link
Collaborator Author

@HanzhangRen A first draft of the time-shifted data is almost ready! Tomorrow I'll set up the cross-validation code to work with it. I'm unclear on where to read in the data files. It looks like I should indicate the path to my data files here, but I'm confused about where to read in PreFer_train_outcome.csv.

Is DATA_FILE supposed to be PreFer_train_outcome.csv, and is BACKGROUND_DATA supposed to be train_data.csv? In which case, PreFer_train_background_data.csv isn't used? Or is there a different spot to read in the outcome data?

(P.S. I know to just do this locally of course, since we can't post the data files to github. I'll share the time-shifted data csv files I create with you so that you can also do it locally.)

@HanzhangRen
Copy link
Collaborator

HanzhangRen commented May 8, 2024

@HanzhangRen A first draft of the time-shifted data is almost ready! Tomorrow I'll set up the cross-validation code to work with it. I'm unclear on where to read in the data files. It looks like I should indicate the path to my data files here, but I'm confused about where to read in PreFer_train_outcome.csv.

Is DATA_FILE supposed to be PreFer_train_outcome.csv, and is BACKGROUND_DATA supposed to be train_data.csv? In which case, PreFer_train_background_data.csv isn't used? Or is there a different spot to read in the outcome data?

(P.S. I know to just do this locally of course, since we can't post the data files to github. I'll share the time-shifted data csv files I create with you so that you can also do it locally.)

data_file is supposed to be prefer_train_data.csv/PreFer_fake_data.csv, and background_data is supposed to be PreFer_train_background_data.csv/PreFer_fake_background_data.csv.

Background data refers to an extended dataset that we did not really use in our code. See these two sections in the dataset guide:
https://stulp.gmw.rug.nl/prefer/posts/posts/2024-03-20-prefer-datasets.html#background
https://stulp.gmw.rug.nl/prefer/posts/posts/2024-03-20-prefer-datasets.html#prefer_train_background_data.csv

My understanding is that "Rscript run.R PreFer_fake_data.csv PreFer_fake_background_data.csv" does not really use the outcome data and only applies model.rds to the fake data to produce predictions. In other words, running this code does not train the model. The trained model rds object is something we need to create by ourselves using training.py

Some other code would then compare these predictions to actual outcomes.

@emilycantrell
Copy link
Collaborator Author

Got it. Thanks!

@emilycantrell
Copy link
Collaborator Author

submission.R is unaffected by the inclusion of time-shifted data (since I'll be renaming all time-shifted variables to have the same names as the "real" variables)

Update: submission.R actually required a small change. This commit creates an indicator for whether the data was time-shifted. In our training data, the "original" data has a value of 0 for time shift, and the data associated with the 2018-2020 outcome period has a value of 1 for time shift. In the holdout data, everyone should have a value of 0 for time shift. So, the code in the commit checks whether there is already a time_shift column, and if there is no time_shift column, it generates a time_shift column with a value of 0 for everybody, which is what will need to happen on the holdout data.

@emilycantrell
Copy link
Collaborator Author

We've successfully integrated the time-shifted data into the code. This issue is ready to close!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants