-
Notifications
You must be signed in to change notification settings - Fork 1
/
Snowbird_precipitation_models.Rmd
371 lines (290 loc) · 10.9 KB
/
Snowbird_precipitation_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: "Snowbird precipitation models"
author: "Jeremy Boyd"
date: "2/9/2021"
output: html_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(reticulate)
library(tidyverse)
library(imputeTS)
library(feather)
source("get_station_data.R")
source("get_buoy_data.R")
source("prep_data_for_keras.R")
source_python("fit_lstm.py")
source_python("fit_cnn_lstm.py")
```
## Get Snowbird precipitation
```{r get_precip}
# Get Snowbird precipitation
# NOTE: This function only takes a few seconds, so there's not much downside to
# just running it over and over again--no need to revise to only add newest
# data.
sb_list <- get_station_data(station_id = "USS0011J42S",
min_date = "1980-01-01")
# View flags
sb_list$flags
# View data
sb_list$data
summary(sb_list$data)
```
## Get wave heights from buoy
```{r get_wave_heights}
# Read in saved data...
wvht <- read_feather("wvht_buoy_51101.feather")
# ... or pull freshest dataset from NDBC and save
# wvht <- get_buoy_data(buoy_id = "51101")
# write_feather(wvht, "wvht_buoy_51101.feather")
# view
wvht
summary(wvht)
```
## Combine precipitation and wave height data
```{r combine}
# Combine into a single dataset
df <- sb_list$data %>%
inner_join(wvht, by = "date")
# View summary. Shows that after the join the variable with the most gaps is wvht at 14%!!
summary(df)
```
# Impute missing values
```{r impute}
# There's almost no wvht data for 2008
df %>%
mutate(year = year(date),
yday = yday(date)) %>%
ggplot(aes(x = yday, y = wvht)) +
geom_path() +
facet_wrap(~ year)
# Drop everything prior to April 21, 2009, which avoids all early NAs
df2 <- df %>% filter(date > "2009-04-20")
# Only ts columns
ts <- df2 %>% select(-matches("id|date|tobs"))
ts_names <- names(ts)
# Plot distribution of NAs in each series
map(ts_names, function(name) {
df2 %>%
pull(name) %>%
ggplot_na_distribution(title = name)
})
# Do imputations
ts_imp <- map_dfc(ts_names, function(name) {
# Current ts with no imputation
no_imp <- df2 %>% pull(name)
# Only do imputation if there's a missing value
if(sum(is.na(no_imp)) > 0) {
imp <- na_seadec(no_imp, find_frequency = TRUE)
} else { imp <- no_imp }
# Return this
tibble(imp)
})
names(ts_imp) <- ts_names
# Check imputation quality. Not great, especially for temperature series. But
# maybe better than nothing.
map(ts_names, function(name) {
if(sum(is.na(ts %>% pull(name))) > 0) {
ggplot_na_imputations(ts %>% pull(name), ts_imp %>% pull(name), title = name)
}
})
# Add date
ts_imp <- bind_cols(df2 %>% select(date), ts_imp)
ts <- bind_cols(df2 %>% select(date), ts)
```
## Fit different datasets to single-layer LSTM
- Powder buoy guy isn't really doing a prediction from 2 weeks out. See (see <https://www.skiutah.com/blog/authors/erika/the-man-behind-the-powder-buoy-and>). He's combining wave heights 2 weeks out with barometric pressure stuff a few days before.
- To make what I'm doing a bit more like what he's doing, use an 18-day window prior to t to predict what happens at t. This allows me to detect wave stuff that's further in the past, and also use more recent meteorological measurements.
- Read Chollet chapters on CNN...
- Am I not getting good generalization because there are too many hidden units in LSTM layer?
- Validation loss looks similar for different inputs...
- Results are similar regardless of whether I convert to matrix in R (as.matrix, in the prep function), or in Python (to_numpy in fit function). But definitely different if I don't convert at all and input pandas dataframes. No errors get thrown, but results are quite different, which makes me think that keras really can't handle pandas input.
- Fitting a model to the eight-predictor dataset for 300 epochs shows the classic overfitting pattern starting around 75 epochs: validation loss starts to go back up while training loss continues down.
- Overfitting pattern starts ~ 75 epochs with learning rate of .001. Goes down to ~ 25 with a learning rate of .01. So we can train a model to overfit much faster with a larger learning rate.
```{r model_different_datasets}
# Set variables
past <- 18
future <- 1
step <- 1L
batch_size <- 256L
epochs <- 600L
patience <- epochs
recurrent_dropout = .75
other_dropout = .75
# Learning rate. I get faster overgeneralization with .01 versus .001.
learning_rate <- 0.004
# Different sets of predictors to use in models
pred_list <- list(c("yday"),
c("prcp"),
c("wvht"),
c("prcp", "yday"),
c("wvht", "prcp"),
c(names(ts_imp)[-1], "yday"))
# Loop over different sets of predictors
results <- map(pred_list, function(preds) {
# Prep datasets
prep_list <- prep_data_for_keras(data = ts_imp,
outcome = "prcp",
predictors <- preds,
split_fraction = .8,
past = past,
future = future)
# Fit CNN-LSTM model
fit_cnn_lstm(
data = prep_list,
step = step,
learning_rate = learning_rate,
batch_size = batch_size,
epochs = epochs,
patience = patience,
recurrent_dropout = recurrent_dropout,
other_dropout = other_dropout)
})
# Visualize losses for all models
map(results, function(model) {
tibble(epoch = 1:length(model$loss),
train = model$loss,
val = model$val_loss) %>%
pivot_longer(cols = matches("train|val"),
names_to = "type",
values_to = "loss") %>%
ggplot(aes(x = epoch, y = loss, color = type)) +
geom_path() +
scale_x_continuous() +
scale_y_continuous(name = "loss (normalized MAE)") +
ggtitle(model$preds)
})
# Visualize actual versus predicted for all models
rsq <- function(x, y) summary(lm(y ~ x))$r.squared
map(results, function(model) {
# Return to original scale
actual = model$actual_y * outcome_sd + outcome_mean
predicted = model$pred_y * outcome_sd + outcome_mean
# Plot
tibble(actual = actual,
predicted = predicted) %>%
ggplot(aes(x = actual, y = predicted)) +
geom_point(alpha = .2, position = position_jitter()) +
geom_smooth(aes(x = actual, y = actual), method = "lm") +
scale_x_continuous(name = "Actual prcp (mm)") +
scale_y_continuous(name = "Predicted prcp (mm)") +
ggtitle(paste0(
model$preds, "\n",
"R-squared: ", round(rsq(actual, predicted),
digits = 2)))
})
# Looks like with these params I get a validation loss min ~ 200 epochs. Want to retrain to stop at 200, then do actual vs. predicted plot based on that.
```
## CNN-LSTM
```{r cnn-lstm}
# Set variables
past <- 18
future <- 1
step <- 1L
batch_size <- 256L
epochs <- 50L
patience <- epochs
# Learning rate. I get faster overgeneralization with .01 versus .001.
learning_rate <- 0.01
# Different sets of predictors to use in models
pred_list <- list(c("yday"),
c("prcp"),
c("wvht"),
c("prcp", "yday"),
c("wvht", "prcp"),
c(names(ts_imp)[-1], "yday"))
# Loop over different sets of predictors
results <- map(pred_list, function(preds) {
# Prep datasets
prep_list <- prep_data_for_keras(data = ts_imp,
outcome = "prcp",
predictors <- preds,
split_fraction = .8,
past = past,
future = future)
# Run model
fit_cnn_lstm(
data = prep_list,
step = step,
learning_rate = learning_rate,
batch_size = batch_size,
epochs = epochs,
patience = patience)
})
# Visualize losses for all models
map(results, function(model) {
tibble(epoch = 1:length(model$loss),
train = model$loss,
val = model$val_loss) %>%
pivot_longer(cols = matches("train|val"),
names_to = "type",
values_to = "loss") %>%
ggplot(aes(x = epoch, y = loss, color = type)) +
geom_path() +
scale_x_continuous() +
scale_y_continuous(name = "loss (normalized MAE)") +
ggtitle(model$preds)
})
# Visualize actual versus predicted for all models
rsq <- function(x, y) summary(lm(y ~ x))$r.squared
map(results, function(model) {
tibble(actual = model$actual_y,
predicted = model$pred_y) %>%
ggplot(aes(x = actual, y = predicted)) +
geom_jitter(alpha = .2) +
geom_smooth(aes(x = actual, y = actual), method = "lm") +
ggtitle(paste0(
model$preds, "\n",
"R-squared: ", round(rsq(model$actual_y, model$pred_y),
digits = 2)))
})
```
## Model 2 is same as 1, but trained 2 epochs
```{python single_layer_LSTM_2, eval = FALSE}
# Refit to 2 epochs
history2 = model.fit(
dataset_train,
epochs = 2,
validation_data = dataset_val,
callbacks = [es_callback, modelckpt_callback],
verbose = 1,
)
# Store losses for use in R
# Make history available to R by storing as separate lists
train_loss2 = history2.history["loss"]
val_loss2 = history2.history["val_loss"]
# Make predictions for all validation samples
len(x_val)
len(y_val)
# Pulls out a batch of inputs & targets
for batch in dataset_val.take(1):
inputs_val, targets_val = batch
# Turn targets into a 1d array for easier reading into R
targets_val = targets_val[:, 0].numpy()
# Make predictions
pred_val = model.predict(inputs_val)
pred_val = pred_val[:, 0]
```
## Visualize validation actual versus predicted for model 2
Shows that predictions aren't good given the current setup:
- Consider 18 time steps in the past (~ 2.5 weeks)
- Predict one day into the future
- Single-layer LSTM
- Wave height is the only predictor of precipitation
```{r actual_predicted, eval = FALSE}
act_pred <- tibble(actual = py$y_val[1:256, "0"],
pred = py$pred_val)
# Function to compute r-squared
rsq <- function(x, y) summary(lm(y ~ x))$r.squared
# Shows that we get actual precip values covering a pretty wide range, but the model only predicts values in a narrow window: from ~0 to 0.25.
# Picture suggests that a good naive model might be to always predict 0 precipitation, since most days don't have any.
# Better naive model would still be to predict the historical average for a month for any day in the month. Are we any better than that?
act_pred %>%
ggplot(aes(x = actual, y = pred)) +
geom_point(alpha = .25) +
geom_line(aes(x = actual, y = actual, color = "red")) +
scale_x_continuous(name = "Actual precipitation (in)") +
scale_y_continuous(name = "Predicted precipitation (in)") +
ggtitle(paste("Model 2 r-squared = ",
round(rsq(act_pred$actual, act_pred$pred), digits = 3)))
```