Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ELEX-2418-find-optimal-lambda #56

Open
wants to merge 58 commits into
base: develop
Choose a base branch
from
Open

ELEX-2418-find-optimal-lambda #56

wants to merge 58 commits into from

Conversation

rishasurana
Copy link
Contributor

@rishasurana rishasurana commented Jun 15, 2023

Description

We can pass a regularization constant to our model lambda_, but on election night we don’t know what the optimal lambda_ is. Add a feature to the client that take options for lambda_, the features and fixed effects. It should read in the data and use k-fold cross validation to find the optimal lambda as the lowest average MAPE on the k hold out sets. The function should probably return the lambda_ and the average MAPE.

AC:

  1. A function that takes a list of possible lambda_ values, the features and fixed effects we are using, any other inputs that are necessary and run k-fold cross validation to pick the best lambda_ value to use
  2. Unit tests

Jira Ticket

https://arcpublishing.atlassian.net/browse/ELEX-2418

@rishasurana rishasurana changed the title Calculate lambda based on a DTC in config Cross-validate for best regularization lambda Jun 15, 2023
@rishasurana rishasurana self-assigned this Jun 16, 2023
@rishasurana rishasurana marked this pull request as ready for review June 20, 2023 14:45
@rishasurana rishasurana requested a review from a team as a code owner June 20, 2023 14:45
jchaskell

This comment was marked as resolved.

@rishasurana rishasurana requested review from jchaskell and lennybronner and removed request for lennybronner and jchaskell June 22, 2023 20:51
@rishasurana rishasurana changed the title Cross-validate for best regularization lambda ELEX-2418-find-optimal-lambda Jun 30, 2023
lennybronner

This comment was marked as resolved.

@rishasurana

This comment was marked as resolved.

lennybronner

This comment was marked as resolved.

@rishasurana
Copy link
Contributor Author

rishasurana commented Jul 6, 2023

If I run the code with postal code fixed effects I get a nan error:

elexmodel 2020-11-03_USA_G --office_id=P --estimands=dem --geographic_unit_type=county --pi_method=nonparametric --percent_reporting=30 --aggregates=postal_code --aggregates=unit --fixed_effects=postal_code --lambda='[1, 10, 100, 1000]'

When getting unit predictions for each fold nonreporting_units_features = self.featurizer.featurize_heldout_data(nonreporting_units), there is number of NAN values (1-2% of the total df) in nonreporting_units_features post calling featurize_heldout_data on the test set of each fold. I can't seem to replicate this issue for other fixed-effect/aggregate combinations and it works fine when not splitting the data into test/train sets. Each time I run this command, certain states have NAN values and cause errors more often than others (HI, DE, NJ). Which states cause errors varies each time because the fold data is now randomly selected.

Example runs:
Columns that have NAN values, total NAN values in dataframe

  1. ['postal_code_HI'], 303
  2. ['postal_code_HI', 'postal_code_NJ', 'postal_code_NV'], 927
  3. ['postal_code_DE', 'postal_code_NJ', 'postal_code_WY'], 930
  4. ['postal_code_HI', 'postal_code_WA'], 602
  5. ['postal_code_DE', 'postal_code_MD'], 632
  6. ['postal_code_DE'], 281

Using fillna(0) here in the Featurizer class: new_heldout_data = new_heldout_data.join(missing_expanded_fixed_effects_df) resolves this issue because it seems like certain states may be inconsistent in certain folds for the smaller test sets and so when we use .join it fills them with NAN.

Switching to use .merge doesn't work here because there are no common columns to merge on.

@rishasurana rishasurana requested review from lennybronner and removed request for lennybronner July 6, 2023 03:32
@rishasurana

This comment was marked as resolved.

@jchaskell

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants