Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: JSON Post Processing Script #252

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

TimothyWillard
Copy link
Contributor

Draft version of a postprocessing script for converting the parquet/csv files found in model_output to the json files expected by https://github.com/HopkinsIDD/covid-dashboard-app. This addresses GH-237. Lots of work to do here, this script currently does not import gempyor so there may be duplicated functionality. And for the functionality that doesn't exist it may make sense to move it from this script into gempyor.

A current example usage is:

~> python postprocessing/json_reformatter.py \
    --hosp-directory /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/hosp/global/final \
    --spar-directory /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/spar/global/final \
    --output-directory /Users/twillard/Desktop/USA-20240517T205228/json_files/ \
    --scenario-name SMH_R18_noBoo_lowIE_blk1 \
    --severity-name low \
    --rounding 1 \
    --verbose
[2024-07-12 15:31:06] Using 'SMH_R18_noBoo_lowIE_blk1' as the scenario name.
[2024-07-12 15:31:06] Reading the spar files in /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/spar/global/final, found 300 parquet files.
[2024-07-12 15:31:06] Finished reading the spar files in /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/spar/global/final. 0.22 seconds elapsed.
[2024-07-12 15:31:06] Reading the hosp files in /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/hosp/global/final, found 300 parquet files.
[2024-07-12 15:31:34] Finished reading the hosp files in /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/hosp/global/final. 27.33 seconds elapsed.
[2024-07-12 15:31:34] Calculating dates series, summary dataframe, and applying rounding.
[2024-07-12 15:32:19] Finished calculating dates series, summary dataframe, and applying rounding. 45.25 seconds elapsed.
[2024-07-12 15:32:19] Writing outcomes to outcomes.json.
[2024-07-12 15:32:19] Finished writing 3775 bytes to outcomes.json. 0.00 seconds elapsed.
[2024-07-12 15:32:19] Writing geo JSONs.
[2024-07-12 15:39:43] Finished writing 1495613708 bytes to 51 geo JSONs. 443.97 seconds elapsed.
[2024-07-12 15:39:45] Writing outcomes to statsForMap.json.
[2024-07-12 15:40:02] Finished writing 200151890 bytes to statsForMap.json. 16.73 seconds elapsed.
[2024-07-12 15:40:02] Writing valid geoids to validGeoids.json.
[2024-07-12 15:40:02] Finished writing 267 bytes to validGeoids.json. 0.01 seconds elapsed.
[2024-07-12 15:40:02] Finished transforming flepiMoP output for use in the dashboard.

I understand that there are OOM issues with some of the postprocessing scripts so I did some light profiling as well:

/usr/bin/time -l python postprocessing/json_reformatter.py --hosp-directory /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/hosp/global/final --spar-directory /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/spar/global/final --output-directory /Users/twillard/Desktop/USA-20240517T205228/json_files/ --scenario-name SMH_R18_noBoo_lowIE_blk1 --severity-name low --rounding 1
      528.89 real       496.04 user        25.96 sys
         20938620928  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            15373988  page reclaims
                 216  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                2164  voluntary context switches
              100796  involuntary context switches
      10339440980500  instructions retired
       1973063685801  cycles elapsed
         41159872704  peak memory footprint

Looks like the max mem was 20938620928 bytes ~ 20.94GB, which seems quite large.

TimothyWillard and others added 6 commits June 21, 2024 14:55
Moved the main entrance into the script into a `main` function per the
Google python style guide.
Changed the docstring style from the numpy style guide to the Google
style guide. Also expanded on the examples in the docs.
* Reordered the imports for aesthetic.
* Changed the return type on the `write_*_json` functions to `Path` from
  `str` for consistency.
If --severity-name is manually provided it was given as a list of one
string rather than as the expected string.
@TimothyWillard TimothyWillard marked this pull request as draft July 12, 2024 20:08
Small memory efficiency gain for group by operations with pandas.
* Create the given output directory if not exists.
* Convert the handling of sim values to using numpy directly in
  `write_geo_jsons`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant