WIP: JSON Post Processing Script #252

TimothyWillard · 2024-07-12T20:07:41Z

Draft version of a postprocessing script for converting the parquet/csv files found in model_output to the json files expected by https://github.com/HopkinsIDD/covid-dashboard-app. This addresses GH-237. Lots of work to do here, this script currently does not import gempyor so there may be duplicated functionality. And for the functionality that doesn't exist it may make sense to move it from this script into gempyor.

A current example usage is:

~> python postprocessing/json_reformatter.py \
    --hosp-directory /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/hosp/global/final \
    --spar-directory /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/spar/global/final \
    --output-directory /Users/twillard/Desktop/USA-20240517T205228/json_files/ \
    --scenario-name SMH_R18_noBoo_lowIE_blk1 \
    --severity-name low \
    --rounding 1 \
    --verbose
[2024-07-12 15:31:06] Using 'SMH_R18_noBoo_lowIE_blk1' as the scenario name.
[2024-07-12 15:31:06] Reading the spar files in /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/spar/global/final, found 300 parquet files.
[2024-07-12 15:31:06] Finished reading the spar files in /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/spar/global/final. 0.22 seconds elapsed.
[2024-07-12 15:31:06] Reading the hosp files in /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/hosp/global/final, found 300 parquet files.
[2024-07-12 15:31:34] Finished reading the hosp files in /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/hosp/global/final. 27.33 seconds elapsed.
[2024-07-12 15:31:34] Calculating dates series, summary dataframe, and applying rounding.
[2024-07-12 15:32:19] Finished calculating dates series, summary dataframe, and applying rounding. 45.25 seconds elapsed.
[2024-07-12 15:32:19] Writing outcomes to outcomes.json.
[2024-07-12 15:32:19] Finished writing 3775 bytes to outcomes.json. 0.00 seconds elapsed.
[2024-07-12 15:32:19] Writing geo JSONs.
[2024-07-12 15:39:43] Finished writing 1495613708 bytes to 51 geo JSONs. 443.97 seconds elapsed.
[2024-07-12 15:39:45] Writing outcomes to statsForMap.json.
[2024-07-12 15:40:02] Finished writing 200151890 bytes to statsForMap.json. 16.73 seconds elapsed.
[2024-07-12 15:40:02] Writing valid geoids to validGeoids.json.
[2024-07-12 15:40:02] Finished writing 267 bytes to validGeoids.json. 0.01 seconds elapsed.
[2024-07-12 15:40:02] Finished transforming flepiMoP output for use in the dashboard.

I understand that there are OOM issues with some of the postprocessing scripts so I did some light profiling as well:

/usr/bin/time -l python postprocessing/json_reformatter.py --hosp-directory /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/hosp/global/final --spar-directory /Users/twillard/Desktop/USA-20240517T205228/model_output/USA_inference_all/SMH_R18_noBoo_lowIE_blk1/spar/global/final --output-directory /Users/twillard/Desktop/USA-20240517T205228/json_files/ --scenario-name SMH_R18_noBoo_lowIE_blk1 --severity-name low --rounding 1
      528.89 real       496.04 user        25.96 sys
         20938620928  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            15373988  page reclaims
                 216  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                2164  voluntary context switches
              100796  involuntary context switches
      10339440980500  instructions retired
       1973063685801  cycles elapsed
         41159872704  peak memory footprint

Looks like the max mem was 20938620928 bytes ~ 20.94GB, which seems quite large.

Moved the main entrance into the script into a `main` function per the Google python style guide.

Changed the docstring style from the numpy style guide to the Google style guide. Also expanded on the examples in the docs.

* Reordered the imports for aesthetic. * Changed the return type on the `write_*_json` functions to `Path` from `str` for consistency.

If --severity-name is manually provided it was given as a list of one string rather than as the expected string.

Small memory efficiency gain for group by operations with pandas.

* Create the given output directory if not exists. * Convert the handling of sim values to using numpy directly in `write_geo_jsons`.

TimothyWillard and others added 6 commits June 21, 2024 14:55

Moving this script into version control, still WIP.

fcb2d74

Merge main into feature/GH-237-json-postprocessing-script

de3cd47

Create main function

5d5e4d3

Moved the main entrance into the script into a `main` function per the Google python style guide.

Redocumented json reformatter script

e907afc

Changed the docstring style from the numpy style guide to the Google style guide. Also expanded on the examples in the docs.

Reorder imports, str to Path

95e6cdf

* Reordered the imports for aesthetic. * Changed the return type on the `write_*_json` functions to `Path` from `str` for consistency.

Bug fix for severity name args

f682911

If --severity-name is manually provided it was given as a list of one string rather than as the expected string.

TimothyWillard marked this pull request as draft July 12, 2024 20:08

TimothyWillard added 3 commits July 16, 2024 10:02

Use observed=True in group by operations

928d5a0

Small memory efficiency gain for group by operations with pandas.

Create output dir, geo jsons improvements

40e5c53

* Create the given output directory if not exists. * Convert the handling of sim values to using numpy directly in `write_geo_jsons`.

Merge main into feature/GH-237-json-postprocessing-script

f89c06c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: JSON Post Processing Script #252

WIP: JSON Post Processing Script #252

TimothyWillard commented Jul 12, 2024

WIP: JSON Post Processing Script #252

Are you sure you want to change the base?

WIP: JSON Post Processing Script #252

Conversation

TimothyWillard commented Jul 12, 2024