[NSETM-2279] Use shared memory to write partial DataFrames of features #33

GianlucaFicarelli · 2024-04-05T15:44:42Z

Improve performance of report extraction and features calculation by writing partial DataFrames to the shared memory (or temp directory, if shared memory is not available).
Both the used memory and the execution time should be lower than before, when processing large DataFrames.
Use zstd compression instead of snappy when writing parquet files.
When repo is pickled, extract the DataFrames only if they aren't already stored in the cache.
Remove fastparquet extra dependency.

codecov-commenter · 2024-04-09T17:52:51Z

Codecov Report

Attention: Patch coverage is 93.70079% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 91.49%. Comparing base (5602fd3) to head (7edbf63).

Files	Patch %	Lines
src/blueetl/repository.py	55.55%	1 Missing and 3 partials ⚠️
src/blueetl/store/parquet.py	90.32%	0 Missing and 3 partials ⚠️
src/blueetl/store/base.py	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #33      +/-   ##
==========================================
+ Coverage   91.47%   91.49%   +0.02%     
==========================================
  Files          45       45              
  Lines        2638     2703      +65     
  Branches      539      554      +15     
==========================================
+ Hits         2413     2473      +60     
- Misses        140      141       +1     
- Partials       85       89       +4

Flag	Coverage Δ
pytest	`91.49% <93.70%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

- Improve performance of report extraction and features calculation by writing partial DataFrames to the shared memory (or temp directory, if shared memory is not available). - Use zstd compression instead of snappy when writing parquet files. - When repo is pickled, extract the DataFrames only if they aren't already stored in the cache. - Remove fastparquet extra dependency.

GianlucaFicarelli self-assigned this Apr 5, 2024

GianlucaFicarelli marked this pull request as draft April 5, 2024 15:51

GianlucaFicarelli force-pushed the use_shm branch 4 times, most recently from ceec390 to c4956d9 Compare April 9, 2024 17:49

GianlucaFicarelli force-pushed the use_shm branch 8 times, most recently from 61b7f0f to 28d09eb Compare April 12, 2024 13:39

GianlucaFicarelli marked this pull request as ready for review April 12, 2024 13:42

GianlucaFicarelli requested a review from mgeplf April 12, 2024 13:42

GianlucaFicarelli force-pushed the use_shm branch from 28d09eb to f5b71f9 Compare April 19, 2024 10:39

GianlucaFicarelli changed the title ~~Use shared memory to write partial DataFrames of features~~ [NSETM-2279] Use shared memory to write partial DataFrames of features Apr 19, 2024

GianlucaFicarelli force-pushed the use_shm branch from f5b71f9 to 20f2009 Compare April 22, 2024 11:50

GianlucaFicarelli force-pushed the use_shm branch from 20f2009 to 7edbf63 Compare April 24, 2024 08:02

mgeplf approved these changes Apr 24, 2024

View reviewed changes

GianlucaFicarelli merged commit c2861df into main Apr 24, 2024
8 checks passed

GianlucaFicarelli deleted the use_shm branch April 24, 2024 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NSETM-2279] Use shared memory to write partial DataFrames of features #33

[NSETM-2279] Use shared memory to write partial DataFrames of features #33

GianlucaFicarelli commented Apr 5, 2024 •

edited

Loading

codecov-commenter commented Apr 9, 2024 •

edited

Loading

[NSETM-2279] Use shared memory to write partial DataFrames of features #33

[NSETM-2279] Use shared memory to write partial DataFrames of features #33

Conversation

GianlucaFicarelli commented Apr 5, 2024 • edited Loading

codecov-commenter commented Apr 9, 2024 • edited Loading

Codecov Report

GianlucaFicarelli commented Apr 5, 2024 •

edited

Loading

codecov-commenter commented Apr 9, 2024 •

edited

Loading