Document/Test `gempyor.utils.read_df/write_df` #247

TimothyWillard · 2024-07-01T19:04:43Z

An example of GH-246. Documents and tests for gempyor.utils.read_df/write_df. If these look good I can remove the current test fixtures from tests/utils/test_utils.py.

I do find it odd that gempyor.utils.read_df has different behavior when the file is a csv vs a parquet when a column called "subpop" is present. I added a text fixture to demonstrate this current behavior, but was that intended and should it be changed?

Also, not 100% sure I requested the appropriate reviewers, if not please let me know.

Overhauled the gempyor.utils.write_df unit tests by placing them in a new file with a class grouping similar fixtures. Added tests for the NotImplementedError, writing to csv, and writing to parquet.

Overhauled the gempyor.utils.read_df unit tests by blacing them in a new file with a class for grouping similar fixtures. Added tests for the NotImplementedError, reading from csv, and reading from parquet.

* Formatted the `tests/utils/test_read_df.py` file. * Added `engine="pyarrow"` to `write_df` unit tests.

Moved read_df to be next to write_df in utils.py.

* Documented the gempyor.utils.write_df function using the Google style guide. * Extended write_df to explicitly support os.PathLike types for fname (was implicitly supported) and added support for bytes. * Changed the file manipulation logic to use pathlib rather than manipulating strings in write_df.

* Added unit test reading a file with a column called 'subpop', converted to a string when the file is a csv and left unaltered when the file is a parquet file. * Typo in test_raises_not_implemented_error docstring.

* Documented the gempyor.utils.read_df function using the Google style guide. * Extended read_df to explicitly support os.PathLike types for fname (was implicitly supported) and added support for bytes. * Changed the file manipulation logic to use pathlib rather than manipulating strings in read_df.

* Changed extension param of read_df/write_df from str to Literal[None, "", "csv", "parquet"]. * Added unit tests for `extension=None` for both read_df/write_df.

* Applied formatting (all whitespace) to read_df/write_df functions. * Reorganized the imports in utils to be clearer.

* Added missing period in NotImplementedError in read_df/write_df, updated corresponding unit tests. Also use path suffix directly instead of given extension. * Added missing test for write_df with provided extension raising NotImplementedError.

* Created unit tests in tests/utils/test_list_filenames.py, including tests for searching flat and nested folders. * Added relevant documentation to the tests as well. * Created create_directories_with_files pytest class fixture, might need to be extracted into a more general purpose location later.

* Added more detail to the `gempyor.utils.list_filenames` type hints. * Formatted the documentation to comply with the Google style guide. * Refactored the internals of list_filenames to be single list comprehension instead of a loop of nested conditionals. * Allow `filters` to accept a single string.

* Expanded type support for the `folder` arg of `gempyor.utils.list_filnames` to support bytes and os.PathLike. * Vastly expanded test suite to target new supported types. * Corrected bug when filters was given as a string, uncovered by extensive test suite.

Added unit tests for `gempyor.utils.get_truncated_normal` function.

* Documented `gempyor.utils.get_truncated_normal` including adding appropriate type hints. * Refactored the function lightly for legibility.

Added unit tests for the `gempyor.utils.get_log_normal` function.

Added documentation for `gempyor.utils.get_log_normal` including adding appropriate type hints.

* Added unit tests for `gempyor.utils.rolling_mean_pad` in `tests/utils/test_rolling_mean_pad.py`. * Wrote a reference implementation of `rolling_mean_pad` for comparison purposes.

* Added type hints for `gempyor.utils.rolling_mean_pad`. * Expanded the existing docstring and included an example.

emprzy · 2024-07-10T15:57:03Z

For changes that actually require in-depth review, you may want to also include somebody who is more familiar with flepiMoP (perhaps Sara or Shaun), but I always like to be looped in so that I can learn. I'll be working on some of the documentation stuff.

TimothyWillard · 2024-07-10T19:38:37Z

Thanks for the suggestion @emprzy!

saraloo · 2024-07-10T20:21:51Z

I'm not 100% sure, but is the "different behaviour when subpop column is present" a legacy thing?

saraloo · 2024-07-10T20:26:30Z

This looks good to me. The documentation is great - makes things easy to follow. I'll defer to @jcblemai on the logic of the functions and csv vs parquet, but otherwise 👍 from me.

jcblemai

Excellent changes, thank you 🙏🏻

jcblemai · 2024-07-11T11:46:18Z

I do find it odd that gempyor.utils.read_df has different behavior when the file is a csv vs a parquet when a column called "subpop" is present. I added a text fixture to demonstrate this current behavior, but was that intended and should it be changed?

The reason behind this is that subpop is often a code (e.g. US geoids are a five-digit number). Panda.read_csv guesses the type of the column and will sometimes wrongly pick a numeric type (which changes "06000" to "6000", for example) which messes up with the rest of the code. Usually, this is fine with other categories (which are either numbers or strings that look like strings -- though edge-cases are not impossible).
Parquet is fine because the types are defined within the file.

@TimothyWillard agree on the oddness, if you find a better way to get this behavior it would be a welcome change. I guess the default behavior would be to have each function give the column type.

The PR is very good and you may merge.

TimothyWillard · 2024-07-11T11:55:23Z

Parquet is fine because the types are defined within the file.

Ah, that explains what I was missing. Would be nice at some point to unify the behaviors at some point, but not urgent. Thanks for the reviews y'all!

TimothyWillard added 10 commits July 1, 2024 12:35

Overhual write_df unit tests.

9b99b89

Overhauled the gempyor.utils.write_df unit tests by placing them in a new file with a class grouping similar fixtures. Added tests for the NotImplementedError, writing to csv, and writing to parquet.

Overhual read_df unit tests

6ae23b3

Overhauled the gempyor.utils.read_df unit tests by blacing them in a new file with a class for grouping similar fixtures. Added tests for the NotImplementedError, reading from csv, and reading from parquet.

Formatted read_df tests, explicit parquet engine

9ad1abd

* Formatted the `tests/utils/test_read_df.py` file. * Added `engine="pyarrow"` to `write_df` unit tests.

Reorganized utils.py

3d22841

Moved read_df to be next to write_df in utils.py.

Add test for read_df with subpop column

19078a0

* Added unit test reading a file with a column called 'subpop', converted to a string when the file is a csv and left unaltered when the file is a parquet file. * Typo in test_raises_not_implemented_error docstring.

Changed extension type from str to Literal

c51a509

* Changed extension param of read_df/write_df from str to Literal[None, "", "csv", "parquet"]. * Added unit tests for `extension=None` for both read_df/write_df.

Formatted/organized utils

18d9498

* Applied formatting (all whitespace) to read_df/write_df functions. * Reorganized the imports in utils to be clearer.

TimothyWillard requested review from jcblemai and emprzy July 1, 2024 19:04

TimothyWillard added 9 commits July 5, 2024 10:39

Added unit tests for get_truncated_normal

2c7a81d

Added unit tests for `gempyor.utils.get_truncated_normal` function.

Documented and refactored get_truncated_normal

17c715d

* Documented `gempyor.utils.get_truncated_normal` including adding appropriate type hints. * Refactored the function lightly for legibility.

Added unit tests for get_log_normal

2aaa6fb

Added unit tests for the `gempyor.utils.get_log_normal` function.

Documented get_log_normal

c09b953

Added documentation for `gempyor.utils.get_log_normal` including adding appropriate type hints.

Added unit tests for rolling_mean_pad

e073f88

* Added unit tests for `gempyor.utils.rolling_mean_pad` in `tests/utils/test_rolling_mean_pad.py`. * Wrote a reference implementation of `rolling_mean_pad` for comparison purposes.

Documented rolling_mean_pad

a91fc6d

* Added type hints for `gempyor.utils.rolling_mean_pad`. * Expanded the existing docstring and included an example.

TimothyWillard mentioned this pull request Jul 10, 2024

rolling_mean_pad Performance Improvements #249

Merged

TimothyWillard requested a review from saraloo July 10, 2024 19:38

saraloo approved these changes Jul 10, 2024

View reviewed changes

jcblemai approved these changes Jul 11, 2024

View reviewed changes

TimothyWillard merged commit 4485458 into main Jul 11, 2024
1 check passed

TimothyWillard deleted the enhancement/GH-246-document-test-write_df-read_df branch July 11, 2024 11:53

TimothyWillard mentioned this pull request Jul 22, 2024

Update utils.py #260

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document/Test `gempyor.utils.read_df/write_df` #247

Document/Test `gempyor.utils.read_df/write_df` #247

TimothyWillard commented Jul 1, 2024

emprzy commented Jul 10, 2024

TimothyWillard commented Jul 10, 2024

saraloo commented Jul 10, 2024

saraloo commented Jul 10, 2024

jcblemai left a comment

jcblemai commented Jul 11, 2024

TimothyWillard commented Jul 11, 2024

Document/Test gempyor.utils.read_df/write_df #247

Document/Test gempyor.utils.read_df/write_df #247

Conversation

TimothyWillard commented Jul 1, 2024

emprzy commented Jul 10, 2024

TimothyWillard commented Jul 10, 2024

saraloo commented Jul 10, 2024

saraloo commented Jul 10, 2024

jcblemai left a comment

Choose a reason for hiding this comment

jcblemai commented Jul 11, 2024

TimothyWillard commented Jul 11, 2024

Document/Test `gempyor.utils.read_df/write_df` #247

Document/Test `gempyor.utils.read_df/write_df` #247