-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document/Test gempyor.utils.read_df/write_df
#247
Document/Test gempyor.utils.read_df/write_df
#247
Conversation
Overhauled the gempyor.utils.write_df unit tests by placing them in a new file with a class grouping similar fixtures. Added tests for the NotImplementedError, writing to csv, and writing to parquet.
Overhauled the gempyor.utils.read_df unit tests by blacing them in a new file with a class for grouping similar fixtures. Added tests for the NotImplementedError, reading from csv, and reading from parquet.
* Formatted the `tests/utils/test_read_df.py` file. * Added `engine="pyarrow"` to `write_df` unit tests.
Moved read_df to be next to write_df in utils.py.
* Documented the gempyor.utils.write_df function using the Google style guide. * Extended write_df to explicitly support os.PathLike types for fname (was implicitly supported) and added support for bytes. * Changed the file manipulation logic to use pathlib rather than manipulating strings in write_df.
* Added unit test reading a file with a column called 'subpop', converted to a string when the file is a csv and left unaltered when the file is a parquet file. * Typo in test_raises_not_implemented_error docstring.
* Documented the gempyor.utils.read_df function using the Google style guide. * Extended read_df to explicitly support os.PathLike types for fname (was implicitly supported) and added support for bytes. * Changed the file manipulation logic to use pathlib rather than manipulating strings in read_df.
* Changed extension param of read_df/write_df from str to Literal[None, "", "csv", "parquet"]. * Added unit tests for `extension=None` for both read_df/write_df.
* Applied formatting (all whitespace) to read_df/write_df functions. * Reorganized the imports in utils to be clearer.
* Added missing period in NotImplementedError in read_df/write_df, updated corresponding unit tests. Also use path suffix directly instead of given extension. * Added missing test for write_df with provided extension raising NotImplementedError.
* Created unit tests in tests/utils/test_list_filenames.py, including tests for searching flat and nested folders. * Added relevant documentation to the tests as well. * Created create_directories_with_files pytest class fixture, might need to be extracted into a more general purpose location later.
* Added more detail to the `gempyor.utils.list_filenames` type hints. * Formatted the documentation to comply with the Google style guide. * Refactored the internals of list_filenames to be single list comprehension instead of a loop of nested conditionals. * Allow `filters` to accept a single string.
* Expanded type support for the `folder` arg of `gempyor.utils.list_filnames` to support bytes and os.PathLike. * Vastly expanded test suite to target new supported types. * Corrected bug when filters was given as a string, uncovered by extensive test suite.
Added unit tests for `gempyor.utils.get_truncated_normal` function.
* Documented `gempyor.utils.get_truncated_normal` including adding appropriate type hints. * Refactored the function lightly for legibility.
Added unit tests for the `gempyor.utils.get_log_normal` function.
Added documentation for `gempyor.utils.get_log_normal` including adding appropriate type hints.
* Added unit tests for `gempyor.utils.rolling_mean_pad` in `tests/utils/test_rolling_mean_pad.py`. * Wrote a reference implementation of `rolling_mean_pad` for comparison purposes.
* Added type hints for `gempyor.utils.rolling_mean_pad`. * Expanded the existing docstring and included an example.
For changes that actually require in-depth review, you may want to also include somebody who is more familiar with flepiMoP (perhaps Sara or Shaun), but I always like to be looped in so that I can learn. I'll be working on some of the documentation stuff. |
Thanks for the suggestion @emprzy! |
I'm not 100% sure, but is the "different behaviour when subpop column is present" a legacy thing? |
This looks good to me. The documentation is great - makes things easy to follow. I'll defer to @jcblemai on the logic of the functions and csv vs parquet, but otherwise 👍 from me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent changes, thank you 🙏🏻
The reason behind this is that subpop is often a code (e.g. US geoids are a five-digit number). Panda.read_csv guesses the type of the column and will sometimes wrongly pick a numeric type (which changes "06000" to "6000", for example) which messes up with the rest of the code. Usually, this is fine with other categories (which are either numbers or strings that look like strings -- though edge-cases are not impossible). @TimothyWillard agree on the oddness, if you find a better way to get this behavior it would be a welcome change. I guess the default behavior would be to have each function give the column type. The PR is very good and you may merge. |
Ah, that explains what I was missing. Would be nice at some point to unify the behaviors at some point, but not urgent. Thanks for the reviews y'all! |
An example of GH-246. Documents and tests for
gempyor.utils.read_df/write_df
. If these look good I can remove the current test fixtures fromtests/utils/test_utils.py
.I do find it odd that
gempyor.utils.read_df
has different behavior when the file is a csv vs a parquet when a column called "subpop" is present. I added a text fixture to demonstrate this current behavior, but was that intended and should it be changed?Also, not 100% sure I requested the appropriate reviewers, if not please let me know.