Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document/Test gempyor.utils.read_df/write_df #247

Merged
merged 19 commits into from
Jul 11, 2024

Conversation

TimothyWillard
Copy link
Contributor

An example of GH-246. Documents and tests for gempyor.utils.read_df/write_df. If these look good I can remove the current test fixtures from tests/utils/test_utils.py.

I do find it odd that gempyor.utils.read_df has different behavior when the file is a csv vs a parquet when a column called "subpop" is present. I added a text fixture to demonstrate this current behavior, but was that intended and should it be changed?

Also, not 100% sure I requested the appropriate reviewers, if not please let me know.

Overhauled the gempyor.utils.write_df unit tests by placing them in a
new file with a class grouping similar fixtures. Added tests for the
NotImplementedError, writing to csv, and writing to parquet.
Overhauled the gempyor.utils.read_df unit tests by blacing them in a new
file with a class for grouping similar fixtures. Added tests for the
NotImplementedError, reading from csv, and reading from parquet.
* Formatted the `tests/utils/test_read_df.py` file.
* Added `engine="pyarrow"` to `write_df` unit tests.
Moved read_df to be next to write_df in utils.py.
* Documented the gempyor.utils.write_df function using the Google style
  guide.
* Extended write_df to explicitly support os.PathLike types for fname
  (was implicitly supported) and added support for bytes.
* Changed the file manipulation logic to use pathlib rather than
  manipulating strings in write_df.
* Added unit test reading a file with a column called 'subpop',
  converted to a string when the file is a csv and left unaltered when
  the file is a parquet file.
* Typo in test_raises_not_implemented_error docstring.
* Documented the gempyor.utils.read_df function using the Google style
  guide.
* Extended read_df to explicitly support os.PathLike types for fname
  (was implicitly supported) and added support for bytes.
* Changed the file manipulation logic to use pathlib rather than
  manipulating strings in read_df.
* Changed extension param of read_df/write_df from str to Literal[None,
  "", "csv", "parquet"].
* Added unit tests for `extension=None` for both read_df/write_df.
* Applied formatting (all whitespace) to read_df/write_df functions.
* Reorganized the imports in utils to be clearer.
* Added missing period in NotImplementedError in read_df/write_df,
  updated corresponding unit tests. Also use path suffix directly
  instead of given extension.
* Added missing test for write_df with provided extension raising
  NotImplementedError.
* Created unit tests in tests/utils/test_list_filenames.py, including
  tests for searching flat and nested folders.
* Added relevant documentation to the tests as well.
* Created create_directories_with_files pytest class fixture, might need
  to be extracted into a more general purpose location later.
* Added more detail to the `gempyor.utils.list_filenames` type hints.
* Formatted the documentation to comply with the Google style guide.
* Refactored the internals of list_filenames to be single list
  comprehension instead of a loop of nested conditionals.
* Allow `filters` to accept a single string.
* Expanded type support for the `folder` arg of
  `gempyor.utils.list_filnames` to support bytes and os.PathLike.
* Vastly expanded test suite to target new supported types.
* Corrected bug when filters was given as a string, uncovered by
  extensive test suite.
Added unit tests for `gempyor.utils.get_truncated_normal` function.
* Documented `gempyor.utils.get_truncated_normal` including adding
  appropriate type hints.
* Refactored the function lightly for legibility.
Added unit tests for the `gempyor.utils.get_log_normal` function.
Added documentation for `gempyor.utils.get_log_normal` including adding
appropriate type hints.
* Added unit tests for `gempyor.utils.rolling_mean_pad` in
  `tests/utils/test_rolling_mean_pad.py`.
* Wrote a reference implementation of `rolling_mean_pad` for comparison
  purposes.
* Added type hints for `gempyor.utils.rolling_mean_pad`.
* Expanded the existing docstring and included an example.
@emprzy
Copy link
Collaborator

emprzy commented Jul 10, 2024

For changes that actually require in-depth review, you may want to also include somebody who is more familiar with flepiMoP (perhaps Sara or Shaun), but I always like to be looped in so that I can learn. I'll be working on some of the documentation stuff.

@TimothyWillard
Copy link
Contributor Author

Thanks for the suggestion @emprzy!

@saraloo
Copy link
Contributor

saraloo commented Jul 10, 2024

I'm not 100% sure, but is the "different behaviour when subpop column is present" a legacy thing?

@saraloo
Copy link
Contributor

saraloo commented Jul 10, 2024

This looks good to me. The documentation is great - makes things easy to follow. I'll defer to @jcblemai on the logic of the functions and csv vs parquet, but otherwise 👍 from me.

Copy link
Collaborator

@jcblemai jcblemai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent changes, thank you 🙏🏻

@jcblemai
Copy link
Collaborator

I do find it odd that gempyor.utils.read_df has different behavior when the file is a csv vs a parquet when a column called "subpop" is present. I added a text fixture to demonstrate this current behavior, but was that intended and should it be changed?

The reason behind this is that subpop is often a code (e.g. US geoids are a five-digit number). Panda.read_csv guesses the type of the column and will sometimes wrongly pick a numeric type (which changes "06000" to "6000", for example) which messes up with the rest of the code. Usually, this is fine with other categories (which are either numbers or strings that look like strings -- though edge-cases are not impossible).
Parquet is fine because the types are defined within the file.

@TimothyWillard agree on the oddness, if you find a better way to get this behavior it would be a welcome change. I guess the default behavior would be to have each function give the column type.

The PR is very good and you may merge.

@TimothyWillard TimothyWillard merged commit 4485458 into main Jul 11, 2024
1 check passed
@TimothyWillard TimothyWillard deleted the enhancement/GH-246-document-test-write_df-read_df branch July 11, 2024 11:53
@TimothyWillard
Copy link
Contributor Author

Parquet is fine because the types are defined within the file.

Ah, that explains what I was missing. Would be nice at some point to unify the behaviors at some point, but not urgent. Thanks for the reviews y'all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants