Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make strings_to_categoricals upon .write() optional #1474

Merged
merged 7 commits into from
Aug 9, 2024
Merged

Conversation

falexwolf
Copy link
Member

@falexwolf falexwolf commented Apr 22, 2024

To illustrate that it'd be a simple change, this PR adds the 3 lines for what @ivirshup suggested here 3 years ago:

I think I'd be fine with a keyword argument for this.

In the two issues below, @grst, @adamgayoso, @lazappi & @ljjh20 expressed that "forced" sanitization is problematic.

I found these issues only now but also agree that a write function should not mutate the object that's about to be written. @chaichontat made me aware.

Things turn out particularly bad for gene symbols, which should remain strings even for performance reasons and compatibility with JS. Having 30k categories in a 32k dimensional genes vector is detrimental, not beneficial.

Copy link

codecov bot commented Apr 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.80%. Comparing base (6e918a4) to head (7390740).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1474      +/-   ##
==========================================
- Coverage   86.69%   82.80%   -3.89%     
==========================================
  Files          37       37              
  Lines        5937     5939       +2     
==========================================
- Hits         5147     4918     -229     
- Misses        790     1021     +231     
Files Coverage Δ
src/anndata/_io/h5ad.py 93.00% <100.00%> (+0.03%) ⬆️
src/anndata/_io/zarr.py 83.54% <100.00%> (+0.21%) ⬆️

... and 12 files with indirect coverage changes

@ilan-gold ilan-gold self-assigned this Jul 2, 2024
@lazappi
Copy link

lazappi commented Jul 3, 2024

Just chipping in to say that while I think this would be great, it doesn't really solve scverse/scanpy#1747 as scanpy functions will still cause this conversion (unless something has changed). So this helps if you are just reading/writing AnnData but anything that goes through scanpy is still likely to be converted to categoricals.

You probably knew this already but wanted to make it clear for anyone reading this later.

@falexwolf
Copy link
Member Author

Just chipping in to say that while I think this would be great

Great that you're also onboard, @lazappi!

as scanpy functions will still cause this conversion (unless something has changed)

Yes, you're right about the Scanpy plotting functionality. But I think the issue is more pressing for anndata, given its widespread use outside of Scanpy.

@ilan-gold ilan-gold added this to the 0.10.9 milestone Jul 3, 2024
@flying-sheep flying-sheep enabled auto-merge (squash) July 4, 2024 08:24
@ilan-gold ilan-gold disabled auto-merge July 4, 2024 08:29
@ilan-gold
Copy link
Contributor

@flying-sheep I am disabling auto-merge. Isaac brought up a good point about whether this should be in settings on zulip. What do you think? I think it's fine to leave this here because you might have different behavior for different datasets. If you agree we can merge (two against one haha).

@flying-sheep
Copy link
Member

flying-sheep commented Jul 4, 2024

We agreed on the following scope for settings (I’ll add checks for things that apply here)

  • special and uncommon use-case (performance etc), only to be done when you know what you’re doing, especially when turning off checks
  • things that transitively apply (e.g. when concatenating/slicing) and therefore either need deep API changes or a global setting

I think we didn’t fully agree on if this should be an “and” or an “or” for things to go into settings.

I think IO is localized enough that it doesn’t need a global setting, but I have no strong opinion here.

@ilan-gold
Copy link
Contributor

@ivirshup your call then.

@ilan-gold
Copy link
Contributor

I am fine leaving this as-is @flying-sheep so will enable auto-merge

@ilan-gold ilan-gold enabled auto-merge (squash) August 9, 2024 13:51
@ilan-gold
Copy link
Contributor

I think we didn’t fully agree on if this should be an “and” or an “or” for things to go into settings.

I am comfortable with "or" here

@ilan-gold ilan-gold merged commit a3d656f into main Aug 9, 2024
14 of 15 checks passed
@ilan-gold ilan-gold deleted the stringstocat branch August 9, 2024 14:16
meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Aug 9, 2024
ilan-gold pushed a commit that referenced this pull request Aug 9, 2024
ilan-gold added a commit that referenced this pull request Aug 9, 2024
ilan-gold added a commit that referenced this pull request Aug 9, 2024
@flying-sheep flying-sheep restored the stringstocat branch August 23, 2024 09:23
@flying-sheep flying-sheep deleted the stringstocat branch August 23, 2024 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants