(feat): [WIP] support physical quantities via Pint #1481

mruffalo · 2024-04-24T20:59:46Z

Hello!

The HuBMAP Consortium makes extensive use of AnnData, writing gene expression count matrices (and more) as .h5ad files for public download.

As we're continuing to extend our computational pipelines to handle sequencing spatial transcriptomics with associated imaging data (e.g., Visium), with spatial coordinates registered to pixel coordinates, it's become clear that it would be very useful to store physical units directly in the AnnData/.h5ad structures. We've had a very pleasant experience using the Pint package in other contexts, and this package interoperates naturally with NumPy and similar (though not as nicely with SciPy sparse matrices, unfortunately).

This draft PR implements serialization of pint.Quantity objects to H5AD and Zarr, separately writing the magnitude and unit of each Quantity, and then instantiating a new Quantity object when reading from disk.

There are a few caveats to this -- the pint package allows instantiating multiple UnitRegistry objects, and quantities associated with one registry cannot be compared to quantities from another registry. As such, the anndata package instantiates a UnitRegistry on import, accessible at anndata.units and used to instantiate Quantity objects when loading data. Users should almost definitely use that anndata.units registry when creating their own Quantity objects, to interoperate with Quantity objects created by the package.

This PR adds pint as an unconditional requirement for anndata.

I added a test method which demonstrates serialization and loading of pint.Quantity-wrapped arrays and scalars, which works as expected. I ran the full test suite on my machine and encountered no new failures (10 tests failed on current main and on my branch, on macOS 14.4.1, and 29 failed on Ubuntu 22.04).

This PR does not add documentation; I'm happy to do so but wanted to check whether this approach looked reasonable to everyone before expending that effort.

Closes #
Tests added
Release note added (or unnecessary)

… methods

codecov · 2024-04-24T21:18:19Z

Codecov Report

Attention: Patch coverage is 62.50000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 84.19%. Comparing base (20535bf) to head (7a51205).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1481      +/-   ##
==========================================
- Coverage   84.22%   84.19%   -0.04%     
==========================================
  Files          35       35              
  Lines        5685     5699      +14     
==========================================
+ Hits         4788     4798      +10     
- Misses        897      901       +4

Files	Coverage Δ
src/anndata/__init__.py	`92.00% <100.00%> (ø)`
src/anndata/_core/anndata.py	`85.00% <100.00%> (+0.03%)`	⬆️
src/anndata/_io/specs/methods.py	`87.01% <53.84%> (-0.99%)`	⬇️

... and 1 file with indirect coverage changes

flying-sheep · 2024-04-25T13:27:13Z

Hmm, generally units are a great idea. Two questions:

What’s the effect of using a global unit registry? The idea seems to be that if have two units with the same name defined in two different anndata on-disk representations, loading both will result in interoperability. But isn’t the idea of the registry concept that that shouldn’t happen by accident?

Does pint have a global unit registry for standard (SI) units? What do you use it for?
Does pint allow to use these units in arrays? I don’t see a huge advantage in just having a handful of constants saved in .uns, but again: maybe it would help to see what you use them for

mruffalo · 2024-04-25T14:10:10Z

Hmm, generally units are a great idea. Two questions:

Thanks!

What’s the effect of using a global unit registry? The idea seems to be that if have two units with the same name defined in two different anndata on-disk representations, loading both will result in interoperability. But isn’t the idea of the registry concept that that shouldn’t happen by accident?

What I'd expect as an anndata user is for the units to be interoperable between different AnnData objects loaded by the package -- a millimeter is a millimeter, after all, and see the example code I included below as an answer to your second question.

The utility of having multiple UnitRegistry objects only starts to apply when users want to modify the defaults (like using imperial instead of SI as the base units) or adding new custom units, etc.

I referred to the suggestions in https://pint.readthedocs.io/en/stable/getting/pint-in-your-projects.html#having-a-shared-registry when considering how to handle needing an instantiated UnitRegistry object to do any unit handling.

Does pint have a global unit registry for standard (SI) units? What do you use it for?

The pint package does not instantiate a "global" unit registry itself; users (including packages embedding pint like anndata) need to do this manually.

Does pint allow to use these units in arrays? I don’t see a huge advantage in just having a handful of constants saved in .uns, but again: maybe it would help to see what you use them for

Absolutely! The test code I implemented adds one scalar Quantity to .uns (e.g., diameter of a capture bead for a Visium dataset), and the array of spatial coordinates in .obsm['X_spatial'] has units of mm.

I believe this would even allow anndata.concat to operate on objects with spatial coordinates using different units, as long as something like np.vstack is used:

In [1]: import numpy as np

In [2]: from pint import UnitRegistry

In [3]: reg = UnitRegistry()

In [4]: coords_1 = np.array([[1, 2], [2, 1]]) * reg.mm

In [5]: coords_1
Out[5]: 
array([[1, 2],
       [2, 1]]) <Unit('millimeter')>

In [6]: coords_2 = np.array([[0, 3], [4, 0]]) * reg.um

In [7]: coords_2
Out[7]: 
array([[0, 3],
       [4, 0]]) <Unit('micrometer')>

In [8]: np.vstack([coords_1, coords_2])
Out[8]: 
array([[1.   , 2.   ],
       [2.   , 1.   ],
       [0.   , 0.003],
       [0.004, 0.   ]]) <Unit('millimeter')>

(Whether or not it's useful to mix spatial coordinates between two different experiments, it's still an improvement to have that concatenation be unit-aware.)

mruffalo · 2024-04-25T14:25:34Z

@flying-sheep After writing my comment and re-reading yours, I could be persuaded that anndata should make it hard to blindly concatenate AnnData objects that have spatial information, if they're read separately because they're from separate experiments. Coordinate (2mm, 2mm) in Visium dataset 1 has no relationship to the same location in another Visium dataset, for example, and one shouldn't plot capture beads at those same locations on top of each other in the same plot. Maybe anndata.concat should grow a new parameter harmonize_units or something, to force converting everything to the same UnitRegistry if a user knows this is what they want to do.

My initial vision for this functionality was to make it easy on users to deal with physical quantities in AnnData objects, hence allowing mixing units between AnnDatas, to allow checking things like "of these two sequencing spatial transcriptomics datasets I have in memory, which has smaller size of capture bead?"

I would be relatively annoyed if I couldn't compare those quantities easily as an anndata user:

>>> (2 * UnitRegistry().mm) < (5 * UnitRegistry().um)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/plain/quantity.py", line 1381, in <lambda>
    __lt__ = lambda self, other: self.compare(other, op=operator.lt)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/plain/quantity.py", line 100, in wrapped
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/plain/quantity.py", line 1369, in compare
    raise ValueError(
ValueError: Cannot operate with Quantity and Quantity of different registries.

and I would probably be left with the impression that unit handling is an obstacle rather than an enhancement.

flying-sheep · 2024-04-25T15:19:52Z

Yeah, those are the considerations. Basically the big question is “shall the registry live in each AnnData object or be global”

There’s also mixes possible, like having a context manager that overrides the (otherwise global) registry. @ivirshup @ilan-gold what do you think about this?

ivirshup · 2024-04-27T09:36:52Z

Two main things:

I think this should be discussed in an issue
the use case sound like spatialdata, which already has coordinate spaces and transforms. Why not just use those?

mruffalo · 2024-04-28T19:18:09Z

I think this should be discussed in an issue

Works for me. I'll make an issue when I get a chance, unless someone beats me to it.

the use case sound like spatialdata, which already has coordinate spaces and transforms. Why not just use those?

As noted on https://spatialdata.scverse.org/en/latest/, SpatialData is under active development and doesn't yet claim to offer a stable API. We want to distribute the results of HuBMAP computational pipelines in a stable (enough) format, and 'plain' H5AD is how we're already publishing sc/snRNA-seq processing results.

Our use case is sharing VIsium data in .h5ad format, with spatial coordinates in .obsm['X_spatial'] -- we are already doing this, and wanted to enhance these outputs by including physical units for these coordinates, not to mention metadata like the diameter of each capture bead.

Using a library for handling physical units eliminates a whole class of bugs, and using Pint has simplified a lot of our code, including in handling of physical pixel sizes read from OME-XML metadata. I think there's value in allowing users to use pint.Quantity objects in any appropriate AnnData fields, and supporting serializing those to disk -- users who load such a file can't miss the fact that there are physical sizes associated with some of the values, making it harder to handle these quantities incorrectly.

ilan-gold · 2024-05-22T08:50:31Z

pyproject.toml

@@ -47,6 +47,7 @@ dependencies = [
    "exceptiongroup; python_version<'3.11'",
    "natsort",
    "packaging>=20.0",
+    "pint",


I might lean towards putting this in its own optional dependency section.

@flying-sheep thoughts?

ilan-gold · 2024-05-22T09:03:13Z

src/anndata/_io/specs/methods.py

+    _writer.write_elem(g, "magnitude", v.magnitude, dataset_kwargs=dataset_kwargs)
+    _writer.write_elem(g, "units", str(v.units), dataset_kwargs=dataset_kwargs)


I think that the units should be written into the attrs instead of being given their own array. It is just a short string, right? This would also make it easier for interoperability purposes. Arrays with this extra metadata wouldn't break existing readers.

ilan-gold · 2024-05-22T09:07:47Z

src/anndata/tests/test_readwrite.py

+        uns={"size": 100 * ad.units["um"]},
+        obsm={"X_spatial": X_arr * ad.units.mm},
+    )
+    write(tmp_path / name, adata)


I'd follow the pattern from above, unless I'm missing a good reason not to i.e.,

read = lambda pth: getattr(ad, f"read_{diskfmt}")(pth) write = lambda adata, pth: getattr(adata, f"write_{diskfmt}")(pth)

and the corresponding fixtures for the diskfmt

mruffalo added 3 commits April 24, 2024 15:12

Add pint requirement to pyproject.toml

25d2a0a

Import pint, initialize UnitRegistry as anndata.units, add read/write…

7b51bfa

… methods

Add test for reading/writing Pint quantities

7a51205

ilan-gold mentioned this pull request Apr 29, 2024

Using Pint for units #1483

Open

ilan-gold reviewed May 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat): [WIP] support physical quantities via Pint #1481

(feat): [WIP] support physical quantities via Pint #1481

mruffalo commented Apr 24, 2024 •

edited

Loading

codecov bot commented Apr 24, 2024 •

edited

Loading

flying-sheep commented Apr 25, 2024

mruffalo commented Apr 25, 2024 •

edited

Loading

mruffalo commented Apr 25, 2024

flying-sheep commented Apr 25, 2024

ivirshup commented Apr 27, 2024

mruffalo commented Apr 28, 2024

ilan-gold May 22, 2024

ilan-gold May 22, 2024

ilan-gold May 22, 2024

ilan-gold May 22, 2024

		_writer.write_elem(g, "magnitude", v.magnitude, dataset_kwargs=dataset_kwargs)
		_writer.write_elem(g, "units", str(v.units), dataset_kwargs=dataset_kwargs)

(feat): [WIP] support physical quantities via Pint #1481

Are you sure you want to change the base?

(feat): [WIP] support physical quantities via Pint #1481

Conversation

mruffalo commented Apr 24, 2024 • edited Loading

codecov bot commented Apr 24, 2024 • edited Loading

Codecov Report

flying-sheep commented Apr 25, 2024

mruffalo commented Apr 25, 2024 • edited Loading

mruffalo commented Apr 25, 2024

flying-sheep commented Apr 25, 2024

ivirshup commented Apr 27, 2024

mruffalo commented Apr 28, 2024

ilan-gold May 22, 2024

Choose a reason for hiding this comment

ilan-gold May 22, 2024

Choose a reason for hiding this comment

ilan-gold May 22, 2024

Choose a reason for hiding this comment

ilan-gold May 22, 2024

Choose a reason for hiding this comment

mruffalo commented Apr 24, 2024 •

edited

Loading

codecov bot commented Apr 24, 2024 •

edited

Loading

mruffalo commented Apr 25, 2024 •

edited

Loading