Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow write to accept file-like objects and HDF5 File or Group #378

Open
Helveg opened this issue Feb 11, 2022 · 12 comments
Open

Allow write to accept file-like objects and HDF5 File or Group #378

Helveg opened this issue Feb 11, 2022 · 12 comments

Comments

@Helveg
Copy link
Contributor

Helveg commented Feb 11, 2022

I'd like to store the created files a bit more dynamically than just in a file on the filesystem, to enable various use cases, and because it is conventional in Python, can the write function also accept file-like objects (anything with a write function) to write the morphology to?

Additionally, when storing HDF5, can we pass the handle to a h5py.File or h5py.Group so that we may store morphologies inside the hierarchy of existing HDF5 files.

@mgeplf
Copy link
Contributor

mgeplf commented Feb 14, 2022

Both ideas are interesting; the python side of the library is largely exposing the C++ code, so it's not trivial to add the glue that would allow this functionality. However, we will look into it to see what is required.

Thanks for the suggestions.

@Helveg
Copy link
Contributor Author

Helveg commented Feb 23, 2023

@mgeplf any progress on this? I'd really like to leverage MorphIO in a framework we're developing. We take morphologies from several sources, including straight from NeuroMorpho and other online databases, and having to buffer everything in a temporary file on disk is something I'd like to avoid :)

@matz-e
Copy link
Member

matz-e commented Feb 23, 2023

We're using BlueBrain/HighFive to wrap access to the HDF5 files. To be able to pass groups into writing (and reading), we would have to first see about how h5py and HighFive can interface best without creating any potential memory issues. We will discuss this more internally.

(FWIW, what you are describing sounds similar to some of our envisioned morphology storage concepts)

@mgeplf
Copy link
Contributor

mgeplf commented Feb 23, 2023

Wow, this fell off the radar; sorry about that.

I remember looking it at a bit, and h5py exposes an id, but I wasn't sure if it was the hdf5 id. It was mainly the glue from h5py that would be a challenge, but I'm forgetting why.

FWIW, what you are describing sounds similar to some of our envisioned morphology storage concepts

Is a good point; if you can, @Helveg, can you describe what you're envisining? Perhaps some special API can be added instead of generic h5py handling.

@Helveg
Copy link
Contributor Author

Helveg commented Feb 23, 2023

I see that my original request was for the write function, but also the Morphology constructor strictly takes a file path. In both these read and write cases I'd already have the file contents in memory, but they're not supposed to pass through the filesystem.

In Python it is conventional for I/O functions to check 2 things:

  • Can this thing I've been handed be resolved to a filesystem path? (Using os.fspath, so that things like strings and Paths from pathlib can all be handled)
  • Is this thing that I've been handed a file-like object (a file, buffer, stream, ...)? This is usually checked merely by the presence of a .read and/or .write attribute. (Python likes duck-typing)
  • Additionally, since you're using h5py, which is a hierarchical data format, it's also needlessly restrictive that I can only hand it the path, and thereby implicitly only being able to point to the root of a new hierarchy, and not put it / read it from an existing hierarchy.

I'm not sure if I'm in a great position to recommend any particular design solutions or APIs for this, being so unfamiliar with your code :) IMHO none of the APIs need to change, they just ought to accept more input types.

@mgeplf
Copy link
Contributor

mgeplf commented Feb 24, 2023

I looked into it more, and with some quick hacking around, I'm not sure it's possible.

On linux, it seems that h5py bundles their own hdf5 libraries, and they are loaded into the process separately from the hdf5 libraries that morphio uses:

Mapped address spaces:                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                   
          Start Addr           End Addr       Size     Offset objfile    
[snip]
      0x7fffd81a1000     0x7fffd81a7000     0x6000   0x374000 /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.103.0.0
[snip]
      0x7ffff6ed1000     0x7ffff6ef1000    0x20000        0x0 /home/gevaert/src/MorphIO/build/venv/lib/python3.8/site-packages/h5py.libs/libhdf5_hl-eb6dc72f.so.200.0.0

Since the way that hdf5 manages resource access is through handles, and each of these instances is managing its own resources, one can't just pass the handle that's used by h5py to the libhdf5 used by morphio.

The next best thing that I can think of is to be able to pass both a path to an h5file and a path to a group within it, when reading or writing morphologies:

m = Morphology.from_h5(path to h5, path within h5)
# and
Morphology.write_h5(path to h5, path within h5)

But I'm a little reluctant to go down this road, mainly because how h5 handles file locking.

@Helveg would that API work for you? I'd have to ponder adding it, some more.

@Helveg
Copy link
Contributor Author

Helveg commented Feb 24, 2023

Yea, those would solve the HDF5 related read/writes!

What about Python's file-like objects? I think I have a proposal there: In the C++ library, factor out the file reading code from the content processing code, with everything remaining backwards compatible. The Python bindings can then contain a bit of glue to determine the file path (using os.fspath), and call the outermost file reading function with the string file path; or should os.fspath yield a TypeError, you use the .read/.write attribute to obtain the content, and call the inner newly factored out content processing function :)

@mgeplf
Copy link
Contributor

mgeplf commented Feb 28, 2023

Yea, those would solve the HDF5 related read/writes!

Ok, I will consider that, and we'll have to see when we can find time to add it.

The Python bindings can then contain a bit of glue to determine the file path (using os.fspath), [...]

I don't think it's as simple as that; for instance; not all python file like objects have paths, and presumably one of the reason that one would want to be able to use a file like object would be to control the position within the file that one reads/writes too (ie: seek()).

The text based parsers already accept content, so factoring things out shouldn't be a problem

morphio::Property::Properties loadFile(const std::string& path, unsigned int options) {

@Helveg
Copy link
Contributor Author

Helveg commented Feb 28, 2023

I don't think it's as simple as that; for instance; not all python file like objects have paths

With these extensions the user could be trying to hand MorphIO a path, or a file-like object, and the 2 need to be consistently separated from each other. Not just strings are PathLike in Python - being a duck typed language where the implementation is often the specification - anything that doesn't TypeError when you call os.fspath on it, is a valid path. So anything that can be fspath'ed should be considered to be a path and read from the filesystem, while anything that does not, should be file-like object and should be .read or .write. Using .read and .write would already respect the fpos/seek position.

@mgeplf
Copy link
Contributor

mgeplf commented Feb 28, 2023

It's possible that we can disambiguate the ctor based on whether an extension is passed, as well as if there is a .read() or .write() method. This all happens on the c++ side, so it's a bit cumbersome to manage, but I will give it a shot.

@1uc
Copy link
Contributor

1uc commented Feb 28, 2023

If two distinct HDF5 libraries are being used, I'm not sure what happens in code such as:

# create_morphlogies.py
import morphio
import h5py

with h5py.File(filename, "w") as h5:
    group_path = "some/thing"
    h5.create_group(group_path)

    morphology = load_morphology()
    morphology.write_h5(filename, group_path)

Are we guaranteed that the group is created and visible before reopening the file/group from the C++ HDF5 library? Metadata is often aggregated an written in bulk, e.g. when closing the file. If we anyway need to make sure that we're using the same HDF5 library, then it might be nicer to accept h5py.File and h5py.Group via bindings we need to implement in HighFive.

@mgeplf
Copy link
Contributor

mgeplf commented Mar 2, 2023

We appear to be having 2 discussions in parallel:

  1. handling h5py file handles with MorphIO

@1uc > Are we guaranteed that the group is created and visible before reopening the file/group from the C++ HDF5 library?

Good point; I'm starting to shy away from wanting to pass around potentially live h5 objects in the interface, especially because of the split hdf5 library, but not only for that reason: it opens up a can of worms with file locking, flushing, etc, etc, and I don't think we have enough control to do it reliably on the C++ side since python's refcounted objects may or may not be closed at any particular point.

Perhaps the more fruitful information will be born from the discussions @matz-e mentioned.

  1. Handling file-like objects for the text-like formats (SWC, ASC):
    @Helveg > while anything that does not, should be file-like object and should be .read or .write. Using .read and .write would already respect the fpos/seek position.

For the case that one wants to pass a descriptor in, that has a name (ie: os.fspath works), I don't think we need to change the API for that; the user can themselves call os.fpath and use the objects as per normal.

I played w/ a way of handling .read and .write, and I'm not sure how well it would work: For the .read case, we already allow loading ascii-like files with Morphology(contents, extension='swc'), so reading works. With a file-like object, one can't signal what the end of the contents that needs to be read is: ie: if you're putting multiple ascii-like morphs in a file, the parser would just read past one into the next.

For the write side, I think we could add a .to_string("extenstion") function, which would allow one to: fd.write(morph.to_string("asc"), which would allow one to manipulate non-filesystem based morphologies. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants