Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pitch] Apache avro serialization #74

Open
jspaezp opened this issue Jan 5, 2024 · 4 comments
Open

[Pitch] Apache avro serialization #74

jspaezp opened this issue Jan 5, 2024 · 4 comments

Comments

@jspaezp
Copy link

jspaezp commented Jan 5, 2024

Hi y'all!

I started a (VERY EARLY PROTOTYPE) that implements serialization to apache avro.
I think it would be a good alternative to json with more efficient disk usage.

https://github.com/jspaezp/avrospeclib

I am still implementing the schema using pydantic and deriving form it the
avro schema.

Some disk usage metrics on a reasonably large speclib I have

    # ~ 50MB  binary speclib file from diann
    #  552M   tmp/speclib_out.tsv
    #  448M   tmp/speclib_out.mzlib.json # using mzspeclib
    #  148M   tests/data/test.mzlib.avro

Read-write speeds

avro write: 4.832904
avro read: 6.133625
json write: 6.304285
json read: 4.992042
pydantic validation: 19.415933 # Not needed for avro because schema is on-write.

let me know if there is any interest in adopting it!
best!

@jspaezp
Copy link
Author

jspaezp commented Jan 5, 2024

Also ... I f you could point me out to an example of a json-resialized clusters and spectrum_attribute_sets I would really appreciate it!

@jspaezp
Copy link
Author

jspaezp commented Jan 5, 2024

other random thoughts...

Questions:

  1. Why is the 'id' key in the dict object if the key is also
    that same identifier? ('analytes': {'1': {id = '1', ...}}).
    Reading the specification file, the rationale is that one could query
    by id but once read the element 'knows' where it came from ... I am not
    100% sure if i aggree with the rationale, since it could be an implementation
    detail of the reader rather than a constrain on the serialization.
  2. Why are cvparam lists not dicts of cvparams?
    Maybe dict[id:(values|list[values])] or even dict[id:list[values]]
    is more efficient than list[dict]? (this follows a similar rationale as the ID)
{
  "1": {
    "id": "1",
    "attributes": [
      {
        "accession": "MS:1003169",
        "name": "proforma peptidoform sequence",
        "value": "AAAAAAAAAAAAAAAASAGGK"
      },
      {
        "accession": "MS:1001117",
        "name": "theoretical mass",
        "value": 1554.8114178831797
      },
      {
        "accession": "MS:1000041",
        "name": "charge state",
        "value": 2.0
      }
    ]
  }
}

vs

{
  "1": {
    "id": "1",
    "attributes": {
      "MS:1003169": {
        "name": "proforma peptidoform sequence",
        "value": "AAAAAAAAAAAAAAAASAGGK"
      },
      "MS:1001117": {
        "name": "theoretical mass",
        "value": 1554.8114178831797
      },
      "MS:1000041": {
        "name": "charge state",
        "value": 2.0
      }
    }
  }
}

OR EVEN

.... This would be very efficient in terms of space

{
    "CVParamNames": {
        "MS:1003169": "proforma peptidoform sequence",
        "MS:1001117": "theoretical mass",
        "MS:1000041": "charge state"
    },
... {
  "1": {
    "id": "1",
    "attributes": {
      "MS:1003169": {
        "value": "AAAAAAAAAAAAAAAASAGGK"
      },
      "MS:1001117": {
        "value": 1554.8114178831797
      },
      "MS:1000041": {
        "value": 2.0
      }
    }
  }
}
}

  1. Why are accession ids AND names required?
    Why not just accession ids? (I can see how names are useful for
    humans, but I can imagine a 'minized' version of the file that
    only has accession ids).

  2. Is there any place where there is a suggestion on how to deal
    with cvparams that encode redundant pieces of information?
    For example ... the proforma sequence "ASDASD/4" over
    "ASDASD" + "MS:1000041 = 4"? or is this on a per-use-case basis?.

@mobiusklein
Copy link
Collaborator

Hello, I wanted to make sure you knew this was seen, but haven't had a chance to write a full response, will hopefully respond by this Friday.

@mobiusklein
Copy link
Collaborator

Why is the 'id' key in the dict object if the key is also
that same identifier? ('analytes': {'1': {id = '1', ...}}).
Reading the specification file, the rationale is that one could query
by id but once read the element 'knows' where it came from ... I am not
100% sure if i aggree with the rationale, since it could be an implementation
detail of the reader rather than a constrain on the serialization.

The JSON scheme was made for ergonomics of use, not for size, and you're correct, it's that the Analyte might need to know its own identity value when it's being used, e.g. when figuring out which interpretation members correspond to it. The idea that the user will want to be able to find an analyte by its id is more convenient than saving a few bytes and forcing any consumer to write a linear probing procedure. If you are writing a new serialization, you don't need to preserve the pattern as long as the reader puts the id value in the Analyte object.

Why are cvparam lists not dicts of cvparams?
Maybe dict[id:(values|list[values])] or even dict[id:list[values]]
is more efficient than list[dict]? (this follows a similar rationale as the ID)

The outer-most container isn't a dict so that we can repeat the same term multiple times. The elements themselves are dict objects because it lets us represent all the parts of the CVParam structure conveniently, with name, accession, value, and group ID visible.

Why are accession ids AND names required?
Why not just accession ids? (I can see how names are useful for
humans, but I can imagine a 'minized' version of the file that
only has accession ids).

The name is again kept for human readability. If you were writing a binary format you wouldn't need to store it, just the accession or something like it to be able to get back to the full CVParam at read time.

Is there any place where there is a suggestion on how to deal
with cvparams that encode redundant pieces of information?
For example ... the proforma sequence "ASDASD/4" over
"ASDASD" + "MS:1000041 = 4"? or is this on a per-use-case basis?.

I don't know if this is explicitly called out in the specification, but I think (and may be wrong) that readers should support and recognize reading both separate and compound values like proforma peptidoform ion notation, and it is an implementation detail of the reader which is given precedence. Charge state in particular is complicated because it is part of both analytes and spectra, and means different things in different contexts, especially because of chimeric spectra.

We don't have any example libraries with clusters in them yet, but I converted one of the SpectraST libraries to JSON. It includes several attribute sets. I think there are still problems with the conversion though, so I'll work on this more in the coming week.

fetal_brain_tiny.mzlib.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants