[Pitch] Apache avro serialization #74

jspaezp · 2024-01-05T05:14:47Z

Hi y'all!

I started a (VERY EARLY PROTOTYPE) that implements serialization to apache avro.
I think it would be a good alternative to json with more efficient disk usage.

https://github.com/jspaezp/avrospeclib

I am still implementing the schema using pydantic and deriving form it the
avro schema.

Some disk usage metrics on a reasonably large speclib I have

    # ~ 50MB  binary speclib file from diann
    #  552M   tmp/speclib_out.tsv
    #  448M   tmp/speclib_out.mzlib.json # using mzspeclib
    #  148M   tests/data/test.mzlib.avro

Read-write speeds

avro write: 4.832904
avro read: 6.133625
json write: 6.304285
json read: 4.992042
pydantic validation: 19.415933 # Not needed for avro because schema is on-write.

let me know if there is any interest in adopting it!
best!

The text was updated successfully, but these errors were encountered:

jspaezp · 2024-01-05T16:09:43Z

Also ... I f you could point me out to an example of a json-resialized clusters and spectrum_attribute_sets I would really appreciate it!

jspaezp · 2024-01-05T18:31:34Z

other random thoughts...

Questions:

Why is the 'id' key in the dict object if the key is also
that same identifier? ('analytes': {'1': {id = '1', ...}}).
Reading the specification file, the rationale is that one could query
by id but once read the element 'knows' where it came from ... I am not
100% sure if i aggree with the rationale, since it could be an implementation
detail of the reader rather than a constrain on the serialization.
Why are cvparam lists not dicts of cvparams?
Maybe dict[id:(values|list[values])] or even dict[id:list[values]]
is more efficient than list[dict]? (this follows a similar rationale as the ID)

{
  "1": {
    "id": "1",
    "attributes": [
      {
        "accession": "MS:1003169",
        "name": "proforma peptidoform sequence",
        "value": "AAAAAAAAAAAAAAAASAGGK"
      },
      {
        "accession": "MS:1001117",
        "name": "theoretical mass",
        "value": 1554.8114178831797
      },
      {
        "accession": "MS:1000041",
        "name": "charge state",
        "value": 2.0
      }
    ]
  }
}

vs

{
  "1": {
    "id": "1",
    "attributes": {
      "MS:1003169": {
        "name": "proforma peptidoform sequence",
        "value": "AAAAAAAAAAAAAAAASAGGK"
      },
      "MS:1001117": {
        "name": "theoretical mass",
        "value": 1554.8114178831797
      },
      "MS:1000041": {
        "name": "charge state",
        "value": 2.0
      }
    }
  }
}

OR EVEN

.... This would be very efficient in terms of space

{
    "CVParamNames": {
        "MS:1003169": "proforma peptidoform sequence",
        "MS:1001117": "theoretical mass",
        "MS:1000041": "charge state"
    },
... {
  "1": {
    "id": "1",
    "attributes": {
      "MS:1003169": {
        "value": "AAAAAAAAAAAAAAAASAGGK"
      },
      "MS:1001117": {
        "value": 1554.8114178831797
      },
      "MS:1000041": {
        "value": 2.0
      }
    }
  }
}
}

Why are accession ids AND names required?
Why not just accession ids? (I can see how names are useful for
humans, but I can imagine a 'minized' version of the file that
only has accession ids).
Is there any place where there is a suggestion on how to deal
with cvparams that encode redundant pieces of information?
For example ... the proforma sequence "ASDASD/4" over
"ASDASD" + "MS:1000041 = 4"? or is this on a per-use-case basis?.

mobiusklein · 2024-01-16T15:28:58Z

Hello, I wanted to make sure you knew this was seen, but haven't had a chance to write a full response, will hopefully respond by this Friday.

mobiusklein · 2024-01-26T03:12:33Z

Why is the 'id' key in the dict object if the key is also
that same identifier? ('analytes': {'1': {id = '1', ...}}).
Reading the specification file, the rationale is that one could query
by id but once read the element 'knows' where it came from ... I am not
100% sure if i aggree with the rationale, since it could be an implementation
detail of the reader rather than a constrain on the serialization.

The JSON scheme was made for ergonomics of use, not for size, and you're correct, it's that the Analyte might need to know its own identity value when it's being used, e.g. when figuring out which interpretation members correspond to it. The idea that the user will want to be able to find an analyte by its id is more convenient than saving a few bytes and forcing any consumer to write a linear probing procedure. If you are writing a new serialization, you don't need to preserve the pattern as long as the reader puts the id value in the Analyte object.

Why are cvparam lists not dicts of cvparams?
Maybe dict[id:(values|list[values])] or even dict[id:list[values]]
is more efficient than list[dict]? (this follows a similar rationale as the ID)

The outer-most container isn't a dict so that we can repeat the same term multiple times. The elements themselves are dict objects because it lets us represent all the parts of the CVParam structure conveniently, with name, accession, value, and group ID visible.

Why are accession ids AND names required?
Why not just accession ids? (I can see how names are useful for
humans, but I can imagine a 'minized' version of the file that
only has accession ids).

The name is again kept for human readability. If you were writing a binary format you wouldn't need to store it, just the accession or something like it to be able to get back to the full CVParam at read time.

Is there any place where there is a suggestion on how to deal
with cvparams that encode redundant pieces of information?
For example ... the proforma sequence "ASDASD/4" over
"ASDASD" + "MS:1000041 = 4"? or is this on a per-use-case basis?.

I don't know if this is explicitly called out in the specification, but I think (and may be wrong) that readers should support and recognize reading both separate and compound values like proforma peptidoform ion notation, and it is an implementation detail of the reader which is given precedence. Charge state in particular is complicated because it is part of both analytes and spectra, and means different things in different contexts, especially because of chimeric spectra.

We don't have any example libraries with clusters in them yet, but I converted one of the SpectraST libraries to JSON. It includes several attribute sets. I think there are still problems with the conversion though, so I'll work on this more in the coming week.

fetal_brain_tiny.mzlib.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pitch] Apache avro serialization #74

[Pitch] Apache avro serialization #74

jspaezp commented Jan 5, 2024

jspaezp commented Jan 5, 2024

jspaezp commented Jan 5, 2024 •

edited

Loading

mobiusklein commented Jan 16, 2024

mobiusklein commented Jan 26, 2024

[Pitch] Apache avro serialization #74

[Pitch] Apache avro serialization #74

Comments

jspaezp commented Jan 5, 2024

jspaezp commented Jan 5, 2024

jspaezp commented Jan 5, 2024 • edited Loading

mobiusklein commented Jan 16, 2024

mobiusklein commented Jan 26, 2024

jspaezp commented Jan 5, 2024 •

edited

Loading