Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PrettyMIDI object serializable efficiently #133

Open
bzamecnik opened this issue Jul 26, 2017 · 5 comments
Open

Make PrettyMIDI object serializable efficiently #133

bzamecnik opened this issue Jul 26, 2017 · 5 comments

Comments

@bzamecnik
Copy link
Contributor

Parsing MIDI is rather slow (eg. music21 is pretty slow, pretty-midi is better, but still not pretty fast) and we might want to perform queries on the parsed data or use that multiple times (eg. as training data for an ML model). Besides trying to optimize the parsing stage another option is to cache the parsed results, eg. by picking or a different form of serialization. In music21 it seems the objects are not serializable at all. In pretty-midi I was able to serialize the PrettyMIDI object and load it back, but the problem is that the serialized form several orders bigger than the original MIDI (just prohibitively big - several MB for a few kB of MIDI).

The subject of this issue is to serialize only vital information that can be used to restore the object while keeping any post-processing still faster than parsing the MIDI again.

Originally I though pickle takes derived properties like get_piano_roll() abut after a very superficial inspection it seems some internal properties like __tick_to_time take much space. I can investigate and measure it in more detail.

The possible solution might be to explicitly provide object for serialization and possibly compress them (eg. dense matrix to sparse) before serialization and decompress after serialization.

The goal is to reduce the pickled size to something comparable to MIDI (or one or two orders bigger) and also to keep the (de)serialization time low.

@craffel
Copy link
Owner

craffel commented Jul 26, 2017

Maybe it would be first to define what you need in terms of speed, memory usage, disk space, etc. I have done a few projects which involve parsing/analyzing/using ML models on O(100,000) MIDI files with pretty_midi on commodity hardware with no issues.

it seems some internal properties like __tick_to_time take much space.

Yes, this is cached to make things more efficient. A few megabytes should be no big deal in memory, I think :)

@cifkao
Copy link

cifkao commented Apr 25, 2019

Here is a comparison of creating a PrettyMIDI object from MIDI vs. unpickling it. The first file is very small (8 bars of monophony), the second one is a full song.

In [1]: %timeit pretty_midi.PrettyMIDI('small.mid')
3.42 ms ± 42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [2]: %timeit with open('small.pickle', 'rb') as f: pickle.load(f)
236 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit pretty_midi.PrettyMIDI('smoke.mid')
367 ms ± 4.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit with open('smoke.pickle', 'rb') as f: pickle.load(f)
25.4 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Here are the file sizes:

 514 small.mid           56K smoke.mid
126K small.pickle       2.1M smoke.pickle

@cifkao
Copy link

cifkao commented Apr 25, 2019

@craffel Would it be viable to integrate the NoteSequence protobuffer from Magenta into pretty_midi (either by adding conversion methods or by directly including it as part of the internal representation)? It seems that the proto mimics the design of pretty_midi, and the code for converting back and forth already exists. However, it could be inconvenient to use Magenta directly, since it depends on a lot of other packages (e.g. a specific version of TensorFlow).

@craffel
Copy link
Owner

craffel commented Apr 25, 2019

No, I don't think so. pretty_midi does not depend or rely on NoteSequence in any way; the the dependency graph only points in one direction. If it's hard to use NoteSequence because of all of Magenta's dependencies, I'd suggest you advocate for Magenta to factor out NoteSequence into a separate library.

@cifkao
Copy link

cifkao commented Sep 10, 2020

note-seq is now a separate library with reduced dependencies, and NoteSequence has been fixed to support efficient pickling!

Now we can do this:

import pretty_midi, note_seq, pickle

pm = pretty_midi.PrettyMIDI('file.mid')

# PrettyMIDI -> NoteSequence -> pickle
ns = note_seq.midi_to_sequence_proto(pm)
with open('file.pickle', 'wb') as f:
    pickle.dump(ns, f)

# pickle -> NoteSequence -> PrettyMIDI
with open('file.pickle', 'rb') as f:
    ns = pickle.load(f)
pm = note_seq.sequence_proto_to_pretty_midi(ns)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants