Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typescript: provide message iterator which yields synchronously for messages #1188

Closed
wants to merge 1 commit into from

Conversation

james-rms
Copy link
Collaborator

@james-rms james-rms commented Jun 24, 2024

Changelog

Docs

Description

await on every message significantly slows down iteration over messages, when there are many messages per chunk. See benchmark results here:

$ benchmarks % yarn bench --suite reader
Running 'reader' suite
McapStreamReader
        2.97±0.04 op/s  Heap Used: 265.88±1.07 MB/op    Heap Total: 298.72±0.70 MB/op   ArrayBuffers: 68.20±1.83 MB/op
readMessages_async
        1.63±0.02 op/s  Heap Used: 276.18±0.40 MB/op    Heap Total: 315.12±0.41 MB/op   ArrayBuffers: 54.70±2.00 MB/op
readMessages_async_reverse
        1.67±0.01 op/s  Heap Used: 272.47±0.01 MB/op    Heap Total: 311.40±0.09 MB/op   ArrayBuffers: 52.70±0.00 MB/op
readMessages_sync
        2.17±0.01 op/s  Heap Used: 282.83±0.67 MB/op    Heap Total: 319.89±0.20 MB/op   ArrayBuffers: 52.70±0.00 MB/op
readMessages_sync_reverse
        2.16±0.01 op/s  Heap Used: 276.25±0.00 MB/op    Heap Total: 315.57±0.10 MB/op   ArrayBuffers: 52.70±0.00 MB/op

That is roughly 30% faster.

This PR provides a new method (with name subject to change) that allows callers to only await when necessary.

I'm not confident on the name for this method or return type, happy for feedback there.

BeforeAfter

@james-rms james-rms force-pushed the jrms/fast-typescript-reading branch from 392be23 to 79ae3c2 Compare July 26, 2024 03:14
@james-rms james-rms changed the title Typescript: read in-order without heap Typescript: provide message iterator which yields synchronously for messages Jul 26, 2024
@james-rms james-rms marked this pull request as ready for review July 26, 2024 03:23

/** Returns an iterator that iterates over messages in the MCAP file in order of log time.
* The returned object will have either the `message` or `promise` member defined. If `promise`
* is not undefined, the caller must wait for it to resolve before calling next().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* is not undefined, the caller must wait for it to resolve before calling next().
* is defined, the caller must wait for it to resolve before calling next().

@defunctzombie
Copy link
Contributor

The general takeaway here is that the "async" aspect of getting the next message is typically only async when it comes to loading the next chunk, and otherwise getting the next message is a "sync" operation where you can immediately grab the message from an internal data structure.

My initial reaction to the proposed interface is lackluster - feels somewhat awkward but I'd be open to seeing this kind of interface in any prior art. Maybe another way to accomplish a similar result is to return an iterator that provides batches of messages? So your outer iterator get you some other iterable thing that will return the latest messages in the loaded batch until you need to "wait for more"? Its not obvious to me that this would be easier or more natural to use.

@james-rms
Copy link
Collaborator Author

james-rms commented Jul 26, 2024

Yeah - i agree the interface isn't elegant. I figure there's a place in the world for uglier-but-faster code.

Maybe another way to accomplish a similar result is to return an iterator that provides batches of messages?

As the caller I prefer one for loop with an if vs. two nested loops, but that's just my preference.

@james-rms james-rms closed this Aug 5, 2024
@james-rms james-rms reopened this Aug 7, 2024
@james-rms
Copy link
Collaborator Author

I closed this before because I was going to pursue a nested-loop api, but i've decided against that now. My reason is that with this example API:

async *readMessages(): AsyncGenerator<Generator<TypedMcapRecord["Message"], void, void>, void, void>

It's too tempting for a caller to try to store the second-level generators between iterations of the top level. I can't think of a good design that both a) doesn't leak objects into orphaned closures and b) can handle iterators being stored gracefully.

@james-rms
Copy link
Collaborator Author

cc. @AaronO

@james-rms james-rms closed this Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants