Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance with very large input files #18

Open
UnixJunkie opened this issue Dec 4, 2019 · 4 comments
Open

performance with very large input files #18

UnixJunkie opened this issue Dec 4, 2019 · 4 comments

Comments

@UnixJunkie
Copy link
Owner

It seems the parallelization is not super good.
Maybe we should not cut the input file into chunks.
Instead, we should provide readers which move into the file using an offset,
if that's possible.

@UnixJunkie
Copy link
Owner Author

in the parallel case (not distributed), read_some does not need to really read from the input file:
it can emit a read instruction to the worker process.
For example:

  • read block at index i with size s
  • read line i to j
  • skip M records then read N records

@UnixJunkie
Copy link
Owner Author

on one computer: the demuxer should translate Bytes|Line|Line_sep|Red into blocks
and the worker should take charge of reading that block

@UnixJunkie
Copy link
Owner Author

this will prevent from having to create tmp files to hold chunks

@UnixJunkie
Copy link
Owner Author

needs a bench also, to check that this really accelerate things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant