performance with very large input files #18

UnixJunkie · 2019-12-04T07:27:54Z

It seems the parallelization is not super good.
Maybe we should not cut the input file into chunks.
Instead, we should provide readers which move into the file using an offset,
if that's possible.

UnixJunkie · 2019-12-04T07:55:45Z

in the parallel case (not distributed), read_some does not need to really read from the input file:
it can emit a read instruction to the worker process.
For example:

read block at index i with size s
read line i to j
skip M records then read N records

UnixJunkie · 2019-12-04T08:10:35Z

on one computer: the demuxer should translate Bytes|Line|Line_sep|Red into blocks
and the worker should take charge of reading that block

UnixJunkie · 2019-12-04T08:19:04Z

this will prevent from having to create tmp files to hold chunks

UnixJunkie · 2019-12-05T02:35:18Z

needs a bench also, to check that this really accelerate things

UnixJunkie added the performance label Dec 4, 2019

UnixJunkie added enhancement New feature or request important prio-01 labels May 15, 2020

UnixJunkie self-assigned this May 15, 2020

UnixJunkie added data-copy tmp-files labels May 15, 2020

UnixJunkie added prio-02 and removed important prio-01 labels Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance with very large input files #18

performance with very large input files #18

UnixJunkie commented Dec 4, 2019

UnixJunkie commented Dec 4, 2019

UnixJunkie commented Dec 4, 2019

UnixJunkie commented Dec 4, 2019

UnixJunkie commented Dec 5, 2019

performance with very large input files #18

performance with very large input files #18

Comments

UnixJunkie commented Dec 4, 2019

UnixJunkie commented Dec 4, 2019

UnixJunkie commented Dec 4, 2019

UnixJunkie commented Dec 4, 2019

UnixJunkie commented Dec 5, 2019