Use int instead of long to keep CSV row number #1287

aababilov · 2022-11-03T03:56:53Z

Generate TooManyRowsNotice (error) for files that have more than 1 billion rows. Such large files would require too much memory and cause OOM.

This change reduces memory consumption by 4 bytes per row. Large stops_times.txt files may contain 30 M and even 500 M lines, so we are saving from 120 MB to 2 GB.

Generate TooManyEntriesNotice (error) for files that have more than MAX_INT rows. Such large files would require too much memory and cause OOM.

isabelle-dr · 2022-11-03T15:55:04Z

Thanks @aababilov, one question: how do we define MAX_INT?

This new rule will have to be added to the documentation. 😊
We deprecated NOTICES.md, now we centralize all the docs in RULES.md.

asvechnikov2 · 2022-11-03T21:17:17Z

Thanks! The change looks good. Could you please add a bit more information in the first comment on what impact the change has (e.g. memory savings, etc)? Also, should we use something more conservative than INT_MAX [0], maybe restricting to 500 million entries in a single file or 1 billion? @isabelle-dr do you have any opinion on this?

[0] INT_MAX is 2'147'483'647

isabelle-dr · 2022-11-04T12:19:00Z

@asvechnikov2 what is the proportion of datasets than have > 500 million entries?

asvechnikov2 · 2022-11-08T01:28:31Z

@isabelle-dr I think the biggest I saw was under 100 million entries, so it should be safe to assume that there are none for > 500 millions (not even close)

aababilov · 2022-11-08T03:14:24Z

We deprecated NOTICES.md, now we centralize all the docs in RULES.md.

This is a risky practice. RULES.md already has almost 3 k rows and it has to be maintained manually. Saving this file in IDEA takes several seconds. I would suggest to drop this file completely and generate docs based on comments.

…nyRows We are counting the amount of rows, not entities. A CSV file may have empty rows that have no GTFS entities.

bdferris-v2 · 2022-11-08T03:29:26Z

@aababilov agreed that RULES.md is a pain to maintain. I don't recommend opening it in IDEA but instead a simple text editor. I've explored generating the file from comments directly and have a local patch that automates a big chunk of it, but it's not quite fully there.

asvechnikov2

LGTM! @isabelle-dr the proposal is to use 1 billion entries as a limit, this is well beyond of what is available now and won't cause any issues.

isabelle-dr

LGTM! Merging this PR. 🥳

aababilov requested a review from asvechnikov2 November 3, 2022 03:57

aababilov force-pushed the csv-row-int branch from 060c77f to f2ecbe4 Compare November 3, 2022 03:57

Use int instead of long to keep CSV row number

0d55c90

Generate TooManyEntriesNotice (error) for files that have more than MAX_INT rows. Such large files would require too much memory and cause OOM.

aababilov force-pushed the csv-row-int branch from f2ecbe4 to 0d55c90 Compare November 3, 2022 04:00

Reduce max ros number to 1 billion and rename TooManyEntries to TooMa…

ff924d1

…nyRows We are counting the amount of rows, not entities. A CSV file may have empty rows that have no GTFS entities.

asvechnikov2 approved these changes Nov 8, 2022

View reviewed changes

isabelle-dr approved these changes Nov 8, 2022

View reviewed changes

isabelle-dr merged commit 1636e44 into MobilityData:master Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use int instead of long to keep CSV row number #1287

Use int instead of long to keep CSV row number #1287

aababilov commented Nov 3, 2022 •

edited

Loading

isabelle-dr commented Nov 3, 2022 •

edited

Loading

asvechnikov2 commented Nov 3, 2022

isabelle-dr commented Nov 4, 2022

asvechnikov2 commented Nov 8, 2022

aababilov commented Nov 8, 2022

bdferris-v2 commented Nov 8, 2022

asvechnikov2 left a comment

isabelle-dr left a comment

Use int instead of long to keep CSV row number #1287

Use int instead of long to keep CSV row number #1287

Conversation

aababilov commented Nov 3, 2022 • edited Loading

isabelle-dr commented Nov 3, 2022 • edited Loading

asvechnikov2 commented Nov 3, 2022

isabelle-dr commented Nov 4, 2022

asvechnikov2 commented Nov 8, 2022

aababilov commented Nov 8, 2022

bdferris-v2 commented Nov 8, 2022

asvechnikov2 left a comment

Choose a reason for hiding this comment

isabelle-dr left a comment

Choose a reason for hiding this comment

aababilov commented Nov 3, 2022 •

edited

Loading

isabelle-dr commented Nov 3, 2022 •

edited

Loading