Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use int instead of long to keep CSV row number #1287

Merged
merged 2 commits into from
Nov 8, 2022

Conversation

aababilov
Copy link
Collaborator

@aababilov aababilov commented Nov 3, 2022

Generate TooManyRowsNotice (error) for files that have more than 1 billion rows. Such large files would require too much memory and cause OOM.

This change reduces memory consumption by 4 bytes per row. Large stops_times.txt files may contain 30 M and even 500 M lines, so we are saving from 120 MB to 2 GB.

Generate TooManyEntriesNotice (error) for files that have more than
MAX_INT rows. Such large files would require too much memory and cause
OOM.
@isabelle-dr
Copy link
Contributor

isabelle-dr commented Nov 3, 2022

Thanks @aababilov, one question: how do we define MAX_INT?

This new rule will have to be added to the documentation. 😊
We deprecated NOTICES.md, now we centralize all the docs in RULES.md.

@asvechnikov2
Copy link
Collaborator

Thanks! The change looks good. Could you please add a bit more information in the first comment on what impact the change has (e.g. memory savings, etc)? Also, should we use something more conservative than INT_MAX [0], maybe restricting to 500 million entries in a single file or 1 billion? @isabelle-dr do you have any opinion on this?

[0] INT_MAX is 2'147'483'647

@isabelle-dr
Copy link
Contributor

@asvechnikov2 what is the proportion of datasets than have > 500 million entries?

@asvechnikov2
Copy link
Collaborator

@isabelle-dr I think the biggest I saw was under 100 million entries, so it should be safe to assume that there are none for > 500 millions (not even close)

@aababilov
Copy link
Collaborator Author

We deprecated NOTICES.md, now we centralize all the docs in RULES.md.

This is a risky practice. RULES.md already has almost 3 k rows and it has to be maintained manually. Saving this file in IDEA takes several seconds. I would suggest to drop this file completely and generate docs based on comments.

…nyRows

We are counting the amount of rows, not entities. A CSV file may have
empty rows that have no GTFS entities.
@bdferris-v2
Copy link
Collaborator

@aababilov agreed that RULES.md is a pain to maintain. I don't recommend opening it in IDEA but instead a simple text editor. I've explored generating the file from comments directly and have a local patch that automates a big chunk of it, but it's not quite fully there.

Copy link
Collaborator

@asvechnikov2 asvechnikov2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @isabelle-dr the proposal is to use 1 billion entries as a limit, this is well beyond of what is available now and won't cause any issues.

Copy link
Contributor

@isabelle-dr isabelle-dr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Merging this PR. 🥳

@isabelle-dr isabelle-dr merged commit 1636e44 into MobilityData:master Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants