Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline downloads record limit #413

Open
nielsklazenga opened this issue Jun 8, 2021 · 3 comments
Open

Offline downloads record limit #413

nielsklazenga opened this issue Jun 8, 2021 · 3 comments
Labels
biocache-service Issues related to biocache service question Further information is requested

Comments

@nielsklazenga
Copy link

I did a download for the following query, https://biocache-test.ala.org.au/occurrences/search?&q=*&fq=data_resource_uid%3Adr376&disableAllQualityFilters=true, which contained only 500,000 rows.

The same download in the Biocache, https://biocache.ala.org.au/occurrences/search?&q=*&fq=data_resource_uid%3Adr376&disableAllQualityFilters=true, gives me all 994,654 records.

The Biocache Store has slightly more records than LA Pipelines, but not that many more.

I have never done such a big download before, but I can see myself doing bigger downloads in the future. Is the lower record limit on purpose?

@javier-molina javier-molina added the question Further information is requested label Jun 9, 2021
@javier-molina
Copy link

@nickdos will be able to add more details if need it but my understanding is that the limit is there for two reasons:

  1. Some users issue a big download without actually realising if they actually need that big dataset.
  2. Big downloads have a performance hit on the system hence it is important to have a guard like this in place to maintain service responsiveness and availability. Our second cut will include improvements in this area Implement downloads with SOLR streaming web services #367. @nielsklazenga If Implement downloads with SOLR streaming web services #367 does not allow to raise the limit to make it useful when you need it first thing that comes to mind is implement a power user role that has more allowances.

@nielsklazenga
Copy link
Author

@javier-molina , I was just observing the difference with the old system. If the 500,000 record download limit is intended, that is perfectly fine. It might be good to issue a warning and not even start the download if the query yields more than 500,000 records (again, not a show-stopper).

If I am going to need a 500,000+ record download, it is for a very specific thing (all plant records from the VBA) and will not happen more than once a year, so I can make special arrangements. In future, a power user role or something with API keys might be a good idea.

@nickdos
Copy link

nickdos commented Jun 9, 2021

I thought the limit was going to be higher than 500,000. @peggynewman is best placed to advise on this. My understanding is we want users to be able to download the single largest dataset but not the "whole ALA". So limit needs to be something like eBird or BirdLife number of records.

@javier-molina javier-molina added the post-v1.1 Not required for version 1 or v1.1 release label Jun 10, 2021
@javier-molina javier-molina added biocache-service Issues related to biocache service and removed post-v1.1 Not required for version 1 or v1.1 release labels Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
biocache-service Issues related to biocache service question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants