Ensure proper behavior if quota limits are hit #1

mark-meyer · 2021-06-28T23:05:55Z

Textract has several quotas.

Of particular concern are:

10 requests/second quota on GetDocumentTextDetection
This is the function that is used to get paginated results from a scanned document. For long PDFs this may be called many times for a single document. The processing of OCR results happens asynchronously based on when Textract has finished processing the PDF, which means we can't control exactly when this function will be called.
Maximum number of asynchronous jobs per account that can simultaneously exist: 600
We have about 780000 documents to process. Which means we will need to limit the rate at which we start async jobs.

A possible solution:

Limit the number of concurrent lambdas processing documents so we don't exceed the 600 total calls at any time.
Set a high number of retries on the SQS queue so failure simply get rescheduled.
Use a dead letter queue to catch anything that fails after max tries so we can resend.

This may require some testing to get right.

The text was updated successfully, but these errors were encountered:

Provide feedback