Add exception mappers to convert storage failures to Iceberg REST client exceptions #8558

dimas-b · 2024-05-21T23:16:53Z

Note: before this change a storage 403 error would manifest as 500 on the REST client side.
Remove ObjectIOException. Use BackendErrorStatus instead.
Add exception mappers to convert storage failures to BackendErrorStatus
and to Iceberg REST client exceptions.
Store error codes in TaskStatus and convert them to Iceberg REST
status code when failed tasks are accessed again.
Add AccessCheckHandler to ObjectStorageMock for simulating access
failures in tests.

snazy · 2024-05-23T09:08:25Z

...log/files/api/src/main/java/org/projectnessie/catalog/files/api/StorageFailureException.java

+
+import org.projectnessie.storage.uri.StorageUri;
+
+public class StorageFailureException extends NonRetryableException {


I think the "hard"/unconditional association with "non-retryable" is not good.
We need a way to distinguish "hard/final failures" from "retryable" ones, so that org.projectnessie.catalog.service.impl.Util#throwableAsErrorTaskState (via e.g. org.projectnessie.catalog.service.impl.EntitySnapshotTaskBehavior#asErrorTaskState) can return the right state (retryable vs final).

Retryable examples:

Throttled requests

Technical network errors ("no route to host", "read timeout")

Other unknown (IO)Exceptions

Forbidden (users may "fix" credentials or ACLs)

Non-retryable ones:

Not found

We might eventually need a third TaskState - to handle "forbidden"/"access denied" better, to handle the case when the object store's ACL's just wrong: a state that's immediately reported as an error, but retried later.

We might eventually need a third TaskState - to handle "forbidden"/"access denied" better, to handle the case when the object store's ACL's just wrong: a state that's immediately reported as an error, but retried later.

~~Isn't that what TaskStatus.ERROR_RETRY already does?~~
Hm - seems you're right. ERROR_RETRY seems to wait until the task enters a final state.

However, not sure whether that deserves a third state. We might instead just want a "deadline" that callers can (should) provide. The resulting timeout-exception could provide a reason indicating the current state and error - but that might be a bit too much for now. WDYT?

To mitigate potentially unmapped/unhandled error states, WDYT about making all current FAILURE states ERROR_RETRY ones but with a somewhat higher retry interval?

Non-retryable ones: Not found

Actually, this feels also retryable. Sure, in a perfect world, "not found" should be a "final" state.
But it could also be an ephemeral case, for example when users messed up the object store locations - in theory though, but possible. Or if someone manually migrates files.

Yes, that's pretty much where I'm moving (in my local ATM). Each caller should timeout at some point and concurrent caller would get the same result within some time window. If another caller, touches the same task after a certain time period, a re-try will happen regardless of the root cause. We can tune with the first and second timeout separately for each "cause".

Re: "not found" : It may be in reality represent a permission error (as in GH), so we cannot really rely on the object being really missing.

snazy · 2024-05-28T16:07:38Z

Oh - seems the deletion of the "merge base" closed this PR :(

dimas-b · 2024-05-28T16:58:26Z

No worries. I'll resume on main.

catalog/files/api/src/main/java/org/projectnessie/catalog/files/api/ObjectIOOutputStream.java

catalog/files/api/src/main/java/org/projectnessie/catalog/files/api/ObjectIOInputStream.java

...og/service/rest/src/main/java/org/projectnessie/catalog/service/rest/IcebergErrorMapper.java

...s-server/src/test/java/org/projectnessie/server/catalog/AbstractIcebergCatalogUnitTests.java

snazy

Not a full review. It's a lot of changes in this PR. It seems, your new approach is around handling the exception when running in the "import worker", feels appropriate to me. Just need to keep in mind that we have to eventually do the same when writing files.

I think, the approach can also be implemented w/ using the ERROR_RETRY and leave the FAILURE state as it is (the changes for that are incomplete BTW).

...og/service/impl/src/main/java/org/projectnessie/catalog/service/impl/CatalogServiceImpl.java

tasks/api/src/main/java/org/projectnessie/nessie/tasks/api/TaskState.java

...service/impl/src/main/java/org/projectnessie/nessie/tasks/service/impl/TasksServiceImpl.java

tasks/api/src/main/java/org/projectnessie/nessie/tasks/api/TaskState.java

...log/files/api/src/main/java/org/projectnessie/catalog/files/api/ObjectIOExceptionMapper.java

catalog/files/impl/src/main/java/org/projectnessie/catalog/files/s3/S3ExceptionMapper.java

dimas-b · 2024-07-04T16:46:30Z

leave the FAILURE state as it is

I'm not sure about that. I think we have to be able to re-attempt after any failure because the conditions for the failure might change over time. In a way, no failure is a "final" failure. The different between ERROR_RETRY and FAILURE is only in failing the currently waiting tasks (in the latter case).

snazy · 2024-07-05T10:41:23Z

leave the FAILURE state as it is

I'm not sure about that. I think we have to be able to re-attempt after any failure because the conditions for the failure might change over time. In a way, no failure is a "final" failure. The different between ERROR_RETRY and FAILURE is only in failing the currently waiting tasks (in the latter case).

Well, FAILURE is well defined now as a final state. Then I'd rather add another type. For that it's probably easier to do it in a separate PR and only focus on the tasks-stuff.

This may happen when a config class has all its properties grouped under sub-sections, but we still want to produce `.md` files from the top-level javadoc. This is a pre-requisite to projectnessie#8558

snazy

Basically two things left:

Moving the config property (not a big one)
Compatibility/behavior during rolling upgrades

Otherwise LGTM

...ce/impl/src/main/java/org/projectnessie/catalog/service/impl/EntitySnapshotTaskBehavior.java

catalog/files/api/src/main/java/org/projectnessie/catalog/files/api/BackendErrorCode.java

snazy · 2024-07-16T08:44:52Z

...service/impl/src/main/java/org/projectnessie/catalog/service/impl/PreviousTaskException.java

+      BackendErrorCode errorCode = BackendErrorCode.valueOf(state.errorCode());
+      return new PreviousTaskException(errorCode, state.message());
+    } catch (IllegalArgumentException e) {
+      return new PreviousTaskException(UNKNOWN, state.message());


I wonder how far we should go here wrt to new BackendErrorCode values added in future releases, and behavior during a rolling-upgrade (new version writing NEW_CODE but old version reading it as UNKNOWN, writing it as UNKNOWN and new version re-reading it as UNKNOWN as well). Would it make sense to serialize the error code including the HTTP status code or to make BackendErrorCode a value type?

It's not very apparent in code, but this method is called only when the Task Service decides it's done with trying to execute a task. The UNKNOWN status here will not be stored in Persist but will only float up to the caller via IcebergErrorMapper... I'll think a bit more how to make the code clearer, but ATM I'm not sure it warrants any specific action for rolling upgrades.

I see. All good then.

Approve? ;)

…ent exceptions * Note: before this change a storage 403 error would manifest as 500 on the REST client side. * Remove `ObjectIOException`. Use `BackendErrorStatus` instead. * Add exception mappers to convert storage failures to `BackendErrorStatus` and to Iceberg REST client exceptions. * Store error codes in TaskStatus and convert them to Iceberg REST status code when failed tasks are accessed again. * Add AccessCheckHandler to ObjectStorageMock for simulating access failures in tests. Closes projectnessie#8738

…yzer sub-classes

snazy · 2024-07-18T19:19:32Z

catalog/files/api/src/main/java/org/projectnessie/catalog/files/api/BackendErrorCode.java

+   * able to determine a specific failure reason. This may be changed in the future when task
+   * deadlines are supported.
+   */
+  UNKNOWN(false),


Let's make UNKNOWN retryable - we don't know what happened, so it can be retryable.

I know real UNKNOWN will end up in infinite retry

I'm planning to add task deadlines in a follow-up PR and also fix #8860

catalog/files/impl/src/main/java/org/projectnessie/catalog/files/adls/AdlsObjectIO.java

dimas-b force-pushed the exception-propagation branch from f49fd64 to 3c14611 Compare May 22, 2024 03:11

snazy reviewed May 23, 2024

View reviewed changes

snazy force-pushed the feature/nessie-catalog-server branch 5 times, most recently from 5e176f9 to 49cf7f7 Compare May 27, 2024 18:55

snazy deleted the branch projectnessie:main May 28, 2024 16:06

snazy closed this May 28, 2024

snazy mentioned this pull request Jun 5, 2024

[Catalog] Verify that the Iceberg exception mapping returns the correct status codes and error object contents #8738

Closed

dimas-b reopened this Jun 6, 2024

dimas-b changed the base branch from feature/nessie-catalog-server to main June 6, 2024 17:26

dimas-b closed this Jun 6, 2024

dimas-b force-pushed the exception-propagation branch from 3c14611 to b51d1b8 Compare June 6, 2024 17:29

dimas-b reopened this Jun 10, 2024

dimas-b mentioned this pull request Jun 20, 2024

Add admin command: delete-catalog-tasks #8869

Merged

dimas-b force-pushed the exception-propagation branch from c37e4ee to afa53b9 Compare June 21, 2024 21:58

snazy reviewed Jun 22, 2024

View reviewed changes

dimas-b force-pushed the exception-propagation branch 4 times, most recently from 19e9645 to 2be307b Compare June 28, 2024 23:21

snazy reviewed Jul 1, 2024

View reviewed changes

dimas-b mentioned this pull request Jul 5, 2024

Allow Smallrye config sections to have no properties #9023

Merged

dimas-b force-pushed the exception-propagation branch from 40914e0 to 53b16b7 Compare July 5, 2024 21:10

dimas-b force-pushed the exception-propagation branch 2 times, most recently from 6fd3824 to 5897903 Compare July 15, 2024 21:22

dimas-b requested review from adutra and snazy July 15, 2024 21:46

snazy reviewed Jul 16, 2024

View reviewed changes

snazy added this to the 0.93.0 milestone Jul 18, 2024

dimas-b force-pushed the exception-propagation branch from 5897903 to 2b4e476 Compare July 18, 2024 17:01

dimas-b requested a review from snazy July 18, 2024 17:25

dimas-b added 17 commits July 18, 2024 14:02

Fix helm

12fd3ae

review: rename of to fromHttpStatusCode

c791b96

review: add TaskState javadoc

933c0ac

review: remove unused parameter

3e39210

review: move retry settings to EntitySnapshotTaskBehavior

eb8b528

review: report unhandled backend error codes

6ee6aaa

review: call backendExceptionMapper only if the is not mapped yet

bcfcab0

review: rename config to error-handling.throttled-retry-after

53ad515

review: use TooManyRequestsException error `type for HTTP 429

fbc48dc

review: support 429 in fromHttpStatusCode()

5d3e076

Move exception introspection code from IcebergErrorMapper to new Anal…

bb39700

…yzer sub-classes

review: rename backendExceptionMapper()

a58391a

review: update errorCode() javadoc

90ac58e

Remove IOException wrappers from ObjectIO impl.

c35c880

review: move throttled-retry-after to CatalogConfig

bbbf525

review: move retryable flag to BackendErrorCode

a910364

dimas-b force-pushed the exception-propagation branch from e00eac6 to a910364 Compare July 18, 2024 18:02

snazy reviewed Jul 18, 2024

View reviewed changes

snazy approved these changes Jul 18, 2024

View reviewed changes

dimas-b merged commit c91a96a into projectnessie:main Jul 18, 2024
18 checks passed

dimas-b deleted the exception-propagation branch July 18, 2024 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exception mappers to convert storage failures to Iceberg REST client exceptions #8558

Add exception mappers to convert storage failures to Iceberg REST client exceptions #8558

dimas-b commented May 21, 2024 •

edited

Loading

snazy May 23, 2024

snazy Jun 22, 2024 •

edited

Loading

snazy Jun 22, 2024

snazy Jun 22, 2024

dimas-b Jun 24, 2024

dimas-b Jun 24, 2024 •

edited

Loading

snazy commented May 28, 2024

dimas-b commented May 28, 2024

snazy left a comment

dimas-b commented Jul 4, 2024

snazy commented Jul 5, 2024

snazy left a comment

snazy Jul 16, 2024

dimas-b Jul 18, 2024

snazy Jul 18, 2024

dimas-b Jul 18, 2024

snazy Jul 18, 2024

dimas-b Jul 18, 2024

dimas-b Jul 18, 2024 •

edited

Loading


		import org.projectnessie.storage.uri.StorageUri;

		public class StorageFailureException extends NonRetryableException {

Add exception mappers to convert storage failures to Iceberg REST client exceptions #8558

Add exception mappers to convert storage failures to Iceberg REST client exceptions #8558

Conversation

dimas-b commented May 21, 2024 • edited Loading

Choose a reason for hiding this comment

snazy Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimas-b Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

snazy commented May 28, 2024

dimas-b commented May 28, 2024

snazy left a comment

Choose a reason for hiding this comment

dimas-b commented Jul 4, 2024

snazy commented Jul 5, 2024

snazy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimas-b Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

dimas-b commented May 21, 2024 •

edited

Loading

snazy Jun 22, 2024 •

edited

Loading

dimas-b Jun 24, 2024 •

edited

Loading

dimas-b Jul 18, 2024 •

edited

Loading