Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exception mappers to convert storage failures to Iceberg REST client exceptions #8558

Merged
merged 17 commits into from
Jul 18, 2024

Conversation

dimas-b
Copy link
Member

@dimas-b dimas-b commented May 21, 2024

  • Note: before this change a storage 403 error would manifest as 500 on the REST client side.

  • Remove ObjectIOException. Use BackendErrorStatus instead.

  • Add exception mappers to convert storage failures to BackendErrorStatus
    and to Iceberg REST client exceptions.

  • Store error codes in TaskStatus and convert them to Iceberg REST
    status code when failed tasks are accessed again.

  • Add AccessCheckHandler to ObjectStorageMock for simulating access
    failures in tests.

Closes #8738


import org.projectnessie.storage.uri.StorageUri;

public class StorageFailureException extends NonRetryableException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "hard"/unconditional association with "non-retryable" is not good.
We need a way to distinguish "hard/final failures" from "retryable" ones, so that org.projectnessie.catalog.service.impl.Util#throwableAsErrorTaskState (via e.g. org.projectnessie.catalog.service.impl.EntitySnapshotTaskBehavior#asErrorTaskState) can return the right state (retryable vs final).

Retryable examples:

  • Throttled requests
  • Technical network errors ("no route to host", "read timeout")
  • Other unknown (IO)Exceptions
  • Forbidden (users may "fix" credentials or ACLs)

Non-retryable ones:

  • Not found

We might eventually need a third TaskState - to handle "forbidden"/"access denied" better, to handle the case when the object store's ACL's just wrong: a state that's immediately reported as an error, but retried later.

Copy link
Member

@snazy snazy Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might eventually need a third TaskState - to handle "forbidden"/"access denied" better, to handle the case when the object store's ACL's just wrong: a state that's immediately reported as an error, but retried later.

Isn't that what TaskStatus.ERROR_RETRY already does?
Hm - seems you're right. ERROR_RETRY seems to wait until the task enters a final state.

However, not sure whether that deserves a third state. We might instead just want a "deadline" that callers can (should) provide. The resulting timeout-exception could provide a reason indicating the current state and error - but that might be a bit too much for now. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To mitigate potentially unmapped/unhandled error states, WDYT about making all current FAILURE states ERROR_RETRY ones but with a somewhat higher retry interval?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-retryable ones: Not found

Actually, this feels also retryable. Sure, in a perfect world, "not found" should be a "final" state.
But it could also be an ephemeral case, for example when users messed up the object store locations - in theory though, but possible. Or if someone manually migrates files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's pretty much where I'm moving (in my local ATM). Each caller should timeout at some point and concurrent caller would get the same result within some time window. If another caller, touches the same task after a certain time period, a re-try will happen regardless of the root cause. We can tune with the first and second timeout separately for each "cause".

Copy link
Member Author

@dimas-b dimas-b Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: "not found" : It may be in reality represent a permission error (as in GH), so we cannot really rely on the object being really missing.

@snazy snazy force-pushed the feature/nessie-catalog-server branch 5 times, most recently from 5e176f9 to 49cf7f7 Compare May 27, 2024 18:55
@snazy snazy deleted the branch projectnessie:main May 28, 2024 16:06
@snazy snazy closed this May 28, 2024
@snazy
Copy link
Member

snazy commented May 28, 2024

Oh - seems the deletion of the "merge base" closed this PR :(

@dimas-b
Copy link
Member Author

dimas-b commented May 28, 2024

No worries. I'll resume on main.

@dimas-b dimas-b force-pushed the exception-propagation branch 4 times, most recently from 19e9645 to 2be307b Compare June 28, 2024 23:21
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review. It's a lot of changes in this PR. It seems, your new approach is around handling the exception when running in the "import worker", feels appropriate to me. Just need to keep in mind that we have to eventually do the same when writing files.

I think, the approach can also be implemented w/ using the ERROR_RETRY and leave the FAILURE state as it is (the changes for that are incomplete BTW).

@dimas-b
Copy link
Member Author

dimas-b commented Jul 4, 2024

leave the FAILURE state as it is

I'm not sure about that. I think we have to be able to re-attempt after any failure because the conditions for the failure might change over time. In a way, no failure is a "final" failure. The different between ERROR_RETRY and FAILURE is only in failing the currently waiting tasks (in the latter case).

@snazy
Copy link
Member

snazy commented Jul 5, 2024

leave the FAILURE state as it is

I'm not sure about that. I think we have to be able to re-attempt after any failure because the conditions for the failure might change over time. In a way, no failure is a "final" failure. The different between ERROR_RETRY and FAILURE is only in failing the currently waiting tasks (in the latter case).

Well, FAILURE is well defined now as a final state. Then I'd rather add another type. For that it's probably easier to do it in a separate PR and only focus on the tasks-stuff.

dimas-b added a commit to dimas-b/nessie that referenced this pull request Jul 5, 2024
This may happen when a config class has all its properties
grouped under sub-sections, but we still want to produce
`.md` files from the top-level javadoc.

This is a pre-requisite to projectnessie#8558
@dimas-b dimas-b force-pushed the exception-propagation branch 2 times, most recently from 6fd3824 to 5897903 Compare July 15, 2024 21:22
@dimas-b dimas-b requested review from adutra and snazy July 15, 2024 21:46
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically two things left:

  • Moving the config property (not a big one)
  • Compatibility/behavior during rolling upgrades

Otherwise LGTM

BackendErrorCode errorCode = BackendErrorCode.valueOf(state.errorCode());
return new PreviousTaskException(errorCode, state.message());
} catch (IllegalArgumentException e) {
return new PreviousTaskException(UNKNOWN, state.message());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how far we should go here wrt to new BackendErrorCode values added in future releases, and behavior during a rolling-upgrade (new version writing NEW_CODE but old version reading it as UNKNOWN, writing it as UNKNOWN and new version re-reading it as UNKNOWN as well). Would it make sense to serialize the error code including the HTTP status code or to make BackendErrorCode a value type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very apparent in code, but this method is called only when the Task Service decides it's done with trying to execute a task. The UNKNOWN status here will not be stored in Persist but will only float up to the caller via IcebergErrorMapper... I'll think a bit more how to make the code clearer, but ATM I'm not sure it warrants any specific action for rolling upgrades.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. All good then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve? ;)

@snazy snazy added this to the 0.93.0 milestone Jul 18, 2024
@dimas-b dimas-b requested a review from snazy July 18, 2024 17:25
…ent exceptions

* Note: before this change a storage 403 error would manifest as 500 on
  the REST client side.

* Remove `ObjectIOException`. Use `BackendErrorStatus` instead.

* Add exception mappers to convert storage failures to `BackendErrorStatus`
  and to Iceberg REST client exceptions.

* Store error codes in TaskStatus and convert them to Iceberg REST
  status code when failed tasks are accessed again.

* Add AccessCheckHandler to ObjectStorageMock for simulating access
  failures in tests.

Closes projectnessie#8738
* able to determine a specific failure reason. This may be changed in the future when task
* deadlines are supported.
*/
UNKNOWN(false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make UNKNOWN retryable - we don't know what happened, so it can be retryable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know real UNKNOWN will end up in infinite retry

Copy link
Member Author

@dimas-b dimas-b Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to add task deadlines in a follow-up PR and also fix #8860

@dimas-b dimas-b merged commit c91a96a into projectnessie:main Jul 18, 2024
18 checks passed
@dimas-b dimas-b deleted the exception-propagation branch July 18, 2024 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Catalog] Verify that the Iceberg exception mapping returns the correct status codes and error object contents
3 participants