-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add exception mappers to convert storage failures to Iceberg REST client exceptions #8558
Conversation
f49fd64
to
3c14611
Compare
|
||
import org.projectnessie.storage.uri.StorageUri; | ||
|
||
public class StorageFailureException extends NonRetryableException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "hard"/unconditional association with "non-retryable" is not good.
We need a way to distinguish "hard/final failures" from "retryable" ones, so that org.projectnessie.catalog.service.impl.Util#throwableAsErrorTaskState
(via e.g. org.projectnessie.catalog.service.impl.EntitySnapshotTaskBehavior#asErrorTaskState
) can return the right state (retryable vs final).
Retryable examples:
- Throttled requests
- Technical network errors ("no route to host", "read timeout")
- Other unknown (IO)Exceptions
- Forbidden (users may "fix" credentials or ACLs)
Non-retryable ones:
- Not found
We might eventually need a third TaskState - to handle "forbidden"/"access denied" better, to handle the case when the object store's ACL's just wrong: a state that's immediately reported as an error, but retried later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might eventually need a third TaskState - to handle "forbidden"/"access denied" better, to handle the case when the object store's ACL's just wrong: a state that's immediately reported as an error, but retried later.
Isn't that what TaskStatus.ERROR_RETRY
already does?
Hm - seems you're right. ERROR_RETRY seems to wait until the task enters a final state.
However, not sure whether that deserves a third state. We might instead just want a "deadline" that callers can (should) provide. The resulting timeout-exception could provide a reason indicating the current state and error - but that might be a bit too much for now. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To mitigate potentially unmapped/unhandled error states, WDYT about making all current FAILURE
states ERROR_RETRY
ones but with a somewhat higher retry interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-retryable ones: Not found
Actually, this feels also retryable. Sure, in a perfect world, "not found" should be a "final" state.
But it could also be an ephemeral case, for example when users messed up the object store locations - in theory though, but possible. Or if someone manually migrates files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's pretty much where I'm moving (in my local ATM). Each caller should timeout at some point and concurrent caller would get the same result within some time window. If another caller, touches the same task after a certain time period, a re-try will happen regardless of the root cause. We can tune with the first and second timeout separately for each "cause".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: "not found" : It may be in reality represent a permission error (as in GH), so we cannot really rely on the object being really missing.
5e176f9
to
49cf7f7
Compare
Oh - seems the deletion of the "merge base" closed this PR :( |
No worries. I'll resume on main. |
3c14611
to
b51d1b8
Compare
c37e4ee
to
afa53b9
Compare
catalog/files/api/src/main/java/org/projectnessie/catalog/files/api/ObjectIOOutputStream.java
Outdated
Show resolved
Hide resolved
catalog/files/api/src/main/java/org/projectnessie/catalog/files/api/ObjectIOInputStream.java
Outdated
Show resolved
Hide resolved
...og/service/rest/src/main/java/org/projectnessie/catalog/service/rest/IcebergErrorMapper.java
Outdated
Show resolved
Hide resolved
...s-server/src/test/java/org/projectnessie/server/catalog/AbstractIcebergCatalogUnitTests.java
Show resolved
Hide resolved
19e9645
to
2be307b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review. It's a lot of changes in this PR. It seems, your new approach is around handling the exception when running in the "import worker", feels appropriate to me. Just need to keep in mind that we have to eventually do the same when writing files.
I think, the approach can also be implemented w/ using the ERROR_RETRY
and leave the FAILURE
state as it is (the changes for that are incomplete BTW).
...og/service/impl/src/main/java/org/projectnessie/catalog/service/impl/CatalogServiceImpl.java
Outdated
Show resolved
Hide resolved
tasks/api/src/main/java/org/projectnessie/nessie/tasks/api/TaskState.java
Outdated
Show resolved
Hide resolved
...service/impl/src/main/java/org/projectnessie/nessie/tasks/service/impl/TasksServiceImpl.java
Outdated
Show resolved
Hide resolved
tasks/api/src/main/java/org/projectnessie/nessie/tasks/api/TaskState.java
Outdated
Show resolved
Hide resolved
tasks/api/src/main/java/org/projectnessie/nessie/tasks/api/TaskState.java
Outdated
Show resolved
Hide resolved
...log/files/api/src/main/java/org/projectnessie/catalog/files/api/ObjectIOExceptionMapper.java
Outdated
Show resolved
Hide resolved
...log/files/api/src/main/java/org/projectnessie/catalog/files/api/ObjectIOExceptionMapper.java
Outdated
Show resolved
Hide resolved
...log/files/api/src/main/java/org/projectnessie/catalog/files/api/ObjectIOExceptionMapper.java
Outdated
Show resolved
Hide resolved
...log/files/api/src/main/java/org/projectnessie/catalog/files/api/ObjectIOExceptionMapper.java
Outdated
Show resolved
Hide resolved
catalog/files/impl/src/main/java/org/projectnessie/catalog/files/s3/S3ExceptionMapper.java
Show resolved
Hide resolved
I'm not sure about that. I think we have to be able to re-attempt after any failure because the conditions for the failure might change over time. In a way, no failure is a "final" failure. The different between |
Well, |
This may happen when a config class has all its properties grouped under sub-sections, but we still want to produce `.md` files from the top-level javadoc. This is a pre-requisite to projectnessie#8558
40914e0
to
53b16b7
Compare
6fd3824
to
5897903
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically two things left:
- Moving the config property (not a big one)
- Compatibility/behavior during rolling upgrades
Otherwise LGTM
...ce/impl/src/main/java/org/projectnessie/catalog/service/impl/EntitySnapshotTaskBehavior.java
Outdated
Show resolved
Hide resolved
...ce/impl/src/main/java/org/projectnessie/catalog/service/impl/EntitySnapshotTaskBehavior.java
Outdated
Show resolved
Hide resolved
catalog/files/api/src/main/java/org/projectnessie/catalog/files/api/BackendErrorCode.java
Outdated
Show resolved
Hide resolved
BackendErrorCode errorCode = BackendErrorCode.valueOf(state.errorCode()); | ||
return new PreviousTaskException(errorCode, state.message()); | ||
} catch (IllegalArgumentException e) { | ||
return new PreviousTaskException(UNKNOWN, state.message()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how far we should go here wrt to new BackendErrorCode
values added in future releases, and behavior during a rolling-upgrade (new version writing NEW_CODE
but old version reading it as UNKNOWN
, writing it as UNKNOWN
and new version re-reading it as UNKNOWN
as well). Would it make sense to serialize the error code including the HTTP status code or to make BackendErrorCode
a value type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not very apparent in code, but this method is called only when the Task Service decides it's done with trying to execute a task. The UNKNOWN
status here will not be stored in Persist
but will only float up to the caller via IcebergErrorMapper
... I'll think a bit more how to make the code clearer, but ATM I'm not sure it warrants any specific action for rolling upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. All good then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve? ;)
5897903
to
2b4e476
Compare
…ent exceptions * Note: before this change a storage 403 error would manifest as 500 on the REST client side. * Remove `ObjectIOException`. Use `BackendErrorStatus` instead. * Add exception mappers to convert storage failures to `BackendErrorStatus` and to Iceberg REST client exceptions. * Store error codes in TaskStatus and convert them to Iceberg REST status code when failed tasks are accessed again. * Add AccessCheckHandler to ObjectStorageMock for simulating access failures in tests. Closes projectnessie#8738
e00eac6
to
a910364
Compare
* able to determine a specific failure reason. This may be changed in the future when task | ||
* deadlines are supported. | ||
*/ | ||
UNKNOWN(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make UNKNOWN
retryable - we don't know what happened, so it can be retryable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know real UNKNOWN
will end up in infinite retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to add task deadlines in a follow-up PR and also fix #8860
Note: before this change a storage 403 error would manifest as 500 on the REST client side.
Remove
ObjectIOException
. UseBackendErrorStatus
instead.Add exception mappers to convert storage failures to
BackendErrorStatus
and to Iceberg REST client exceptions.
Store error codes in TaskStatus and convert them to Iceberg REST
status code when failed tasks are accessed again.
Add AccessCheckHandler to ObjectStorageMock for simulating access
failures in tests.
Closes #8738