Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive node sync issue #6356

Closed
1 task done
mrabino1 opened this issue Feb 2, 2024 · 14 comments
Closed
1 task done

Archive node sync issue #6356

mrabino1 opened this issue Feb 2, 2024 · 14 comments
Labels
A-db Related to the database C-bug An unexpected or incorrect behavior M-prevent-stale Prevents old inactive issues/PRs from being closed due to inactivity

Comments

@mrabino1
Copy link

mrabino1 commented Feb 2, 2024

Describe the bug

[](https://www.dropbox.com/s/zbc0n4wk9d6xhhz/2024-02-01_17-27-34.png?dl=0)

this was seen during an initial archive node sync.
running alpha.16 on ubuntu

Steps to reproduce

ubuntu
start reth initial sync for an archive node

Node logs

see picture

Platform(s)

Linux (x86)

What version/commit are you on?

alpha.16

What database version are you on?

alpha.16

What type of node are you running?

Archive (default)

What prune config do you use, if any?

No response

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

  • I agree to follow the Code of Conduct
@mrabino1 mrabino1 added C-bug An unexpected or incorrect behavior S-needs-triage This issue needs to be labelled labels Feb 2, 2024
@mattsse
Copy link
Collaborator

mattsse commented Feb 2, 2024

@shekhirin

image

@shekhirin
Copy link
Collaborator

this looks either a bug with long-lived transaction detection, or a database query that took that long to execute

Copy link
Contributor

This issue is stale because it has been open for 21 days with no activity.

@github-actions github-actions bot added the S-stale This issue/PR is stale and will close with no further activity label Feb 24, 2024
@shekhirin
Copy link
Collaborator

@mrabino1 could you please tell us what hardware you were using for syncing this node?

@mrabino1
Copy link
Author

NUC 10.. with an i7 .. nvme 4tb.. and 32gb of ram..

@github-actions github-actions bot removed the S-stale This issue/PR is stale and will close with no further activity label Feb 29, 2024
@emhane emhane self-assigned this Feb 29, 2024
@emhane emhane added the A-db Related to the database label Feb 29, 2024
@emhane
Copy link
Member

emhane commented Feb 29, 2024

maybe related @shekhirin ? #6850

@shekhirin
Copy link
Collaborator

maybe related @shekhirin ? #6850

likely not, because this warning isn't coming from the TxnManager and timeout-ing the transaction, it's just an informational message that we don't act upon

@emhane
Copy link
Member

emhane commented Feb 29, 2024

thinking if mdbx re-assigning a ptr could have lead to this... @shekhirin

@shekhirin
Copy link
Collaborator

thinking if mdbx re-assigning a ptr could have lead to this... @shekhirin

we issue these warning logs independently from the pointer tracking, and they're based purely on our Rust structs for transactions

@emhane
Copy link
Member

emhane commented Feb 29, 2024

yes but considering that the same txn id 41970 is shown for both crashes in the backtrace I think it's safe to say the same pointer is being used. the scenario I presented, which your pr addresses, has probably happened:
txn_1 has timed out and been aborted
pointer of txn_1 has been reassigned by mdbx to txn_2
user uses txn_1 again

so both txn_1 and txn_2 use same txn pointer.

furthermore I read the code for checking open_duration. on main, committing doesn't check if a transaction is timed out (indirectly that checks open_duration: if timed out open_duration has been exceeded). so the open_duration could be exceeded.

self.execute_with_close_transaction_metric(TransactionOutcome::Commit, |this| {
match this.inner.commit().map_err(|e| DatabaseError::Commit(e.into())) {
Ok((v, latency)) => (Ok(v), Some(latency)),
Err(e) => (Err(e), None),
}
})

/// Executes the given closure once the lock on the transaction is acquired.
#[inline]
pub(crate) fn txn_execute<F, T>(&self, f: F) -> T
where
F: FnOnce(*mut ffi::MDBX_txn) -> T,
{
let _lck = self.lock.lock();
(f)(self.txn)
}
}

however now this is done now in your pr

/// Executes the given closure once the lock on the transaction is acquired.
///
/// Returns the result of the closure or an error if the transaction is timed out.
#[inline]
pub(crate) fn txn_execute_fail_on_timeout<F, T>(&self, f: F) -> Result<T>
where
F: FnOnce(*mut ffi::MDBX_txn) -> T,
{
let _lck = self.lock.lock();
// No race condition with the `TxnManager` timing out the transaction is possible here,
// because we're taking a lock for any actions on the transaction pointer, including a call
// to the `mdbx_txn_reset`.
if self.is_timed_out() {
return Err(Error::ReadTransactionTimeout)
}
Ok((f)(self.txn))
}

@emhane emhane closed this as completed Feb 29, 2024
@shekhirin
Copy link
Collaborator

shekhirin commented Feb 29, 2024

yes but considering that the same txn id 41970 is shown for both crashes in the backtrace I think it's safe to say the same pointer is being used.

this is not true. txn_id in the screenshot is the ID of the transaction, and not the pointer. For read transactions, IDs are reused because they mean the snapshot ID of the database, and are expected to be non-unique: https://erthink.github.io/libmdbx/group__c__statinfo.html#gae13b7a7436c8073013bc5832194c6add

txn_1 has timed out and been aborted

I don't think so:

  1. We don't see a WARN log for transaction being timed out by the TxnManager
    warn!(target: "libmdbx", ?open_duration, "Long-lived read transactions has been aborted");
  2. Transactions on the screenshot are open for 108s and 78s, but our timeout is 5 minutes
    const DEFAULT_MAX_READ_TRANSACTION_DURATION: Duration = Duration::from_secs(5 * 60);

I'm also re-opening this, because we see that these read-only transactions open for more than a minute are coming from the networking part, and the PR #6850 most likely isn't going to change anything there.

@shekhirin shekhirin reopened this Feb 29, 2024
@emhane
Copy link
Member

emhane commented Mar 1, 2024

yes but considering that the same txn id 41970 is shown for both crashes in the backtrace I think it's safe to say the same pointer is being used.

this is not true. txn_id in the screenshot is the ID of the transaction, and not the pointer. For read transactions, IDs are reused because they mean the snapshot ID of the database, and are expected to be non-unique: https://erthink.github.io/libmdbx/group__c__statinfo.html#gae13b7a7436c8073013bc5832194c6add

alright, I followed the code down to here, looks like it's assigned by mdbx not by db

/// Returns the transaction id.
pub fn id(&self) -> Result<u64> {
self.txn_execute(|txn| unsafe { ffi::mdbx_txn_id(txn) })
}

txn_1 has timed out and been aborted

I don't think so:

  1. We don't see a WARN log for transaction being timed out by the TxnManager
    warn!(target: "libmdbx", ?open_duration, "Long-lived read transactions has been aborted");

fair enough

  1. Transactions on the screenshot are open for 108s and 78s, but our timeout is 5 minutes
    const DEFAULT_MAX_READ_TRANSACTION_DURATION: Duration = Duration::from_secs(5 * 60);

but this doesn't need to be a chain of just 2 transactions being reassigned the same pointer, this could be a much longer chain.

I'm also re-opening this, because we see that these read-only transactions open for more than a minute are coming from the networking part, and the PR #6850 most likely isn't going to change anything there.

@emhane emhane removed their assignment Mar 1, 2024
@DaniPopes DaniPopes removed the S-needs-triage This issue needs to be labelled label Mar 6, 2024
Copy link
Contributor

This issue is stale because it has been open for 21 days with no activity.

@github-actions github-actions bot added the S-stale This issue/PR is stale and will close with no further activity label Mar 29, 2024
Copy link
Contributor

github-actions bot commented Apr 5, 2024

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2024
@emhane emhane added M-prevent-stale Prevents old inactive issues/PRs from being closed due to inactivity and removed S-stale This issue/PR is stale and will close with no further activity labels Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-db Related to the database C-bug An unexpected or incorrect behavior M-prevent-stale Prevents old inactive issues/PRs from being closed due to inactivity
Projects
Archived in project
Development

No branches or pull requests

5 participants