Archive node sync issue #6356

mrabino1 · 2024-02-02T17:21:03Z

Describe the bug

[](https://www.dropbox.com/s/zbc0n4wk9d6xhhz/2024-02-01_17-27-34.png?dl=0)

this was seen during an initial archive node sync.
running alpha.16 on ubuntu

Steps to reproduce

ubuntu
start reth initial sync for an archive node

Node logs

see picture

Platform(s)

Linux (x86)

What version/commit are you on?

alpha.16

What database version are you on?

alpha.16

What type of node are you running?

Archive (default)

What prune config do you use, if any?

No response

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

I agree to follow the Code of Conduct

The text was updated successfully, but these errors were encountered:

mattsse · 2024-02-02T18:02:08Z

@shekhirin

shekhirin · 2024-02-02T18:25:42Z

this looks either a bug with long-lived transaction detection, or a database query that took that long to execute

github-actions · 2024-02-24T01:46:29Z

This issue is stale because it has been open for 21 days with no activity.

shekhirin · 2024-02-28T10:11:49Z

@mrabino1 could you please tell us what hardware you were using for syncing this node?

mrabino1 · 2024-02-28T10:32:17Z

NUC 10.. with an i7 .. nvme 4tb.. and 32gb of ram..

emhane · 2024-02-29T03:57:56Z

maybe related @shekhirin ? #6850

shekhirin · 2024-02-29T11:41:57Z

maybe related @shekhirin ? #6850

likely not, because this warning isn't coming from the TxnManager and timeout-ing the transaction, it's just an informational message that we don't act upon

emhane · 2024-02-29T15:27:27Z

thinking if mdbx re-assigning a ptr could have lead to this... @shekhirin

shekhirin · 2024-02-29T15:36:43Z

thinking if mdbx re-assigning a ptr could have lead to this... @shekhirin

we issue these warning logs independently from the pointer tracking, and they're based purely on our Rust structs for transactions

emhane · 2024-02-29T20:39:18Z

yes but considering that the same txn id 41970 is shown for both crashes in the backtrace I think it's safe to say the same pointer is being used. the scenario I presented, which your pr addresses, has probably happened:
txn_1 has timed out and been aborted
pointer of txn_1 has been reassigned by mdbx to txn_2
user uses txn_1 again

so both txn_1 and txn_2 use same txn pointer.

furthermore I read the code for checking open_duration. on main, committing doesn't check if a transaction is timed out (indirectly that checks open_duration: if timed out open_duration has been exceeded). so the open_duration could be exceeded.

reth/crates/storage/db/src/implementation/mdbx/tx.rs

Lines 290 to 295 in c5955f1

    
           self.execute_with_close_transaction_metric(TransactionOutcome::Commit, |this| { 
        
               match this.inner.commit().map_err(|e| DatabaseError::Commit(e.into())) { 
        
                   Ok((v, latency)) => (Ok(v), Some(latency)), 
        
                   Err(e) => (Err(e), None), 
        
               } 
        
           })

reth/crates/storage/libmdbx-rs/src/transaction.rs

Lines 561 to 570 in c5955f1

    
               /// Executes the given closure once the lock on the transaction is acquired. 
        
               #[inline] 
        
               pub(crate) fn txn_execute<F, T>(&self, f: F) -> T 
        
               where 
        
                   F: FnOnce(*mut ffi::MDBX_txn) -> T, 
        
               { 
        
                   let _lck = self.lock.lock(); 
        
                   (f)(self.txn) 
        
               } 
        
           }

however now this is done now in your pr

reth/crates/storage/libmdbx-rs/src/transaction.rs

Lines 542 to 560 in 130f8b7

    
           /// Executes the given closure once the lock on the transaction is acquired. 
        
           /// 
        
           /// Returns the result of the closure or an error if the transaction is timed out. 
        
           #[inline] 
        
           pub(crate) fn txn_execute_fail_on_timeout<F, T>(&self, f: F) -> Result<T> 
        
           where 
        
               F: FnOnce(*mut ffi::MDBX_txn) -> T, 
        
           { 
        
               let _lck = self.lock.lock(); 
        
               // No race condition with the `TxnManager` timing out the transaction is possible here, 
        
               // because we're taking a lock for any actions on the transaction pointer, including a call 
        
               // to the `mdbx_txn_reset`. 
        
               if self.is_timed_out() { 
        
                   return Err(Error::ReadTransactionTimeout) 
        
               } 
        
               Ok((f)(self.txn)) 
        
           }

shekhirin · 2024-02-29T23:13:51Z

yes but considering that the same txn id 41970 is shown for both crashes in the backtrace I think it's safe to say the same pointer is being used.

this is not true. txn_id in the screenshot is the ID of the transaction, and not the pointer. For read transactions, IDs are reused because they mean the snapshot ID of the database, and are expected to be non-unique: https://erthink.github.io/libmdbx/group__c__statinfo.html#gae13b7a7436c8073013bc5832194c6add

txn_1 has timed out and been aborted

I don't think so:

We don't see a WARN log for transaction being timed out by the TxnManager

reth/crates/storage/libmdbx-rs/src/txn_manager.rs

Line 276 in d8a9b58

warn!(target: "libmdbx", ?open_duration, "Long-lived read transactions has been aborted");
Transactions on the screenshot are open for 108s and 78s, but our timeout is 5 minutes

reth/crates/storage/libmdbx-rs/src/environment.rs

Line 26 in d8a9b58

const DEFAULT_MAX_READ_TRANSACTION_DURATION: Duration = Duration::from_secs(5 * 60);

I'm also re-opening this, because we see that these read-only transactions open for more than a minute are coming from the networking part, and the PR #6850 most likely isn't going to change anything there.

emhane · 2024-03-01T00:02:22Z

yes but considering that the same txn id 41970 is shown for both crashes in the backtrace I think it's safe to say the same pointer is being used.

this is not true. txn_id in the screenshot is the ID of the transaction, and not the pointer. For read transactions, IDs are reused because they mean the snapshot ID of the database, and are expected to be non-unique: https://erthink.github.io/libmdbx/group__c__statinfo.html#gae13b7a7436c8073013bc5832194c6add

alright, I followed the code down to here, looks like it's assigned by mdbx not by db

reth/crates/storage/libmdbx-rs/src/transaction.rs

Lines 128 to 131 in 2df0d91

    
           /// Returns the transaction id. 
        
           pub fn id(&self) -> Result<u64> { 
        
               self.txn_execute(|txn| unsafe { ffi::mdbx_txn_id(txn) }) 
        
           }

txn_1 has timed out and been aborted

I don't think so:

We don't see a WARN log for transaction being timed out by the TxnManager

reth/crates/storage/libmdbx-rs/src/txn_manager.rs

Line 276 in d8a9b58

warn!(target: "libmdbx", ?open_duration, "Long-lived read transactions has been aborted");

fair enough

Transactions on the screenshot are open for 108s and 78s, but our timeout is 5 minutes

reth/crates/storage/libmdbx-rs/src/environment.rs

Line 26 in d8a9b58

const DEFAULT_MAX_READ_TRANSACTION_DURATION: Duration = Duration::from_secs(5 * 60);

but this doesn't need to be a chain of just 2 transactions being reassigned the same pointer, this could be a much longer chain.

I'm also re-opening this, because we see that these read-only transactions open for more than a minute are coming from the networking part, and the PR #6850 most likely isn't going to change anything there.

github-actions · 2024-03-29T01:47:51Z

This issue is stale because it has been open for 21 days with no activity.

github-actions · 2024-04-05T01:48:19Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

mrabino1 added C-bug An unexpected or incorrect behavior S-needs-triage This issue needs to be labelled labels Feb 2, 2024

DaniPopes mentioned this issue Feb 5, 2024

chore: log long tx backtrace on a new line #6419

Merged

github-actions bot added the S-stale This issue/PR is stale and will close with no further activity label Feb 24, 2024

github-actions bot removed the S-stale This issue/PR is stale and will close with no further activity label Feb 29, 2024

emhane self-assigned this Feb 29, 2024

emhane added the A-db Related to the database label Feb 29, 2024

emhane closed this as completed Feb 29, 2024

shekhirin reopened this Feb 29, 2024

emhane removed their assignment Mar 1, 2024

DaniPopes removed the S-needs-triage This issue needs to be labelled label Mar 6, 2024

github-actions bot added the S-stale This issue/PR is stale and will close with no further activity label Mar 29, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2024

emhane added M-prevent-stale Prevents old inactive issues/PRs from being closed due to inactivity and removed S-stale This issue/PR is stale and will close with no further activity labels Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive node sync issue #6356

Archive node sync issue #6356

mrabino1 commented Feb 2, 2024

mattsse commented Feb 2, 2024

shekhirin commented Feb 2, 2024

github-actions bot commented Feb 24, 2024

shekhirin commented Feb 28, 2024

mrabino1 commented Feb 28, 2024

emhane commented Feb 29, 2024

shekhirin commented Feb 29, 2024

emhane commented Feb 29, 2024

shekhirin commented Feb 29, 2024

emhane commented Feb 29, 2024

shekhirin commented Feb 29, 2024 •

edited

Loading

emhane commented Mar 1, 2024

github-actions bot commented Mar 29, 2024

github-actions bot commented Apr 5, 2024

Archive node sync issue #6356

Archive node sync issue #6356

Comments

mrabino1 commented Feb 2, 2024

Describe the bug

Steps to reproduce

Node logs

Platform(s)

What version/commit are you on?

What database version are you on?

What type of node are you running?

What prune config do you use, if any?

If you've built Reth from source, provide the full command you used

Code of Conduct

mattsse commented Feb 2, 2024

shekhirin commented Feb 2, 2024

github-actions bot commented Feb 24, 2024

shekhirin commented Feb 28, 2024

mrabino1 commented Feb 28, 2024

emhane commented Feb 29, 2024

shekhirin commented Feb 29, 2024

emhane commented Feb 29, 2024

shekhirin commented Feb 29, 2024

emhane commented Feb 29, 2024

shekhirin commented Feb 29, 2024 • edited Loading

emhane commented Mar 1, 2024

github-actions bot commented Mar 29, 2024

github-actions bot commented Apr 5, 2024

shekhirin commented Feb 29, 2024 •

edited

Loading