Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster/more efficient testing in CI #3763

Closed
wileyj opened this issue Jun 22, 2023 · 2 comments
Closed

faster/more efficient testing in CI #3763

wileyj opened this issue Jun 22, 2023 · 2 comments
Assignees
Labels
chore Necessary but less impactful tasks such as cleanup or reorg CI

Comments

@wileyj
Copy link
Contributor

wileyj commented Jun 22, 2023

Goal: reduce the time it takes to run all required tests when CI runs.

currently, there are 2 several jobs that are either repeating build steps, or loading a large amount of data that is affecting how quickly a test can run.
further, there seems to be some improvements that can be made by using nextest (https://github.com/nextest-rs/nextest) vs cargo test.

I'm looking into a few approaches here, firstly using default github action workflows with modified testing jobs:

  1. cache binaries and other data as much as possible per run vs building extremely large docker images that need to be copied to a new job (currently a workflow is copying an 8GB image > 60 times).
  2. archive (and cache) binaries that can be used with nextests's --archive flag
  3. keep the structure of the job matrix as-is, but run the tests using the cached data and cached nextest archive.

in a fork, for the most part, the above changes cut the test time nearly in half - but i'm finding it to be a bit inconsistent due to the way github actions provisions runners (on occasion, i've had a job wait for 30 minutes for some cpu time on a simple task such as retrieving cache data, which on average takes about 10s).

a second approach for the above i'm looking at is removing the current test matrix, and creating a single job with a bash list/for loop to process tests. this will get around the issue of waiting for cpu time, but the drawbacks may outweigh the benefit - namely, if there are flaky tests, or any issue out of our control, the entire workflow will fail vs maybe a single test or two.
nevertheless, i'm trying it to see if it's a possible solution.

finally, there are ways to get more reliable CPU for the actions, i.e. using a service like buildjet (this will also be checked as an option).

I am also currently having an issue starting bitcoind in a few of the tests that require it, but overall most of the tests are completing rather quickly. i don't consider it a blocker; i'll figure it out.

@obycode @pavitthrap @kantai i would appreciate you all looking at the tests we're currently running and advising if we still need them for CI (looking at the naming alone, i'm curious about the tests for older epochs, like tests::epoch_205::transition_empty_blocks

@wileyj wileyj added chore Necessary but less impactful tasks such as cleanup or reorg CI labels Jun 22, 2023
@wileyj wileyj self-assigned this Jun 22, 2023
@kantai
Copy link
Member

kantai commented Jun 22, 2023

in a fork, for the most part, the above changes cut the test time nearly in half - but i'm finding it to be a bit inconsistent due to the way github actions provisions runners (on occasion, i've had a job wait for 30 minutes for some cpu time on a simple task such as retrieving cache data, which on average takes about 10s).

In my opinion, this seems like a pretty big win already.

a second approach for the above i'm looking at is removing the current test matrix, and creating a single job with a bash list/for loop to process tests. this will get around the issue of waiting for cpu time, but the drawbacks may outweigh the benefit - namely, if there are flaky tests, or any issue out of our control, the entire workflow will fail vs maybe a single test or two.

I worry that this would reduce the potential concurrency. Failed tests could also step on other tests by failing to close bitcoind, reap zombie processes, etc.

@obycode @pavitthrap @kantai i would appreciate you all looking at the tests we're currently running and advising if we still need them for CI (looking at the naming alone, i'm curious about the tests for older epochs, like tests::epoch_205::transition_empty_blocks

My preference is to leave any non-flaky tests in place. These kinds of tests are important to detect potential regressions. It may be that the run frequency for these tests could be reduced, i.e., regression tests only run before merging, not on every PR open + update, but that may require a bunch of work to figure out how to implement something like that.

@wileyj
Copy link
Contributor Author

wileyj commented Jun 23, 2023

@kantai i did attempt the second approach merely as an idea - i had no confidence it would be useful, and my test simply proved that. it took far longer looping through the list of tests in a run step (i.e. a for loop) so i can safely say this won't be an approach i'll be pursuing.

there are a few other issues i'm dealing with regarding the tests, namely bitcoind isn't starting properly on the github runner (not sure why, needs more investigation) so a single test is reliably failing still. Otherwise, i see good test times for the other 63 tests - but one thing I don't think can be solved with the current platform is this:
on occasion, a test that usually completes quickly can sometimes take > 30 mins due to factors out of our control. because github actions is a bit opaque, i'm guessing that we simply don't have access to any CPU to run the job.
for example, i've seem a step (download a cache) take > 30 minutes. re-running the test completes in ~2 minutes.

so, another aspect i'm looking into here is how to make these workflows consistent and reliable (alongside how to run them faster). buildjet has been mentioned, which i'll be looking into (once the bitcoind issue is resolved above).
i'll also be looking around to see what other options are available as well.

you also raise an interesting point about re-running tests on PR updates etc. as part of this work, i think can mock up something to simulate how it could work, and we can work out the details later on.

@wileyj wileyj linked a pull request Aug 15, 2023 that will close this issue
@wileyj wileyj closed this as completed Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Necessary but less impactful tasks such as cleanup or reorg CI
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants