chore: bump actions/checkout from 3 to 4 #3

dependabot · 2023-09-10T01:48:50Z

Release notes

Sourced from actions/checkout's releases.

v4.0.0

What's Changed

Update default runtime to node20 by @takost in actions/checkout#1436

Support fetching without the --progress option by @simonbaird in actions/checkout#1067

Release 4.0.0 by @takost in actions/checkout#1447

New Contributors

@takost made their first contribution in actions/checkout#1436

@simonbaird made their first contribution in actions/checkout#1067

Full Changelog: actions/checkout@v3...v4.0.0

v3.6.0

What's Changed

Mark test scripts with Bash'isms to be run via Bash by @dscho in actions/checkout#1377

Add option to fetch tags even if fetch-depth > 0 by @RobertWieczoreck in actions/checkout#579

Release 3.6.0 by @luketomlinson in actions/checkout#1437

New Contributors

@RobertWieczoreck made their first contribution in actions/checkout#579

@luketomlinson made their first contribution in actions/checkout#1437

Full Changelog: actions/checkout@v3.5.3...v3.6.0

v3.5.3

What's Changed

Fix: Checkout Issue in self hosted runner due to faulty submodule check-ins by @megamanics in actions/checkout#1196

Fix typos found by codespell by @DimitriPapadopoulos in actions/checkout#1287

Add support for sparse checkouts by @dscho and @dfdez in actions/checkout#1369

Release v3.5.3 by @TingluoHuang in actions/checkout#1376

New Contributors

@megamanics made their first contribution in actions/checkout#1196

@DimitriPapadopoulos made their first contribution in actions/checkout#1287

@dfdez made their first contribution in actions/checkout#1369

Full Changelog: actions/checkout@v3...v3.5.3

v3.5.2

What's Changed

Fix: Use correct API url / endpoint in GHES by @fhammerl in actions/checkout#1289 based on #1286 by @1newsr

Full Changelog: actions/checkout@v3.5.1...v3.5.2

v3.5.1

What's Changed

Improve checkout performance on Windows runners by upgrading @actions/github dependency by @BrettDong in actions/checkout#1246

New Contributors

@BrettDong made their first contribution in actions/checkout#1246

... (truncated)

Changelog

Sourced from actions/checkout's changelog.

Changelog

v4.0.0

Support fetching without the --progress option

Update to node20

v3.6.0

Fix: Mark test scripts with Bash'isms to be run via Bash

Add option to fetch tags even if fetch-depth > 0

v3.5.3

Fix: Checkout fail in self-hosted runners when faulty submodule are checked-in

Fix typos found by codespell

Add support for sparse checkouts

v3.5.2

Fix api endpoint for GHES

v3.5.1

Fix slow checkout on Windows

v3.5.0

Add new public key for known_hosts

v3.4.0

Upgrade codeql actions to v2

Upgrade dependencies

Upgrade @actions/io

v3.3.0

Implement branch list using callbacks from exec function

Add in explicit reference to private checkout options

[Fix comment typos (that got added in #770)](actions/checkout#1057)

v3.2.0

Add GitHub Action to perform release

Fix status badge

Replace datadog/squid with ubuntu/squid Docker image

Wrap pipeline commands for submoduleForeach in quotes

Update @actions/io to 1.1.2

Upgrading version to 3.2.0

v3.1.0

Use @actions/core saveState and getState

Add github-server-url input

v3.0.2

Add input set-safe-directory

v3.0.1

... (truncated)

Commits

3df4ab1 Release 4.0.0 (#1447)
8b5e8b7 Support fetching without the --progress option (#1067)
97a652b Update default runtime to node20 (#1436)
See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v3...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]>

torch.init_process_group includes a default 30 minute timeout. ``` While the worker is listening for instructions, after thirty idle minutes an error gets thrown: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:605 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7f3ff9fb295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x7d (0x7f3ff9f6b7cd in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xf8 (0x7f3fc834c858 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3a (0x7f3fc834d4ca in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x84 (0x7f3fc834d594 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #7: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) + 0x1fc (0x7f3f8c6e443c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x530 (0x7f3f8c6e7c10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x4c2 (0x7f3f8c6f5922 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #11: <unknown function> + 0x4c1a310 (0x7f3fc82eb310 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #12: <unknown function> + 0x4c23bfb (0x7f3fc82f4bfb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #13: <unknown function> + 0x4c43ab3 (0x7f3fc8314ab3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #14: <unknown function> + 0xb8c97a (0x7f3fceaa197a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #15: <unknown function> + 0x39bb46 (0x7f3fce2b0b46 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #16: <unknown function> + 0x15cc9e (0x5572975c2c9e in /usr/bin/python) frame #17: _PyObject_MakeTpCall + 0x25b (0x5572975b972b in /usr/bin/python) frame #18: <unknown function> + 0x16b1eb (0x5572975d11eb in /usr/bin/python) frame #19: _PyEval_EvalFrameDefault + 0x640a (0x5572975b175a in /usr/bin/python) frame #20: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #21: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python) frame #22: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python) frame #23: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #24: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python) frame #25: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #26: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python) frame #27: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python) frame #28: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #29: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python) frame #30: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x6cd (0x5572975aba1d in /usr/bin/python) frame #32: <unknown function> + 0x142176 (0x5572975a8176 in /usr/bin/python) frame #33: PyEval_EvalCode + 0x86 (0x55729769dc56 in /usr/bin/python) frame #34: <unknown function> + 0x264b18 (0x5572976cab18 in /usr/bin/python) frame #35: <unknown function> + 0x25d96b (0x5572976c396b in /usr/bin/python) frame #36: <unknown function> + 0x264865 (0x5572976ca865 in /usr/bin/python) frame #37: _PyRun_SimpleFileObject + 0x1a8 (0x5572976c9d48 in /usr/bin/python) frame #38: _PyRun_AnyFileObject + 0x43 (0x5572976c9a43 in /usr/bin/python) frame #39: Py_RunMain + 0x2be (0x5572976bac3e in /usr/bin/python) frame #40: Py_BytesMain + 0x2d (0x557297690bcd in /usr/bin/python) frame #41: <unknown function> + 0x29d90 (0x7f4018b72d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #42: __libc_start_main + 0x80 (0x7f4018b72e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #43: _start + 0x25 (0x557297690ac5 in /usr/bin/python) . This may indicate a possible application crash on rank 0 or a network set up issue. ``` After this error, any attempt by the worker to reconnect and establish a new session with the master fails. Including pod restarts. Trying to establish any new connection from worker is met with this error: ``` root@llama-2-13b-chat-pod-1:/workspace/llama/llama-2# torchrun --nnodes 2 --nproc_per_node 1 --rdzv_endpoint 10.224.0.181:29500 --master_port 29500 --rdzv_backend c10d inference-api.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.1.0a0+4136153', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 255, in launch_agent result = agent.run() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run result = self._invoke_run(role) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 871, in _invoke_run self._initialize_workers(self._worker_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers self._rendezvous(worker_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous self._op_executor.run(join_op, deadline) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 635, in run raise RendezvousClosedError() torch.distributed.elastic.rendezvous.api.RendezvousClosedError ``` Cleaning up the process group and reinitalizing on the worker side does not resolve this issue either. The state between the worker and master is inconsistent, requires a master restart. The best solution here is increasing the timeout to prevent the connection from being closed. I have included a dockerfile fix here. --------- Co-authored-by: Fei Guo <[email protected]>

dependabot bot requested review from Fei-Guo, helayoty and ishaansehgal99 as code owners September 10, 2023 01:48

dependabot bot added dependencies github_actions Pull requests that update GitHub Actions code labels Sep 10, 2023

helayoty merged commit 5c17266 into main Sep 10, 2023
4 checks passed

helayoty deleted the dependabot/github_actions/actions/checkout-4 branch September 10, 2023 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: bump actions/checkout from 3 to 4 #3

chore: bump actions/checkout from 3 to 4 #3

dependabot bot commented on behalf of github Sep 10, 2023

chore: bump actions/checkout from 3 to 4 #3

chore: bump actions/checkout from 3 to 4 #3

Conversation

dependabot bot commented on behalf of github Sep 10, 2023

v4.0.0

What's Changed

New Contributors

v3.6.0

What's Changed

New Contributors

v3.5.3

What's Changed

New Contributors

v3.5.2

What's Changed

v3.5.1

What's Changed

New Contributors

Changelog

v4.0.0

v3.6.0

v3.5.3

v3.5.2

v3.5.1

v3.5.0

v3.4.0

v3.3.0

v3.2.0

v3.1.0

v3.0.2

v3.0.1