Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: bump actions/checkout from 3 to 4 #3

Merged
merged 1 commit into from
Sep 10, 2023

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Sep 10, 2023

Bumps actions/checkout from 3 to 4.

Release notes

Sourced from actions/checkout's releases.

v4.0.0

What's Changed

New Contributors

Full Changelog: actions/checkout@v3...v4.0.0

v3.6.0

What's Changed

New Contributors

Full Changelog: actions/checkout@v3.5.3...v3.6.0

v3.5.3

What's Changed

New Contributors

Full Changelog: actions/checkout@v3...v3.5.3

v3.5.2

What's Changed

Full Changelog: actions/checkout@v3.5.1...v3.5.2

v3.5.1

What's Changed

New Contributors

... (truncated)

Changelog

Sourced from actions/checkout's changelog.

Changelog

v4.0.0

v3.6.0

v3.5.3

v3.5.2

v3.5.1

v3.5.0

v3.4.0

v3.3.0

v3.2.0

v3.1.0

v3.0.2

v3.0.1

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
@dependabot dependabot bot added dependencies github_actions Pull requests that update GitHub Actions code labels Sep 10, 2023
@helayoty helayoty merged commit 5c17266 into main Sep 10, 2023
4 checks passed
@helayoty helayoty deleted the dependabot/github_actions/actions/checkout-4 branch September 10, 2023 02:10
Fei-Guo added a commit that referenced this pull request Nov 6, 2023
torch.init_process_group includes a default 30 minute timeout. 
```
While the worker is listening for instructions, after thirty idle minutes an error gets thrown: 
[1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:605 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7f3ff9fb295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x7d (0x7f3ff9f6b7cd in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xf8 (0x7f3fc834c858 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3a (0x7f3fc834d4ca in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x84 (0x7f3fc834d594 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) + 0x1fc (0x7f3f8c6e443c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x530 (0x7f3f8c6e7c10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x4c2 (0x7f3f8c6f5922 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x4c1a310 (0x7f3fc82eb310 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4c23bfb (0x7f3fc82f4bfb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x4c43ab3 (0x7f3fc8314ab3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb8c97a (0x7f3fceaa197a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x39bb46 (0x7f3fce2b0b46 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15cc9e (0x5572975c2c9e in /usr/bin/python)
frame #17: _PyObject_MakeTpCall + 0x25b (0x5572975b972b in /usr/bin/python)
frame #18: <unknown function> + 0x16b1eb (0x5572975d11eb in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x640a (0x5572975b175a in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #21: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python)
frame #23: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python)
frame #25: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #26: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python)
frame #30: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x6cd (0x5572975aba1d in /usr/bin/python)
frame #32: <unknown function> + 0x142176 (0x5572975a8176 in /usr/bin/python)
frame #33: PyEval_EvalCode + 0x86 (0x55729769dc56 in /usr/bin/python)
frame #34: <unknown function> + 0x264b18 (0x5572976cab18 in /usr/bin/python)
frame #35: <unknown function> + 0x25d96b (0x5572976c396b in /usr/bin/python)
frame #36: <unknown function> + 0x264865 (0x5572976ca865 in /usr/bin/python)
frame #37: _PyRun_SimpleFileObject + 0x1a8 (0x5572976c9d48 in /usr/bin/python)
frame #38: _PyRun_AnyFileObject + 0x43 (0x5572976c9a43 in /usr/bin/python)
frame #39: Py_RunMain + 0x2be (0x5572976bac3e in /usr/bin/python)
frame #40: Py_BytesMain + 0x2d (0x557297690bcd in /usr/bin/python)
frame #41: <unknown function> + 0x29d90 (0x7f4018b72d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #42: __libc_start_main + 0x80 (0x7f4018b72e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #43: _start + 0x25 (0x557297690ac5 in /usr/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.

```
After this error, any attempt by the worker to reconnect and establish a
new session with the master fails. Including pod restarts. Trying to
establish any new connection from worker is met with this error:
```
root@llama-2-13b-chat-pod-1:/workspace/llama/llama-2# torchrun --nnodes 2 --nproc_per_node 1 --rdzv_endpoint 10.224.0.181:29500 --master_port 29500 --rdzv_backend c10d inference-api.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+4136153', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 871, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
    self._rendezvous(worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous
    self._op_executor.run(join_op, deadline)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 635, in run
    raise RendezvousClosedError()
torch.distributed.elastic.rendezvous.api.RendezvousClosedError
```

Cleaning up the process group and reinitalizing on the worker side does
not resolve this issue either. The state between the worker and master
is inconsistent, requires a master restart.

The best solution here is increasing the timeout to prevent the
connection from being closed. I have included a dockerfile fix here.

---------

Co-authored-by: Fei Guo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
github_actions Pull requests that update GitHub Actions code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant