fix: inference fault tolerance #108

ishaansehgal99 · 2023-10-26T02:17:39Z

This PR introduces a couple changes:

Minor fixes on the e2e preset build pipeline
Updating image version of pytorch/cuda/nccl from 23.06 to 23.10
Update a docker file path (/home/llama -> /llama and /home/falcon -> /falcon) to prevent conflicting dir paths (host-volume: /home) mounting over one another
Add torch rdzv and headless service for inference fault tolerance
Inference code increased error handling, resiliency, fault tolerance, timeouts

…f processes upon termination

…s which are required

Fei-Guo

Rename createHeadlessService to useHeadlessService in all cases.

presets/llama-2-chat/inference-api.py

presets/llama-2/inference-api.py

…orker-networking-inference

fix: fix networking issue inference

b3eec53

ishaansehgal99 requested review from Fei-Guo and helayoty as code owners October 26, 2023 02:17

ishaansehgal99 added 18 commits October 25, 2023 19:18

nit: remove threading

c1d3a18

fix: ensure child process

32586dd

fix: upgrade nvidia pytorch

933efc6

fix: lint

86101a8

fix: naming

b691d9b

fix: diff

cbbebbd

fix: fetch

5248737

fix: log

1690ccc

feat: add the headless service, add the resliency to ensure cleanup o…

2ee601c

…f processes upon termination

fix: timeout error handling

224416a

fix: remove comments

a91bded

feat: added torchrdzvparams, headless service

fda874a

fix: simplify timeout

cdf10e7

fix: headless service variable fixes

a12f16c

fix: shutdown

2f2821d

fix: dockerfile

86ff9e0

fix: update docker file paths to avoid conflicting volume mounts

ca788a4

fix: fix service naming, and add service namespace and ownerreference…

2213d85

…s which are required

ishaansehgal99 changed the title ~~fix: worker networking issue inference~~ fix: inference fault tolerance Nov 2, 2023

Fei-Guo reviewed Nov 2, 2023

View reviewed changes

presets/llama-2-chat/inference-api.py Outdated Show resolved Hide resolved

presets/llama-2-chat/inference-api.py Outdated Show resolved Hide resolved

presets/llama-2/inference-api.py Outdated Show resolved Hide resolved

ishaansehgal99 added 4 commits November 2, 2023 13:46

fix: remove logs

8a9b4ae

fix

c52b40c

Merge branch 'main' of https://github.com/Azure/kdm into Ishaan/fix-w…

ac0e880

…orker-networking-inference

fix: rename create to useHeadlessService

fd3f9db

Fei-Guo approved these changes Nov 2, 2023

View reviewed changes

Fei-Guo merged commit d0ba4d5 into main Nov 2, 2023
5 checks passed

Fei-Guo deleted the Ishaan/fix-worker-networking-inference branch November 2, 2023 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: inference fault tolerance #108

fix: inference fault tolerance #108

ishaansehgal99 commented Oct 26, 2023 •

edited

Loading

Fei-Guo left a comment •

edited

Loading

fix: inference fault tolerance #108

fix: inference fault tolerance #108

Conversation

ishaansehgal99 commented Oct 26, 2023 • edited Loading

Fei-Guo left a comment • edited Loading

Choose a reason for hiding this comment

ishaansehgal99 commented Oct 26, 2023 •

edited

Loading

Fei-Guo left a comment •

edited

Loading