aws-samples / awsome-distributed-training Public

Notifications You must be signed in to change notification settings
Fork 69
Star 156

Code
Issues 19
Pull requests 23
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Pull requests: aws-samples/awsome-distributed-training

Labels 14 Milestones 2

New pull request New

23 Open 278 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Reviews

Filter by reviews

No reviews Review required Approved review Changes requested

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Pull requests list

Update Megatron-LM base image

#402 opened Aug 8, 2024 by KeitaW

Loading…

add maxtext test case enhancement

New feature or request

#397 opened Aug 5, 2024 by KeitaW • Draft

Update bionemo test case + propose to subdirectories per orchastrator documentation

Improvements or additions to documentation

#396 opened Aug 5, 2024 by KeitaW • Draft

Smhp add features in LCS utills

#392 opened Aug 1, 2024 by gmgtamz

Loading…

Esm2 on Sagemaker Hyperpod

#387 opened Jul 25, 2024 by awsankur

Loading…

FSDP: Mistral(mathstral) sbatch file - MISTRAL MODEL SUPPORT New model

#385 opened Jul 23, 2024 by nithiyn

Loading…

FSDP: Add mistral model type support New model

#384 opened Jul 23, 2024 by arm-diaz

Loading…

update MosaicML composer image and MPT test case

#376 opened Jul 15, 2024 by KeitaW • Draft

update dependencies of PyTorch base image

#375 opened Jul 15, 2024 by KeitaW

Loading…

Update SMPv2 conda setup script with latest PT2.3.1 TSM2.4.0

#366 opened Jun 25, 2024 by viclzhu

Loading…

Neuron distributed

#359 opened Jun 13, 2024 by KeitaW

Loading…

End-to-End LLM Model Development with Torchtitan and Torchtune enhancement

New feature or request

#341 opened May 20, 2024 by KeitaW

Loading…

Llama training with FP8

#331 opened May 15, 2024 by pbelevich • Draft

Example for benchmarking ML worloads using Torch Profiler and NSight

#322 opened May 10, 2024 by syedazi

Loading…

Llama3 Neuron-distributed test case

#297 opened May 2, 2024 by KeitaW • Draft

Add draft gpu troubles

#290 opened Apr 30, 2024 by mhuguesaws • Draft

Slurm job template: how a job can probe instance topology and hostname-instanceid mappings…

#268 opened Apr 16, 2024 by verdimrc • Draft

Script to probe the nccl libraries that PyTorch uses

#267 opened Apr 16, 2024 by verdimrc

Loading…

[WIP] torchtune usecase

#260 opened Apr 12, 2024 by pbelevich • Draft

Bump pytorch dockerfile template

#211 opened Mar 12, 2024 by verdimrc

Loading…

SMHP: slurm exporter to report gpu metrics

#181 opened Mar 6, 2024 by verdimrc

Loading…

Update organization and tag to V1

#150 opened Feb 22, 2024 by perifaws

Loading…

megatron-lm test case: update README

#114 opened Jan 25, 2024 by verdimrc • Draft

ProTip! Follow long discussions with comments:>50.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly