Releases · huggingface/optimum-habana

06 Sep 20:17

regisss

v1.13.2

1266993

Latest

Llava(-next) improvements

This patch release adds multi-card support for Llava(-next) and enables users to turn on/off recomputing for flash attention.

Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute #1278 @tthakkal
Add the deepspeed injection_policy of mistral #1309 @yuanwu2017

Full Changelog: v1.13.1...v1.13.2

Contributors

yuanwu2017 and tthakkal

Assets 2

25 Aug 13:34

regisss

v1.13.1

52e22cb

v1.13.1: Patch release

Fixed memory regressions

Remove _expand_inputs_for_generation for greedy search (#1266) @libinta
Fix memory regression for modeling llama (#1271) @libinta

FSDP

FSDP checkpoint saving is fixed.

Fix BERT FSDP test (#1281) @regisss

Known limitations

ESMFold does not work on Gaudi1, this will be fixed in a future version

Full Changelog: v1.13.0...v1.13.1

Contributors

regisss and libinta

Assets 2

16 Aug 14:25

regisss

v1.13.0

41e0a3f

v1.13.0: Stable Diffusion 3, Sentence Transformers, SAM, DETR, Kubernetes example

SynapseAI 1.17

Upgrade SynapseAI version to 1.17.0 #1217

Transformers 4.43

Upgrade to Transformers 4.43 #1163 @regisss

Diffusers 0.29

Upgrade optimum-habana diffusers dependency from 0.26.3 to 0.29.2 #1150 @dsocek

Stable Diffusion 3

Sd3 #1153 @dsocek
Refactor SD3 #1199 @dsocek

Training with Sentence Transformers

Enable Sentence Transformer Trainer with Gaudi #1111 @ZhengHongming888

Model optimizations

Fix starcoder2 accuracy issue and optimize performance with fused rope #1095 @mandy-li
Enable FusedRoPE using float32 for gpt-neox model #1104 @yeonsily
Mamba initial enablement. #1122 @libinta
Adding fused qkv support along with config #1102 @bhargaveede
Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087 @Zhiwei35
Enable fp8 inference for Llava-Next and add Fused_SDPA #1120 @tthakkal
Support bucket_internal for MPT #1137 @pk1d3v
Enable Flash Attention (Fused SDPA) for Starcoder #1114 @abhilash1910
gpt_bigcode: added FusedSDPA kernel #1138 @mgonchar
Enable torch.compile for Granite20B #1185 @dvarshney-habana
Refine use cache for mpt model #1158 @Jing1Ling
GPT-J support reuse_cache #1094 @atakaha
Use fast softmax only on prefill #1159 @jaygala223
Starcoder2 : KVCache and flash attention (FusedSDPA) enablement #1149 @abhatkal
Gpt bigcode fused sdpa #1260 @yeonsily

SAM, FastVIT, VideoMAE, OpenCLIP, DETR, Table Transformer, deciLM

Add an example of Segment Anything Model [Inference] #814 @cfgfung
Add an example of FastViT model (Infernece) #826 @cfgfung
VideoMAE Model Enabling and Examples #922 @pi314ever
OpenCLIP sample for visual question answering #977 @vidyasiv
Enabled DETR (Object Detection) model #1046 @cfgfung
Table transformer enabling #978 @pi314ever
deciLM support #1133 @sywangyi

Stable Diffusion inpainting, unconditional image generation

Add the Stable diffusion inpaint support #869 @yuanwu2017
Enable Unconditional Image Generation on Gaudi 2 [Diffuser/Tasks] #859 @cfgfung

Text feature extraction example

Feature extraction enabling #994 @pi314ever

Tensor parallelism

Tensor parallel distributed strategy without using deepspeed #1121 @kalyanjk
Disable torch.compile for all_reduce when parallel_strategy is set to "tp" #1174 @kalyanjk

Kubernetes cluster example

Adds a helm chart, dockerfile, and instructions for running examples using a Kubernetes cluster #1099 @dmsuehir
Fix PyTorch version in the Kubernetes docker-compose to match image #1246 @dmsuehir

FP8 training

TE FP8 integration #1096 @SanjuCSudhakaran

Other

Updates run_lora_clm.py with enhanced dataset support #955 @dmsuehir
Fix prefix tuning finetune issue and update test #975 @sywangyi
Fix throughput calculation in image-to-text example #1070 @regisss
SDXL-trainig: fixed ci, changed gated dataset, fixes for non-square datasets #1038 @imangohari1
Updating batch_size of Albert-XXL in README #1063 @vineethanandh
Fix the error of running run_pipeline.py of text_generation example #1055 @yuanwu2017
Add a test for llama finetuning with FP8 precision #1106 @SanjuCSudhakaran
Beam-search fix #1113 @ssarkar2
Add chat format support dataset in SFT #1066 @libinta
Fix nan loss of gemma and crash if dataset_concatenation is not set #1088 @sywangyi
torch.compile keep input mutation in graph this avoids unnecessary memcpy #1069 @sushildubey171
Updated langchain text-generation pipeline to work with latest release 0.2.5 #1084 @rbrugaro
Add the MC example #891 @yuanwu2017
Fix recompiles if limit_hpu_graph is False #1129 @ssarkar2
Update examples batchsize in README #1123 @shepark
Fix OOM error in SDXL Fine-Tuning validation stage #1134 @dsocek
Added an example code to demonstrate how to use deterministic image generation #878 @cfgfung
SD image variation/InstructPix2Pix/StableDiffusionXLImg2ImgPipeline pipeline #988 @sywangyi
Add ci test for trl rewarding and ppo, fix backward failure in ppo caused by rmsfusion #1020 @sywangyi
Llama adapter #983 @sywangyi
torch.flip issue is fixed in SynapseAI 1.16, so remove the WA #1092 @sywangyi
Fix test CausalLanguageModelingLORAExampleTester KeyError #1139 @dmsuehir
fix(ci): new runs-on #1136 @XciD
Add trust_remote_code for loading datasets in the audio classification example #1074 @regisss
Generation example: print number of warmup iterations #1145 @mgonchar
CI Updates: text-gen to recieve ranks/bs, Updated bs/metric for baselines #1140 @imangohari1
Support for custom files for run_lora_clm.py #1039 @vidyasiv
Change the device_id for FSDP plugin #1086 @ckvermaAI
Set KV Cache update as static method #1160 @ulivne
To fix CPU tensor issue #1157 @mkumargarg
Adding missing init.py to mistral and mixtral test package #1188 @rkumar2patel
Add example of multitask_prompt/poly tuning #915 @sywangyi
Fix data-type mismatch for mlperf_inference accuracy test #1146 @kalyanjk
Fix spawn MP context, limit cpu and download data #1131 @polisettyvarma
T5 multi card #1222 @yafshar
Add trust_remote_code for t5 poly-tuning test #1220 @yafshar
Resolve "empty tensor optional" error with hpu_graphs + kv cache for StarCoder #1181 @vidyasiv
Fix VIT, add wav2vec comment #1223 @ssarkar2
Roberta tests were running on CPU #1229 @ssarkar2
Fix bert/roberta contrastive search tests #1226 @skavulya
Remove the default env variable to trust remote code by default #1225 @yafshar
Improve style check workflow #1230 @regisss
Added scheduler selection for SDXL fine-tuning #867 @kplau1128
Clear help msg for ignore_eos to avoid misunderstanding @sywangyi
Support loading hugging face checkpoint #1165 @ulivne
Change triggering event for code style check #1238 @regisss
gptj: fix missing token_idx #1234 @envsp
fix(nltk): fixed the version to working one #1247 @imangohari1
Updating to avoid hardcoding tests in CI framework #1221 @vidyasiv
Fix FSDP graph error due to Tranformer 4.43 update #1251 @jiminha
Fix SD README commands #1250 @imangohari1
Fix spelling errors #1252 @changwangss
Set HLS_MODULE_ID only if it wasn't set previously #1254 @astachowiczhabana
Fix overflow of steps in SDXL for default diffusers scheduler @dsocek
fix(test_diffusers): automated the checking for tests without upstream HF #1232 @imangohari1
fix(nltk): Revert 1247. Updated the version. added the punkt_tab download #1258 @imangohari1
Set input_embeds before it gets used #1261 @tthakkal
Update README and more changes, rebase to main #1259 @shepark

Known limitations

For Llama, some big batch sizes lead to out-of-memory errors whereas they used to work

Contributors

kalyanjk, yafshar, and 42 other contributors

Assets 2

11 Jul 13:51

regisss

v1.12.1

820901c

v1.12.1: Patch Release

Fix 1st token latency time measure

Fix 1st token latency time #1091 @libinta

Fix for Mixtral

Mixtral typo fix #1107 @schoi-habana

Other

Fix for selective seq length test with batch size 1 #1110 @libinta

Full Changelog: v1.12.0...v1.12.1

Contributors

libinta and schoi-habana

Assets 2

22 Jun 18:28

regisss

v1.12.0

6adad16

v1.12: Qwen2, Gemma, SVD, Dreambooth, speculative sampling

SynapseAI v1.16

Upgrade to SynapseAI v1.16 #1043 @regisss

Transformers 4.40

Upgrade to Transformers 4.40 #1027 @regisss

Speculative Sampling

Speculative sampling on Gaudi using Optimum-Habana #973 @nraste
Fix assisted decoding generation error #1080 @libinta

Model optimizations

Add --bucket_size support for gpt_bigcode #802 @jiminha
Optimize StableLM model inference #805 @XinyuYe-Intel
Enable google/gemma-7b. #747 @lkk12014402
Enable llava static generation. #767 @lkk12014402
Fix perf drop in flan-t5 summarization #908 @MohitIntel
Enable Qwen2 model #774 @XinyuYe-Intel
Extend bucket_internal to SAMPLE generation mode #819 @xt574chen
SpeechT5 static consistent dropout #824 @Spycsh
Optimize inference of Persimmon model #822 @XinyuYe-Intel
Enable OWL-ViT graph mode on Gaudi platform #783 @cfgfung
Support mixtral kvcache reuse and remove kv_cache_fp8 #898 @jychen21
Add fp8 related changes to mistral for text-generation #918 @skaulintel
Optimization for phi series models: support fp8 kv cache and reuse kv cache #902 @yuwenzho
Support Mistral 32K input token #931 @jiminha
Support mixtral long sequence 32k with bs 4 #903 @jychen21
Adapt Mixtral long sequence handling for Mistral #985 @jiminha
Fix performance issue in mistral #1030 @jiminha
Optimized inference of Starcoder2 model #829 @XinyuYe-Intel
Add support for IBM Granite #1045 @regisss
Enable fp8 inference for Llava-hf 7B and 13B in 1.16 release #951 @Luca-Calabria
Fusedrope inp bf16 #1026 @ssarkar2
Enhance Qwen2 model with FSDPA and bucket #1033 @Zhiwei35
Optimize seamless-m4t/vits model for text-to-speech generation #825 @sywangyi
cache_optimization #1028 @ssarkar2
Ensure KV cache is not returned as output tensor during decode phase for Falcon #993 @schoi-habana
Fast softmax #972 @wszczurekhabana
Falcon optimization #974 @libinta
Quantization for FSDPA #976 @dudilester
Falcon update park #1052 @ssarkar2
Add the Llava_next support #1041 @yuanwu2017
Improve torch compile performance #1082 @libinta

Stable Video Diffusion

Add SVD pipeline #743 @dsocek

PEFT

Add ia3 and adalora support #809 @sywangyi
Enable prompt tuning/prefix tuning/p tuning clm and example #758 @sywangyi

TRL

Finetuning stable diffusion with DDPO #733 @skavulya

Object Segmentation Example

Add an example of object segmentation (ClipSeg) #801 @cfgfung

Dreambooth

Diffuser dreambooth full/lora/lokr/loha/oft finetune, dreambooth XL lora finetune #881 @sywangyi

Others

Text generation pipeline: Extended functionality to align with run_generation script #782 @mgonchar
Enable clip mediapipe and update G2 baseline #856 @MohitIntel
Add ci test for SFT and DPO #857 @sywangyi
Fix SFT, DPO CI on Gaudi1 #893 @regisss
Add SDXL in README #894 @regisss
Fix falcon 180b oom issue if peft > 0.6.2 #895 @sywangyi
Enabled additional models in CI #879 @MohitIntel
Add static shape support for vision_encoder_decoder generation if decoder supports static shape #834 @sywangyi
Add HabanaProfile to Stable Diffusion and XL #828 @atakaha
Pytest accuracy updates for Falcon, T5, GPT2 #916 @Luca-Calabria
Update text-generation readme with torch.compile info. #884 @libinta
Update Wav2Vec2ModelTest::test_initialization #919 @malkomes
Add linear and dynamic RoPE to Mistral and Mixtral #892 @regisss
Fix for wav2vec2 test cases #923 @lqnguyen
Add nograd() to prevent backward backend #897 @astachowiczhabana
Assisted decoding not implemented #910 @tjs-intel
Disable wav2vec2 symbolic tracing test #904 @tjs-intel
Add support for symbolic tracing of GPT2 models #913 @tjs-intel
Utils: return more reasonable error in case of attempt of non-PyTorch model loading #921 @mgonchar
Pytest accuracy updates for Bridgetower, Swin, Vit #927 @Luca-Calabria
Text generation: added langchain pipeline script #887 @mgonchar
Fix for AST models #914 @vidyasiv
Fix AttributeError for wav2vec test #929 @Jianhong-Zhang
Fix ValueError for test_summarization #939 @Jianhong-Zhang
Grad norm tensor fix #938 @yeonsily
Add information to the audio-classification examples README about --ddp_find_unused_parameters parameter #941 @Alberto-Villarreal
Add leaderboard link #947 @echarlaix
Fix formatting of arg parse help strings in the PEFT example #944 @dmsuehir
Use new Habana llama and falcon model configs #940 @skaulintel
Update based on legal requirements. #900 @libinta
Update test generation config to raise ValueError #949 @malkomes
Add --trust_remote_code for text generation examples #870 @yangulei
Added Llama-2 fp8 text-generation test cases #934 @yeonsily
Upgrade SD output image verification with CLIP score #920 @MohitIntel
Llama Guard for text classification example #871 @dsmertin
Update README logo #950 @regisss
Add Gaudi CI for Sentence Transformers #928 @regisss
Get iteration times through generate() #899 @hsubramony
Update speech recognition seq2seq example #953 @regisss
Fix wrongly all_gather for mixtral finetune #965 @ccrhx4
Add intel-mila protST example #860 @sywangyi
Small CI refacto #968 @regisss
Llama70b one card to infer device map with max memory limitation #963 @Yantom1
Map list to tensors #926 @ssarkar2
Fix fsdp lora torch compile issue #971 @sywangyi
Fix for the simulate_dyn_prompt flag assertion #984 @alekseyfa
Initial enablement with FP8 Training (port from OHF #91) #936 @libinta
Warn user when using --disk_offload without hqt #964 @Yantom1
Assign grad_norm for logging only if it's a single element tensor #992 @yeonsily
Update examples #998 @regisss
Fix warmup for diffusers when batch size < throughput_warmup_steps #960 @dsocek
Add torch.compile instructions for Roberta-Large #981 @MohitIntel
Fix gpt_neox, stablelm inference regression caused by RoPE dtype #999 @mandy-li
fea(examples): Updated the READMEs with requirements.txt installation #1000 @imangohari1
Initial commit for fp8 CI #995 @yeonsily
Fixed 'MixtralConfig' object has no attribute 'rope_scaling' #1009 @aslanxie
Use the lenght of timesteps as the inference step num #986 @yuanwu2017
Fix the bug of output_type=np or latent. #996 @yuanwu2017
Fix wav2vec test load adapter #937 @malkomes
Mark scale as const and remove --fp8 flag usage #962 @Yantom1
Add per step time collection to other methods #1004 @ssarkar2
Fix first token time #1019 @ssarkar2
Fix text-generation example #1025 @regisss
Updates test_beam_search to transformers_4.40 #1017 @malkomes
Fix eos problem #1034 @sywangyi
fp8 textgen ci structure update #1029 @jiminha
Fix a return value issue casued by PR 973 #1040 @yafshar
Add no_checks for sub dataset in lvwerra/stack-exchange-paired since it does not contain test split #1003 @sywangyi
Readme Update for FSDP #980 @hlahkar
Add unifier script and disk offload flag usages to README. #1023 @libinta
Add mixtral for meta device load due to mixtral-8x22b model size #909 @libinta
Update unifier script #1010 @Yantom1
Update text-generation CI configuration for falcon and Mixtral #1044 @yeonsily
Update multi-node README to check ssh connection issue #1048 @yeonsily
Infra upgrade workflows #480 @glegendre01
Update test_text_generation_example.py #1051 @ssarkar2
BERT training migrated to torch.compile #990 @ANSHUMAN87
Update test_examples.py #1053 @ssarkar2
Update modeling_llama.py: deepspeed fix for codellama #1054 @ssarkar2
No shapes in profilings by default #1050 @astachowiczhabana
Change the way to unset environemt variable for gpt-neox ci #1060 @yeonsily
Update README for Albert torch.compile mode #1061 @MohitIntel
Fix lm_evaluation_harness to specific commit (#240) #1064 @astachowiczhabana
Fix text-generation example README.md #1081 @shepark

Contributors

yafshar, skavulya, and 47 other contributors

Assets 2

20 Apr 05:28

regisss

v1.11.1

989484c

v1.11.1: Patch Release

Llama3 has been validated on Gaudi

Llama3 test and readme changes #905 @ssarkar2

Fix issue with `pytest`

The latest SynapseAI Docker images come with Pytest v8 already installed, which is incompatible with the Transformers library and leads to an error in a few non-test cases. As a temporary workaround, Pytest is pinned and moved as a hard dependency.

Move pytest dependency #883 @regisss

Other

Fp8 merge fix #863 @libinta
Fixed "reuse_cache" Bug #888 @Danielohayon
Remove deprecated AOT_HPU_TRAINING_BACKEND #877 @astachowiczhabana
Add mark step and inplace residual add in llama model code #833 @puneeshkhanna
Enable Flash Attention in recompute and causal modes #862 @wszczurekhabana
Add mark_step for llama inference #875 @libinta

Full Changelog: v1.11.0...v1.11.1

Contributors

regisss, ssarkar2, and 5 other contributors

Assets 2

04 Apr 14:55

regisss

v1.11.0

eaac913

v1.11: SDXL fine-tuning, Whisper, Phi, ControlNet

SynapseAI v1.15

The codebase is fully validated for the latest version of Habana SDK, SynapseAI v1.15.0.

Upgrade to SynapseAI 1.15.0 #831 @regisss

SDXL fine-tuning

SDXL fine tuning #667 @dsocek
Mediapipe sdxl #787 @ssarkar2

Whisper

Support speech recognition with whisper models and seq2seq #704 @emascarenhas

Phi

Enable phi series models #732 @lkk12014402

ControlNet

Controlnet training #650 @vidyasiv

Transformers v4.38

The codebase is fully validated for Transformers v4.38.

Upgrade to Transformers 4.38 #788 @regisss

Model optimizations

Add optimization for blip text model generation #653 @sywangyi
Enable internal kv bucket in llama #720 @xt574chen
Enable Mixtral-8x7B #739 @jychen-habana
Update Mixtral-8x7B fp8 hqt example #756 @jychen-habana
Further fixes for performance with internal bucketing #781 @puneeshkhanna
speecht5 optimization #722 @sywangyi
move img_mask@get_attn_mask() to hpu #795 @hsubramony
Mistral optimizations #804 @ssarkar2

Image-to-text and VQA examples

Add image-to-text and visual question answering example #738 @sywangyi

torch.compile

Enable torch_compile mode for distributed #659 @kalyanjk
Fix graph breaks in torch compile mode #806 @hlahkar
Fix torch.compile for text generation #811 @regisss
Add Llama7b FSDP test for torch.compile mode #818 @pankd

Bug fixes

Fix beamsearch crash and incorrect output in decode-only model and encode-decode model #627 @sywangyi
Fix translation models #710 @vidyasiv
Fix throughput calculation for diffusion models #715 @skavulya
Fix crash in llama mode in llava image-to-text generation #755 @sywangyi
Fix backward error in DDP when running reward model finetune in RLHF #507 @sywangyi
Fix get_dtype and convert_into_dtypes #769 @regisss
Override sdpa option in Gaudi #771 @jiminha
Fix Llama-70B-FSDP model loading issue #752 @hlahkar
Fix FSDP in transformer4.38 #812 @libinta
Delay importing deepspeed comm due for perf #810 @jiminha
Fix llama rotary pos emb issue for transformers 4.38 #813 @libinta
Fix torch.full issue below when running deepspeed z3 for llama #820 @libinta
Fix profile issue with 1st step #837 @libinta
Fix mistral after syn1.15 update #858 @ssarkar2

Others

Small test_text_generation_example.py refacto #725 @regisss
Update README, add PPO support #721 @sywangyi
Update the Mistral model naming #726 @yafshar
Changing backend name #708 @vivekgoe
Update ppo_trainer.py #718 @skaulintel
Add seed in sft example, make sft result reproducable #735 @sywangyi
Adding a flag whether to save checkpoint or not in run_lora_clm.py #736 @yeonsily
Refactor and update CI for encoder-decoders #742 @regisss
Expose Llama Fused OPs control from run_lora_clm.py #751 @hlahkar
Fixing tests by making static_shapes False #778 @bhargaveede
Fix ControlNet README #785 @regisss
Workaround for RoPE computed in bf16 for GPT-NeoX #746 @regisss
Add Whisper and SpeechT5 to model table #790 @regisss
Update summarization example README #791 @srajabos
Block torchscript pytest because of seg fault issue #793 @yeonsily
Fix test_encoder_decoder.py for opus-mt-zh-en #798 @regisss
Replacing obsolete API for mediapipe #796 @MohitIntel
Add --distribution_strategy fast_ddp in contrastive-image-text README and BridgeTower test #799 @regisss
Fix redundant bucket internal and hpu graph setting #797 @puneeshkhanna
Add Llama test for fsdp #761 @hlahkar
Enable dynamic shapes for esmfold #803 @hsubramony
Add Llama/Llama2 support in Question-Answering #745 @kplau1128
Update MLM example #830 @regisss
Revert Wav2Vec2 TDNNLayer forward function same as transformer v4.37.2 #827 @yeonsily
Save CI test output image #835 @MohitIntel
Update ckpt loading #773 @schoi-habana
Skip SDXL test in CI #840 @regisss
Fix FSDP test on Gaudi1 #841 @regisss
Remove installation from source for Diffusers in CI #846 @regisss
Fix fp8 ci #852 @regisss
Fix PR #848 #853 @regisss
Disable safe loading tests in CI #854 @regisss
Add warmup for eval #855 @libinta

Known issue

A crash may occur with unify_measurements.py

Contributors

kalyanjk, yafshar, and 24 other contributors

Assets 2

23 Feb 03:26

regisss

v1.10.4

1dfbc02

v1.10.4: Patch release

Fix Llama memory issue with DeepSpeed ZeRO-3

Fix Llama initialization #712

Full Changelog: v1.10.2...v1.10.4

Assets 2

18 Feb 02:23

regisss

v1.10.2

a6a88fa

v1.10.2: Patch release

Upgrade to Transformers v4.37

Upgrade to Transformers 4.37 #651

Full Changelog: v1.10.0...v1.10.2

Assets 2

30 Jan 21:50

regisss

v1.10.0

c1154b2

v1.10: SDXL, Textual-Inversion, TRL, SynapseAI v1.14

SynapseAI v1.14

The codebase is fully validated for the latest version of Habana SDK, SynapseAI v1.14.0.

Upgrade to SynapseAI 1.14 #664 @regisss

Stable Diffusion XL

SDXL is now supported and optimized for Gaudi.

Stable Diffusion XL for Gaudi #619 @dsocek
Update for SDXL Turbo support #634 @atakaha

Textual inversion fine-tuning

An example of textual-inversion fine-tuning has been added.

Add Textual Inversion fine-tuning script #243 @regisss

TRL

The 🤗 TRL library is now supported on Gaudi for performing DPO and SFT.

Add DPO and SFT of TRL support in Gaudi and example #601
Restructure example/trl/stack_llama_2 for generic DPO #635 @libinta
Add DPO of TRL in README.md #652 @libinta
Add seed in DPO for reproduce the training result #646 @sywangyi

Full bf16 evaluation

Full bf16 evaluation inside the trainer can now be performed like in Transformers.

Adding support for bf16_full_eval #610 @bhargaveede

Text-generation pipeline

A text-generation pipeline fully optimized for Gaudi has been added.

Text-Generation Pipeline Example #526 @sjagtap1803

Model optimizations

Enhances llama performance by removing the 'cast_f32_to_bf16' operation #564 @kalyanjk
Refactoring LLama Attention and mlp layers #589 @bgoldberg-habana
Support for FlashAttention in Llama2 #584 @wszczurekhabana
Integrate Habana flash attention to Llama2-70B finetune #596 @mandy-li
Enabling T5ForConditionalGeneration Inference using static shapes #425 @bhargaveede
Avoid falcon perf drop from PR#607 when BS=1 @schoi-habana
Enable fused rmsnorm in bf16 for llama #621 @puneeshkhanna
Flash attention enhancement of repeatKV #626 @puneeshkhanna
Update repeat KV llama logic for better TP-4 performance #639 @puneeshkhanna
Falcon changes for v1.14.0 release #654 @schoi-habana

TGI

TGI on Gaudi has been moved to a dedicated repo: https://github.com/huggingface/tgi-gaudi

Update tokenizer for tgi #572 @hsubramony
Remove redundant requirements #575 @hsubramony
Change next_token_chooser to HeterogeneousNextTokenChooser for TGI #574 @yeonsily
Remove TGI folder from Optimum Habana #597 @regisss

Various fixes

Fix messed up README for llama2-70b #571 @mandy-li
Fix Diffusers tests #570 @ssarkar2
Fix fp8 command in text-generation README #586 @regisss
Fix wav2vec inference bug #588 @skaulintel
Fix hash_with_views error #587 @bgoldberg-habana
Add dataset disposal of b-mc2/sql-create-context for codegen and fix zero3 lora save issue #552 @sywangyi
Fix gptj training issue #594 @BaihuiJin
Fix DataLoaderDispatcher issue in Gaudi #600 @sywangyi
Fix for Falcon error from PR #587 #608 @schoi-habana
Falcon graph compilation error fix for when bs>1 #607 @regisss
Fix crash if gaudi_config is not passed to GaudiTrainer #613 @sywangyi
Fix flash attention output for llama for padded batched inputs #623 @puneeshkhanna
Fix backward error in DDP when running reward model finetune in RLHF #507 @sywangyi
Fix dpo graph compile error in evaluation #630 @sywangyi
Fix error in run_image_classification.py #631 @regisss
Fix RLHF llama rewarding modeling backward issue #612 @sywangyi
Fix SD example so that custom bf16 ops can be used #642 @regisss
Fix SD2 test #647 @regisss
Fix typo in README #656 @yeonsily
Fix error in PR#654 #661 @schoi-habana
Fix compile error for torch_cmpile for llama #662 @jiminha
Fix SDXL test #666 @regisss

Others

Remove red crosses in model table #577 @regisss
Misc changes for transformers tests #581 @ankurneog
Remove delete_doc_comment workflows #582 @regisss
Pin PEFT for the languge-modeling example #591 @regisss
Remove workarounds to have causal_mask in uint8 for GPT2, GPT-J and CodeGen #592 @regisss
Change Synapse validated version in README #603 @regisss
Dyn prompt afterrefactor #543 @ssarkar2
In peft, only the trainable parameters need to be saved #576 @sywangyi
Add inheritance in Diffusers pipelines #611 @regisss
Update generation config to enable flash attention for inference #609 @puneeshkhanna
Remove setting of PT_HPU_LAZY_MODE=2 in training_args.py #625 @vivekgoe
Remove hpu:X notation untill fully supported by bridge #637 @hsubramony
Add use_flash_attention to Llama2-70B finetuning command in README #640 @mandy-li
Enable master_port selecting for DeepSpeed and MPI #641 @yangulei
Enabling Graphs in Wav2Vec AC training #622 @bhargaveede
Add changes to support FSDP #598 @vivekgoe
Run Llama2 with torch.compile on Gaudi2 #616 @kausikmaiti
Hqt #648 @bgoldberg-habana

Contributors

kalyanjk, bhargaveede, and 21 other contributors

Assets 2

Releases: huggingface/optimum-habana

v1.13.2: Patch release

Llava(-next) improvements

Contributors

v1.13.1: Patch release

Fixed memory regressions

FSDP

Known limitations

Contributors

v1.13.0: Stable Diffusion 3, Sentence Transformers, SAM, DETR, Kubernetes example

SynapseAI 1.17

Transformers 4.43

Diffusers 0.29

Stable Diffusion 3

Training with Sentence Transformers

Model optimizations

SAM, FastVIT, VideoMAE, OpenCLIP, DETR, Table Transformer, deciLM

Stable Diffusion inpainting, unconditional image generation

Text feature extraction example

Tensor parallelism

Kubernetes cluster example

FP8 training

Other

Known limitations

Contributors

v1.12.1: Patch Release

Fix 1st token latency time measure

Fix for Mixtral

Other

Contributors

v1.12: Qwen2, Gemma, SVD, Dreambooth, speculative sampling

SynapseAI v1.16

Transformers 4.40

Speculative Sampling

Model optimizations

Stable Video Diffusion

PEFT

TRL

Object Segmentation Example

Dreambooth

Others

Contributors

v1.11.1: Patch Release

Llama3 has been validated on Gaudi

Fix issue with pytest

Other

Contributors

v1.11: SDXL fine-tuning, Whisper, Phi, ControlNet

SynapseAI v1.15

SDXL fine-tuning

Whisper

Phi

ControlNet

Transformers v4.38

Model optimizations

Image-to-text and VQA examples

torch.compile

Bug fixes

Others

Known issue

Contributors

v1.10.4: Patch release

Fix Llama memory issue with DeepSpeed ZeRO-3

v1.10.2: Patch release

Upgrade to Transformers v4.37

v1.10: SDXL, Textual-Inversion, TRL, SynapseAI v1.14

SynapseAI v1.14

Stable Diffusion XL

Textual inversion fine-tuning

TRL

Full bf16 evaluation

Text-generation pipeline

Model optimizations

TGI

Various fixes

Others

Contributors

Fix issue with `pytest`