Skip to content

v1.2.0

Compare
Choose a tag to compare
@shashikg shashikg released this 23 Dec 10:20
· 57 commits to main since this release

Release Notes

  • Fixed ffmpeg resampling issue by adding an option to use the swr resampler in case soxr is not available.
  • Added word timestamp feature.

Word Timestamp Benchmarks

Model Name Acc. Overlapped Acc. Within Collar (0.1s) Acc. Within Collar (0.2s) Acc. Within Collar (0.5s) Acc. Within Collar (1.0s) Total Word Hits Inference Time
WhisperS2T (ASR: whsiper-large-v2 - Aligner: whsiper-tiny) 66.21 38.67 60.8 76.06 85.82 64350 2.6x
WhisperS2T (ASR: whsiper-large-v2 - Aligner: whsiper-large-v2) 66.72 48.95 58.54 73.44 84.0 64350 1.6x
WhisperX (ASR: whsiper-large-v2 - Aligner: wav2vec) 55.65 50.66 55.84 66.18 75.57 64307 1x

We used the Whisper model for alignment. What we observed is that both Whisper as well as phoneme-level alignment (as in WhisperX) yield similar performance. However, using Whisper provides several advantages, including out-of-the-box support for all languages. For phoneme-level alignment, we need an individual model for every new language, which we believe somewhat diminishes the advantages of using the Whisper model at all. Moreover, when using the whisper-tiny model for word alignment, it incurs very little latency overhead without affecting the alignment accuracies. We utilized the AMI-MIX-Headset-Test dataset for benchmarking.

There's no properly defined metric for estimating word alignment accuracy. Hence, we introduce a new metric to accurately estimate the performance of word alignment. Check this function: Word Alignment Metric Function.

The proposed metric performs the following steps:

  1. Initially, it identifies the words detected in the predicted transcript when compared against the reference transcript. This step is crucial because words that are missed or inserted in the predicted transcript should not be considered when evaluating word alignment accuracy.
  2. After identifying the detected words, we calculate two values: overlapped_words and words_within_collar (refer to the figure below). Finally, we divide both values by the total number of detected words.

word_alignment_benchmark