Skip to content

This is a work aiming to provide an API to make unit tests on Language Generation Tasks metrics, it's based on the original Paper ( Boag et al ) repo and datasets.

Notifications You must be signed in to change notification settings

nprime496/MUTT_Wl_research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MUTT

Metrics Unit TesTing (MUTT) for machine translation and other similarity metrics.

"To design better metrics, we need a principled approach to evaluating their performance. Historically, MT metrics have been evaluated by how well they correlate with human annotations (Callison-Burch et al., 2010; Machacek and Bojar, 2014. However, as we demonstrate in Sec. 5, human judgment can result in inconsistent scoring. This presents a serious problem for determining whether a metric is ”good” based on correlation with inconsistent human scores. When ”gold” target data is unreliable, even good metrics can appear to be inaccurate. Furthermore, correlation of system output with human-derived scores typically provides an overall score but fails to isolate specific errors that metrics tend to miss. This makes it difficult to discover system-specific weaknesses to improve their performance. For instance, an ngram-based metric might effectively detect non-fluent, syntactic errors, but could also be fooled by legitimate paraphrases whose ngrams simply did not appear in the training set.

The goal of this paper (thus this repo) is to propose a process for consistent and informative automated analysis of evaluation metrics. This method is demonstrably more consistent and interpretable than correlation with human annotations. In addition, we extend the SICK dataset to include un-scored fluency-focused sentence comparisons and we propose a toy metric for evaluation."

Edit

This version of MUTT tends to provide means to evaluate any metric on the same datase as the paper through the evaluate_mutt API

Dependencies:

  • python (3.*)

Run :

To just evaluate metric, you have to clone the repo:

git clone https://github.com/Nprime496/MUTT_Wl_research.git`
cd MUTT_Wl_research/src

The available corruptions are divised in two categories: opposite meaning corruptions

det_sub : the corruption of reference has a determinant substitution
shuffled: the corruption of reference words are shuffled
neg_sub: the corruption of reference has opposite subjects
neg_verb: the corruption of reference has opposite verbs
sem_opps: 
remove_prep:
double_pp:
swap_chunks:

similar meaning corruptions

passive:
near_syms:

You have to define your function which will be used to compare two sentences. For example, for BERTScore

from bert_score import BERTScorer
scorer=BERTSCorer(...)

def evaluate_BERTScore_two_sentences(sent_a,sent_b):
  #computes F1-Score
  return scorer.score([sent_a],[sent_b])[2]

Then, you run the API using evaluate_mutt

from mutt_ import evaluate_mutt
evaluate_mutt([("<name of your model>",<function taking two sentences as input and returning a float value of the score>),...])

For our example (with BERTScore) , to evaluate it against original MUTT paper...

from mutt_ import evaluate_mutt
evaluate_mutt([("BERTScore",evaluate_BERTScore_two_sentences)])

to evaluate it against reviewed MUTT dataset (qaed)

from mutt_ import evaluate_mutt
evaluate_mutt([("BERTScore",evaluate_BERTScore_two_sentences)],qaed=True)

About

This is a work aiming to provide an API to make unit tests on Language Generation Tasks metrics, it's based on the original Paper ( Boag et al ) repo and datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages