MUTT

Metrics Unit TesTing (MUTT) for machine translation and other similarity metrics.

"To design better metrics, we need a principled approach to evaluating their performance. Historically, MT metrics have been evaluated by how well they correlate with human annotations (Callison-Burch et al., 2010; Machacek and Bojar, 2014. However, as we demonstrate in Sec. 5, human judgment can result in inconsistent scoring. This presents a serious problem for determining whether a metric is ”good” based on correlation with inconsistent human scores. When ”gold” target data is unreliable, even good metrics can appear to be inaccurate. Furthermore, correlation of system output with human-derived scores typically provides an overall score but fails to isolate specific errors that metrics tend to miss. This makes it difficult to discover system-specific weaknesses to improve their performance. For instance, an ngram-based metric might effectively detect non-fluent, syntactic errors, but could also be fooled by legitimate paraphrases whose ngrams simply did not appear in the training set.

The goal of this paper (thus this repo) is to propose a process for consistent and informative automated analysis of evaluation metrics. This method is demonstrably more consistent and interpretable than correlation with human annotations. In addition, we extend the SICK dataset to include un-scored fluency-focused sentence comparisons and we propose a toy metric for evaluation."

Edit

This version of MUTT tends to provide means to evaluate any metric on the same datase as the paper through the evaluate_mutt API

Dependencies:

python (3.*)

Run :

To just evaluate metric, you have to clone the repo:

git clone https://github.com/Nprime496/MUTT_Wl_research.git`
cd MUTT_Wl_research/src

The available corruptions are divised in two categories: opposite meaning corruptions

det_sub : the corruption of reference has a determinant substitution
shuffled: the corruption of reference words are shuffled
neg_sub: the corruption of reference has opposite subjects
neg_verb: the corruption of reference has opposite verbs
sem_opps: 
remove_prep:
double_pp:
swap_chunks:

similar meaning corruptions

passive:
near_syms:

You have to define your function which will be used to compare two sentences. For example, for BERTScore

from bert_score import BERTScorer
scorer=BERTSCorer(...)

def evaluate_BERTScore_two_sentences(sent_a,sent_b):
  #computes F1-Score
  return scorer.score([sent_a],[sent_b])[2]

Then, you run the API using evaluate_mutt

from mutt_ import evaluate_mutt
evaluate_mutt([("<name of your model>",<function taking two sentences as input and returning a float value of the score>),...])

For our example (with BERTScore) , to evaluate it against original MUTT paper...

from mutt_ import evaluate_mutt
evaluate_mutt([("BERTScore",evaluate_BERTScore_two_sentences)])

to evaluate it against reviewed MUTT dataset (qaed)

from mutt_ import evaluate_mutt
evaluate_mutt([("BERTScore",evaluate_BERTScore_two_sentences)],qaed=True)

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
corruptions		corruptions
data		data
metrics		metrics
results		results
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MUTT

Edit

Dependencies:

Run :

About

Releases

Packages

Contributors 3

Languages

nprime496/MUTT_Wl_research

Folders and files

Latest commit

History

Repository files navigation

MUTT

Edit

Dependencies:

Run :

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages