Assessing The Factual Accuracy of Text Generation

Ben Goodrich
Peter Liu
Vinay Rao
The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'19) (2019) (to appear)
Google Scholar

Abstract

We propose an automatic metric to reflect the
factual accuracy of generated text as an alternative
to typical scoring schemes like ROUGE
(Recall-Oriented Understudy for Gisting Evaluation)
and BLEU (Bilingual Evaluation Understudy).
We consider models that can extract fact
triplets from text and then use them to de-
fine a metric that compares triplets extracted
from generated summaries and reference texts.
We show that this metric correlates with human
evaluation of factual accuracy better than
ROUGE does.
To build these models, we introduce a new
Wikidata based dataset for fact extraction, and
show that a transformer-based attention model
can learn to predict structured fact triplets as
well as perform favorably compared to more
traditional two-stage approaches (entity recognition
and relationship classification).