Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Dara Bahri

Yi Tay

Che Zheng

Don Metzler

Cliff Brunk

Andrew Tomkins

WSDM 2021 (2021)

Download Google Scholar

Abstract

Large generative language models such as GPT-2 are well-known for not only their ability to generate highly realistic text but also in their utility for common downstream tasks. However, how and in what settings one can best leverage these powerful language models is still a nascent research question. In this work, we explore their use in predicting ``language quality'', a notion of coherence and understandability of text. Our key finding is that, when trained in a self-discriminating fashion, large language models emerge as unsupervised predictors for such language quality. This enables fast bootstrapping of quality indicators in a low-resource setting. We conduct extensive qualitative and quantitative analysis over 500 million web articles, the largest-scale study conducted on this topic.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities