NewsEmbed: Modeling News through Pretrained Document Representations

Cong Yu

Jialu Liu

Tianqi Liu

KDD2021

Google Scholar

Abstract

Effectively modeling text-rich fresh content such as news articles
and blog posts is a challenging problem. To ensure a content-based
model generalize well to a broad range of applications, it is critical
to have a training dataset that is large beyond the scale of human
labels while achieving desired quality. In this work, we addressing those two challenges by proposing a novel approach to mine
semantically-relevant fresh documents, and their topic labels, with
little human supervision. Specifically, we design a multitask model
that alternate trains a contrasting learning with a multi-label classification to derive an universal document encoder. We show that
this approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting
where texts in different languages are encoded in the same semantic
space. We experimentally demonstrate NewsEmbed’s competitive
performance across multiple natural language understanding tasks,
both supervised and unsupervised.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

NewsEmbed: Modeling News through Pretrained Document Representations

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

NewsEmbed: Modeling News through Pretrained Document Representations

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities