Neural Ranking Models with Weak Supervision

Hamed Zamani
Jaap Kamps
W. Bruce Croft
Proceedings of The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2017)

Abstract

Despite the impressive improvements achieved by unsupervised
deep neural networks in computer vision, natural language processing,
and speech recognition tasks, such improvements have not
generally been observed in ranking for information retrieval. The
reason might be related to the complexity of the ranking problem,
in the sense that it is not obvious how to learn from queries and
documents when no supervised signal is available. Hence, in this
paper, we propose to train a neural ranking model from a weak
supervision signal, which is a training signal that can be obtained
automatically without human labeling or any external resources
(e.g., click data). To this aim, we use the output of a known unsupervised
ranking model, such as BM25, as a weak supervision
signal. We further train a set of simple yet e‚ffective ranking models
based on feed-forward neural networks. We study their e‚ffectiveness
under various learning scenarios (point-wise and pair-wise
models) and using di‚fferent input representations (i.e., from encoding
query-document pairs into dense/sparse vectors to using word
embedding representation). We train our network on 5 million
unique queries obtained from the publicly available AOL query
logs and two standard collections: a homogeneous news collection
(Robust) and a heterogeneous large-scale web collection (ClueWeb).
Our experiments indicate that feeding raw data to the networks
and letting them learn representations for the input data leads to
an impressive performance, with over 13% and 35% MAP improvements
compared to the BM25 model on the Robust and the ClueWeb
collections, respectively. Our findings suggest that neural ranking
models can greatly benefit from large amounts of weakly labeled
data that can be easily obtained from unsupervised IR models.