Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance

Chris Homan

Chris Welty

Lora Aroyo

Shira Wein

Download Google Scholar

Abstract

We tackle the problem of providing accurate, rigorous p-values
for comparisons between the results of two evaluated systems
whose evaluations are based on a crowdsourced “gold” reference
standard. While this problem has been studied before, we argue
that the null hypotheses used in previous work have been based
on a common fallacy of equality of probabilities, as opposed to the
standard null hypothesis that two sets are drawn from the same
distribution. We propose using the standard null hypothesis, that
two systems’ responses are drawn from the same distribution, and
introduce a simulation-based framework for determining the true
p-value for this null hypothesis. We explore how to estimate the
true p-value from a single test set under different metrics, tests,
and sampling methods, and call particular attention to the role of
response variance, which exists in crowdsourced annotations as a
product of genuine disagreement, and in system predictions as a
product of stochastic training regimes, or in generative models as
an expected property of the outputs. We find that response variance
is a powerful tool for estimating p-values, and present results for
the metrics, tests, and sampling methods that make the best p-value
estimates in a simple machine learning model comparison

Research Areas

Machine Intelligence

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities