Google Dataset Search by the Numbers

Omar Benjelloun

Shiyu Chen

Natasha Noy

International Semantic Web Conference (ISWC-2020), In-Use Track (to appear)

Download Google Scholar

Abstract

Scientists, governments, and companies increasingly publish datasets on the
Web. Google's Dataset Search tool extracts dataset metadata---expressed in the
schema.org vocabulary---from webpages in order to make datasets discoverable.
Since the tool's inception, the number of datasets described in schema.org has
grown from about 500K to almost 30M, and has become a valuable snapshot of
what data on the Web looks like. This paper analyzes the corpus of dataset
metadata we collected. To the best of our knowledge, this corpus is the
largest and most diverse of its kind. We discuss such questions as where the
datasets originate from, what topics they cover, which form they take, and what
people searching for datasets are interested in. We describe our methods for
collecting and analyzing this data as well as our observations. We conclude
with identifying the gaps and possible future work to help make data more
accessible.

Research Areas

Data Management

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Google Dataset Search by the Numbers

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Google Dataset Search by the Numbers

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities