
Tarfah Alrashed
I am a Research Scientist at Google Research, where I am part of the Dataset Search team. Our mission is to make structured data on the Web more accessible and useful. We developed Google Dataset Search, a tool that helps users discover datasets stored across the Web.
Before joining Google, I completed my Ph.D. in Computer Science at MIT CSAIL, where my research focused on designing systems that enable users to access and manipulate web data without the need to write code. You can view my pre-Google publications on my Google Scholar profile
Before joining Google, I completed my Ph.D. in Computer Science at MIT CSAIL, where my research focused on designing systems that enable users to access and manipulate web data without the need to write code. You can view my pre-Google publications on my Google Scholar profile
Authored Publications
Sort By
Discovering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search
Daniel Russell
Stella Dugall
Harvard Data Science Review (2024)
Preview abstract
With the rise of open data in the last two decades, more datasets are online and more people are using them for projects and research. But how do people find datasets? We present the first user study of Google Dataset Search, a dataset-discovery tool that uses a web crawl and open ecosystem to find datasets. Google Dataset Search contains a superset of the datasets in other dataset-discovery tools—a total of 45 million datasets from 13,000 sources. We found that the tool addresses a previously identified need: a search engine for datasets across the entire web, including datasets in other tools. However, the tool introduced new challenges due to its open approach: building a mental model of the tool, making sense of heterogeneous datasets, and learning how to search for datasets. We discuss recommendations for dataset-discovery tools and open research questions.
View details
Preview abstract
The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of users who discover, use, and share datasets on the Web: what relationships are important for different tasks? What contextual information might users want to know? We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery. We develop a series of methods to identify these relationships and compare their performance on a large corpus of datasets generated from Web pages with schema.org markup. We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%. Finally, we highlight gaps in available semantic markup for datasets and discuss how incorporating comprehensive semantics can facilitate the identification of dataset relationships. By providing a comprehensive overview of dataset relationships at scale, this paper sets a benchmark for future research.
View details