Yongwei Yang
Yongwei Yang is a researcher at Google. He works on (1) user and consumer research, (2) public perceptions about AI, (3) integrating AI into research methods and processes, and (4) attitude-behavior linkage and its implication to business goal-setting and impact evaluation.
Yongwei also works on foundational methodological research on collecting better data and making better use of data, esp. with surveys, psychological measurement, and behavioral signals. He is passionate about using his expertise to create a positive impact and to help others become effective users of research.
Yongwei holds a Ph.D. in Quantitative and Psychometric Methods from the University of Nebraska-Lincoln.
Authored Publications
Sort By
Preview abstract
Survey communities have regularly discussed optimal questionnaire design for attitude measurement. Specifically for consumer satisfaction, which has historically been treated as a bipolar construct (Thurstone, 1931; Likert, 1932), some argue it is actually two separate unipolar constructs, which may yield signals with separable and interactive dynamics (Cacioppo & Berntson, 1994).
Earlier research has explored whether attitude measurement validity can be optimized with a branching design that involves two questions: a question about the direction of an attitude (e.g., positive, negative) followed by a question using a unipolar scale, about the intensity of the selected direction (Krosnick & Berent, 1993).
The current experiment evaluated differences across a variety of question designs for in-product contextual satisfaction surveys (Sedley & Müller, 2016). Specifically, we randomly assigned respondents into the following designs:
Traditional 5-point bipolar satisfaction scale (fully labeled)
Branched: a directional question (satisfied, neither satisfied nor dissatisfied, dissatisfied), followed by a unipolar question on intensity (5-point scale from “not at all” to “extremely,” fully labeled)
Unipolar satisfaction scale, followed by a unipolar dissatisfaction scale (both use 5-point scale from “not at all” to “extremely,” fully labeled)
Unipolar dissatisfaction scale, followed by a unipolar satisfaction scale both use 5-point scale from “not at all” to “extremely,” fully labeled)
The experiment adds to the attitude question design literature by evaluating designs based on criterion validity evidence; namely the relationship with user behaviors linked to survey responses.
Results show that no format clearly outperformed the ‘traditional’ bipolar scale format, for the criteria included. Separate unipolar scales performed poorly, and may be awkward or annoying for respondents. Branching, while performing similarly as the traditional bipolar design, showed no gain in validity. Thus, it is also not desirable because it requires two questions instead of one, increasing respondent burden.
REFERENCES
Cacioppo, J. T., & Berntson, G. G. (1994). Relationship between attitudes and evaluative space: A critical review, with emphasis on the separability of positive and negative substrates. Psychological bulletin, 115, 401-423.
Krosnick, J. A., & Berent, M. K. (1993). Comparisons of party identification and policy preferences: The impact of survey question format. American Journal of Political Science, 37, 941-964.
Reliability of responses via test-retest, comparing branched vs unbranched
Orthogonal to our study? Not a validity analysis
Malhotra, N., Krosnick, J. A., & Thomas, R. K. (2009). Optimal design of branching questions to measure bipolar constructs. Public Opinion Quarterly, 73), 304-324.
Looks like their analyses were within-condition, and not comparing single question versions to branched versions like we are
page 308 summarizes how they coded the variants and normalized 0 to 1 for regression analysis
O’Muircheartaigh, C., Gaskell, G., & Wright, D. B. (1995). Weighing anchors: Verbal and numeric labels for response scales. Journal of Official Statistics, 11, 295–308.
Wang, R., & Krosnick, J. A. (2020). Middle alternatives and measurement validity: a recommendation for survey researchers. International Journal of Social Research Methodology, 23, 169-184.
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 79, 281–299.
Thurstone, L. L. (1931). Rank order as a psychological method. Journal of Experimental Psychology, 14, 187–201.
Likert, R. (1932). A Technique for the Measurement of Attitudes. Archives of Psychology,
22, 5–55.
Sedley, A., & Müller, H. (2016, May). User experience considerations for contextual product surveys on smartphones. Paper presented at 71st annual conference of the American Association for Public Opinion Research, Austin, TX. Retrieved from https://ai.google/research/pubs/pub46422/
View details
Test-retest reliability of four U.S. non-probability sample sources
Mario Callegaro
Inna Tsirlin
American Association for Public Opinion Research (2022)
Preview abstract
It is a common practice in market research to set up cross sectional survey trackers. Although many studies have investigated the accuracy of non-probability-based online samples, less is known about their test-retest reliability which is of key importance for such trackers. In this study, we wanted to assess how stable measurement is over short periods of time so that any changes observed over long periods in survey trackers could be attributed to true changes in sentiment rather than sample artifacts.
To achieve this, we repeated the same 10-question survey of 1,500 respondents two weeks apart in four different U.S. non-probability-based samples. The samples included: Qualtrics panels representing a typical non-probability-based online panel, Google Surveys representing a river sampling approach, Google Opinion Rewards representing a mobile panel, and Amazon MTurk, not a survey panel in itself but de facto used as such in academic research.
To quantify test-retest reliability, we compared the response distributions from the two survey administrations. Given the attitudes measured were not expected to change in a short timespan and no relevant external events were reported during fielding to potentially affect the attitudes, the assumption was that the two measurements should be very close to each other, aside from transient measurement error.
We found two of the samples produced remarkably consistent results between the two survey administrations, one sample was less consistent, and the fourth sample had significantly different response distributions for three of the four attitudinal questions. This study sheds light on the suitability of different non-probability-based samples for cross sectional attitude tracking.
It is a common practice in market research to set up cross sectional survey trackers. Although many studies have investigated the accuracy of non-probability-based online samples, less is known about their test-retest reliability which is of key importance for such trackers. In this study, we wanted to assess how stable measurement is over short periods of time so that any changes observed over long periods in survey trackers could be attributed to true changes in sentiment rather than sample artifacts.
To achieve this, we repeated the same 10-question survey of 1,500 respondents two weeks apart in four different U.S. non-probability-based samples. The samples included: Qualtrics panels representing a typical non-probability-based online panel, Google Surveys representing a river sampling approach, Google Opinion Rewards representing a mobile panel, and Amazon MTurk, not a survey panel in itself but de facto used as such in academic research.
To quantify test-retest reliability, we compared the response distributions from the two survey administrations. Given the attitudes measured were not expected to change in a short timespan and no relevant external events were reported during fielding to potentially affect the attitudes, the assumption was that the two measurements should be very close to each other, aside from transient measurement error.
We found two of the samples produced remarkably consistent results between the two survey administrations, one sample was less consistent, and the fourth sample had significantly different response distributions for three of the four attitudinal questions. This study sheds light on the suitability of different non-probability-based samples for cross sectional attitude tracking.
View details
Exciting, Useful, Worrying, Futuristic: Public Perception of Artificial Intelligence in 8 Countries
Patrick Gage Kelley
Christopher Moessner
Aaron Sedley
Andreas Kramm
David T. Newman
Allison Woodruff
AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (2021), 627–637
Preview abstract
As the influence and use of artificial intelligence (AI) have grown and its transformative potential has become more apparent, many questions have been raised regarding the economic, political, social, and ethical implications of its use. Public opinion plays an important role in these discussions, influencing product adoption, commercial development, research funding, and regulation. In this paper we present results of an in-depth survey of public opinion of artificial intelligence conducted with 10,005 respondents spanning eight countries and six continents. We report widespread perception that AI will have significant impact on society, accompanied by strong support for the responsible development and use of AI, and also characterize the public’s sentiment towards AI with four key themes (exciting, useful, worrying, and futuristic) whose prevalence distinguishes response to AI in different countries.
View details
“Mixture of amazement at the potential of this technology and concern about possible pitfalls”: Public sentiment towards AI in 15 countries
Patrick Gage Kelley
Christopher Moessner
Aaron M Sedley
Allison Woodruff
Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 44 (2021), pp. 28-46
Preview abstract
Public opinion plays an important role in the development of technology, influencing product adoption, commercial development, research funding, career choices, and regulation. In this paper we present results of an in-depth survey of public opinion of artificial intelligence (AI) conducted with over 17,000 respondents spanning fifteen countries and six continents. Our analysis of open-ended responses regarding sentiment towards AI revealed four key themes (exciting, useful, worrying, and futuristic) which appear to varying degrees in different countries. These sentiments, and their relative prevalence, may inform how the public influences the development of AI.
View details
Scaling the smileys: A multicountry investigation
Aaron Sedley
Joseph M. Paxton
The Essential Role of Language in Survey Research, RTI Press (2020), pp. 231-242
Preview abstract
Contextual user experience (UX) surveys are brief surveys embedded in a
website or mobile app (Sedley & Müller, 2016). In these surveys, emojis (e.g.,
smiley faces, thumbs, stars), with or without text labels, are often used as
answer scales. Previous investigations in the United States found that
carefully designed smiley faces may distribute fairly evenly along a numerical
scale (0–100) for measuring satisfaction (Sedley, Yang, & Hutchinson, 2017).
The present study investigated the scaling properties and construct meaning
of smiley faces in six countries. We collected open-ended descriptions of
smileys to understand construct interpretations across countries. We also
assessed numeric meaning of a set of five smiley faces on a 0–100 range by
presenting each face independently, as well as in context with other faces with
and without endpoint text labels.
View details
Response Option Order Effects in Cross-Cultural Context. An experimental investigation
Rich Timpone
Mario Callegaro
Marni Hirschorn
Vlad Achimescu
Maribeth Natchez
2019 Conference of the European Association for Survey Research (ESRA), Zagreb (2019) (to appear)
Preview abstract
Response option order effect occurs when different orders of rating scale response options lead to different distribution or functioning of survey questions. Theoretical interpretations, notably satisficing, memory bias (Krosnick & Alwin, 1987) and anchor-and-adjustment (Yan & Keusch, 2015) have been used to explain such effects. Visual interpretive heuristics (esp. “left-and-top-mean-first” and “up-means-good”) may also provide insights on how positioning of response options may affect answers (Tourangeau, Couper, & Conrad, 2004, 2013). Most existing studies that investigated the response option order effect were conducted in mono-cultural settings. However, the presence and extent of response option order effect may be affected by “cultural” factors in a few ways. First, interpretive heuristics, such as “left-means-first” may work differently due to varying reading conventions (e.g., left-to-right vs. right-to-left). Furthermore, people within cultures where there are multiple primary languages and multiple reading conventions might possess different positioning heuristics. Finally, respondents from different countries may have varying degree of exposure and familiarity to a specific type of visual design. In this experimental study, we investigate rating scale response option order effect across three countries with different reading conventions and industry norms for answer scale designs -- US, Israel, Japan. The between-subject factor of the experiment consists of four combinations of scale orientation (vertical and horizontal) and the positioning of the positive end of the scale. The within-subject factors are question topic area and the number of scale points. The effects of device (smartphone vs. desktop computer/tablet), age, gender, education, and the degree of exposure to left-to-right contents will also be evaluated. We incorporate a range of analytical approaches: distributional comparisons, analysis of response latency and paradata, and latent structure modeling. We will discuss implications on choosing response option orders for mobile surveys and on comparing data obtained from different response option orders.
View details
Preview abstract
Contextual user experience (UX) surveys are brief surveys embedded in a website or mobile app and triggered during or after a user-product interaction. They are used to measure user attitude and experience in the context of actual product usage. In these surveys, smiley faces (with or without verbal labels) are often used as answer scales for questions measuring constructs such as satisfaction. From studies done in the US in 2016 and 2017, we found that carefully designed smiley faces may distribute fairly evenly along a numerical scale (0-100) and scaling property further improved with endpoint verbal labels (Sedley, Yang, & Hutchinson, presented at APPOR 2017).
With the propagation of mobile apps products around the world, the survey research community is compelled to test the generalizability of single-population findings (often from the US) to cross-national, cross-language and cross-cultural contexts.
The current study builds upon the above scaling study as well as work by cross-cultural survey methodologies that investigated meanings of verbal scales (e.g., Smith, Mohler, Harkness, & Onodera, 2005). We investigate the scaling properties of smiley faces in a number of distinct cultural and language settings: US (English), Japan (Japanese), Germany (German), Spain (Spanish), India (English), and Brazil (Portuguese).
Specifically, we explore construct alignment by capturing respondents’ own interpretations of the smiley face variants, via open-ended responses.
We also assess scaling properties of various smiley designs by measuring each smiley face on a 0-100 scale, to calculate semantic distance between smileys. This is done by both presenting each smiley face independently and in-context with other smileys. We additionally evaluate the effect of including verbal endpoint labels with smiley scale.
View details
Assessing the validity of inferences from scores on the cognitive reflection test
Nikki Blacksmith
Tara S. Behrend
Gregory A. Ruark
Journal of Behavioral Decision Making, 32 (2019), pp. 599-612
Preview abstract
Decision‐making researchers purport that a novel cognitive ability construct, cognitive reflection, explains variance in intuitive thinking processes that traditional mental ability constructs do not. However, researchers have questioned the validity of the primary measure because of poor construct conceptualization and lack of validity studies. Prior studies have not adequately aligned the analytical techniques with the theoretical basis of the construct, dual‐processing theory of reasoning. The present study assessed the validity of inferences drawn from the cognitive reflection test (CRT) scores. We analyzed response processes with an item response tree model, a method that aligns with the dual‐processing theory in order to interpret CRT scores. Findings indicate that the intuitive and reflective factors that the test purportedly measures were indistinguishable. Exploratory, post hoc analyses demonstrate that CRT scores are most likely capturing mental abilities. We suggest that future researchers recognize and distinguish between individual differences in cognitive abilities and cognitive processes.
View details
From Big Data to Big Analytics: Automated Analytic Platforms for Data Exploration
Jonathan Kroening
Rich Timpone
BigSurv 18 (Big Data Meet Survey Science) conference, Barcelona, Spain (2018)
Preview abstract
As Big Data has altered the face of research, the same factors of Volume, Velocity and Variety used to define it, are changing the opportunities of analytic data exploration as well; thus, the introduction of the term Big Analytics. Improvement in algorithms and computing power provide the foundation to produce automated platforms that can identify patterns in analytic model results beyond simply looking at the patterns in the data itself.
Introducing the class of Automated Analysis Insight Exploration Platforms allows conducting tens and hundreds of thousands of statistical models to explore them to identify systematic changes in dynamic environments that would often be missed otherwise. These techniques are designed to extract more value out of both traditional survey as well as Big Data, and is relevant for academic, industry, governmental and NGO exploration of new insights of changing patterns of attitudes and behaviors.
This paper discusses the architecture of our Ipsos Research Insight Scout (IRIS) and then provides examples of it in action to identify insights for scientific and practical discovery in public opinion and business data. From the Ipsos Global Advisor Study we show examples from the U.S. withdrawal from the Paris Agreement and the 2016 presidential election. We then show with an example how a research project at Google is leveraging these platforms to inform business decision-making.
View details
Justice Rising - The Growing Ethical Importance of Big Data, Survey Data, Models and AI
Rich Timpone
BigSurv 18 (Big Data Meet Survey Science) conference, Barcelona, Spain (2018)
Preview abstract
In past work, the criteria of Truth, Beauty, and Justice have been leveraged to evaluate models (Lave and March 1993, Taber and Timpone 1996). Earlier, while relevant, Justice was seen as the least important of modeling considerations, but that is no longer the case. As the nature of data and computing power have opened new opportunities for the application of data and algorithms from public policy decision-making to technological advances like self-driving cars, the ethical
considerations have become far more important in the work that researchers are doing.
While a growing literature has been highlighting ethical concerns of Big Data, algorithms and artificial intelligence, we take a practical approach of reviewing how decisions throughout the research process can result in unintended consequences in practice. Building off Gawande’s (2009) approach of using checklists to reduce risks, we have developed an initial framework and set of checklist questions for researchers to consider the ethical implications of their analytic endeavors explicitly. While many aspects are considered those tied to Truth and accuracy, through our examples it will be seen that considering research design through the lens of Justice may lead to different research choices.
These checklists include questions on the collection of data (Big Data and Survey; including sources and measurement), how it is modeled and finally issues of transparency. These issues are of growing importance for practitioners from academia to industry to government and will allow us to advance the intended goals of our scientific and practical endeavors while avoiding potential risks and pitfalls.
View details