Shumin Zhai
Shumin Zhai is a Human-Computer Interaction research scientist at Google where he leads and directs research, design, and development of input methods and haptics systems on Google’s and its partner’s flagship products. His research career has contributed to foundational models and understandings of human-computer interaction as well as practical user interface inventions and products based on his scientific and technical insights. He originated and led the SHARK/ShapeWriter project at IBM Research and a start-up company that pioneered the touchscreen word-gesture keyboard paradigm, filing the first patents of this paradigm, publishing the first generation of scientific papers, releasing the first word-gesture keyboard in 2004 and a top ranked (6th) iPhone app called ShapeWriter WritingPad in 2008. His publications have won the ACM UIST Lasting Impact Award and a IEEE Computer Society Best Paper Award, among others. He served as the 4th Editor-in-Chief of ACM Transactions on Computer-Human Interaction, and frequently contributes to other academic boards and program committees. He received his Ph.D. degree at the University of Toronto in 1995. In 2006, he was selected as one of ACM's inaugural class of Distinguished Scientists. In 2010 he was named Member of the CHI Academy and Fellow of the ACM.
His external web page is at www.shuminzhai.com.
Authored Publications
Sort By
Rambler: Supporting Writing With Speech via LLM-Assisted Gist Manipulation
Susan Lin
Jeremy Warner
J.D. Zamfirescu-Pereira
Matthew G Lee
Sauhard Jain
Michael Xuelin Huang
Bjoern Hartmann
Can Liu
Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, New York, NY, USA
Preview abstract
Dictation enables efficient text input on mobile devices. However, writing with speech can produce disfluent, wordy, and incoherent text and thus requires heavy post-processing. This paper presents Rambler, an LLM-powered graphical user interface that supports gist-level manipulation of dictated text with two main sets of functions: gist extraction and macro revision. Gist extraction generates keywords and summaries as anchors to support the review and interaction with spoken text. LLM-assisted macro revisions allow users to respeak, split, merge, and transform dictated text without specifying precise editing locations. Together they pave the way for interactive dictation and revision that help close gaps between spontaneously spoken words and well-structured writing. In a comparative study with 12 participants performing verbal composition tasks, Rambler outperformed the baseline of a speech-to-text editor + ChatGPT, as it better facilitates iterative revisions with enhanced user control over the content while supporting surprisingly diverse user strategies.
View details
SkipWriter: LLM-Powered Abbreviated Writing on Tablets
Zheer Xu
Mukund Varma T
Proceedings of UIST 2024 (2024)
Preview abstract
Large Language Models (LLMs) may offer transformative opportunities for text input, especially for physically demanding modalities like handwriting. We studied a form of abbreviated handwriting by designing, developing and evaluating a prototype, named SkipWriter, that convert handwritten strokes of a variable-length, prefix- based abbreviation (e.g., “ho a y” as handwritten strokes) into the intended full phrase (e.g., “how are you” in the digital format) based
on preceding context. SkipWriter consists of an in-production hand-writing recognizer and a LLM fine-tuned on this skip-writing task. With flexible pen input, SkipWriter allows the user to add and revise prefix strokes when predictions don’t match the user’s intent. An user evaluation demonstrated a 60% reduction in motor movements with an average speed of 25.78 WPM. We also showed that this reduction is close to the ceiling of our model in an offline simulation.
View details
Can Capacitive Touch Images Enhance Mobile Keyboard Decoding?
Billy Dou
Cedric Ho
Proceedings of UIST 2024 (2024)
Preview abstract
Capacitive touch sensors capture the two-dimensional spatial profile (referred to as a touch heatmap) of a finger's contact with a mobile touchscreen. However, the research and design of touchscreen mobile keyboards - one of the most speed- and accuracy-demanding touch interfaces - has focused on the location of the touch centroid derived from the touch image heatmap as the input, discarding the rest of the raw spatial signals. In this paper, we investigate whether touch heatmaps can be leveraged to further improve the tap decoding accuracy for mobile touchscreen keyboards. Specifically, we compared machine-learning models that decode user taps by using the centroids and/or the heatmaps as their input and studied the contribution due to the heatmap. The results show that adding the heatmap into the input feature set led to 21.4% relative reduction of character error rates on average, compared to using the centroid alone. Furthermore, we conducted online deployment testing of the heatmap-based decoder in a user study with 16 participants and observed lower error rate, faster typing speed, and higher self-reported satisfaction score based on the heatmap-based decoder than the centroid-based decoder. These findings underline the promise of utilizing touch heatmaps for improving typing experience in mobile keyboards.
View details
TapNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input
Michael Xuelin Huang
Nazneen Nazneen
Alex Chao
ACM CHI Conference on Human Factors in Computing Systems, ACM (2021)
Preview abstract
Off-screen interaction offers great potential for one-handed and eyes-free mobile interaction. While a few existing studies have explored the built-in mobile phone sensors to sense off-screen signals, none met practical requirement. This paper discusses the design, training, implementation and applications of TapNet, a multi-task network that detects tapping on the smartphone using built-in accelerometer and gyroscope. With sensor location as auxiliary information, TapNet can jointly learn from data across devices and simultaneously recognize multiple tap properties, including tap direction and tap location. We developed four datasets consisting of over 180K training samples, 38K testing samples, and 87 participants in total. Experimental evaluation demonstrated the effectiveness of the TapNet design and its significant improvement over the state of the art. Along with the datasets, codebase, and extensive experiments, TapNet establishes a new technical foundation for off-screen mobile input.
View details
i’sFree: Eyes-Free Gesture Typing via a Touch-Enabled Remote Control
Suwen Zhu
Xiaojun Bi
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, USA, 448:1-448:12 (to appear)
Preview abstract
Entering text without having to pay attention to the keyboard is compelling but challenging due to the lack of visual guidance. We propose i'sFree to enable eyes-free gesture typing on a distant display from a touch-enabled remote control. i'sFree does not display the keyboard or gesture trace but decodes gestures drawn on the remote control into text according to an invisible and shifting Qwerty layout. i'sFree decodes gestures similar to a general gesture typing decoder, but learns from the instantaneous and historical input gestures to dynamically adjust the keyboard location. We designed it based on the understanding of how users perform eyes-free gesture typing. Our evaluation shows eyes-free gesture typing is feasible: reducing visual guidance on the distant display hardly affects the typing speed. Results also show that the i’sFree gesture decoding algorithm is effective, enabling an input speed of 23 WPM, 46% faster than the baseline eyes-free condition built on a general gesture decoder. Finally, i'sFree is easy to learn: participants reached 22 WPM in the first ten minutes, even though 40% of them were first-time gesture typing users.
View details
Active Edge: Designing Squeeze Gestures for the Google Pixel 2
Claire Lee
Melissa Barnhart
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, 274:1-274:13
Preview abstract
Active Edge is a feature of Google Pixel 2 smartphone devices that creates a force-sensitive interaction surface along their sides, allowing users to perform gestures by holding and squeezing their device. Supported by strain gauge elements adhered to the inner sidewalls of the device chassis, these gestures can be more natural and ergonomic than on-screen (touch) counterparts. Developing these interactions is an integration of several components: (1) an insight and understanding of the user experiences that benefit from squeeze gestures; (2) hardware with the sensitivity and reliability to sense a user's squeeze in any operating environment; (3) a gesture design that discriminates intentional squeezes from innocuous handling; and (4) an interaction design to promote a discoverable and satisfying user experience. This paper describes the design and evaluation of Active Edge in these areas as part of the product's development and engineering.
View details
M3 Gesture Menu: Design and Experimental Analyses of Marking Menus for Touchscreen Mobile Interaction
Kun Li
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, USA, 249:1-249:14
Preview abstract
Despite their learning advantages in theory, marking menus have faced adoption challenges in practice, even on today's touchscreen-based mobile devices. We address these challenges by designing, implementing, and evaluating multiple versions of M3 Gesture Menu (M3), a reimagination of marking menus targeted at mobile interfaces. M3 is defined on a grid rather than in a radial space, relies on gestural shapes rather than directional marks, and has constant and stationary space use. Our first controlled experiment on expert performance showed M3 was faster and less error-prone by a factor of two than traditional marking menus. A second experiment on learning demonstrated for the first time that users could successfully transition to recall-based execution of a dozen commands after three ten-minute practice sessions with both M3 and Multi-Stroke Marking Menu. Together, M3, with its demonstrated resolution, learning, and space use benefits, contributes to the design and understanding of menu selection in the mobile-first era of end-user computing.
View details
Preview abstract
Word–Gesture keyboards allow users to enter text using continuous input strokes (also known as gesture typing or shape writing). We developed a production model of gesture typing input based on a human motor control theory of optimal control (specifically, modeling human drawing movements as a minimization of jerk—the third derivative of position). In contrast to existing models, which consider gestural input as a series of concatenated aiming movements and predict a user’s time performance, this descriptive theory of human motor control predicts the shapes and trajectories that users will draw. The theory is supported by an analysis of user-produced gestures that found qualitative and quantitative agreement between the shapes users drew and the minimum jerk theory of motor control. Furthermore, by using a small number of statistical via-points whose distributions reflect the sensorimotor noise and speed–accuracy trade-off in gesture typing, we developed a model of gesture production that can predict realistic gesture trajectories for arbitrary text input tasks. The model accurately reflects features in the figural shapes and dynamics observed from users and can be used to improve the design and evaluation of gestural input systems.
View details
A Cost–Benefit Study of Text Entry Suggestion Interaction
Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, pp. 83-88
Preview abstract
Mobile keyboards often present error corrections and word completions (suggestions) as candidates for anticipated user input. However, these suggestions are not cognitively free: they require users to attend, evaluate, and act upon them. To understand this trade-off between suggestion savings and interaction costs, we conducted a text transcription experiment that controlled interface assertiveness: the tendency for an interface to present itself. Suggestions were either always present (extraverted), never present (introverted), or gated by a probability threshold (ambiverted). Results showed that although increasing the assertiveness of suggestions reduced the number of keyboard actions to enter text and was subjectively preferred, the costs of attending to and using the suggestions impaired average time performance.
View details
Optimizing Touchscreen Keyboards for Gesture Typing
Brian Smith
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2015), ACM, New York, NY, USA, pp. 3365-3374
Preview abstract
Despite its growing popularity, gesture typing suffers from a major problem not present in touch typing: gesture ambiguity on the Qwerty keyboard. By applying rigorous mathematical optimization methods, this paper systematically investigates the optimization space related to the accuracy, speed, and Qwerty similarity of a gesture typing keyboard. Our investigation shows that optimizing the layout for gesture clarity (a metric measuring how unique word gestures are on a keyboard) drastically improves the accuracy of gesture typing. Moreover, if we also accommodate gesture speed, or both gesture speed and Qwerty similarity, we can still reduce error rates by 52% and 37% over Qwerty, respectively. In addition to investigating the optimization space, this work contributes a set of optimized layouts such as GK-D and GK-T that can immediately benefit mobile device users.
View details