
Xiaofan Zhang
Authored Publications
Sort By
Reconfigurable Stream Network Architecture
Chengyue Wang
Jason Cong
James Hoe
International Symposium on Computer Architecture (ISCA) (2025)
Preview abstract
As AI systems grow increasingly specialized and complex, managing hardware heterogeneity becomes a pressing challenge. How can we efficiently coordinate and synchronize heterogeneous hardware resources to achieve high utilization? How can we minimize the friction of transitioning between diverse computation phases, reducing costly stalls from initialization, pipeline setup, or drain? Our insight is that a network abstraction at the ISA level naturally unifies heterogeneous resource orchestration and phase transitions. This paper presents a Reconfigurable Stream Network Architecture (RSN), a novel ISA abstraction designed for the DNN domain. RSN models the datapath as a circuit-switched network with stateful functional units as nodes and data streaming on the edges. Programming a computation corresponds to triggering a path. Software is explicitly exposed to the compute and communication latency of each functional unit, enabling precise control over data movement for optimizations such as compute-communication overlap and layer fusion. As nodes in a network naturally differ, the RSN abstraction can efficiently virtualize heterogeneous hardware resources by separating control from the data plane, enabling low instruction-level intervention. We build a proof-of-concept design RSN-XNN on VCK190, a heterogeneous platform with FPGA fabric and AI engines. Compared to the SOTA solution on this platform, it reduces latency by 6.1x and improves throughput by 2.4x-3.2x. Compared to the T4 GPU with the same FP32 performance, it matches latency with only 18% of the memory bandwidth. Compared to the A100 GPU at the same 7nm process node, it achieves 2.1x higher energy efficiency in FP32.
View details
SSDTrain: Faster Large Language Model Training Using SSD-Based Activation Offloading
Kun Wu
Jeongmin Brian Park
Mert Hidayetoğlu
Vikram Sharma Mailthody
Sitao Huang
Steven Lumetta
Wen-mei Hwu
Design Automation Conference (DAC) (2025)
Preview abstract
The scaling up of Large Language Models (LLMs) demands more memory than current GPUs can provide, hindering the training process. To address this challenge, we propose SSDTrain to efficiently offload activations, the intermediate tensors produced during LLM training, to SSDs. This approach reduces GPU memory usage without impacting performance by adaptively overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication, forwarding, and adaptive offloading to further enhance efficiency. We conduct extensive experiments on Llama, BERT, and T5. Results demonstrate that SSDTrain effectively reduces 45% of the activation peak memory usage. It can perfectly overlap the IO with the computation without introducing performance penalty. SSDTrain can achieve a performance boost of up to 31% compared to the conventional training strategy using the same GPU systems.
View details
AutoAI2C: An Automated Hardware Generator for DNN Acceleration on both FPGA and ASIC
Yongan Zhang
Pengfei Xu
Yang Zhao
Cong Hao
Deming Chen
Yingyan Lin
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2024)
Preview abstract
Recent advancements in Deep Neural Networks (DNNs) and the slowing of Moore’s law have made domainspecific hardware accelerators for DNNs (i.e., DNN chips) a promising means for enabling more extensive DNN applications. However, designing DNN chips is challenging due to (1) the vast and non-standardized design space and (2) different DNN models’ varying performance preferences regarding hardware micro-architecture and dataflows. Therefore, designing a DNN chip often takes a large team of inter-disciplinary experts months to years. To enable flexible and efficient DNN chip design, we propose AutoAI2C: a DNN chip generator that can automatically generate both FPGA- and ASIC-based DNN accelerator implementation (i.e., synthesizable hardware and deployment code) with optimized algorithm-to-hardware mapping, given the DNN model specification from mainstream machine learning frameworks (e.g., PyTorch). Specifically, AutoAI2C consists of two major components: (1) a Chip Predictor, which can efficiently and reliably predict a DNN accelerator’s energy, latency, and resource
consumption using a customized graph-based intermediate accelerator representation and (2) a Chip Builder, which can generate and optimize DNN accelerator designs by automatically exploring the design space based on targeting metrics and the Chip Predictor’s performance feedback. Extensive experiments show that our Chip Predictor’s predictions differ by <10% from realmeasured ones. Furthermore, AutoAI2C generated accelerators can achieve performance comparable to or better than (1.10× to 2.12× speedup) state-of-the-art accelerators, validating the effectiveness and advantages of AutoAI2C
View details
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
Haoran You
Yipin Guo
Yichao Fu
Wei Zhou
Huihong Shi
Souvikk Kundu
Yingyan Lin
38th Annual Conference on Neural Information Processing Systems (NeurIPS) (2024)
Preview abstract
Large language models (LLMs) have shown impressive performance in language tasks but face challenges when deployed on devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and significant latency bottlenecks. Shift and add reparameterization offers a solution by replacing costly multiplications with efficient hardware primitives for both attention and multi-layer perceptrons (MLPs). However, current reparameterization techniques necessitate training from scratch or full parameter fine-tuning to restore accuracy which is often impractical for LLMs. To this end, we propose accelerating pretrained LLMs through a post-training shift and add reparameterization, towards efficient multiplication-less LLMs, dubbed ShiftAddLLM. Specifically, we quantize and reparameterize weight matrices in LLMs into binary matrices of identical shape, coupled with scaling factor matrices of reduced dimensions. Each scaling factor, corresponding to a group of weights, is quantized to powers of two. Such a reparameterization transforms the original multiplications between weights and activations into two steps: (1) bitwise shifts between activations and scaling factors, and (2) queries and additions of these results with the binary matrices. To mitigate accuracy drops, we adopt multiple optimization objectives for optimizing the reparameterization. To further reduce memory usage and latency, we develop a mixed and automatic bit allocation strategy that enables extreme quantization of LLMs. Moreover, we introduce ShiftAddLoRA to fine-tune the post-training ShiftAddLLM, achieving both fast and accurate inference and fine-tuning. Extensive experiments on various LLMs and downstream language tasks consistently validate the effectiveness of ShiftAddLLM.
View details
EH-DNAS: End-to-End Hardware-aware Differentiable Neural Architecture Search
Qian Jiang
Deming Chen
Minh N. Do
Raymond Yeh
ICML 2023 Workshop on Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and Simulators
Preview abstract
In hardware-aware Differentiable Neural Architecture Search (DNAS), it is challenging to integrate hardware metrics into network architecture search. To handle hardware metrics, such as inference latency, existing works mainly rely on linear approximations and lack of support for various customized hardware. In this work, we propose End-to-end Hardware-aware DNAS (EH-DNAS), a seamless integration of an end-to-end hardware performance differentiable approximation, and a fully automated DNAS to deliver hardware-efficient deep neural networks on various hardware, including Edge GPUs, Edge TPUs, Mobile CPUs, and customized accelerators. Given a targeted hardware platform, we propose to learn a differentiable model predicting the end-to-end hardware performance of the neural network architectures during DNAS. We also propose E2E-Perf, a benchmarking tool to expand our design to support customized accelerators. Experiments on CIFAR10 and ImageNet show that EH-DNAS improves the hardware performance by an average of 1.5 times on customized accelerators and existing hardware processors than the state-of-the-art efficient networks while maintaining highly competitive model inference accuracy.
View details
Preview abstract
Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives. To solve these problems, we propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT. By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.
View details