세종대학교

SiteMap

close

Research

Research News

A prestigious global university that fosters Sejong-type talent who challenges creative thinking and communicates with the world.

Research News

Scalable transformer accelerator enables on-device execution of large language models

2025.07.22 471

Scalable transformer accelerator enables on-device execution of large language models


Department of Semiconductor Systems Engineering

Dong-Sun Kim


▲Differences in processes with and without hardware accelerators


"Large language models (LLMs) like BERT and GPT are driving major advances in artificial intelligence, but their size and complexity typically require powerful servers and cloud infrastructure. Running these models directly on devices—without relying on external computation—has remained a difficult technical challenge.


A research team at Sejong University has developed a new hardware solution that may help change that. Their Scalable Transformer Accelerator Unit (STAU) is designed to execute various transformer-based language models efficiently on embedded systems. It adapts dynamically to different input sizes and model structures, making it especially well-suited for real-time on-device AI.


At the heart of the STAU is a Variable Systolic Array (VSA) architecture, which performs matrix operations—the core workload in transformer models—in a way that scales with the input sequence length. By feeding input data row by row and loading weights in parallel, the system reduces memory stalls and improves throughput. This is particularly important for LLMs, where sentence lengths and token sequences vary widely between tasks.


In benchmark tests published in Electronics, the accelerator demonstrated a 3.45× speedup over CPU-only execution while maintaining over 97% numerical accuracy. It also reduced total computation time by more than 68% when processing longer sequences. Since then, continued optimizations have further improved the system’s performance: according to the team, recent internal tests achieved a speedup of up to 5.18×, highlighting the architecture’s long-term scalability.


▲Top module architecture


▲Processing Element (PE) and Variable Systolic Array


The researchers also re-engineered a critical part of the transformer pipeline: the softmax function. Typically a bottleneck due to its reliance on exponentiation and normalization, it was redesigned using a lightweight Radix-2 approach that relies on shift-and-add operations. This reduces the hardware complexity without compromising output quality.


To further simplify computation, the system uses a custom 16-bit floating-point format specifically tailored for transformer workloads. This format eliminates the need for layer normalization—another common performance bottleneck—and contributes to a more efficient, streamlined datapath.


STAU was implemented on a Xilinx FPGA (VMK180) and controlled by an embedded Arm Cortex-R5 processor. This hybrid design allows developers to support a range of transformer models—including those used in LLMs—by simply updating software running on the processor, with no hardware modifications required.


The team sees their work as a step toward making advanced language models more accessible and deployable across a broader range of platforms—including mobile devices, wearables, and edge computing systems—where real-time AI execution, privacy, and low-latency response are essential.


“The STAU architecture shows that transformer models, even large ones, can be made practical for on-device applications,” said lead author Seok-Woo Chang. “It provides a foundation for building intelligent systems that are both scalable and efficient.”


More information:

Seok-Woo Chang and Dong-Sun Kim, Scalable transformer accelerator with variable systolic array for multiple models in voice assistant applications, Electronics (2024). DOI: 10.3390/electronics13234683


Journal information: Electronics


Next The Heatwave-Drought Increase in Eurasia Due to Rapid Climate Change and Its Impact on the Climate of South Korea
Previous New method of measuring gravity with 3D velocities of wide binary stars is developed and confirms modified gravity