Large Language Models (LLMs) have demonstrated unprecedented capabilities in text generation, translation, and summarization tasks. However, their deployment on resource-constrained systems remains challenging due to their large parameter sizes and high computational demands. To address this, we propose SPARQ, a specialized accelerator architecture that leverages both sparsity and quantization to optimize LLM inference. By integrating multiply-accumulate units tailored for quantized operations and a systolic array architecture supporting N:M semi-structured sparsity, SPARQ significantly enhances area and energy efficiency with minimal impact on model quality, as demonstrated in prior work on GPTQ and SparseGPT.
Our evaluations demonstrate that SPARQ achieves up to 1.53 times greater area efficiency and 1.58 times better energy efficiency compared to the baseline, particularly for larger models.
Xinmiao Zhang SKLP, Institute of Computing Technology, CAS, Zheng Feng Institute of Computing Technology at Chinese Academy of Sciences, Shengwen Liang SKLP, Institute of Computing Technology, CAS, Xinyu Chen Hong Kong University of Science and Technology, Lei Zhang ICT CAS, Cheng Liu ICT CAS