The growing adoption of AI applications has led to an increased demand for deploying neural networks on diverse device platforms. However, even modest networks now require specialized hardware for efficient execution due to their rising computational cost. To address this, distributed execution across connected, resource-constrained devices is gaining importance. While prior work relies on empirical models or supports limited partitioning, we present ADaPS, a novel framework for distributing Convolutional Neural Networks (CNNs) inference workloads across heterogeneous embedded devices. Our analytical model partitions the height and width dimensions of 4D tensors and explores layer fusion opportunities, accounting for compute, memory, and communication constraints. ADaPS efficiently explores the vast partitioning space using a tree-based hybrid optimization algorithm combining Alpha-Beta pruning and dynamic programming. Evaluations on multiple CNNs and device configurations show that ADaPS is able to improve inference latency by up to 1.2x on average while significantly reducing data transfers compared to state-of-the-art methods.
Xinmiao Zhang SKLP, Institute of Computing Technology, CAS, Zheng Feng Institute of Computing Technology at Chinese Academy of Sciences, Shengwen Liang SKLP, Institute of Computing Technology, CAS, Xinyu Chen Hong Kong University of Science and Technology, Lei Zhang ICT CAS, Cheng Liu ICT CAS