Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA's Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called \emph{tasks} that operate on \emph{tensors} and are free of communication and synchronization. Cypress programs are bound to the target machine through a \emph{mapping} specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.
Wed 18 JunDisplayed time zone: Seoul change
16:00 - 17:20 | High Performance ComputingPLDI Research Papers at Orchid Chair(s): Charith Mendis University of Illinois at Urbana-Champaign | ||
16:00 20mTalk | Task-Based Tensor Computations on Modern GPUs PLDI Research Papers Rohan Yadav Stanford University, Michael Garland NVIDIA, Alex Aiken Stanford University, Michael Bauer NVIDIA DOI | ||
16:20 20mTalk | Lightweight and Locality-Aware Composition of Black-Box Subroutines PLDI Research Papers Manya Bansal Massachusetts Institute of Technology, Dillon Sharlet Google, Jonathan Ragan-Kelley Massachusetts Institute of Technology, Saman Amarasinghe Massachusetts Institute of Technology DOI | ||
16:40 20mTalk | Modular Construction and Optimization of the UZP Sparse Format for SpMV on CPUs PLDI Research Papers Alonso Rodríguez-Iglesias Universidade da Coruña, Santoshkumar T. Tongli Colorado State University, Emily Tucker Colorado State University, Louis-Noël Pouchet Colorado State University, Gabriel Rodríguez Universidade da Coruña, Juan Tourino Universidade da Coruña DOI | ||
17:00 20mTalk | Dynamic Robustness Verification against Weak MemoryRemote PLDI Research Papers Roy Margalit Tel Aviv University, Michalis Kokologiannakis ETH Zurich, Shachar Itzhaky Technion, Ori Lahav Tel Aviv University DOI |