PLDI 2025
Mon 16 - Fri 20 June 2025 Seoul, South Korea

The ability to extract features from the source code is crucial for various computer science tasks, including vulnerability detection, code clone recognition, AI-generated snippet analysis, file search, and classification. Features are sequences of consecutive tokens that form structural elements according to the grammar of the programming language. These elements can subsequently constitute logical blocks. A prominent idea, frequently observed in research, is that the same features tend to occur alongside similar neighboring tokens. However, most studies rely on a predefined set of labels for these features, such as Common Weakness Enumeration (CWE) categories for vulnerability detection, that often require either significant manual work or weak supervision. Moreover, these approaches often fail to account for the domain-specific nuances of extracted features, i.e. that the same code fragment may carry different meanings across tasks. For instance, in competitive programming, applying the same algorithm to various problems can result in different verdicts due to problem-specific time and memory limit constraints.

In this abstract, we propose a framework for automatic conditional feature extraction from source code grouped by domain. Our approach combines custom code representation and supervised learning with post hoc interpretability analysis to reveal the underlying patterns that lead to specific program behavior, conditioned on domain-specific contexts. We evaluated the effectiveness of our method on two tasks: vulnerability detection and error localization in student programming submissions. Finally, we demonstrate how our framework addresses these challenges. The code, data and trained models are available at https://github.com/Liudmila-Paskonova/ASTCODA.

ASTCODA: Abstract Syntax Tree Convolutions Operating on Domain Attention (pldi25src-paskonova.pdf)1.53MiB