Track 1: Testing and Evaluation of LLMs and SWE Agents
Track leader at TU Delft: Annibale PanichellaTrack leader at JetBrains: Pouria Derakhshanfar
Phd researcher(s): Ali Asgari (TU Delft)
Autonomous AI agents, leveraging the reasoning capabilities of LLMs and multi-agent orchestration, assist developers in coding, testing, and automating workflows. However, these agents introduce new operational challenges. Ensuring the safety, robustness, reliability, and long-term maintainability of LLM-powered agents designed to solve software engineering tasks remains difficult. These agents rely on both probabilistic and continuously evolving models, making their behavior hard to predict and control. Hence, cContinuous testing and the (offline or online) evaluation of agent behavior in real-world environments are crucial.
These challenges motivate research into more rigorous, scalable, and production-ready assessment methods for LLM-driven software engineering agents, and thereby, this track targets this research direction from multiple aspects:
- User Experience and Operational Challenges Investigating and understanding the practical issues and friction points users encounter when integrating and operating Software Engineering (SWE) agents in real-world development workflows.
- Metamorphic Testing for Robustness Developing and applying metamorphic testing techniques to rigorously assess the robustness, reliability, and security of LLMs and multi-agent SWE systems, specifically targeting their probabilistic nature and unpredictable behavior.
- Efficient SWE Task Collection for Evaluation Focusing on methods and tools for the more efficient and scalable collection, curation, and generation of representative SWE tasks and benchmarks necessary for comprehensive and continuous evaluation of agent performance.
- Task Prioritization in Agent Development Cycles Researching strategies and frameworks for prioritizing which SWE tasks should be used in evaluation at different stages of the agent’s development lifecycle (e.g., pre-training, fine-tuning, production deployment) to ensure high-impact assessment and resource efficiency.
MSc Students:
- Remco Schrijver: Thesis
- Milan de Koning Thesis
- Sergey Datskiv: Thesis
- Saga Rut Sunnevudóttir: Thesis
- Andrei Drăgoi
Track News
Publications
-
Metamorphic Testing of Deep Code Models: A Systematic Literature Review. ACM Transactions on Software Engineering Methodology (TOSEM), 2025
-
Improving the Comprehensibility of Generated Test Suites Using Test Case Clustering. 18th IEEE International Conference on Software Testing, Verification and Validation (ICST - Short Papers, Vision and Emerging Results Track), 2025
-
Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation. 18th IEEE International Conference on Software Testing, Verification and Validation (ICST - Research Papers Track), 2025
-
Metamorphic-Based Many-Objective Distillation of LLMs for Code-related Tasks. The 47th IEEE/ACM International Conference on Software Engineering (ACM/IEEE ICSE - Research track), 2025
-
TestSpark: IntelliJ IDEA's Ultimate Test Generation Companion. Proceedings - 2024 ACM/IEEE 46th International Conference on Software Engineering, 2024
-
Evolutionary Generative Fuzzing for Differential Testing of the Kotlin Compiler. FSE 2024 - Industry Track, 2024