Track 1: Testing and Evaluation of LLMs and SWE Agents

Track leader at TU Delft: Annibale Panichella
Track leader at JetBrains: Pouria Derakhshanfar
Phd researcher(s): Ali Asgari (TU Delft)

Autonomous AI agents, leveraging the reasoning capabilities of LLMs and multi-agent orchestration, assist developers in coding, testing, and automating workflows. However, these agents introduce new operational challenges. Ensuring the safety, robustness, reliability, and long-term maintainability of LLM-powered agents designed to solve software engineering tasks remains difficult. These agents rely on both probabilistic and continuously evolving models, making their behavior hard to predict and control. Hence, cContinuous testing and the (offline or online) evaluation of agent behavior in real-world environments are crucial.

These challenges motivate research into more rigorous, scalable, and production-ready assessment methods for LLM-driven software engineering agents, and thereby, this track targets this research direction from multiple aspects:

  1. User Experience and Operational Challenges Investigating and understanding the practical issues and friction points users encounter when integrating and operating Software Engineering (SWE) agents in real-world development workflows.
  2. Metamorphic Testing for Robustness Developing and applying metamorphic testing techniques to rigorously assess the robustness, reliability, and security of LLMs and multi-agent SWE systems, specifically targeting their probabilistic nature and unpredictable behavior.
  3. Efficient SWE Task Collection for Evaluation Focusing on methods and tools for the more efficient and scalable collection, curation, and generation of representative SWE tasks and benchmarks necessary for comprehensive and continuous evaluation of agent performance.
  4. Task Prioritization in Agent Development Cycles Researching strategies and frameworks for prioritizing which SWE tasks should be used in evaluation at different stages of the agent’s development lifecycle (e.g., pre-training, fine-tuning, production deployment) to ensure high-impact assessment and resource efficiency.

MSc Students:

Track News

04 April 2025
13 January 2025

Publications

  1. Ali Asgari, Milan de Koning, Pouria Derakhshanfar, and Annibale Panichella. Metamorphic Testing of Deep Code Models: A Systematic Literature Review. ACM Transactions on Software Engineering Methodology (TOSEM), 2025
  2. Mitchell Olsthoorn. Improving the Comprehensibility of Generated Test Suites Using Test Case Clustering. 18th IEEE International Conference on Software Testing, Verification and Validation (ICST - Short Papers, Vision and Emerging Results Track), 2025
  3. Azat Abdullin, Pouria Derakhshanfar, and Annibale Panichella. Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation. 18th IEEE International Conference on Software Testing, Verification and Validation (ICST - Research Papers Track), 2025
  4. Annibale Panichella. Metamorphic-Based Many-Objective Distillation of LLMs for Code-related Tasks. The 47th IEEE/ACM International Conference on Software Engineering (ACM/IEEE ICSE - Research track), 2025
  5. Arkadii Sapozhnikov, Mitchell Olsthoorn, A. Panichella, V.V. Kovalenko, and P. Derakhshanfar. TestSpark: IntelliJ IDEA's Ultimate Test Generation Companion. Proceedings - 2024 ACM/IEEE 46th International Conference on Software Engineering, 2024
  6. Calin Georgescu, Mitchell Olsthoorn, Pouria Derakhshanfar, Marat Akhin, and Annibale Panichella. Evolutionary Generative Fuzzing for Differential Testing of the Kotlin Compiler. FSE 2024 - Industry Track, 2024