Lin Tan (Purdue)- LLMs for Code: More Data or More Domain Knowledge? Can They Replace Programmers?
Abstract: Recent techniques leverage deep learning techniques, including large language models (LLMs), to improve coding tasks such as code generation, automated program repair, security vulnerability fixing, and binary analysis. An important question is, whether adding more data or more domain knowledge to deep-learning models is a more effective direction to improve LLMs for code. I will discuss existing studies and techniques that answer this question positively or negatively. I will also introduce our code-generation benchmark RepoCod, which answers the question, “Can Language Models Replace Programmers?”, to some extent. RepoCod tasks are real-world, whole-function code generation with repository-level context and contain test cases for validation. Our results show that GPT-4o and other LLMs achieve < 30% pass@1 on RepoCode’s code generation tasks.
Speakers
Lin Tan
Lin Tan is a Mary J. Elmore New Frontiers Professor in the Department of Computer Science at Purdue University. She received her PhD from the University of Illinois, Urbana-Champaign. Prior to joining Purdue, she was a Canada Research Chair and an associate professor at the University of Waterloo. Her research interests include software dependability, software-AI synergy, and software text analytics. Some of her research focuses are leveraging machine learning and natural language processing techniques to improve software dependability, and using software approaches to improve the dependability of machine learning systems. Dr. Tan’s co-authored papers have received ACM Distinguished Paper Awards at CCS 2024, ASE 2020, MSR 2018, and FSE 2016; and IEEE Micro’s Top Picks in 2006. Dr. Tan was a recipient of an Early Career Academic Achievement Alumni Award by the University of Illinois, Urbana-Champaign, Canada Research Chair, an NSERC Discovery Accelerator Supplements Award, an Ontario Early Researcher Award, an Ontario Professional Engineers Award–Engineering Medal for Young Engineer, and multiple industry awards including J.P.Morgan AI Faculty Research Awards, Meta/Facebook Research Awards, Google Faculty Research Awards, and an IBM CAS Research Project of the Year Award. She has served as program co-chair of FSE 2024 (one of the top 2 conferences in software engineering). She was an associate editor of IEEE Transactions on Software Engineering (2017-2022) and Springer Empirical Software Engineering Journal (2015-2021). She was the ACM SIGSOFT Treasurer and an elected Member-at-Large (2021-2024).