Introduction

CS846 Machine Learning for Software Engineering — Spring 2026

Pengyu Nie


Agenda


Course Logistics > Your Instructor

  • Name: Pengyu Nie
  • Assistant Prof @ UWaterloo CS since 2023
    • before that: PhD @ UTAustin ECE
    • before that: BSc @ USTC Physics
  • 2nd time teaching this course
  • Research interests on AI + SE, recent focus:
    • AI-assisted test generation and maintenance
    • Automating MLE (machine learning engineering)
    • Efficient model architectures for SE
    • … find out more in this course

Course Logistics > Communication


Course Logistics > Syllabus


Course Logistics > Class Structure

Typical class structure starting from Week 6:

TimeItem
9:30-10:10Activity 1
10:10-10:20Break
10:20-11:00Activity 2
11:00-11:10Break
11:10-11:50Activity 3

Course Logistics > Assessment

TaskDue DateWeight
Participation-40%
Project: team formation06/01 (Mon)-
Project: proposal report06/12 (Fri)10%
Project: progress report07/10 (Fri)20%
Project: final report08/07 (Fri)30%

Course Logistics > Assessment > Participation


Course Logistics > Assessment > Project

Team size


Course Logistics > Assessment > Project

Project topic


Course Logistics > Assessment > Project


Course Logistics > Assessment > Project

Milestones


Course Logistics > AI Usage Policy


Round-Table Introductions


ML/AI4SE Overview

I will…


ML/AI4SE Overview > Naturalness of Software

Statistical (n-gram) language modeling of code

  • Code written in PLs is natural: repetitive and predictable
  • Code is more "natural" (lower cross-entropy) than English
Hindle et al. On the naturalness of software. In ICSE 2012. https://doi.org/10.1109/ICSE.2012.6227135

ML/AI4SE Overview > Naturalize

Application of language models

  • Predict and suggest coding conventions: identifier naming, whitespace, newlines
Allamanis et al. Learning natural coding conventions. In FSE 2014. https://doi.org/10.1145/2635868.2635883

ML/AI4SE Overview > Graph NGram

Unique property of code: structures

  • Capture tree and graph structures of code, not just token sequences
  • Still n-gram — just the graph version of it
Nguyen and Nguyen. Graph-based statistical language model for code. In ICSE 2015. https://doi.org/10.1109/ICSE.2015.336

ML/AI4SE Overview > RNN, CNN

Neural language models for code

  • From n-grams to neural sequence models
  • RNNs for code ↔ text (e.g., API recommendation)
  • CNNs with attention + copy mechanism for extreme code summarization
Gu et al. Deep API learning. In FSE 2016. https://doi.org/10.1145/2950290.2950334
Allamanis et al. A convolutional attention network for extreme summarization of source code. In ICML 2016. https://proceedings.mlr.press/v48/allamanis16.pdf

ML/AI4SE Overview > Grammar-based

Grammar-constrained code generation

  • Constrain the output to be syntactically valid programs
  • Text ↔ code translation becomes a popular task
Yin and Neubig. A syntactic neural model for general-purpose code generation. In ACL 2017. https://arxiv.org/abs/1704.01696

ML/AI4SE Overview > Path-based

AST-path program representations

  • Encode programs using paths through the abstract syntax tree
Alon et al. A general path-based representation for predicting program properties. In PLDI 2018. https://doi.org/10.1145/3192366.3192412

ML/AI4SE Overview > CodeSearchNet

Large-scale datasets

  • Race of collecting massive, high-quality datasets/benchmarks for code
  • CodeSearchNet: multi-language code search corpus (code and comment pairs)
Husain et al. CodeSearchNet challenge: evaluating the state of semantic code search. 2019. https://arxiv.org/abs/1909.09436

ML/AI4SE Overview > CodeBERT

Transformers, pre-trained on large-scale datasets

  • BERT-style encoder, trained on bimodal (code, NL) corpora
  • Pre-training objectives specific to SE (MLM + replaced token detection)
Feng et al. CodeBERT: a pre-trained model for programming and natural languages. In Findings of EMNLP 2020. https://arxiv.org/abs/2002.08155

ML/AI4SE Overview > CodeXGLUE

Transformers, multi-tasking

  • CodeXGLUE: 14 tasks across 4 categories: code-code, code-text, text-code, text-text
Lu et al. CodeXGLUE: a machine learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks 2021. https://arxiv.org/abs/2102.04664

ML/AI4SE Overview > Codex

Scaling up -> Large Language Models

  • GPT-3 fine-tuned on a large code corpus (later becomes GitHub Copilot)
  • Introduced HumanEval, using test pass/fail as evaluation metric
  • Scaling law of model accuracy and size
Chen et al. Evaluating large language models trained on code. 2021. https://arxiv.org/abs/2107.03374

ML/AI4SE Overview > StarCoder

Open-source code LLMs

  • Open-source pre-training data (mined from GitHub) and LLM
Li et al. StarCoder: may the source be with you! In TMLR 2023. https://arxiv.org/abs/2305.06161
von Werra et al. StarCoder2 and The Stack v2. https://huggingface.co/blog/starcoder2

ML/AI4SE Overview > SWE-Bench

More challenging benchmarks, matching real-world workflows

  • Original SWE-bench: 2,294 issues from 12 popular Python repositories
  • Follow-ups: SWE-bench Lite, Verified, Multimodal, Live, Multilingual, ...
Jimenez et al. SWE-bench: can language models resolve real-world GitHub issues? In ICLR 2024. https://arxiv.org/abs/2310.06770; [SWE-bench leaderboard]

ML/AI4SE Overview > SWE-Agent

Single model -> agent

  • Multiple rounds of input/output
  • Repository-scale context, edits, reasoning, tool use, long-horizon tasks...
Yang et al. SWE-agent: agent-computer interfaces enable automated software engineering. In NeurIPS 2024. https://arxiv.org/abs/2405.15793; [latest version: mini-swe-agent]

ML/AI4SE Overview > SOTA

State of the Art as of 2026/05