Software Engineering Essentials

CS846 Machine Learning for Software Engineering — Spring 2026

Pengyu Nie

Agenda

Goals
- review key concepts in SE from data-driven view
- pointers to tools and datasets
- breath; relevancy; recency
No-Goals
- depth; completeness

Overview (Software Development Life Cycle)

Code: tokens, AST, call graph, data flow
Validation: tests, specs
Natural language: comments, documentations, issues/PRs, logs

some are readily available in verbatim form (code)
some need to be extracted by parsing (call graph) or executing (data flow) code
some need to be connected with other artifacts to make sense (comment-code, test-code)

public class C {
  public int factorial(int n) {
    if (n > 0) {
      return n * factorial(n - 1);
    } else {
      return 0;
    }
  }
}

Basis of the compilation pipeline, produce a stream of PL tokens for downstream analyses
PL tokens ≠ ML tokens (by subword tokenizers like BPE/SentencePiece), e.g., factorial vs fact|orial.

token	kind
`public`	KEYWORD
`class`	KEYWORD
`C`	IDENT
`{`	SYMBOL
`public`	KEYWORD
`int`	KEYWORD
`factorial`	IDENT
`(`	SYMBOL
`int`	KEYWORD
`n`	IDENT
`)`	SYMBOL
…	…

public class C {
  public int factorial(int n) {
    if (n > 0) {
      return n * factorial(n - 1);
    } else {
      return 0;
    }
  }
}

The data format used by most static analysis tools, e.g.,
- extract certain code elements (methods, imports)
- connect code elements (method <signature, body, comment>)
- rewrite code by manipulating the AST
Concrete vs. Abstract
- concrete: all tokens explicitly appear as leaf nodes; suitable for manipulation; e.g., antlr
- abstract: omitting keyword and symbol nodes; suitable for analysis; e.g., python ast

// lexer rules
IDENTIFIER    : Letter LetterOrDigit* ;
Letter        : [a-zA-Z$_] ;
LetterOrDigit : Letter | [0-9] ;

// parser rules
methodDeclaration
    : typeTypeOrVoid identifier formalParameters
      ('[' ']')* (THROWS qualifiedNameList)? methodBody
    ;
typeTypeOrVoid
    : typeType
    | VOID
    ;
typeType
    : annotation* (classOrInterfaceType | primitiveType)
      (annotation* '[' ']')*
    ;

A context-free grammar (CFG) is a set of production rules
Lexer rules produce tokens; parser rules produce tree nodes

Given grammar, produce a lexer + parser for your language
Why parser generator?
- standard toolchain for parsing all PLs
- PL’s official compiler may or may not expose tokens/ASTs for analysis

tool	runtime	advantages
Tree-sitter; [grammars]	C	incremental; error-recovering; permissive
ANTLR; [grammars]	Java	precise; full-grammar; LL(*)

Other parser tools:
- Java: JavaParser
- Python: builtin ast
- AST diff: GumTree

Static = analyze the program without running it.
Kinds of data
- call graph: inter-function dependency; control flow: intra-method dependency
- data flow/def-use: variable dependency; points-to/alias: references to same object
- taint analysis: could sensitive data be leaked to dangerous places?
- type inference: useful for dynamically-typed languages
- symbolic execution / path conditions: logical conditions under which a path executes
Use cases
- input context for ML models (not verbatim code)
- SE tools e.g., linter

public int foo(int x, int y) {
  int a = Math.abs(x);
  int b = Math.abs(y);
  if (a == b) {
    return a;
  } else {
    int c = a - b;
    while (a - b > 0) {
      a++;
      b--;
    }
    return a;
  }
}

Most static analysis tools follow a Visitor pattern
- Traverse the AST in certain order (usually depth-first)
- Visit each node once and extract certain information
- Can manipulate the AST if the goal is to rewrite code

Example: def-use analysis
- int x — def x
- int y — def y
- int a — def a
- Math.abs(x) — use x

Source-code analysis: based on code/token/AST, close to what developers write
Bytecode analysis: based on bytecode/IR (e.g., Java bytecode, LLVM IR), close to how code is executed
- the compiler has already done type resolution, macro expansion, optimizations.
- easier to extract some data (e.g., call graphs)
- harder to relate back to source lines
Tools for Java bytecode analysis: ASM, ByteBuddy

Dynamic = analyze the program during execution
- usually by instrumenting the program to insert helper code
Why? Static analysis can be imprecise
Use cases: (more accurate) context for ML models; coverage, debugger, profiler, etc.

Example: Static call-graph analysis must overestimate when there is dynamic dispatch
drawShape -> {Line.draw, Rectangle.draw, Circle.draw}

Insert logging statements at certain code locations
- e.g., (caller) before invoke instruction, (callee) at the beginning of method
Execute the program
Reconstruct the call graph from the logs

Source code ver.

public static void drawShape(Shape shape) {
  ...
  DynamicAnalyzer.logCaller("ShapeMain.drawShape");
  shape.draw();
}

public void draw() {
  DynamicAnalyzer.logCallee("Line.draw");
  ...
}

Bytecode ver. (most common case for compiled languages)

// ShapeMain
public static void drawShape(ca.uwaterloo.cs846.exp.Shape)
  descriptor: (Lca/uwaterloo/cs846/exp/Shape;)V
  Code:
    0: aload_0
    + ldc "ShapeMain.drawShape"
    + invokestatic DynamicAnalyzer.logCaller
    1: invokeinterface ca/uwaterloo/cs846/exp/Shape.draw:()V, 1
    6: return

// Line (same for Rectangle / Circle)
public void draw()
  descriptor: ()V
  Code:
    + ldc "Line.draw"
    + invokestatic DynamicAnalyzer.logCallee
    0: getstatic java/lang/System.out:Ljava/io/PrintStream
    3: aload_0

Where does execution come from? Tests; production (collecting telemetry data)
Manipulating execution is also possible (e.g., changing variable value in debugger)
Tools:
- Java: ASM, ByteBuddy, Java Instrumentation API, JVMTI
- Python: sys.settrace

Other Data

Validation-related data:
- Tests
- Specs
- Proofs
Natural language data:
- Comments
- Issues/PRs
- Logs
- Documentation

Tests: executable specification for expected behavior of code
- Regression tests: tests that are executed on every commit to check existing behaviors are not broken by new changes
- Fuzzing / random tests: randomly/systematically generate inputs to exercise the code

PL	testing frameworks
Java	JUnit, TestNG
Python	pytest
C / C++	GoogleTest, Catch2
JS / TS	Jest, Vitest, Mocha
Go	built-in `testing` package
Rust	built-in `cargo test`, proptest

Test generation from code: Methods2Test, Classes2Test
Test generation for bug finding: SWT-bench, Defects4J, BugsInPy
Test evolution: TestEvo-Bench (ours)

A specification (spec, method contract) states what the program should do
- pre-conditions
- post-conditions
- invariants
- side-effects
Usages: runtime checking / model checking / verification
Frameworks: JML, Dafny, KLEE

// Dafny: spec + implementation in one place
method Abs(x: int) returns (y: int)
  ensures y >= 0
  ensures y == x || y == -x
{
  if x < 0 { y := -x; } else { y := x; }
}

//@ requires n >= 0;
//@ ensures \result >= 1;
public int factorial(int n) { ... }   // JML in a comment

Natural Language Data > Comments

API comments (JavaDoc, docstrings in Python)
- natural language specification
- summary
- @param ≈ pre-conditions
- @return + @throws ≈ post-conditions
Inline comments
Natural language <-> code transduction
Datasets:
- mined from GitHub: CodeSearchNet
- competitive programming: HumanEval, MBPP, APPS

/**
 * Computes n! for non-negative integers.
 *
 * @param n a non-negative integer
 * @return n factorial
 * @throws IllegalArgumentException if n is negative
 */
public int factorial(int n) {
  // base case
  if (n == 0) return 1;
  return n * factorial(n - 1);
}

def factorial(n: int) -> int:
    """Compute n! for non-negative integers.

    Args:
        n: a non-negative integer.
    Returns:
        n factorial.
    Raises:
        ValueError: if n is negative.
    """
    ...

Natural Language Data > Issues & Pull Requests

Issue: bug report, feature request
Pull request: code changes towards solving an issue, (code, tests, discussion, …)
Platforms: GitHub, JIRA, Bugzilla
Mine real-world software development tasks from PRs: SWE-bench
Related: code review, AI-generated PRs (MSR'26 Mining Challenge)

Software Engineering Essentials

CS846 Machine Learning for Software Engineering — Spring 2026

Agenda

Overview (Software Development Life Cycle)

Code-Related Data

Code-Related Data > Lexing & Parsing > Lexing

Code-Related Data > Lexing & Parsing > Parsing

Code-Related Data > Lexing & Parsing > Grammar

Code-Related Data > Lexing & Parsing > Parser Generator

Code-Related Data > Static Analysis

Code-Related Data > Static Analysis > Visitor

Code-Related Data > Static Analysis > Bytecode Analysis

Code-Related Data > Dynamic Analysis

Code-Related Data > Dynamic Analysis > Instrumentation

Code-Related Data > Dynamic Analysis > Instrumentation (cont.)

Other Data

Validation-Related Data > Tests

Validation-Related Data > Tests > Frameworks & Datasets

Validation-Related Data > Specs

Natural Language Data > Comments

Natural Language Data > Issues & Pull Requests