Software Engineering Essentials

CS846 Machine Learning for Software Engineering — Spring 2026

Pengyu Nie


Agenda


Overview (Software Development Life Cycle)

  • Code: tokens, AST, call graph, data flow
  • Validation: tests, specs
  • Natural language: comments, documentations, issues/PRs, logs
  • some are readily available in verbatim form (code)
  • some need to be extracted by parsing (call graph) or executing (data flow) code
  • some need to be connected with other artifacts to make sense (comment-code, test-code)

Code-Related Data


Code-Related Data > Lexing & Parsing > Lexing

public class C {
  public int factorial(int n) {
    if (n > 0) {
      return n * factorial(n - 1);
    } else {
      return 0;
    }
  }
}
  • Basis of the compilation pipeline, produce a stream of PL tokens for downstream analyses
  • PL tokens ≠ ML tokens (by subword tokenizers like BPE/SentencePiece), e.g., factorial vs fact|orial.
tokenkind
publicKEYWORD
classKEYWORD
CIDENT
{SYMBOL
publicKEYWORD
intKEYWORD
factorialIDENT
(SYMBOL
intKEYWORD
nIDENT
)SYMBOL

Code-Related Data > Lexing & Parsing > Parsing

public class C {
  public int factorial(int n) {
    if (n > 0) {
      return n * factorial(n - 1);
    } else {
      return 0;
    }
  }
}
  • The data format used by most static analysis tools, e.g.,

    • extract certain code elements (methods, imports)
    • connect code elements (method <signature, body, comment>)
    • rewrite code by manipulating the AST
  • Concrete vs. Abstract

    • concrete: all tokens explicitly appear as leaf nodes; suitable for manipulation; e.g., antlr
    • abstract: omitting keyword and symbol nodes; suitable for analysis; e.g., python ast

Code-Related Data > Lexing & Parsing > Grammar

// lexer rules
IDENTIFIER    : Letter LetterOrDigit* ;
Letter        : [a-zA-Z$_] ;
LetterOrDigit : Letter | [0-9] ;

// parser rules
methodDeclaration
    : typeTypeOrVoid identifier formalParameters
      ('[' ']')* (THROWS qualifiedNameList)? methodBody
    ;
typeTypeOrVoid
    : typeType
    | VOID
    ;
typeType
    : annotation* (classOrInterfaceType | primitiveType)
      (annotation* '[' ']')*
    ;

Code-Related Data > Lexing & Parsing > Parser Generator

  • Given grammar, produce a lexer + parser for your language
  • Why parser generator?
    • standard toolchain for parsing all PLs
    • PL’s official compiler may or may not expose tokens/ASTs for analysis
toolruntimeadvantages
Tree-sitter; [grammars]Cincremental; error-recovering; permissive
ANTLR; [grammars]Javaprecise; full-grammar; LL(*)

Code-Related Data > Static Analysis


Code-Related Data > Static Analysis > Visitor

public int foo(int x, int y) {
  int a = Math.abs(x);
  int b = Math.abs(y);
  if (a == b) {
    return a;
  } else {
    int c = a - b;
    while (a - b > 0) {
      a++;
      b--;
    }
    return a;
  }
}
  • Most static analysis tools follow a Visitor pattern
    • Traverse the AST in certain order (usually depth-first)
    • Visit each node once and extract certain information
    • Can manipulate the AST if the goal is to rewrite code
  • Example: def-use analysis
    • int x — def x
    • int y — def y
    • int a — def a
    • Math.abs(x) — use x

Code-Related Data > Static Analysis > Bytecode Analysis

  • Source-code analysis: based on code/token/AST, close to what developers write
  • Bytecode analysis: based on bytecode/IR (e.g., Java bytecode, LLVM IR), close to how code is executed
    • the compiler has already done type resolution, macro expansion, optimizations.
    • easier to extract some data (e.g., call graphs)
    • harder to relate back to source lines
  • Tools for Java bytecode analysis: ASM, ByteBuddy
|

Code-Related Data > Dynamic Analysis

  • Dynamic = analyze the program during execution
    • usually by instrumenting the program to insert helper code
  • Why? Static analysis can be imprecise
  • Use cases: (more accurate) context for ML models; coverage, debugger, profiler, etc.

Example: Static call-graph analysis must overestimate when there is dynamic dispatch
drawShape -> {Line.draw, Rectangle.draw, Circle.draw}


Code-Related Data > Dynamic Analysis > Instrumentation

  • Insert logging statements at certain code locations
    • e.g., (caller) before invoke instruction, (callee) at the beginning of method
  • Execute the program
  • Reconstruct the call graph from the logs

Source code ver.

public static void drawShape(Shape shape) {
  ...
  DynamicAnalyzer.logCaller("ShapeMain.drawShape");
  shape.draw();
}

public void draw() {
  DynamicAnalyzer.logCallee("Line.draw");
  ...
}

Bytecode ver. (most common case for compiled languages)

// ShapeMain
public static void drawShape(ca.uwaterloo.cs846.exp.Shape)
  descriptor: (Lca/uwaterloo/cs846/exp/Shape;)V
  Code:
    0: aload_0
    + ldc "ShapeMain.drawShape"
    + invokestatic DynamicAnalyzer.logCaller
    1: invokeinterface ca/uwaterloo/cs846/exp/Shape.draw:()V, 1
    6: return

// Line (same for Rectangle / Circle)
public void draw()
  descriptor: ()V
  Code:
    + ldc "Line.draw"
    + invokestatic DynamicAnalyzer.logCallee
    0: getstatic java/lang/System.out:Ljava/io/PrintStream
    3: aload_0

Code-Related Data > Dynamic Analysis > Instrumentation (cont.)


Other Data


Validation-Related Data > Tests

  • Tests: executable specification for expected behavior of code
    • Regression tests: tests that are executed on every commit to check existing behaviors are not broken by new changes
    • Fuzzing / random tests: randomly/systematically generate inputs to exercise the code

Validation-Related Data > Tests > Frameworks & Datasets

PLtesting frameworks
JavaJUnit, TestNG
Pythonpytest
C / C++GoogleTest, Catch2
JS / TSJest, Vitest, Mocha
Gobuilt-in testing package
Rustbuilt-in cargo test, proptest

Validation-Related Data > Specs

  • A specification (spec, method contract) states what the program should do
    • pre-conditions
    • post-conditions
    • invariants
    • side-effects
  • Usages: runtime checking / model checking / verification
  • Frameworks: JML, Dafny, KLEE
// Dafny: spec + implementation in one place
method Abs(x: int) returns (y: int)
  ensures y >= 0
  ensures y == x || y == -x
{
  if x < 0 { y := -x; } else { y := x; }
}
//@ requires n >= 0;
//@ ensures \result >= 1;
public int factorial(int n) { ... }   // JML in a comment

Natural Language Data > Comments

  • API comments (JavaDoc, docstrings in Python)
    • natural language specification
    • summary
    • @param ≈ pre-conditions
    • @return + @throws ≈ post-conditions
  • Inline comments
  • Natural language <-> code transduction
  • Datasets:
/**
 * Computes n! for non-negative integers.
 *
 * @param n a non-negative integer
 * @return n factorial
 * @throws IllegalArgumentException if n is negative
 */
public int factorial(int n) {
  // base case
  if (n == 0) return 1;
  return n * factorial(n - 1);
}
def factorial(n: int) -> int:
    """Compute n! for non-negative integers.

    Args:
        n: a non-negative integer.
    Returns:
        n factorial.
    Raises:
        ValueError: if n is negative.
    """
    ...

Natural Language Data > Issues & Pull Requests

  • Issue: bug report, feature request
  • Pull request: code changes towards solving an issue, (code, tests, discussion, …)
  • Platforms: GitHub, JIRA, Bugzilla
  • Mine real-world software development tasks from PRs: SWE-bench
  • Related: code review, AI-generated PRs (MSR'26 Mining Challenge)