Machine Learning Essentials
CS846 Machine Learning for Software Engineering — Spring 2026
Pengyu Nie
Agenda
- Goals
- ML4SE process: data processing, training, evaluation
- techniques
- language modeling, transformers
- fine-tuning (LoRA)
- decoding algorithms, agents
- No-Goals
- not too much internal working mechanisms (essentials as users of the techniques)
- ML/AI libraries and tools evolve very fast, you should research into what’s the latest
ML4SE Process Overview
- Task: input $x$, output $y$
- Dataset: $D = \{(x_i, y_i)\}$
- Model: training $M_\theta = \mathrm{train}(D_{train})$, inference $y = M_\theta(x)$
- Evaluation: $s = \mathrm{metric}(D_{test}, M_\theta) = \mathrm{Avg}(\mathrm{metric}(y_i, M_\theta(x_i)))$
Data Processing
- Data collection (mining software repository) sources:
- GitHub (open-source repositories)
- (raw) datasets mined in prior work
- Data cleaning filters:
- License permissive [ARR checklist] [NeurIPS checklist]
- Quality: #stars > x, not fork
- Recency: last commit > t, timestamp > t
- Correctness: build/compilation success, tests pass
- Task-specific concerns: input-output mapping, context, …
- quantity vs. quality
Model > Language Modeling Examples
Large language model
Model > Language Modeling Definition
- A language model computes either:
- probability of a whole sequence: $P(W) = P(w_1, w_2, \dots, w_T)$
- probability of the next token: $P(w_t \mid w_1, w_2, \dots, w_{t-1})$
- chain rule:
- $P(W) = \prod_{t=1}^{T} P(w_t \mid w_1, \dots, w_{t-1})$
Model > N-gram Language Model
statistical language model
- Markov assumption: condition only on the previous $n-1$ tokens:
- $P(w_t \mid w_1, \dots, w_{t-1}) \approx P(w_t \mid w_{t-n+1}, \dots, w_{t-1})$
- An n-gram is $n$ consecutive tokens; estimate the probabilities from token frequency:
- unigram ($n=1$): $P(w_t)$
- bigram ($n=2$): $P(w_t \mid w_{t-1}) = \dfrac{c(w_{t-1}, w_t)}{c(w_{t-1})}$
- special tokens:
<s> begin, </s> end of sequence
- Does not generalize beyond seen n-grams
Model > Perplexity
- The best language model is one that best predicts the corpus
- usually monitored on train/val sets
- Perplexity: inverse probability normalized by length (lower = better):
- $\mathrm{PP}(W) = P(w_1, \dots, w_T)^{-\frac{1}{T}}$
- Equivalently, the exponential of the cross-entropy
- $H(W) = -\frac{1}{T} \sum_{t} \log P(w_t \mid w_{\lt t})$.
- usually used as the loss function for training the language model
(decoder-only) transformer architecture
- The state-of-the-art language model
- neural network: generalizability, scalability
- evolution: RNN -> attention layer -> ~
- Key component: self-attention layer
- Typical sizes: ~1B, ~4B, ~7B, ~13B, ~70B, ~300B


- Input X = embeddings from prev layer
- Query: the token currently asking
- Key: the token being compared against
- Value: the information to aggregate
- Output Z = embeddings to next layer
(re-weighted with context similarity)
encoder vs. decoder
- decoder-only: standard (thanks to scalability); GPT, Llama, Qwen
- encoder-only: good for embedding/classification; BERT
- encoder-decoder: the original architecture, good for input-output mapping; BART, T5
bottleneck: speed up $\mathcal{O}(n^2)$ attention computation
- [Flash Attention], IO-aware exact attention
- KV cache: reuse past keys/values when decoding instead of recomputing
Mixture of Experts (MoE): route each token to a few expert sub-networks, more parameters at similar inference-time cost [Mixtral]
Speculative decoding: a small draft model proposes next few tokens, the large model verifies them (faster) [Leviathan et al. 2022]
Model > Tokenization
How (L)LM tokenize natural language and code?
- Used to be:
- whitespace-based / regex
- unseen tokens =
<UNK> - (for code) CamelCase/snake_case sub-tokenization
- Data-driven approach: learn the best way to tokenize
- Byte-pair encoding (BPE) algorithm
- initialize vocabulary with base tokens (all bytes)
- while |vocabulary| < v:
- find the most frequent adjacent token pair in the corpus
- merge it into a new token and add to the vocabulary

Model Training
Training of an LLM has many phases:
| Phase | Goal | Data | Strategy |
|---|
| pre-training | learn generic knowledge | massive raw (unlabelled) corpus | semi-supervised learning (language modeling) |
| mid/post-training | improve reasoning | math, code, NL reasoning | supervised learning + reinforcement learning |
| ~ | improve tool use / agent | data with tool use traces | ~ |
| ~ | improve instruction following | data with human labels | reinforcement learning (RLHF) |
| fine-tuning | apply to task | task-specific data | parameter-efficient fine-tuning (PEFT, e.g., LoRA) |
Model Training > LoRA Fine-tuning
- Full fine-tuning (FFT) updates all weights $W$ and stores a full model copy per task
- Parameter-efficient fine-tuning (PEFT): freeze $W$, only update a few parameters $\Delta W$
- LoRA [Hu et al. 2021]
- $\Delta W = BA$ with rank $r \ll d$:
- $h = Wx + \Delta W x = Wx + \tfrac{\alpha}{r}\, B A x$
- Only $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ are trained
(usually 1-10% of the model size)

Model Inference
- Prompt engineering
- Decoding algorithms: greedy, sampling
- Test-time scaling: “learning” without updating model parameters:
- in-context learning, few-shot learning
- retrieval-augmented generation (RAG)
- chain-of-thought (CoT)
- self-consistency
- Agentic harness (e.g., [mini-SWE-agent])
Model Inference > Greedy Decoding

Greedy decoding
- Pick the highest-probability token at each step
- $w_t = \arg\max_{w} P(w \mid w_{\lt t})$
- Deterministic (theoretically) given the same input
- Locally optimal ≠ globally optimal; can miss higher-probability sequences and become repetitive
Model Inference > Sampling Decoding
Sampling decoding (aka top-k/top-p/Nucleus sampling)
- Sample the next token randomly from output distribution $P(w \mid w_{\lt t})$
- more diverse, human-like output
- Restrict the candidate set: top-k (k most likely) or top-p (smallest set with cumulative probability $\ge p$)
- Temperature controls randomness: $p_i = \dfrac{\exp(o_i / T)}{\sum_j \exp(o_j / T)}$

Evaluation > Data Split
- Train/val/test splits
- Validation (aka development) set: make design decisions, early-stop training
- Test set: held-out for final evaluation
- It is wrong to make design decisions based on test set performance
- Prevent data leakage
- The model will be eventually deployed to production data
- can be partially or completely unseen -> cross-project split
- most likely future data -> time-segmented split
- For evaluation scores to match production performance:
$D_{test}:(D_{train}+D_{val}) \approx D_{production}:(D_{train}+D_{val}+D_{test})$ - For making correct design decisions:
$D_{val}:D_{train} \approx D_{test}:(D_{train}+D_{val})$
Evaluation > Metrics
- Similarity-based metrics: Exact match, BLEU, CodeBLEU, F1, …
- require ground truth (developer-written $y$)
- may not be the only correct answer
- may be wrong (related to data processing quality)
- lack “semantic” understanding; one workaround: embedding similarity
- Execution-based metrics: build success rate, test pass rate (Pass@k), coverage, …
- some require executable artifacts (tests)
- may need appropriate hardware/OS environment
- may be wrong
- not applicable to natural language artifacts (comments)
- Human validation [related online guideline]
- cost vs. benefit: case study, statistically sampled subset
- measure inter-rater agreement (e.g., Cohen’s Kappa)
- Can LLM replace human validators? [Ahmed et al. 2024]
Evaluation > Variance Control
- Model training and inference are stochastic processes
- can be partially controlled by setting random seeds
- but hard to rule out all the hardware/library effects (greedy decoding can even be stochastic!)
- Repeat experiment k (usually=3/5) times
- use different random seeds
- report average score
- perform statistical significance testing to compare between methods