Save as PDF

Machine Learning Essentials

CS846 Machine Learning for Software Engineering — Spring 2026

Pengyu Nie


Agenda


ML4SE Process Overview


Data Processing


Model > Language Modeling Examples

Large language model


Model > Language Modeling Definition


Model > N-gram Language Model

statistical language model


Model > Perplexity


Model > Transformer

(decoder-only) transformer architecture

  • The state-of-the-art language model
    • neural network: generalizability, scalability
    • evolution: RNN -> attention layer -> ~
  • Key component: self-attention layer
  • Typical sizes: ~1B, ~4B, ~7B, ~13B, ~70B, ~300B
Image modified from Vaswani et al. Attention is all you need. In NeurIPS 2017. https://arxiv.org/abs/1706.03762

Model > Transformer > Self-Attention

  • Input X = embeddings from prev layer
  • Query: the token currently asking
  • Key: the token being compared against
  • Value: the information to aggregate
  • Output Z = embeddings to next layer
    (re-weighted with context similarity)

Model > Transformer > Architecture Variants


Model > Tokenization

How (L)LM tokenize natural language and code?

  • Used to be:
    • whitespace-based / regex
    • unseen tokens = <UNK>
    • (for code) CamelCase/snake_case sub-tokenization
  • Data-driven approach: learn the best way to tokenize
  • Byte-pair encoding (BPE) algorithm
    • initialize vocabulary with base tokens (all bytes)
    • while |vocabulary| < v:
      • find the most frequent adjacent token pair in the corpus
      • merge it into a new token and add to the vocabulary

Model Training

Training of an LLM has many phases:

PhaseGoalDataStrategy
pre-traininglearn generic knowledgemassive raw (unlabelled) corpussemi-supervised learning (language modeling)
mid/post-trainingimprove reasoningmath, code, NL reasoningsupervised learning + reinforcement learning
~improve tool use / agentdata with tool use traces~
~improve instruction followingdata with human labelsreinforcement learning (RLHF)
fine-tuningapply to tasktask-specific dataparameter-efficient fine-tuning (PEFT, e.g., LoRA)

Model Training > LoRA Fine-tuning

  • Full fine-tuning (FFT) updates all weights $W$ and stores a full model copy per task
  • Parameter-efficient fine-tuning (PEFT): freeze $W$, only update a few parameters $\Delta W$
  • LoRA [Hu et al. 2021]
    • $\Delta W = BA$ with rank $r \ll d$:
    • $h = Wx + \Delta W x = Wx + \tfrac{\alpha}{r}\, B A x$
    • Only $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ are trained
      (usually 1-10% of the model size)

Model Inference


Model Inference > Greedy Decoding

Greedy decoding

  • Pick the highest-probability token at each step
    • $w_t = \arg\max_{w} P(w \mid w_{\lt t})$
  • Deterministic (theoretically) given the same input
  • Locally optimal ≠ globally optimal; can miss higher-probability sequences and become repetitive

Model Inference > Sampling Decoding

Sampling decoding (aka top-k/top-p/Nucleus sampling)

  • Sample the next token randomly from output distribution $P(w \mid w_{\lt t})$
    • more diverse, human-like output
  • Restrict the candidate set: top-k (k most likely) or top-p (smallest set with cumulative probability $\ge p$)
  • Temperature controls randomness: $p_i = \dfrac{\exp(o_i / T)}{\sum_j \exp(o_j / T)}$

Evaluation > Data Split


Evaluation > Metrics


Evaluation > Variance Control