replication: prior work -> results in updated context
related to the course in some way, e.g., technique, dataset, application, ideology, etc.
existing project welcomed
Course Logistics > Assessment > Project
Team formation by 06/01 (Mon)
start finding teammates and discussing project ideas now
the team lead should send an email to the instructor with all team members cc-ed
if your team size is not 2–4, contact the instructor in advance to reach an agreement
Project idea/proposal discussions in class 06/09 (Feb)
prepare a 3-minute pitch of your project idea
collect feedback from the instructor and peers
Course Logistics > Assessment > Project
Milestones
Proposal report by 06/12 (Fri): the abstract & introduction sections, outlining the motivation and proposed solution; ~1 page.
Progress report by 07/10 (Fri): complete a couple of more sections of the paper, with some preliminary results and findings; 2–4 pages.
Final report by 08/07 (Fri): the complete paper; 4–10 pages.
Course Logistics > AI Usage Policy
Usage of AI assistants for coding and paper writing: Yes, but you are responsible for the quality of the work.
do fact-checking, especially for the areas you are not familiar with;
(coding) maintain good software engineering practices (test your code, follow coding conventions);
(paper) use standard and professional terminologies, avoid hallucinated citations.
Round-Table Introductions
Name
Position (department, Masters/PhD, year)
Research interests
Expectations from this course
One interesting fact about your hometown
ML/AI4SE Overview
I will…
Present a brief history of the ML/AI4SE research area in the past decade or so. Disclaimer: this is my biased view constrained by my limited academic life.
Give you some ideas of solved vs. unsolved research problems in the area.
Motivate the next two lectures on SE and ML/AI essentials.
ML/AI4SE Overview > Naturalness of Software
Statistical (n-gram) language modeling of code
Code written in PLs is natural: repetitive and predictable
Code is more "natural" (lower cross-entropy) than English
BERT-style encoder, trained on bimodal (code, NL) corpora
Pre-training objectives specific to SE (MLM + replaced token detection)
Feng et al. CodeBERT: a pre-trained model for programming and natural languages. In Findings of EMNLP 2020. https://arxiv.org/abs/2002.08155
ML/AI4SE Overview > CodeXGLUE
Transformers, multi-tasking
CodeXGLUE: 14 tasks across 4 categories: code-code, code-text, text-code, text-text
Lu et al. CodeXGLUE: a machine learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks 2021. https://arxiv.org/abs/2102.04664
ML/AI4SE Overview > Codex
Scaling up -> Large Language Models
GPT-3 fine-tuned on a large code corpus (later becomes GitHub Copilot)
Introduced HumanEval, using test pass/fail as evaluation metric