Advanced Topics in Software Engineering: Machine Learning for Software Engineering - CS846 Sec 001, Fall 2024 (Term 1249) #
About #
CS846-ML4SE is a graduate-level seminar course about the application of machine learning and natural language processing in software engineering. We will mainly cover three topics: (1) language modeling for code, (2) mining software repositories (MSR) with program analysis, (3) automating software engineering tasks with the aforementioned techniques.
Class will be in-person on Mondays at 9:00-11:50am at DC 2585 -> DC 2568 starting from Nov 18. Attendance at all classes is mandatory.
The course will contain a mix of paper discussions and lectures. During the first half of each class, we will discuss 1~2 papers from the reading list; each paper discussion will be led by two students. During the second half, I will give a lecture, usually in the form of coding demos, on the topic of the week.
The course also includes a project. Students should work in teams of 2~4 (larger teams will receive higher expectations; team of 1 is possible only if the project is the one that you are actively working on, e.g., towards your thesis). Each team is expected to conduct a research project in the area of ML4SE and complete a short-paper-level report at the end of the term. For examples of how this level of research projects should look like, please refer to SE conferences’ tool/data tracks (e.g., in MSR'24), mining challenges (e.g., in MSR'24), student research competitions (e.g., in ICSE'24), or new idea tracks (e.g., in ICSE'24).
Contact #
Instructor: Pengyu Nie - pynie@uwaterloo.ca (office hours by appointment)
We will be using the Teams chat group for course discussions and announcements. The project submissions will be done through emails.
Course Schedule #
The syllabus is tentative and subject to change. The reading list will be posted soon.
The demos presented in class can be found here.
Assessment #
All deliverables are due at 11:59pm Eastern Time on the respective day. Late submissions will be graded only on a case-by-case basis.
Task | Due Date | Weight |
---|---|---|
Attendance | - | 20% |
Paper discussion lead | - | 20% |
Project: team formation | Sep 25 (Wed) | - |
Project: proposal report | Oct 11 (Fri) | 10% |
Project: progress report | Nov 08 (Fri) | 20% |
Project: final report | Dec 05 (Thu) | 30% |
Project #
Team formation (due Sep 25) #
- Please submit your team composition via email to the instructor.
- The email should include the names of all team members. If the team has only 1 member or more than 4 members, please include a justification (unless you have talked to the instructor in person / via email about it).
- You can optionally include a tentative title/abstract for your project, so that the instructor can provide early feedback.
Proposal Report (due Oct 11) #
- Please submit your proposal report as a pdf file via email to the instructor.
- The proposal report should contain a rough outline of what is the project about and how you plan to do it. It should be similar to the first 1–2 pages of a paper/journal.
- The pdf should include:
- title, authors (names and emails), abstract, introduction;
- optionally other sections (e.g., related work) where you see fit;
- all of these except for authors can be updated later in the project.
- The pdf should be 1–2 pages long and in ACM sigconf (double-columns) template by default. However, if you are targeting a specific conference/journal submission, feel free to use a different template and adjust the number of pages accordingly (e.g., 1 ACM page = 1.5 IEEE pages = 2 ACL pages).
Progress report (due Nov 08) #
- Please submit your progress report as a pdf file via email to the instructor.
- The progress report should build on your proposal report. Compared to the proposal report, you should have made progress on at least one of the following aspects: technique, dataset, and/or experiments.
- The pdf should include:
- title, authors (names and emails), abstract, introduction (copied/updated from the proposal report);
- technique, dataset, experiments; one of them should be detailed and the rest can be left as high-level outlines at this point;
- optionally other sections (e.g., related work) where you see fit.
- The pdf should be 2–4 pages long and in ACM sigconf (double-columns) template by default. However, if you are targeting a specific conference/journal submission, feel free to use a different template and adjust the number of pages accordingly (e.g., 1 ACM page = 1.5 IEEE pages = 2 ACL pages).
Final report (due Dec 05) #
- Please submit your final report as a pdf file via email to the instructor.
- The final report should build on your progress report. It will be reviewed as a complete, self-contained conference/journal paper.
- The pdf should include all sections that a paper typically contains. A significant part of them can be copied/updated from your proposal report.
- The pdf should be 4–10 pages long and in ACM sigconf (double-columns) template by default. However, if you are targeting a specific conference/journal submission, feel free to use a different template and adjust the number of pages accordingly (e.g., 1 ACM page = 1.5 IEEE pages = 2 ACL pages).
Reading List #
Sep 16: language modeling and n-gram models #
- Big code != big vocabulary: open-vocabulary models for source code
- Capturing Structural Locality in Non-parametric Language Models
- On the Localness of Software
- Mining Source Code Repositories at Massive Scale Using Language Modeling
Sep 23: sequence-to-sequence models and transformers #
- Empirical study of transformers for source code
- Retrieval Augmented Code Generation and Summarization
- Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy
- Synchromesh: Reliable code generation from pre-trained language models
Sep 30: large language models for code #
- Show Your Work: Scratchpads for Intermediate Computation with Language Models
- A Static Evaluation of Code Completion by Large Language Models
- Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?
- Traces of Memorisation in Large Language Models for Code
Oct 07: software engineering datasets and metrics #
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization
- ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
- On the Evaluation of Large Language Models in Unit Test Generation
Oct 21: build systems essentials and parsing #
- A Syntactic Neural Model for General-Purpose Code Generation
- CODIT: Code Editing with Tree-Based Neural Models
- Less is More? An Empirical Study on Configuration Issues in Python PyPI Ecosystem
- Automatically Resolving Dependency-Conflict Building Failures via Behavior-Consistent Loosening of Library Version Constraints
Oct 28: static analysis #
- Data-Driven Evidence-Based Syntactic Sugar Design
- DLInfer: Deep Learning with Static Slicing for Python Type Inference
- Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context
- CONCORD: Clone-aware Contrastive Learning for Source Code
Nov 04: dynamic analysis #
- TRACED: Execution-aware Pre-training for Source Code
- Predictive Program Slicing via Execution Knowledge-Guided Dynamic Dependence Learning
- Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM
- Blended, Precise Semantic Program Embeddings
Nov 11: bug detection and localization #
- Snopy: Bridging Sample Denoising with Causal Graph Learning for Effective Vulnerability Detection
- The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small Classifier
- A Deep Dive into Large Language Models for Automated Bug Localization and Repair
- Towards Better Graph Neural Network-Based Fault Localization through Enhanced Code Representation
Nov 18: code translation #
- On the Evaluation of Neural Code Translation: Taxonomy and Benchmark
- Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code
- Leveraging Automated Unit Tests for Unsupervised Code Translation
- Code Translation with Compiler Representations
- Exploring and Unleashing the Power of Large Language Models in Automated Code Translation
- Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages
Nov 25: program comprehension #
- Natural Language Outlines for Code: Literate Programming in the LLM Era
- Automating Code Review Activities by Large-Scale Pre-training
- CodeAgent: Autonomous Communicative Agents for Code Review
- Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)
Dec 02: test generation #
- Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions
- Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models
- An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation
- CAT-LM: Training Language Models on Aligned Code And Tests
- ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation
- CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models
Acknowledgements #
Administrative Notes #
Generative AI #
Generative artificial intelligence (GenAI) trained using large language models (LLM) or other methods to produce text, images, music, or code, like Chat GPT, DALL-E, or GitHub CoPilot, may be used for assignments in this class with proper documentation, citation, and acknowledgement. Recommendations for how to cite GenAI in student work at the University of Waterloo may be found through the Library. Please be aware that generative AI is known to falsify references to other work and may fabricate facts and inaccurately express ideas. GenAI generates content based on the input of other human authors and may therefore contain inaccuracies or reflect biases.
In addition, you should be aware that the legal/copyright status of generative AI inputs and outputs is unclear. Exercise caution when using large portions of content from AI sources, especially images. More information is available from the Copyright Advisory Committee.
You are accountable for the content and accuracy of all work you submit in this class, including any supported by generative AI.
Territorial Acknowledgement #
The University of Waterloo acknowledges that much of our work takes place on the traditional territory of the Neutral, Anishinaabeg and Haudenosaunee peoples. Our main campus is situated on the Haldimand Tract, the land granted to the Six Nations that includes six miles on each side of the Grand River. Our active work toward reconciliation takes place across our campuses through research, learning, teaching, and community building, and is centralized within the Office of Indigenous Relations.
Inclusive Teaching-Learning Spaces #
The University of Waterloo values the diverse and intersectional identities of its students, faculty, and staff. The University regards equity and diversity as an integral part of academic excellence and is committed to accessibility for all. We consider our classrooms, online learning, and community spaces to be places where we all will be treated with respect, dignity, and consideration. We welcome individuals of all ages, backgrounds, beliefs, ethnicities, genders, gender identities, gender expressions, national origins, religious affiliations, sexual orientations, ability - and other visible and nonvisible differences. We are all expected to contribute to a respectful, welcoming, and inclusive teaching- learning environment. Any member of the campus community who has experienced discrimination at the University is encouraged to seek guidance from the Office of Equity, Diversity, Inclusion & Anti-racism (EDI-R) via email at equity@uwaterloo.ca. Sexual Violence Prevention & Response Office (SVPRO), supports students at UWaterloo who have experienced, or have been impacted by, sexual violence and gender-based violence. This includes those who experienced harm, those who are supporting others who experienced harm. SVPRO can be contacted at svpro@uwaterloo.ca
Religious & Spiritual Observances #
The University of Waterloo has a duty to accommodate religious and spiritual observances under the Ontario Human Rights Code. Please inform the instructor at the beginning of term if special accommodation needs to be made for religious observances that are not otherwise accounted for in the scheduling of classes and assignments. Consult with your instructor(s) within two weeks of the announcement of the due date for which accommodation is being sought.
Respectful Communication and Pronouns #
Communications with Instructor(s) and TAs should be through recommended channels for the course (e.g., email, LEARN, Piazza, Teams, etc.) Please use your UW email address. Include an academic signature with your full name, program, student ID. We encourage you to include your pronouns to facilitate respectful communication (e.g., he/him; she/her; they/them). You can update your chosen/preferred name at WatIAM. You can update your pronouns in Quest.
Mental Health and Wellbeing Resources #
If you are facing challenges impacting one or more courses, contact your academic advisor, Associate Chair Undergraduate, or the Director of your academic program. Mental health is a serious issue for everyone and can affect your ability to do your best work. We encourage you to seek out mental health and wellbeing support when needed. The Faculty of Engineering Wellness Program has programming and resources for undergraduate students. For counselling (individual or group) reach out to Campus Wellness and Counselling Services. Counselling Services is an inclusive, non-judgmental, and confidential space for anyone to seek support. They offer confidential counselling for a variety of areas including anxiety, stress management, depression, grief, substance use, sexuality, relationship issues, and much more.
Intellectual Property #
Be aware that this course contains the intellectual property of their instructor, TA, and/or the University of Waterloo. Intellectual property includes items such as:
- Lecture content, spoken and written (and any audio/video recording thereof).
- Lecture handouts, presentations, and other materials prepared for the course (e.g., PowerPoint slides).
- Questions or solution sets from various types of assessments (e.g., assignments, quizzes, tests, final exams); and
- Work protected by copyright (e.g., any work authored by the instructor or TA or used by the instructor or TA with permission of the copyright owner).
Course materials and the intellectual property contained therein are used to enhance a student’s educational experience. However, sharing this intellectual property without the intellectual property owner’s permission is a violation of intellectual property rights. For this reason, it is necessary to ask the instructor, TA and/or the University of Waterloo for permission before uploading and sharing the intellectual property of others online (e.g., to an online repository).
Permission from an instructor, TA or the University is also necessary before sharing the intellectual property of others from completed courses with students taking the same/similar courses in subsequent terms/years. In many cases, instructors might be happy to allow distribution of certain materials. However, doing so without expressed permission is considered a violation of intellectual property rights and academic integrity.
Please alert the instructor if you become aware of intellectual property belonging to others (past or present) circulating, either through the student body or online.
Continuity Plan - Fair Contingencies for Unforeseen Circumstances (e.g., resurgence of Covid) #
In the event of emergencies or highly unusual circumstances, the instructor will collaborate with the Department/Faculty to find reasonable and fair solutions that respect rights and workloads of students, staff, and faculty. This may include modifying content delivery, course topics and/or assessments and/or weight and/or deadlines with due and fair notice to students. Substantial changes after the first week of classes require the approval of the Associate Dean, Undergraduate Studies.
Declaring absences (undergraduate students and/or courses only) #
Regardless of the process used to declare an absence, students are responsible for reaching out to their instructors as soon as possible. The course instructor will determine how missed course components are accommodated. Self-declared absences (for COVID-19 and short-term absences up to 2 days) must be submitted through Quest. Absences requiring documentation (e.g., Verification of Illness Form, bereavement, etc.) are to be uploaded by completing the form on the VIF System. The UW Verification of Illness form, completed by a health professional, is the only acceptable documentation for an absence due to illness. Do not send documentation to your advisor, course instructor, teaching assistant, or lab coordinator. Submission through the VIF System, once approved, will notify your instructors of your absence.
Rescheduling Co-op Interviews #
Follow the co-op process for rescheduling co-op interviews for conflicts to graded assignments (e.g., midterms, tests, and final exams). Attendance at co-operative work-term employment interviews is not considered to be a valid reason to miss a test.
Policies #
Academic integrity #
In order to maintain a culture of academic integrity, members of the University of Waterloo community are expected to promote honesty, trust, fairness, respect and responsibility. [Check the Office of Academic Integrity for more information.]
Grievance #
A student who believes that a decision affecting some aspect of their university life has been unfair or unreasonable may have grounds for initiating a grievance. Read Policy 70, Student Petitions and Grievances, Section 4. When in doubt, please be certain to contact the department’s administrative assistant who will provide further assistance.
Discipline #
A student is expected to know what constitutes academic integrity to avoid committing an academic offence, and to take responsibility for their actions. [Check the Office of Academic Integrity for more information.] A student who is unsure whether an action constitutes an offence, or who needs help in learning how to avoid offences (e.g., plagiarism, cheating) or about “rules” for group work/collaboration should seek guidance from the course instructor, academic advisor, or the undergraduate associate dean. For information on categories of offences and types of penalties, students should refer to Policy 71, Student Discipline. For typical penalties, check Guidelines for the Assessment of Penalties.
Appeals #
A decision made or penalty imposed under Policy 70, Student Petitions and Grievances (other than a petition) or Policy 71, Student Discipline may be appealed if there is a ground. A student who believes they have a ground for an appeal should refer to Policy 72, Student Appeals.
Note for students with disabilities #
AccessAbility Services, located in Needles Hall, Room 1401, collaborates with all academic departments to arrange appropriate accommodations for students with disabilities without compromising the academic integrity of the curriculum. If you require academic accommodations to lessen the impact of your disability, please register with AccessAbility Services at the beginning of each academic term.
Turnitin.com: #
Text matching software (Turnitin®) may be used to screen assignments in this course. Turnitin® is used to verify that all materials and sources in assignments are documented. Students’ submissions are stored on a U.S. server, therefore students must be given an alternative (e.g., scaffolded assignment or annotated bibliography), if they are concerned about their privacy and/or security. Students will be given due notice, in the first week of the term and/or at the time assignment details are provided, about arrangements and alternatives for the use of Turnitin in this course.
It is the responsibility of the student to notify the instructor if they, in the first week of term or at the time assignment details are provided, wish to submit alternate assignment.