Course Staff

Chenhao Tan
Chenhao Tan
Instructor
Dang Nguyen
Dang Nguyen
Teaching Assistant
Harvey Fu
Harvey Fu
Teaching Assistant

Logistics

Content

What is this course about?

Large language models are rapidly reshaping machine learning research and practice, yet many questions remain about how they work, how to ensure they behave as intended, and how to build reliable systems on top of them. This course dives into three core areas at the frontier of LLM research: interpretability, alignment, and agents. Students will learn to analyze circuits and internal representations, probe the geometry of model features through sparse autoencoders and linear representations, and reason about scalable oversight and emergent misalignment. The course will also cover how LLMs are deployed as autonomous agents for software engineering and scientific research, how they are used to simulate human behavior, and how they can complement human decision-making. This is an advanced course and assumes familiarity with transformers and language modeling. We will read and discuss recent publications, with importance placed on analyzing, interpreting, and making arguments from necessarily incomplete empirical evidence. Students will get hands-on experience through assignments and a quarter-long research project that pushes into open problems in the field.

Prerequisites

You are expected have understoond the transformer architecture and have experience with training and analyzing language models. Research experience is preferred too.

Coursework

Grading

Quizzes

Short quizzes will be held at the beginning of the lecture to assess understanding of the readings.

Roast or Toast

Students will either critically analyze (roast) a paper or propose (imagine) an extension or question from the course readings.

Assignments

There will be three assignments throughout the quarter.

Project

Compute

Modal has generously offered compute to each student. See details on Ed. Modal

Textbook

There is no required textbook. Reading materials for each week will be a combination of technical papers and online resources.

Honor Code

We expect students to not look at solutions or implementations online. Like all other classes at UChicago, we take academic honesty very seriously. Please make sure to read the UChicago Academic Honesty page.

Collaboration policy

For individual assignments, collaboration with fellow students is encouraged as long as they are properly disclosed for each submission. However, you should not share any written work or code for your assignments. After discussing a problem with others, you should write the solution by yourself. For final projects, you are expected to work in groups of 1-2, preferrably 1.

AI tools policy

Using generative AI tools such as Claude Code and ChatGPT is allowed as long as they are properly disclosed for each submission. You are encouraged to use AI (e.g., NeuriCo) heavily for the project.

Additional course policies can be found on Canvas.

Submitting Coursework

Late Days


Other Resources


Preliminary Schedule

# Date Topic Readings Deadlines
Interpretability
1 Mon Mar 23 Introduction Lecture
2 Wed Mar 25 Attention Lecture
*What Does BERT Look At? An Analysis of BERT's Attention by Clark et al., 2019;
Play around with BertViz by Jesse Vig for at least 10 minutes. *Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small by Wang et al., 2022;
Function Vectors in Large Language Models by Todd et al., 2023.
Project Proposal due (Fri Mar 28)
3 Mon Mar 30 MLPs and Factual Recall Lecture
*Locating and Editing Factual Associations in GPT by Meng et al., 2022;
* What does the Knowledge Neuron Thesis Have to do with Knowledge? by Niu et al., 2024
Transformer Feed-Forward Layers Are Key-Value Memories by Geva et al., 2020;
4 Wed Apr 1 Transformer Circuits Lecture
*A Mathematical Framework for Transformer Circuits by Elhage et al., 2021 (read until but not including Two Layer Transformers);
*In-context Learning and Induction Heads by Olsson et al., 2022 (read through "Arguments");
interpreting GPT: the logit lens by nostalgebraist
Proposal Revision due (Fri Apr 4)
5 Mon Apr 6 Geometry of Representations (guest lecture by Todd Nief) * The Linear Representation Hypothesis and the Geometry of Large Language Models by Park et al., 2023;
* The Information Geometry of Softmax: Probing and Steering by Park et al., 2026;
The Geometry of Truth by Marks & Tegmark, 2023 (main body, first ten pages)
6 Wed Apr 8 Superposition & Sparse Autoencoders Toy Models of Superposition by Elhage et al., 2022;
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning by Bricken et al., 2023;
Scaling and Evaluating Sparse Autoencoders by Gao et al., 2024
Blog Entry 1 due (Fri Apr 10)
7 Mon Apr 13 Chain of Thought Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting by Turpin et al., 2023;
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation by Baker et al., 2025;
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety by Korbak et al., 2025
8 Wed Apr 15 Interpretability for Science BERTology Meets Biology: Interpreting Attention in Protein Language Models by Vig et al., 2020;
From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models by Lam et al., 2025;
Protein Language Models Learn Evolutionary Statistics of Interacting Sequence Motifs by Zhang et al., 2024
Blog Entry 2 due (Fri Apr 17)
Alignment
9 Mon Apr 20 The Alignment Problem Concrete Problems in AI Safety by Amodei et al., 2016;
What Failure Looks Like by Christiano (blog post);
Scheming AIs: Will AIs Fake Alignment During Training in Order to Get Power? by Carlsmith, 2023
10 Wed Apr 22 Scalable Oversight Measuring Progress on Scalable Oversight for Large Language Models by Bowman et al., 2022;
Debating with More Persuasive LLMs Leads to More Truthful Answers by Khan et al., 2024;
On Scalable Oversight with Weak LLMs Judging Strong LLMs by Kenton et al., 2024
Blog Entry 3 due (Fri Apr 24)
11 Mon Apr 27 Emergent Misalignment (guest lecture by Shi Feng) Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs by Betley et al., 2025;
Risks from Learned Optimization in Advanced Machine Learning Systems by Hubinger et al., 2019;
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by Hubinger et al., 2024
12 Wed Apr 29 Sycophancy Towards Understanding Sycophancy in Language Models by Sharma et al., 2023;
Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models by Denison et al., 2024;
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback by Casper et al., 2023
Blog Entry 4 due (Fri May 1)
13 Mon May 4 Finding Novel Behavior Alignment Faking in Large Language Models by Greenblatt et al., 2024;
Frontier Models are Capable of In-context Scheming by Meinke et al., 2024;
AI Sandbagging: Language Models can Strategically Underperform on Evaluations by van der Weij et al., 2024
Agents
14 Wed May 6 Agents & Agentic RL (guest lecture by Ofir Press) ReAct: Synergizing Reasoning and Acting in Language Models by Yao et al., 2022;
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? by Jimenez et al., 2023;
Agentless: Demystifying LLM-based Software Engineering Agents by Xia et al., 2024
First Draft due (Fri May 8)
15 Mon May 11 Research Agents AlphaEvolve: A Gemini-based Coding Agent for Mathematical and Algorithmic Discovery by Novikov et al., 2025;
The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research by Bai et al., 2026;
Agent Laboratory: Using LLM Agents as Research Assistants by Schmidgall et al., 2025
16 Wed May 13 Simulation Generative Agents: Interactive Simulacra of Human Behavior by Park et al., 2023;
Out of One, Many: Using Language Models to Simulate Human Samples by Argyle et al., 2022;
Synthetic Replacements for Human Survey Data? The Perils of Large Language Models by Bisbee et al., 2024
Blog Entry 5 due (Fri May 15)
17 Mon May 18 Complementary AI Superhuman Artificial Intelligence Can Improve Human Decision-Making by Increasing Novelty by Shin et al., 2023;
How AI Impacts Skill Formation by Shen and Tamkin, 2026;
Machine Explanations and Human Understanding by Chen et al., 2022
18 Wed May 20 Final Presentations Final Report due (Fri May 22)

Acknowledgments

This course website is adapted from the Stanford CS336 course website.