llm-workshop-2025

An ongoing approach to interpret LLMs

Session objectives. This session will discuss some key challenges in fully interpreting how LLMs work and why efforts to interpret LLMs matters for AI safety and trust. Using Neuronpedia as an example tool, it will focus on an emerging approach in explainable AI (XAI) called mechanistic interpretability

link to slides

Session topics

Topic
5 min Mechanistic interpretability as the new XAI approach
5 min Demo of Neuronpedia
10 min Practice session
— Activity 1 (simple)
— Activity 2 (advanced)

Activity for Interpreting LLM’s

The prompt: The oposite of "small" is " Link to its attribution graph

Q: What is the most likely output? How likely it is?
Ans:

Q: What is the 2nd most likely output? How likely it is?
Ans: