llm-workshop-2025

An ongoing approach to interpret LLMs

Session objectives. This session will discuss some key challenges in fully interpreting how LLMs work and why efforts to interpret LLMs matters for AI safety and trust. Using Neuronpedia as an example tool, it will focus on an emerging approach in explainable AI (XAI) called mechanistic interpretability

link to slides

Session topics

⏳	Topic
5 min	Mechanistic interpretability as the new XAI approach
5 min	Demo of Neuronpedia
10 min	Practice session — Activity 1 (simple) — Activity 2 (advanced)

Activity for Interpreting LLM’s

The prompt: The oposite of "small" is " Link to its attribution graph

Q: What is the most likely output? How likely it is?
Ans:

Q: What is the 2nd most likely output? How likely it is?
Ans: