Session objectives. This session will discuss some key challenges in fully interpreting how LLMs work and why efforts to interpret LLMs matters for AI safety and trust. Using Neuronpedia as an example tool, it will focus on an emerging approach in explainable AI (XAI) called mechanistic interpretability
⏳ | Topic |
---|---|
5 min | Mechanistic interpretability as the new XAI approach |
5 min | Demo of Neuronpedia |
10 min | Practice session — Activity 1 (simple) — Activity 2 (advanced) |
The prompt: The oposite of "small" is "
Link to its attribution graph
Q: What is the most likely output? How likely it is?
Ans:
Q: What is the 2nd most likely output? How likely it is?
Ans: