Inspecting the information encoded in hidden representations of a large language model (LLM) can help explain the model's behavior and verify its alignment with human values.
Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language.
We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation.
We show that many prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework.
Moreover, several shortcomings of prior methods, such as failure in inspecting early layers or lack of expressivity, can be mitigated by Patchscopes.
Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.
Patchscopes can be configured to answer a wide range of questions about an LLM's computation. Many prominent interpretability methods can be cast as special instances, and several of their limitations such as failure in inspecting early layers, or lack of expressivity, can be mitigated with a new Patchscope.
Additionally, Patchscopes' generality enables novel inspection possibilities and helps address questions that are hard to answer with existing methods. For example, how do LLMs contextualize input entity names in early layers? This is where vocabulary projections mostly fail and other methods only provide a binary signal of whether the entity has been resolved; but a Patchscope can be easily created to verbalize the gradual entity resolution process and works at early layers.
A simple few-shot token identity Patchscope works very well from layer 10 onwards, significantly better than mainstream vocab projection methods across multiple LLMs. In this experiment, our target prompt is composed of k demonstrations representing an identity-like function, formatted as "tok1 → tok1 ; tok2 → tok2 ; . . . ; tokk".
With Patchscopes, we can decode specific attributes from LLM representations, even when they are detached from their original context. Despite using no training data, a zero-shot feature extraction Patchscope significantly outperforms linear probing in 6 out of 12 factual and commonsense reasoning tasks, and works comparably well to all but one of the remaining six. In this experiment, our target prompt is a verbalization of the relation followed by a placeholder for the subject. For example, to extract the official currency of the United States from the representation of States, we use the target prompt "The official currency of x".
How LLMs contextualize input entity names in early layers is hard to answer with existing methods. This is where vocab projection methods mostly fail and other methods only provide a binary signal of whether the entity has been resolved. However, a few-shot entity description Patchscope can verbalize the gradual entity resolution process in the very early layers. In this experiment, we use the following few-shot target prompt composed of three random entities and their corresponding description obtained from Wikipedia : "Syria: Country in the Middle East, Leonardo DiCaprio: American actor, Samsung: South Korean multinational major appliance and consumer electronics corporation, x".
You can even get more expressive descriptions using a more capable model of the same family to explain the entity resolution process of a smaller model, e.g., using Vicuna 13B to explain Vicuna 7B. The target prompt in this experiment is the same as the target prompt we used above.
We also show a practical application, fixing latent multi-hop reasoning errors. Particularly, when the model is correct in each reasoning step, but fails to process their connection in-context, we show that our proposed Patchscope improves accuracy from 19.57% to 50%. The target prompt in this experiment is the same as the source prompt, with a modified attention mask. See the paper for more details.
@inproceedings{
ghandeharioun2024patchscopes,
title={Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models},
author={Asma Ghandeharioun and Avi Caciularu and Adam Pearce and Lucas Dixon and Mor Geva},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://arxiv.org/abs/2401.06102}
}