🩺 Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

1Google Research,   2Tel Aviv University  
*Equal Contribution


We propose a framework that decodes specific information from a representation within an LLM by “patching” it into the inference pass on a different prompt that has been designed to encourage the extraction of that information. A "Patchscope" is a configuration of our framework that can be viewed as an inspection tool geared towards a particular objective.

For example, this figure shows a simple Patchscope for decoding what is encoded in the representation of "CEO" in the source prompt (left). We patch a target prompt (right) comprised of few-shot demonstrations of token repetitions, which encourages decoding the token identity given a hidden representation.

Step 1: Run the forward computation on the source prompt in the source model.
Step 2: Apply an optional transformation to the source layer's representation.
Step 3: Run the forward computation on the target prompt up to the target layer in the target model.
Step 4: Patch the target representation of "?" at the target layer; replacing it with the transformed representation (from step 2), and continue the forward computation from that layer onward.

Abstract

Inspecting the information encoded in hidden representations of a large language model (LLM) can help explain the model's behavior and verify its alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language.

We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that many prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several shortcomings of prior methods, such as failure in inspecting early layers or lack of expressivity, can be mitigated by Patchscopes.

Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.

How is it related to prior work?

Patchscopes can be configured to answer a wide range of questions about an LLM's computation. Many prominent interpretability methods can be cast as special instances, and several of their limitations such as failure in inspecting early layers, or lack of expressivity, can be mitigated with a new Patchscope.

Additionally, Patchscopes' generality enables novel inspection possibilities and helps address questions that are hard to answer with existing methods. For example, how do LLMs contextualize input entity names in early layers? This is where vocabulary projections mostly fail and other methods only provide a binary signal of whether the entity has been resolved; but a Patchscope can be easily created to verbalize the gradual entity resolution process and works at early layers.


Results (1)
Next token prediction 🩺 is more robust across layers.

A simple few-shot token identity Patchscope works very well from layer 10 onwards, significantly better than mainstream vocab projection methods across multiple LLMs. In this experiment, our target prompt is composed of k demonstrations representing an identity-like function, formatted as "tok1 → tok1 ; tok2 → tok2 ; . . . ; tokk".


Results (2)
Attribute-extraction 🩺 requires no training data.

With Patchscopes, we can decode specific attributes from LLM representations, even when they are detached from their original context. Despite using no training data, a zero-shot feature extraction Patchscope significantly outperforms linear probing in 6 out of 12 factual and commonsense reasoning tasks, and works comparably well to all but one of the remaining six. In this experiment, our target prompt is a verbalization of the relation followed by a placeholder for the subject. For example, to extract the official currency of the United States from the representation of States, we use the target prompt "The official currency of x".


Results (3)
Entity description 🩺 is more expressive than prior methods.

How LLMs contextualize input entity names in early layers is hard to answer with existing methods. This is where vocab projection methods mostly fail and other methods only provide a binary signal of whether the entity has been resolved. However, a few-shot entity description Patchscope can verbalize the gradual entity resolution process in the very early layers. In this experiment, we use the following few-shot target prompt composed of three random entities and their corresponding description obtained from Wikipedia : "Syria: Country in the Middle East, Leonardo DiCaprio: American actor, Samsung: South Korean multinational major appliance and consumer electronics corporation, x".


Results (4)
Cross-model 🩺 further improves expressivity.

You can even get more expressive descriptions using a more capable model of the same family to explain the entity resolution process of a smaller model, e.g., using Vicuna 13B to explain Vicuna 7B. The target prompt in this experiment is the same as the target prompt we used above.


Results (5)
Chain-of-Thought 🩺 can fix multi-hop reasoning errors.

We also show a practical application, fixing latent multi-hop reasoning errors. Particularly, when the model is correct in each reasoning step, but fails to process their connection in-context, we show that our proposed Patchscope improves accuracy from 19.57% to 50%. The target prompt in this experiment is the same as the source prompt, with a modified attention mask. See the paper for more details.


Conclusions

Patchscopes is a simple and effective framework that leverages the ability of LLMs to generate human-like text to decode information from intermediate LLM representations.

We show that many existing interpretability methods can be cast as specific configurations of the more general Patchscopes framework. Moreover, using new underexplored Patchscopes substantially improves our ability to decode various types of information from a model's internal computation, such as output prediction and knowledge attributes, typically outperforming prominent methods that rely on projection to the vocabulary and probing.

Our framework also enables new forms of interpretability, such as analyzing the contextualization process of input tokens in the very early layers of the model, and is beneficial for practical applications, such as multi-hop reasoning correction.

BibTeX

@misc{ghandeharioun2024patchscopes,
      title={Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models},
      author={Ghandeharioun, Asma and Caciularu, Avi and Pearce, Adam and Dixon, Lucas and Geva, Mor},
      year={2024},
      eprint={2401.06102},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}