Learning Interpretability Tool
Tutorials > Analysis > Prompt Debugging with Sequence Salience

Prompt Debugging with Sequence Salience

Or, run this locally with examples/prompt_debugging/server.py

Large language models (LLMs), such as Gemini and GPT-4, have become ubiquitous. Recent releases of "open weights" models, including Llama 2, Mistral, and Gemma, have made it easier for hobbyists, professionals, and researchers alike to access, use, and study the complex and diverse capabilities of LLMs.

Many LLM interactions use prompt engineering methods to control the model's generation behavior. Generative AI Studio and other tools have made it easier to construct prompts, and model interpretability can help engineer prompt designs more effectively by showing us which parts on the prompt the model is using during generation.

In this tutorial, you will learn to use the Sequence Salience module, introduced in LIT v1.1, to explore the impact of your prompt designs on model generation behavior in three case studies. In short, this module allows you to select a segment of the model's output and see a heatmap depicting how much influence each preceding segment had on the selection.

LIT's Language Model Salience demo. Use the Data Table (left) and Datapoint Editor (not shown) to select or create prompt designs, and visualize the salient information therein using the Sequence Salience module (right).

All examples in this tutorial use the Gemma LLM as the analysis target. Most of the time, this is Gemma Instruct 2B, but we also use Gemma Instruct 7B in Case Study 3; more info about variants is available online. LIT supports additional LLMs, including Llama 2 and Mistral, via the HuggingFace Transformers and KerasNLP libraries.

This tutorial was adapted from and expands upon LIT's contributions to the Responsible Generative AI Tookit and the related paper and video submitted to the ACL 2024 System Demonstrations track. This is an active and ongoing research area for the LIT team, so expect changes and further expansions to this tutorial over time.

Case Study 1: Debugging Few-Shot Prompts

Few-shot prompting was introduced with GPT-2: an ML developer provides examples of how to perform a task in a prompt, affixes user-provided content at the end, and sends the prompt to the LLM so it will generate the desired output. This technique has been useful for a number of use cases, including solving math problems, code synthesis, and more.

Imagine yourself as a developer working on an AI-powered recommendation system. The goal is to recommend dishes from a restaurant's menu based on a user's preferences—what they like and do not like. You are designing and few-shot prompt to enable an LLM to complete this task. Your prompt design, shown below, includes five clauses: Taste-likes and Taste-dislikes are provided by the user, Suggestion is the item from the restaurant's menu, and Analysis and Recommendation are generated by the LLM. The dynamic content for the final example is injected before the prompt is sent to the model.

Analyze a menu item in a restaurant.

## For example:

Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Onion soup
Analysis: it has cooked onions in it, which you don't like.
Recommendation: You have to try it.

Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Baguette maison au levain
Analysis: Home-made leaven bread in France is usually great
Recommendation: Likely good.

Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Macaron in France
Analysis: Sweet with many kinds of flavours
Recommendation: You have to try it.

## Now analyse one more example:

Taste-likes: users-food-like-preferences
Taste-dislikes: users-food-dislike-preferences
Suggestion: menu-item-to-analyse
Analysis:

There's a problem with this prompt. Can you spot it? If you find it, how long do you think it took before you noticed it? Let's see how Sequence Salience can speed up bug identification and triage with a simple example.

Consider the following values for the variables in the prompt template above.

users-food-like-preferences = Cheese
users-food-dislike-preferences = Can't eat eggs
menu-item-to-analyse = Quiche Lorraine

When you run this through the model it generates the following (we show the entire example, but the model only generated the text after Analysis):

Taste-likes: Cheese
Taste-dislikes: Can't eat eggs
Suggestion: Quiche Lorraine
Analysis: A savoury tart with cheese and eggs
Recommendation: You might not like it, but it's worth trying.

Why is the model suggesting something that contains an ingredient that the user cannot eat (eggs)? Is this a problem with the model or a problem with the prompt? The Sequence Salience module can help us find out.

If you are following along in Colab, you can select this example from the Data Table by selecting the example with the source value fewshot-mistake. Alternatively, you can add the example directly using the Datapoint Editor.

Once selected, the Sequence Salience module will allow you to choose the response field from the model (bottom) and see a running-text view of the prompt. The module defaults to word-level granularity, but this prompt design is more suitable for sentence-level analysis since the data it contained in each example is separated into distinct, sentence-like clauses. After enabling sentence-level aggregation with Granularity controls, select the Recommendation line from the model's generated response to see a heatmap that shows the impact preceding lines have on that line. You can also use paragraph-level aggregation to help quickly identify the most influential examples and then switch to a finer-grained aggregation to see how different statements in the prompt influence generation. These two perspectives are shown in the figure below.

Sequence Salience maps depicting the influence from few-shot examples at two levels. Paragraph-level aggregation (left) allows us to quickly identify the most influential complete example, and sentence-level (right) aids in differentiating the influence of constituent clauses. Notice that the most influential example is the first one, and that the most salient clause in that example is the Analysis line. However, the Recommendation that follows contradicts the stated taste preferences and Analysis.
Adjusting Segment Granularity Input salience methods for text-to-text generation tasks operate over the subword tokens used by the model. However, human tend not to reason effectively over these tokenized representations, so we provide a granularity control that (roughly) aggregates tokens into words, sentences, and paragraphs, or into custom segments using a regular expression parser. The salience score for each aggregate segment is the sum of the scores for its constituent tokens. Selecting an aggregate segment is equivalent to selecting all constituent tokens.
Adjusting Color Map Intensity The Sequence Salience module allows you to control the intensity of the color map, which can balance the visual presence of segments at different granularities. We've tried to set a suitable default intensity, but encourage you to play around with these controls to see what works well for your needs.

As you scan up through the sentence-level heatmap, you will notice two things: 1) the strongest influence on the recommendation in the instruction at the top to analyze the menu item; and 2) the next most influential segments are the Analysis lines in each of the few-shot examples. These suggest that the model is correctly attending to the task and leaning on the analyses to guide generation, so what could be going wrong? The most influential Analysis clause is from the Onion soup example. Looking at this example more closely we see that the Recommendation clause for this example does not align with the user's tastes; they dislike onions but the model recommends the onion soup anyway.

Research suggests that the relatively tight distribution over the taste and recommendation spaces in the limited examples in the prompt can affect the model's ability to learn the recommendation task. The other examples in the prompt appear to have the correct recommendation given the user's tastes. If we fix the Onion soup example recommendation, maybe the model will perform better.

The Datapoint Editor (left) allows you to edit the prompt text directly in LIT, with any edited fields highlighted in yellow until they are added to the dataset. After adding the edited prompt, the Sequence Salience module (right) will update and allow you to view the influence of prior clauses in the corrected prompt. Fixing the few-shot example appears to have correctly adjusted model behavior.

After making adjustments in the Datapoint Editor (or selecting the fewshot-fixed example in the Data Table if you're following along in Colab), we can again load the example into the Sequence Salience module and, with sentence-level granularity selected, select the new Recommendation line in the model's generated response. We can immediately see that the response is now correct. The heatmap looks largely the same as before, but the corrected examples have improved the models performance.

Case Study 2: Assessing Constitutional Principles in Prompts

Constitutional principles are a more recent development in the pantheon of prompt engineering. The core concept is that clear, concise instructions describing the qualities of a good output can improve model performance and increase developers' ability to control generations. Initial research has shown that self-critique from the model is best for this, and tools have been developed to help humans have control over the principles that are added into prompts. The Sequence Salience module can take this one step further by providing a feedback loop to assess the influence of constitutional principles on generations.

Building on the task from Case Study 1, let's consider how the following constitutional principles might impact a prompt designed for food recommendations from a restaurant menu.

* The analysis should be brief and to the point.
* The analysis and recommendation should both be clear about the suitability for someone with a specified dietary restriction.

The location of principles in a prompt can directly affect model performance. To start, let's look at how they impact generations when placed between the instruction (Analyze a menu...) and the start of the few-shot examples. The heatmap shown in the figure below shows a desirable pattern; the model is being strongly influenced by the task instruction and the principle related to the Recommendation component of the generation, with support from the Analysis clauses in the few-shot examples.

A Sequence Salience map depicting the influence of constitutional principles on a model generation with few-shot examples. Notice that placing the principles near the task instruction seems to give them significant influence compared to the heatmaps in Case Study 1.

What happens if we change the location of these principles? You can use LIT's Datapoint Editor to move the principles to their own section in the prompt, between the few-shot examples and the completion, a shown in the figure below.

This Sequence Salience maps suggests that moving principles around in the prompt does not seem to affect model generation on this example, but it does change the influence pattern changes dramatically.

After moving the principles, the influence seems to be more diffuse across all of the Analysis sections and the relevant principle. The sentiment conveyed in the Recommendation is similar to the original, and even more terse after they were moved, which better aligns with the principle. If similar patterns were found across multiple test examples, this might suggest the model does a better job of following the principles when they come later on in the prompt.

Constitutional principles are still very new, and the interactions between them and model size, for example, are not well understood at this time. We hope that LIT's Sequence Salience module will help develop and validate methods for using them in prompt engineering use cases.

Case Study 3: Side-by-Side Behavior Comparisons

LIT support a side-by-side (SxS) mode that can be used to compare two models, or here, compare model behavior on two related examples. Let's see how we can use this to understand differences in prompt designs with Sequence Salience.

GSM8K is a benchmark dataset of grade school math problems commonly used to evaluate LLMs' mathematical reasoning abilities. Most evaluations employ a chain-of-thought prompt design where a set of few-shot examples demonstrate how to decompose a word problem into subproblems and then combine the results from the various subproblems to arrive at the desired answer. GSM8K and other work has shown that LLMs often need assistance to perform calculations, introducing the idea of tool use by LLMs.

Less explored is the Socratic form of the dataset, where subproblems are framed as questions instead of declarative statements. One might assume that a model will perform similarly or even better on the Socratic form than the conventional form, especially when you consider modifying the prompt design to include the preceding Socratic questions in the prompt, isolating the work the model must perform to the final question, as shown in the following example.

A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. How much did the booth earn for 5 days after paying the rent and the cost of ingredients?
How much did the booth make selling cotton candy each day? ** The booth made $50 x 3 = $<<50*3=150>>150 selling cotton candy each day.
How much did the booth make in a day? ** In a day, the booth made a total of $150 + $50 = $<<150+50=200>>200.
How much did the booth make in 5 days? ** In 5 days, they made a total of $200 x 5 = $<<200*5=1000>>1000.
How much did the booth have to pay? ** The booth has to pay a total of $30 + $75 = $<<30+75=105>>105.
How much did the booth earn after paying the rent and the cost of ingredients? **

When we inspect the model's response to a zero-shot prompt in the Sequence Salience module, we notice two things. First, the model failed to compute the correct answer. It was able to correctly set up the problem as the difference between two values, but the calculated value is incorrect ($995 when it should be $895). Second, we see a fairly diffuse heatmap attending near equally to the operands for the final problem and all of the preceding answers to the Socratic questions.

On load, the Sequence Salience module lets you choose which target sequence to analyze. Sequences from the dataset are shown on top, there is typically only one of these as it acts as the ground truth against which predictions are compared. Sequences from the model are shown on the bottom; there may be more than one of these depending on the sampling strategy used by the model.

This dataset does provide ground truth, so let's use SxS mode to compare the generated response with the ground truth. The fastest way to enter SxS mode for the selected datapoint is by using the pin button in the main toolbar. When you enable SxS mode, the Sequence Salience module will ask you to choose which target sequence to view on each side. The order doesn't matter, but ground truth is on the left and the models' response is on the right in the figure below.

Side-by-side Sequence Salience maps for the ground truth (left) and model generated response (right) for a zero-shot prompt of a GSM8K example. Note the similarities between these heatmaps, with diffuse influence over the preceding answers and the incorrect calculation to the final question.

Next, ensure that the same granularity (word-level) is being used on both Sequence Salience visualizations, and then select the segment for the last calculation on both sides. The heatmap is quite similar on both sides; the same diffuse pattern suggesting the model isn't quite sure what to pay attention to.

Side-by-side Sequence Salience maps for the ground truth (left) and model generated response (right) for a zero-shot prompt of a GSM8K example. Note the similarities between these heatmaps, with diffuse influence over the preceding answers and the incorrect calculation to the final question.

One possibility that might improve performance is to adjust the prompt so that the segments used in the calculations are more salient. GSM8K uses a special calculation annotation to tell the model when it should employ an external calculator tool during generation. The naive zero-shot prompt above left these annotations intact and they might be confusing the model. Let's see what happens when we remove these annotations. Using the Datapoint Editor we can remove all of the << ... >> content from the prompt, then use the "Add" button to add it to our dataset, run generation, and load the example in the Sequence Salience module as the "selected" datapoint on the right. Choose to view the model's response field in the Sequence Salience module, ensure the same granularity is being used, and then select the segment containing the calculated value on both sides, as shown in the figure below.

We can immediately see that the modified prompt has a much more intense salience map focusing on the operands to the calculation and the preceding answers from which they originate. That said, the model still gets the calculation wrong.

Side-by-side Sequence Salience maps of the model's response for the original zero-shot prompt (left) and a revised prompt (right) that removes the special calculation annotations. Despite the more focused influence of the segments relevant to the final question, the model still fails to calculate the correct answer.

In addition to these between-examples comparisons, LIT's SxS mode also supports comparison between two models. Prior research investigating the necessity of tool use by models has noted that model size does seem to correlate with performance on mathematical reasoning benchmarks. Let's test that hypothesis here.

Resource Needs for Between-Model Comparisons
Side-by-side comparison requires loading both models at once, which requires additional memory. To load both Gemma 2B and 7B, we recommend a GPU or TPU with 40GB of memory, such as the Nvidia A100 available through Colab Pro.

To enable between-model comparison, first unpin the original example using the button in the main toolbar, then enable the 7B and 2B model instances using the checkboxes (also in the main toolbar). This will duplicate the Sequence Salience module, with the 7B model on the left and the 2B model on the right. Select model response for both, and then select the final calculation result segment to see their respective heatmaps.

Side-by-side Sequence Salience maps for the responses from two models—Gemma 7B IT (left) and Gemma 2B IT (right)—to the revised zero-shot prompt from above.

Notice that the heatmaps are quite similar, suggesting the models have similar behavioral characteristics, but that both still get the answer wrong. At this point, it may be possible to improve performance by revisiting different prompting strategies or by training the model to use tools.

Conclusion

The case studies above demonstrate how to use LIT's Sequence Salience module to evaluate prompt designs rapidly and iteratively, in combination with LIT's tools for side-by-side comparison and datapoint editing.

Salience methods for LLMs is an active research area. The LIT team has provided reference implementations for computing gradient-based salience— Grad L2 Norm and Grad · Input—for LLMs in two popular frameworks: KerasNLP and HuggingFace Transformers.

There is considerable opportunity to research how the model analysis foundations described in this tutorial can support richer workflows, particularly as they relate to aggregate analysis of salience results over many examples, and the semi-automated generation of new prompt designs. Consider contributing your ideas, prototypes, and implementations with us via GitHub.

Further Reading

In addition to the links above, the Google Cloud, Responsible AI and Human-Centered Technologies, and the People + AI Research teams have several helpful guides that can help you develop better prompts, including:

time to read
20 minutes
takeaways
Learn to use LIT's Sequence Salience module for prompt debugging.