Language, Context, and Geometry in Neural Networks

Part II (see Part I) of a series of expository notes accompanying this paper, by Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, and Martin Wattenberg. These notes are designed as an expository walk through some of the main results. Please see the paper for full references and details.

This is accompanied by the release of Context Atlas, a word sense visualization tool (code)
In linguistics, a word sense is one of the meanings of a word (definition from wikipedia.)

Have you ever eaten a hot dog with hot sauce on a hot day? Even if you have not, you still understood the question—which is remarkable, since hot was used in three completely different senses. Humans effortlessly take context into account, disambiguating the multiple meanings.

Bringing this same skill to computers has been a longstanding open problem. A recently invented technique, the Transformer architecture, is designed in a way that may begin to address this challenge. Our paper explores one particular example of this type of architecture, a network called BERT.

One of the key tools for representing meaning is a "word embedding," that is, a map of words into points in high-dimensional Euclidean space. This kind of geometric representation can capture a surprisingly rich set of relationships between words, and word embeddings have become a critical component of many modern language-processing systems.

The catch is that traditional word embeddings only capture a kind of "average" meaning of any given word--they have no way to take context into account. Transformer architectures are designed to take in a series of words--encoded via context-free word embeddings--and, layer by layer, transform them into "context embeddings." The hope is that these context embeddings will somehow represent useful information about the usage of those words in the particular sentence they appear. In other words, in the input to BERT, the word hot would be represented by the same point in space, no matter whether it appears next to dog, sauce, or day. But in subsequent layers, each hot would occupy a different point in space.

In this blog, we describe visualizations that help us explore the geometry of these context embeddings--and which suggest that context embeddings do indeed capture different word senses. We then provide quantitative evidence for that hypothesis, showing that simple classifiers on context embedding data may be used to achieve state-of-the-art results on standard word sense disambiguation tasks. Finally, we show that there is strong evidence for a semantic subspace within the embedding space, similar to the syntactic subspace described by Hewitt and Manning, and discussed in the first post of this series.

Part 1: Context Atlas Visualization

See a similar, independently created visualization by David McClure here!

So, which subtleties are captured in these embeddings, and which are lost? To understand this, we approach the question with a visualization which attempts to isolate the contextual information of a single word across many instances. How does BERT understand the meaning of a given word in each of these contexts?

The BERT model comes in several sizes. Our visualization shows results for "BERT-base"

When a word (e.g., hot)is entered, the system retrieves 1,000 sentences from wikipedia that contain hot. It sends these sentences to BERT as input, and for each one it retrieves the context embedding for hot at each layer.

UMAP is a technique for projecting high-dimensional data into a low-dimensional view.

The system then visualizes these 1,000 context embeddings using UMAP. The UMAP projection generally shows clear clusters relating to word senses, for example hot as in hot day or as in hot dog. Different senses of a word are typically spatially separated, and within the clusters there is often further structure related to fine shades of meaning.

We highlight words that appear in many sentences clustered in a small area, as measured by the median distance between sentences containing those words. We also show only as many labels as can fit without overlapping.

To make it easier to interpret the visualization without mousing over every point, we added labels of words that are common between sentences in a cluster.

Case study: Walk

A natural question is whether the space is partitioned by part of speech. Showing the parts of speech can be enabled in the UI with the "show POS" toggle. The dots are then colored by the part of speech of the query word, and the labels are then uncolored.

Visualization of walk in various contexts.

Indeed, this is the case. The words are partitioned into nouns and verbs.

One cluster with sentences using the word walk as a noun, as in "take a walk."
The corresponding verb cluster, with sentences such as "they walk."

Case study: Class

The separation of clusters goes beyond parts-of-speech. For the word class, below, the space is partitioned more subtly. To the top right, there is a cluster of sentences about working class, middle/upper/lower classes, etc. Below is the cluster for the educational sense of the word-- high school classes, students taking classes, etc. Still more clusters can be found for class-action lawsuits, equivalence classes in math, and more.

Early Layers

The previous examples have all used embeddings from the last layer of BERT. But where do these clusters arise, and what do the embeddings look like as they progress through the network?

For most words that only have one sense (e.g., happy or man), the earliest layer (token embeddings + positional embeddings, with one transformer layer) has a canonical spiral pattern. This pattern turns out to be based on the position of the word in the sentence, and reflects the way linear order is encoded into the BERT input. For example, the sentences with happy as the third word are clustered together, and are between the sentences with happy as the second word, and those with happy as the fourth word.

We have not really explored the reason for this; it could be due either to the vocabularies of different senses being dramatically different, or to the model learning that some words, when combined with class, change the meaning of the word class significantly (in contrast to happy, which has a similar meaning in any context.) However, these are both pure conjectures, and more research is necessary.

Interestingly, many words (e.g., civil, class, and state), do not have this pattern at the first layer. Below are the embeddings for class at the first layer, which are already semantically clustered.

Progression Through Layers

So what about the layers in between? Interestingly, there are often more disparate clusters in the middle of the model. The reason for this is somewhat of an open question. We know from Tenney et al that BERT performs better on some semantic and syntactic tasks with embeddings from the middle of the model. Perhaps there is some information that is lost towards the later layers in service of some higher level task, that causes the more merged cluster.

Another observation is that, towards the end of the network, most of the attention is paid to the CLS and SEP tokens (special tokens at the beginning and end of the sentence.) We’ve conjectured that this could make all clusters merge together to some degree.

Embedding clusters change significantly through the layers.

The apparent detail in the clusters we visualized raises two immediate questions: first, can we quantitatively demonstrate that BERT embeddings capture word senses? Second, how can we resolve the fact that we observe BERT embeddings capturing semantics, when previously we saw those same embeddings capturing syntax?

Part 2: Quantitative Word Sense Analysis

The crisp clusters seen in the visualizations above suggest that BERT may create simple, effective internal representations of word senses, putting different meanings in different locations.

To test this out quantitatively, we trained a simple nearest-neighbor classifier on these embeddings to perform word sense disambiguation (WSD).

We used the data and evaluation from Raganato et al: the training data was SemCor (33,362 senses), and the testing data was the suite described by Raganato et al (3,669 senses).

We follow the procedure described by Peters et al, who performed a similar experiment with the ELMo model. For a given word with n senses, we make a nearest-neighbor classifier where each neighbor is the centroid of a given word sense’s BERT-base embeddings in the training data. To classify a new word we find the closest of these centroids, defaulting to the most commonly used sense if the word was not present in the training data.

The simple nearest-neighbor classifier achieves an F1 score of 71.1, higher than the current state of the art, with the accuracy monotonically increasing through the layers. This is a strong signal that context embeddings are representing word-sense information. Additionally, we got an higher score of 71.5 using the technique described in the following section.

Method
F1 Score
Baseline (most frequent case)
64.8
ELMO
70.1
BERT-base
71.1
BERT-base (with probe)
71.5

Part 3: A Subspace for Semantics?

Hewitt and Manning found that there was an embedding subspace that appeared to contain syntactic information. We hypothesize that there might also exist a subspace for semantics. That is, a linear transformation under which words of the same sense would be closer together and words of different senses would be further apart.

To explore this hypothesis, we trained a probe following Hewitt and Manning’s methodology.

Our training corpus was the same dataset described in part 2, filtered to include only words with at least two senses, each with at least two occurrences (for 8,542 out of the original 33,362 senses).

We initialized a random matrix $B\in{R}^{k\times m}$, testing different values for $m$. Loss is, roughly, defined as the difference between the average cosine similarity between embeddings of words with different senses, and that between embeddings of the same sense.

$m$
Trained Probe
Random Probe
768
71.26
70.74
512
71.52
70.51
256
71.29
69.92
128
71.21
69.56
64
70.19
68.00
32
68.01
64.62
16
65.34
61.01

We evaluate our trained probes on the same dataset and WSD task used in part 2. As a control, we compare each trained probe against a random probe of the same shape. As mentioned, untransformed BERT embeddings achieve a state-of-the-art accuracy rate of 71.1%. We find that our trained probes are able to achieve slightly improved accuracy down to $m$ = 128 dimensions.

Though our probe achieves only a modest improvement in accuracy for final-layer embeddings, we note that we were able to more dramatically improve the performance of embeddings at earlier layers (see the Appendix in our paper for details: Figure 10). This suggests there is more semantic information in the geometry of earlier-layer embeddings than a first glance might reveal. Our results also support the idea that word sense information may be contained in a lower-dimensional space. This suggests a resolution to the question mentioned above: word embeddings encode both syntax and semantics, but perhaps in separate complementary subspaces.

Conclusion

Since embeddings produced by transformer models depend on context, it is natural to speculate that they capture the particular shade of meaning of a word as used in a particular sentence. (E.g., is bark an animal noise or part of a tree?) It is still somewhat mysterious how and where this happens, though. Through the explorations described above, we attempt to answer this question both qualitatively and quantitatively with evidence of geometric representations of word sense.

Many thanks to David Belanger, Tolga Bolukbasi, Dilip Krishnan, D. Sculley, Jasper Snoek, Ian Tenney, and John Hewitt for helpful feedback and discussions about this research. For more details, and results related to syntax as well as semantics, please read our full paper! And look for future notes in this series.