What is Attention Mechanism? | AI & LLM Glossary

An attention mechanism is a neural network component that allows a model to dynamically focus on the most relevant parts of an input sequence when producing each element of the output. It is the core building block of the transformer architecture that powers modern large language models.

Before attention mechanisms were introduced, sequence models like RNNs and LSTMs processed inputs step by step, which made it difficult to capture long-range dependencies in text. If a relevant word appeared far earlier in a sentence, the model might lose track of it. Attention solved this by allowing the model to look at all positions in the input simultaneously and assign different weights to each one based on relevance.

The most widely used form is self-attention, introduced in the landmark 2017 paper "Attention Is All You Need." In self-attention, every token in a sequence computes a relevance score against every other token. This means a word at position 100 can directly attend to a word at position 1, without information having to pass through 99 intermediate steps. Multi-head attention extends this by running multiple attention operations in parallel, each learning to focus on different types of relationships.

In practice, attention is computed using three learned projections of the input: queries (Q), keys (K), and values (V). The attention score between two tokens is determined by the dot product of the query of one token with the key of another, scaled and passed through a softmax function. The result is used to weight the value vectors, producing a context-aware representation of each token.

Attention mechanisms are central to why modern LLMs are so effective. They allow models to capture grammar, coreference, logical relationships, and even factual knowledge within their learned attention patterns. However, the computational cost of attention scales quadratically with sequence length, which is why context window limits and efficiency optimizations like flash attention remain active areas of research.

How It Works

Compute query, key, and value vectors

Each input token is projected through three separate learned weight matrices to produce a query vector (what this token is looking for), a key vector (what this token represents), and a value vector (the information this token carries).

Calculate attention scores

The query of each token is compared with the keys of all other tokens using a dot product, then scaled by the square root of the key dimension. This produces raw attention scores indicating how relevant each token is to every other token.

Apply softmax normalization

The raw scores are passed through a softmax function to convert them into a probability distribution. This ensures the attention weights for each token sum to 1 and emphasizes the most relevant connections while suppressing less relevant ones.

Produce weighted output

The normalized attention weights are used to compute a weighted sum of the value vectors. This produces a new representation for each token that incorporates context from the most relevant parts of the entire input sequence.

Examples

Pronoun resolution in text

In the sentence 'The cat sat on the mat because it was tired,' the attention mechanism helps the model determine that 'it' refers to 'the cat' rather than 'the mat' by assigning higher attention weights between 'it' and 'cat.'

Machine translation

When translating 'The bank of the river' from English to French, attention allows the model to focus on the word 'river' when translating 'bank,' correctly choosing the geographical meaning rather than the financial one.

Document summarization

When summarizing a long article, attention mechanisms allow the model to identify and focus on the most important sentences and key facts across the entire document, even when they are spread far apart in the text.

Why It Matters

Attention mechanisms are the fundamental innovation that makes modern LLMs possible. Without attention, models could not effectively process long contexts, understand relationships between distant words, or scale to the billions of parameters that give today's models their capabilities. Understanding attention is essential for anyone working with or optimizing LLM performance.

Frequently Asked Questions

What is the difference between self-attention and cross-attention?

Self-attention computes attention within a single sequence, where each token attends to all other tokens in the same input. Cross-attention computes attention between two different sequences, such as when a decoder attends to encoder outputs in a translation model. Most decoder-only LLMs primarily use self-attention.

Why does attention have a quadratic cost?

Attention computes a score between every pair of tokens in a sequence. For a sequence of length N, this requires N times N comparisons, resulting in quadratic time and memory complexity. This is why processing very long documents is computationally expensive and why context window sizes have practical limits.

What is multi-head attention?

Multi-head attention runs multiple attention operations in parallel, each with its own set of learned query, key, and value projections. Different heads can learn to focus on different types of relationships, such as syntactic structure in one head and semantic similarity in another, providing a richer representation.

How does flash attention improve performance?

Flash attention is an optimization technique that restructures the attention computation to be more hardware-efficient. Instead of materializing the full attention matrix in memory, it computes attention in tiles, significantly reducing memory usage and improving speed, especially for long sequences.

Observe Attention-Related Performance with Respan

Respan provides observability into how attention-intensive workloads affect your LLM application performance. Monitor latency patterns as context lengths grow, track token usage and costs, and identify when long sequences are degrading response quality or increasing inference times.

Try Respan free

What is Attention Mechanism? | AI & LLM Glossary

How It Works

Compute query, key, and value vectors

Calculate attention scores

Apply softmax normalization

Produce weighted output

Examples

Pronoun resolution in text

Machine translation

Document summarization

Why It Matters

Frequently Asked Questions

What is the difference between self-attention and cross-attention?

Why does attention have a quadratic cost?

What is multi-head attention?

How does flash attention improve performance?

Observe Attention-Related Performance with Respan

Try Respan free

What is Attention Mechanism? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Observe Attention-Related Performance with Respan

What is Attention Mechanism? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Observe Attention-Related Performance with Respan