An attention head is the core functional unit inside a Transformer’s multi-head attention layer. It acts as a specialized data-retrieval system that determines how much “importance” or focus one token in a sequence (e.g., a word) should place on other tokens, capturing specific syntactic or semantic relationships.