GLOSSARY TERM

What is Multi-Head Attention?

Running multiple attention mechanisms in parallel to capture distinct relationships.
Multi-head attention projects inputs into multiple distinct representation subspaces simultaneously. This allows the model to jointly attend to information from different representation spaces at different positions, a core innovation of the Transformer.

Advanced Representation

Harness multi-head architectures optimized directly via the M1 compute layer.