A transformer is a type of neural network architecture used in machine learning that exploits the concepts of attention and self-attention in a stack of encoders and decoders. It was first described in a 2017 paper from Google and has since become one of the most powerful classes of models invented to date. Transformers are used to solve the problem of sequence transduction, or neural machine translation, which means any task that transforms an input sequence to an output sequence. They are particularly successful in natural language processing (NLP) tasks such as machine translation, document summarization, and document generation).
Transformers use self-attention mechanisms that enable them to focus on the most relevant parts of the input data, making them more effective in capturing the underlying relationships between input and output. They have the ability to process large amounts of data, making them ideal for handling big data problems. Transformers can also be used to create general models that can be fine-tuned for specific tasks, enabling the use of transfer learning, where the pre-trained models can be used for various tasks, reducing the need for large amounts of data and training time.
All transformers have the same primary components, including tokenizers, which convert text into tokens, embedding layers, which convert tokens into semantically meaningful representations, and transformer layers, which carry out the reasoning capabilities, and consist of Attention and MLP layers). The transformer building blocks are scaled dot-product attention units. For each attention unit, the transformer model learns three weight matrices: the query weights, the key weights, and the value weights. For each token, the input token representation is multiplied with each of the three weight matrices to produce a query vector, a key vector, and a value vector. Attention is then calculated as the dot product of the query and key vectors, scaled by the square root of the dimension of the key vectors. The resulting scores are then normalized using the softmax function, and the value vectors are multiplied by the normalized scores and summed to produce the output of the self-attention).