Transformers
Before starting first blog on Transformers, I would like to give big thanks to Huggingface community.
HuggingFace Library provides functionality to create transformers models which you can get it from model hub. Transformer library provides single API through which any Transformer model can be loaded, trained, and saved. (all models are simple PyTorch nn.Module
or TensorFlow tf.keras.Model
classes and can be handled like any other models in their respective machine learning (ML) frameworks)
What is Transformers?
Transformers are big models which are used to perform advance NLP work.
Transformers are language models , means they are trained on large amount of raw text in self supervised fashion (means humans are not required to label the data). These type of model develops understanding of language it has been trained on. Example of language model is “casual language modeling” where output depends on present and past inputs, but not future ones.
There is another called “masked language modeling” which predicts masked word in sentence.
Important tool of Transformer:
Pipeline function: It follows 3 steps : Preprocessing, Passing the inputs through the model and Postprocessing.
a. Preprocessing : Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can understand. To do this we use a “tokenizer” . Tokenizer perform below tasks:
- Splitting the input into words, sub words, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model
To do this, we use the AutoTokenizer
class and its from_pretrained()
method.
Next, we need to convert the list of input IDs to tensors. Transformer models only accept tensors as input. To get tensor, we used return_tensors
argument
Output :
b. preprocessed inputs are passed to the model: Like pretrained tokenizer, we can download our pretrained model the same way we did with our tokenizer. Transformers provides an TFAutoModel
class which also has a from_pretrained
method:
Output of model is hidden states, also known as features. For each model input, we’ll retrieve a high-dimensional vector. It has 3 dimensions:
- Batch size: The number of sequences processed at a time (2 in above example).
- Sequence length: The length of the numerical representation of the sequence (16 in above example).
- Hidden size: The vector dimension of each model input (768 is common for smaller models, and in larger models this can reach 3072 or more)
There are many different architectures available in Transformers, like:
*Model
(retrieve the hidden states)*ForCausalLM
*ForMaskedLM
*ForMultipleChoice
*ForQuestionAnswering
*ForSequenceClassification
*ForTokenClassification etc.
Here, we can see , the dimensionality of output vector is reduced, it contains two values (one per label) , so we get result of 2X2.
c. Postprocessing : Predictions of the model are post-processed, so you can make sense of them : The output which we get from model is raw and unnormalized scores, which does not make any sense. So, we have to convert them into understandable format. All transformer models output logits , which needs to be converted into probability format.
For example :
Here, values [-1.5607, 1.6123] and [ 4.1692, -3.3464] are not probabilities, these are logits. To convert them into probability, we need to pass them to “Softmax layer”.
Output :
Now, [0.0402, 0.9598] and [0.9995, 0.0005] are probabilities.
To get labels, we can use id2label attribute,
Finally we have,
- First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
- Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005
Most commonly used pipelines:
ii. fill-mask
iii. ner
v. sentiment-analysis
vi. summarization
vii. text-generation
viii. zero-shot-classification
Note : There are other available pipelines also. Apart from above mentioned pipelines you can also use any model from model hub in a pipeline.
Example :
Transformers Evolution
Transformer Architecture
It consist of Encoder ( with Attention layers) and Decoder .
Encoders are “bi-directional” model also called auto-endcoding model. They have special layers called attention layers (this layer will tell the model to pay specific attention to certain words in the sentence you passed it) which can access words which is after as well as before it in the sentence. Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering.
Example of few Encoder models are :
Decoders are “unidirectional” models also called auto-regressive model. They work sequentially and can only pay attention to the words in the sentence that it has already translated. In other words, for a given word the attention layers can only access the words positioned before it in the sentence. They are not allowed to use future word. For example, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3. These models are best suited for tasks involving text generation.
Example of Decoder models are :
Encoder-Decoder models are basically called sequence-to-sequence models. So, we can say that transformer is a seq2seq model, coz it has both Encoder and Decoder. Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.
Example of seq2seq models are:
So, in this blog I have covered transformer, its evolution, architecture and its main tool pipeline.
In my next blog , I will explain about other components of pipeline function i.e tokenizer and model in brief.