Disclaimer
This guide focuses on LLMs that use the decoder-only Transformer architecture and self-attention mechanisms. Please note that specific details and architecture may vary among different LLMs.
LLMs are trained on an extremely large amount of data that make them learn to recognize intricate patterns, contextual relationships, and linguistic nuances in human language and to fine-tune their parameters.
Most LLMs are indeed trained on internet data, which includes sources such as Wikipedia articles and books. The data is typically converted into plain text format and used to train newly created LLMs. Models like Llama, GPT3.5/4, and GPT-J are trained using a technique called "causal language modeling." In this task, the model's objective is to predict the next text token given a certain context.
Training LLMs can indeed be computationally intensive. It often requires substantial computational resources, including tens of thousands of processing units, to train large-scale models like Llama or chatGPT.
Let’s take a look at how the model generates an output.
Tokenization is a fundamental process in generative AI that involves dividing a given text into individual units, referred to as tokens. These tokens can be words, subwords, or even characters, depending on the level of granularity required for the task at hand.
Why? Because machines don't understand words as they are, we need to convert words into numerical representations. However, separating words by alphabet letters is too inefficient, as you would need four tokens for one word on average. On the other hand, separating words by word itself is also inefficient, as you would need a vocabulary size of 171,476 for just English. That's why we create a small tokenizer model, trained on lots of text, to provide the most efficient tokenization with the smallest vocabulary size.
For instance, when working with AI models, we often represent text data as numerical vectors. In this context, the tokens are typically individual words from the text, and the features of the machine learning model correspond to the words present in the vocabulary. Consider a vocabulary containing words like a
, aardvark
, ...
, zyzzyva
. The feature vector can be quite extensive, such as having 100,000 words in the language model, leading to a feature vector of length 100,000.
Tokenization is not a straightforward task, as it involves determining what constitutes a "word." For instance, the contraction aren't
can be treated as a single token, broken down into aren/'/t
, or are/n't
. The chosen tokenization scheme can significantly impact the performance of AI models, so it requires careful consideration.
Example:
“I have a grey cat”
{"I": 34, "have": 235, "a": 456, "grey": 4512, "cat": 219}
"I" will always be represented by the integer value 34.
"have" will always be represented by the integer value 235.
And so on.
The tokenized input is then processed through different layers. An initial layer assigns a vector to each token. Depending of the architecture, this vector may or may not carry the position of the token in the sequence.
Note
In that context, a vector is a mathematical representation of a token (word, subwords, character, etc.) that enables the model to comprehend and process human language.
These Embed tokens are fed to the next stage.
The tokens pass through the "Decoder" which consists of multiple identical layers (the number may vary). Each layer contains two main sub-layers.
The Multi-Head self-attention sub-layer
If the vectors don't contain the position information, it gets injected here. This sub-layer helps the model understand how words in a sentence relate to each other, determining which words are connected or dependent on one another. By doing this, the model can grasp the full meaning of the sentence and understand the context.
The feed-forward network sub-layer refine and improve the understanding of the input text by the model.
To do that, the FFN uses a set of mathematical operations. These operations help the model to discern more detailed patterns and connections between the words and understand the meaning of the text more accurately.
Finally, the model uses the "softmax" function. This function converts the scores that indicate how likely the model thinks each token is into probabilities. These probabilities indicate the likelihood of each token being the next word in the output sequence.
Note
Remember that tokens can be word, subword or character in this context.
Finally, the LLM produces an output by selecting tokens. The choice is influenced by various factors, including the sampling method and the hyperparameters like “Temperature”.