How Large Language Models Work (LLMs)
A Beginner's Guide to Large Language Models
You're probably using AI bots like ChatGPT, Gemini, and others for your daily tasks—even finding recipes for your favorite foods. But have you ever wondered about the technology behind these generative models? let me introduce you about Large Language Models.
Large Language Models (LLMs) have become an integral part of modern technology, powering everything from chatbots to translation tools. This guide explains how these models work in clear, simple terms.
If you're new to AI but have heard fancy terms like machine learning, deep learning, and neural networks, don't worry. I'll help you understand how LLMs work and guide you into the world of AI.
This is not a deep dive into technical details, but rather a simple overview with essential information to help you understand LLMs.
The World of Artificial intelligence.
AI is "the ability of computers to simulate human-like thinking and decision-making", AI is the simulation of human intelligence in machines, enabling them to think, learn, and make decisions.
Machine Learning
“Machine Learning from the data.“
The essence of machine learning is that machines are trained on large amounts of data to recognize patterns. Once a machine learns these patterns, it can apply them to new, unseen data to test how well it performs on random samples.
Deep Learning
Deep learning is a subset of machine learning. The idea behind deep learning is to simulate the human brain's neural networks and its powers using artificial networks of neurons.
Large Language Models
Now comes the LLMs the magic behind text generation
Now you know the big picture and where the LLM lies.
Encoder-Decoder Architecture
First, let's understand the encoder-decoder architecture.
The encoder-decoder architecture is a neural network framework widely used in sequence-to-sequence (seq2seq) tasks such as machine translation, text summarization, and image captioning.
but what is seq2seq tasks or sequence to sequence learning. It refers to transforming one sequence of data into another sequence of data. It is commonly used in machine learning for tasks where the input and output are sequences, which can vary in length.
There are two main components in this architecture. The goal is to create an output sequence from an input sequence—one example can be, taking a sequence of text and translating it into French or German.
Encoder:
The encoder's task is to process the input sequence, which consists of data or words that are fed into it.
Input: The Input sequence X = { x1, x2, … xi }, where xi represent each element ( eg: word embeddings )
Word embeddings is a technique used in natural language processing (NLP) to represent words as numbers so that computers can work with them.
Operation:
Inside the encoder, there are various neural network mechanisms at work, including Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Convolutional Neural Networks (CNN). We'll skip the technical details of how these deep learning networks operate.
Each input token is processed in sequence. ( like one word is processed at a time from the sentence )
Hidden states capture contextual information.
The final hidden state or sequence of hidden states represents the input sequence.
Output: A fixed-size vector representation of the input sequence called context vector and is send to the decoder. This context vector encapsulates the input sequence's meaning or information.
Decoder
The decoder generates the output sequence Y = { y1,y2, … yn } based on the context vector produced by the encoder
Input: The context vector from the encoder and (optionally) the previous output token.
Operation: Like the encoder, the decoder is often an RNN, LSTM, or GRU. At each time step:
It takes the previous hidden state, the previous output, and the context vector to predict the next token.
A softmax layer converts the output into probabilities over the vocabulary.
Output: The predicted sequence Y.
What are the problems and challenges with sequence-to-sequence learning in encoder-decoder architecture?
The main challenge is that the context vector must compress all input information into a fixed length vector. If the sequence is too long, say 200 words, the encoder begins missing some words. or a context vector is not able to remember long input sequences, as it tends to forget the earlier parts of the sequence.
To actually tackle this challenge there is another mechanism introduced namely “Attention Mechanism”
Now lets understand what is Attention mechanism.
Attention Mechanism:
This mechanism is same as the enocder decoder architecture, the encoder works as same but the difference is in the decoder block.
Instead of using a single context vector, multiple context vectors are used, with each one referring to different parts of the input text.
This allows the model to focus on the most relevant parts of the input sequence while generating each output element, solving the problem of information loss when encoding long sequences into a single context vector.
we will not go into the mathemical details of this mechanism but will grab the high level overview to clearly understand.
The attention mechanism works by assigning different weights to different parts of the input data, indicating their relative importance to the current step of the output sequence generation.
Working of decoder in Attention Mechanism:
at each time t, the decoder generates one output y, To do this the decoder generates a attention vector at time t.
above is the simple illustration that shows you how the attention weight is created at each time t and based on that the importance to that particular word is given and selection is done based on the importance.
While the attention mechanism offers many advantages, it also comes with challenges.
One of the main challenges is computational complexity, since a separate attention vector and then context vector must be calculated at each time step of the decoder and that causes the issues like computational complexity.
As the length of the input sequence increases, the attention mechanism requires more memory and computation.
Another challenge researchers discovered was that this architecture processes words sequentially (one by one). They wanted to transform this sequential processing of data into parallel processing.
and thats where the Transformers comes into picture.
Transformers:
There is a popular research paper “Attention is All you need” which introduced the transformer architecture.
The new transformer architecture solved many limitations of previous sequence models like RNNs and LSTMs. Don't worry about these technical terms—the goal here isn't to understand complex deep learning models, but rather to gain a high-level understanding of how these systems work and solve problems.
The core idea of the Transformer is to replace recurrent operations with self-attention mechanisms, allowing the model to process the entire sequence of data in parallel rather than sequentially.
The Transformer computes relationships between all tokens in a sequence at once. This makes training more efficient and also allows the model to be parallelized effectively on modern hardware like GPUs.
Main Concepts of Transformer Architecture:
Positional Encoding:
Since the Transformer doesn't process tokens (words) sequentially, it needs a way to understand their order. To address this, the model adds positional encodings to the input tokens, allowing it to track each token's position in the sequence.
This is done by taking all the words in the input sequence and appending a number (its position) to each word.
Multi-head Attention:
The Transformer uses multiple attention heads, allowing the model to focus on different aspects of the sequence at the same time. Each head learns distinct types of relationships between tokens, making the model more effective overall.
Attention works like a spotlight—it lets the model examine every word in the input sentence when deciding how to translate words in the output to correctly identify the context, sentiments and ultimate meaning for each word.
Self Attention:
Each word in the input sequence attends to every other word. This means every word can "look at" and weigh the importance of other words, which helps in capturing long-range dependencies like capturing the sentiment of the word based on other so that it can help in translating the text effectively.
Self-attention allows a neural network to understand each word by analyzing its relationship with all other words in the context.
Transformers have been a groundbreaking innovation, solving many challenges in sequence modeling and natural language processing. However, they come with their own set of challenges, particularly when it comes to training them from scratch. These challenges include:
Hardware: Training transformers requires high-performance GPUs or TPUs, which may not be accessible to everyone.
Time: The training process can be extremely time-consuming, often taking days or even weeks depending on the size of the model and dataset.
Data: Transformers require large amounts of data to achieve good results. Unfortunately, not everyone has access to extensive datasets, and working with small amounts of data often leads to subpar performance.
These limitations make training transformers from scratch a significant challenge, especially for individuals or small teams with limited resources.
This is where transfer learning becomes a game-changer.
Now we need to understand what is Transfer Learning.
Transfer Learning:
let me get a bit technical this time for the definition but donot worry you will surely grab each and every concept.
“Transfer learning (TL) is a technique in which knowledge learned from a task is re-used in order to boost performance on a related task.“
not too technical right?…
Example:
For image classification ( like identifying from which set of classes the given image belongs ), knowledge gained while learning to recognize cars could be applied when trying to recognize trucks.
like we are transfering the knowledge to new task.
The first step is
Pre-training:
you train your model on a big universal dataset using extensive computational resources so that we donot have to start from the scratch. The dataset has a lot of samples. and the goal is to learn all the general features.
The second step is
Fine-Tuning:
in fine-tuning, you take the same pre-trained model and train the model on your smaller, task-specific datasets. Now the model is fine-tuned according to your needs and give better predictions. we can achieve excellent results without needing vast amounts of data, time, or high-end hardware. This approach has democratized access to state-of-the-art machine learning and has become an essential tool for many in the field.
Technically speaking you change the weights in fine-tuning, retaing or keeping the weight of the early stages, and changing the weights of the later stages according to your tasks or needs.
for more in depth understanding you can refer to the research paper titled “Universal Language Model Fine-tuning for Text Classification”.
Traditionally, the model are trained on supevised data like labelled data where if you are performing tasks like text translation, if you are translating from english to french then the traning data should be labelled having the english sentences and having their french translation in front of them but in transfer learning language modelling is used as a pre training technique.
Language Modelling as a Pre-training technique
Language modelling is a Natural Language Processing Task where you teach NLP or Deep Learning model to predict the next word based on sentiments and nature of sentence.
this is now the latest and many models are using this as a base training technique.
lets see how this language modelling training technique benefits
Rich Feature Learning:
in language modelling task of predicting next word, the model learns a lot of things, like grammatical rules, semantics of the sentence, the nature and structure of sentence, like for example:
The food was very clean, yet the serving was …..
here the model have already learned that the yet in a sentence shows some negative expression so it is going to suggest or predict some negative fill.
so this is the advantages of language modelling so it can be used for the lot of tasks like text classification, question/answering system, text summarization, parts of speech tagging.
Huge Availability of Data:
you actually have a lot of data becuase you donot need any kind of labelling, so any pdf, or any data on the internet that has some meaning can be used as a training data and this also serves as one of the benefit and this can be called as Unsupervised training.
Up till now we have learned the transfomer architecture ( do you remember the self attention, positional encodings etc etc… ) and transfer learning as training technique and these two approaches are revolutionary.
what if we trained the transformer architecture based models with Transfer Learning technique ..? we can actually acheive a lot.
Now comes the actual part - yes
the big term “LLM”
Large Language Models ( LLMs )
now in 2018 to 20, two transformer based language models were released one is Bert ( Google ) and the other one is GPT from ( openAi ).
They were trained on huge datasets and are based on transformer architecture.
The two models are really good in Transfer learning that you can fine tune the models to let them perform the tasks you want.
Now everyone got the power of
getting the transformer based models → fine tune them on theirs limited data set → get the tasks done you want.
large language model are called large because they are trained on million and billion of parameter but wait
what is meant by parameters here? paramters can be considered as the factors on based on which the text generation depends.
these models were trained on too large datasets that people start calling it large — and huge number of parameters is also a factor that make it “large” Langauge Models...
what make a Langauge Model a Large Language Model, lets undertand the qualities.
Qualities of LLM:
billions of parameters
hardware ( cluster of GPUs, supercomputers )
Too much training time
cost → harware cost, infrastructure, electricity, human expertise
too much energy consumption
This is all about the Large Language Model working, we haven’t look at the technical perspective or working of each model but this high level overview will surely give you understanding about LLMs.