LLM Foundations

Lecture by Sergey Karayev. Published May 19, 2023. Download slides.

Chapter Summaries

Intro

Chapter 0 Cover Image

Discuss four key ideas in machine learning
Address diverse audience, including experts, executives, and investors
Cover Transformer architecture
Mention notable LLMs (e.g., GPT, T5, BERT, etc.)
Share details on running a Transformer

Foundations of Machine Learning

Chapter 1 Cover Image

Machine learning has shifted from traditional programming (Software 1.0) to a Software 2.0 mindset, where algorithms are generated from training data and more emphasis is placed on the training system.
Three types of machine learning include unsupervised learning, supervised learning, and reinforcement learning, which have mostly converged to a supervised learning approach.
For machines, input and output are always just numbers, represented as vectors or matrices.
One dominant approach to machine learning today is neural networks, also known as deep learning, which was inspired by the human brain's structure and function.
Neural networks consist of perceptrons connected in layers, and all operations are matrix multiplications.
GPUs, originally developed for graphics and video games, have played a significant role in advancing deep learning due to their compatibility with matrix multiplications.
To train a neural network, data is typically split into training, validation, and test sets to avoid overfitting and improve model performance.
Pre-training involves training a large model on extensive data, which can then be fine-tuned using smaller sets of specialized data for better performance.
Model hubs, such as Hugging Face, offer numerous pre-trained models for various machine learning tasks and have seen significant growth in recent years.
The Transformer model has become the dominant architecture for a wide range of machine learning tasks.

The Transformer Architecture

Chapter 2 Cover Image

Transformer architecture introduced in 2017 paper "Attention is All You Need"
Set state-of-the-art results in translation tasks
Applied to other NLP tasks and fields like vision
Appears complicated but consists of two similar halves
Focusing on one half called the decoder

Transformer Decoder Overview

Chapter 3 Cover Image

The task of the Transformer decoder is to complete text, much like GPT models.
The input consists of a sequence of tokens (e.g., "it's a blue"), and the goal is to predict the next word (e.g., "sundress").
The output is a probability distribution over potential next tokens.
Inference involves sampling a token from the distribution, appending it to the input, and running the model again with the updated input.
ChatGPT operates by seeing user input, sampling the next word, appending it, and repeating this process.

Inputs

Chapter 4 Cover Image

Inputs need to be vectors of numbers
Text is turned into vectors through tokenization
Tokens are assigned an ID in a vocabulary, rather than being words
Numbers are represented as vectors using one-hot encoding (e.g., number 3 represented by a vector with 1 in third position, zeros everywhere else)

Input Embedding

Chapter 5 Cover Image

One-hot vectors are not good representations of words or tokens as they don't capture the notion of similarity between words
To address the issue, we use embedding
Embedding involves learning an embedding matrix which converts a one-hot vocabulary encoding into a dense vector of chosen dimensionalities
This process turns words into dense embeddings, making it the simplest neural network layer type

Masked Multi-Head Attention

Chapter 6 Cover Image

Attention was introduced in 2015 for translation tasks, and the idea is to predict the most likely next token based on the importance of previous tokens.
Attention mechanism involves an output as a weighted sum of input vectors, and these weights are calculated using dot products (similarities) between the input vectors.
Each input vector plays three roles in the attention mechanism: as a query, key, and value.
To learn and improve attention, input vectors can be projected into different roles (query, key, and value) by multiplying them with learnable matrices.
Multi-head attention refers to learning several different ways of transforming inputs into queries, keys, and values simultaneously.
Masking is used to prevent the model from "cheating" by considering future tokens; it ensures that the model only predicts the next token based on the already seen input.

Positional Encoding

Chapter 7 Cover Image

No notion of position in the current model, only whether something has been seen or not.
Positional encoding is introduced to provide ordering among the seen elements.
Current equations resemble a bag of unordered items.
Positional encoding vectors are added to embedding vectors to provide order.
Seems counterintuitive, but it works; attention mechanism figures out relevant positions.

Skip Connections and Layer Norm

Chapter 8 Cover Image

Add up and norm attention outputs using skip connections and layer normalization
Skip connections help propagate loss from end to beginning of model during backpropagation
Layer normalization resets mean and standard deviation to uniform after every operation
Input embedding determines the dimension of the entire Transformer model
Normalization seems inelegant but is very effective in improving neural net learning

Feed-forward Layer

Chapter 9 Cover Image

Feed forward layer is similar to the standard multi-layer perceptron.
It receives tokens augmented with relevant information.
The layer upgrades the token representation.
The process goes from word-level to thought-level, with more semantic meaning.

Transformer hyperparameters and Why they work so well

Chapter 10 Cover Image

GPT-3 model ranges from 12 to 96 layers of Transformer layers with adjustable embedding dimensions and attention heads, totaling 175 billion parameters.
Most of GPT-3's parameters are in the feed forward layer, but for smaller models, a significant portion is in embedding and attention.
Transformers are effective general-purpose differentiable computers that are expressive, optimizable via backpropagation, and efficient due to parallel processing.
Understanding exact expressiveness of the Transformer is ongoing, with interesting results like RASP (a programming language designed to be implemented within a Transformer).
Decompiling Transformer weights back to a program is still an unsolved problem.
Multiple attention heads allow the model to figure out how to use a second head, showcased in work like Induction Heads.
Learning to code Transformers isn't necessary for AI-powered products, but can be fun and educational. Resources like YouTube tutorials and code examples are available to assist in learning.

Notable LLM: BERT

Chapter 11 Cover Image

Bert, T5, and GPT cover the gamut of large Transformer models
Bert stands for bi-directional encoder representation from Transformers
Bert uses the encoder part of the Transformer, with unmasked attention
Bert contains 100 million parameters, considered large at its time
Bert was trained by masking 15% of words in a text corpus and predicting the masked words
Bert became a building block for other NLP applications

Notable LLM: T5

Chapter 12 Cover Image

T5 applies Transformer architecture to text-to-text transfer, meaning both input and output are text strings
The task is encoded in the input string and can involve translation, summarization, etc.
Encoder-decoder architecture was found to be best, with 11 billion parameters
Trained on Colossal Queen Crawl Corpus (C4) derived from Common Crawl dataset
C4 was created by filtering out short pages, offensive content, pages with code, and de-duplicating data
Fine-tuned using academic supervised tasks for various NLP applications

Notable LLM: GPT

Chapter 13 Cover Image

GPT is a generative pre-trained Transformer, with GPT-2 being decoder only
GPT-2 was trained on a dataset called WebText created by scraping links from Reddit
GPT tokenizes text using byte pair encoding, a middle ground between old-school tokenization and using UTF-8 bytes
GPT-3 came out in 2020 and is 100 times larger than GPT-2, enabling few-shot and zero-shot learning
GPT-3 was trained on webtext, raw common crawl data, a selection of books, and all of Wikipedia
The dataset for GPT-3 contained 500 billion tokens, but it was only trained on 300 billion tokens
GPT-4 details are unknown, but it is assumed to be much larger than previous versions due to the trend in increasing size

Notable LLM: Chinchilla and Scaling Laws

Chapter 14 Cover Image

Using more computation to train AI systems improves their performance
Rich Sutton's "bitter lesson": advantage goes to those stacking more layers
DeepMind's paper, Training Compute Optimal LLMs: studied relationship between model size, compute and data set size
Most LLMs in literature had too many parameters for their data amount
Chinchilla model (70 billion) outperformed Gopher model (four times larger) by training on 1.4 trillion tokens instead of 300 billion
Open question: can models continue to improve by training repeatedly on existing data?

Notable LLM: LLaMA

Chapter 15 Cover Image

Llama is an open-source chinchilla optimal LLM from Meta Research
Several sizes available, ranging from 7 billion to 65 billion, with at least 1 trillion tokens
Competitively benchmarks against GPT-3 and other state-of-the-art LLMs
Open source but non-commercial license for pre-trained weights
Trained on custom common crawl filtering, C4, GitHub, Wikipedia, books, and scientific papers
Data set replicated by Red Pajama, which is also training models to replicate Llama
Interesting inclusion of GitHub as a training resource

Why include code in LLM training data?

Chapter 16 Cover Image

Including code in training data can improve performance on non-code tasks
OpenAI found this with their Codex model, which was fine-tuned on code and outperformed GPT-3 on reasoning tasks
Since then, people have been adding code to training data
Open source dataset called 'the stack' collects code from GitHub while respecting licenses

Instruction Tuning

Chapter 17 Cover Image

Discusses instruction tuning in GPT models and its impact on performance
Mentions the shift from text completion mindset to instruction following mindset
Supervised fine-tuning helps models become better at zero-shot tasks by using data sets of zero-shot inputs and desired outputs
OpenAI hired thousands of contractors to gather zero-shot data and used reinforcement learning for training
GPT model lineage includes DaVinci, Codex, and various iterations, fine-tuning for specific applications
Fine-tuning imposes an "alignment tax," decreasing few-shot learning ability and model's confidence calibration
Llama model by Stanford team used GPT-3 generated instructions, costing less but with reduced performance compared to GPT-3
A specific data set for instruction tuning in chat-based paradigms is called "Open Assistant"

Notable LLM: RETRO

Chapter 18 Cover Image

Discussing a model called "retrieval enhancing" from DeepMind
Goal: train a smaller model good at reasoning and writing code, but looks up facts from a database
Used "burden-coded" sentences in a trillion-token database for fact retrieval
Not as effective as large language models yet, but shows potential for the future

We are excited to share this course with you for free.

We have more upcoming great content. Subscribe to stay up to date as we release it.

We take your privacy and attention very seriously and will never spam you. I am already a subscriber