What We Know About LLMs  (Primer)

What We Know About LLMs (Primer)

Author: Will Thompson (Twitter)

Hacker News Top Post 💫 : Discussion Thread [351 points, 162 comments]

Intro

Crypto VCs & ”builders” making a hard left into AI (Borrowed from ML Twitter)
Crypto VCs & ”builders” making a hard left into AI (Borrowed from ML Twitter)

We are in the midst of yet another AI Summer where the possible futures enumerated in the Press seem both equally amazing and terrifying. LLMs are predicted to both create immeasurable wealth for society as well as potentially compete with (or deprecate?) knowledge workers. While bond markets are trying to read the tea leaves on future Fed rate hikes, equity markets are bullish on all things AI. Many companies are rapidly adopting some form of AI play in order to appease shareholder FOMO. A large percentage of YC cohort members are, unsurprisingly, generative AI startups now. All the “MAANG”s (whatever they are called these days) seem to have some form of giant LLM they are building now. It’s as though crypto was forgotten overnight; the public imagination appears singularly captivated with what possibilities AI may usher forth.

The madness of crowds aside, it is worth reflecting on what we concretely know about LLMs at this point in time and how these insights sparked the latest AI fervor. This will help put into perspective the relevance of current research efforts and the possibilities that abound.

What We Mean By “LLM”

When people say “Large Language Models”, they typically are referring to a type of deep learning architecture called a Transformer. Transformers are models that work with sequence data (e.g. text, images, time series, etc) and are part of a larger family of models called Sequence Models. Many Sequence Models can also be thought of as Language Models, or models that learn a probability distribution of the next word/pixel/value in a sequence: p(wt∣wt−1,wt−2,...)p(w_t|w_{t-1},w_{t-2},...).

Figure 9.7.1. in this illustrated guide 
Figure 9.7.1. in this illustrated guide here. This is a type of encoder-decoder RNN. Notice the left-to-right causal ordering. Each token is processed sequentially.

What differentiates the Transformer from its predecessors is it’s ability to learn the contextual relationship of values within a sequence through a mechanism called (self-) Attention. Unlike the Recurrent Neural Network (RNN), where the arrow of time is preserved by processing each time step serially within a sequence, Transformers can read the entire sequence at once and learn to “pay attention to” only the values that came earlier in time (via “masking”). This allows for faster training times (i.e. the whole sequence in parallel) and larger model parameter sizes. Transformers were once considered “large” when they were ~ 100MM+ parameters; today, published models are ~500B-1T parameters in size. Anecdotally, several papers have reported a major inflection point in Transformer behavior around ~100B+ parameters. (Note: these models are generally too large to fit into a single GPU and require the model to be broken apart and distributed across multiple nodes).

From Jay Alammar’s popular
From Jay Alammar’s popular “The Illustrated Transformer” blog post. A context window is used to read in K tokens simultaneously in order to understand the relative meaning of words in a sentence. This is opposed to the RNN (see above), which technically can be thought as having an infinite context window (an observation by Karpathy), although each word must be processed serially. RNNs have a history of requiring more clever architecture tweaks to overcome their innate inability to connect long-term dependencies in a sentence, which is actually where the original (hierarchical) attention mechanism was born out of necessity.

Transformers can be generally categorized into one of three categories: “encoder only” (a la BERT); “decoder only” (a la GPT); and having an “encoder-decoder” architecture (a la T5). Although all of these architectures can be rigged for a broad range of tasks (e.g. classification, translation, etc), encoders are thought to be useful for tasks where the entire sequence needs to be understood (such as sentiment classification), whereas decoders are thought to be useful for tasks where text needs to be completed (such as completing a sentence). Encoder-decoder architectures can be applied to a variety of problems, but are most famously associated with language translation.

Decoder-only Transformers such as ChatGPT & GPT-4 are the class of LLM that are ubiquitously referring to as “generative AI”.

What We Knew About LLMs (From Before)

Since the debut of the OG Transformer paper ~6 years ago, we’ve gleamed a couple interesting properties about this class of models.

Generalization 🧠

We learned that the the same trained LLM could figure out how to complete many different tasks with only being shown a few examples for each task; that is, LLMs are few-shot learners. This meant that whatever the LLM had learned about language in it’s (pre-)training task (which is usually predicting the next word in a sequence), it could translate to new tasks without needing to be trained from scratch to do said task (and with only a handful of examples).

That is, we discovered LLMs’ capacity to generalize.

Figure 1.2 in
Figure 1.2 in the GPT-3 paper. Shows that larger models are better at generalizing to new tasks (especially via “in-context learning”).

Power Laws in Performance 🚨

We also learned that LLMs had predictable (power law) scaling behavior. With larger training datasets, models could scale up in parameter size and become more data efficient, ultimately leading to better performance on benchmarks.

Given a dataset size and chosen model size, we can (seemingly magically) predict the performance of the model prior to (pre-)training it.

Figure 1 in
Figure 1 in the Scaling Laws Paper 🙏🏽.

Research Trends 📈

Given these observations, a large research trend in LLMs was training progressively larger and larger LLMs and measuring their performance on benchmarks (although, some papers such as the CLIP paper call into question whether benchmark performance actually reflected generalizability, part of a nuanced observation called “The Cheating Hypothesis”). This required splitting these models across many GPUs/TPUs (i.e. model parallelism) due in large part to a model tweak provided from the Megatron paper, innovations in model/pipeline sharding, and packages such as DeepSpeed. Quantization also reduced the memory and computational footprint of these models. And since the traditional self-attention mechanism at the core of the Transformer is O(N2)O(N^2) in space and time complexity, naturally research into faster mechanisms such as Flash Attention was of considerable interest. Further, innovations like Alibi allowed for variable context windows. This opened the door to larger context windows and is the reason why today’s LLMs have as large as 100k token context windows.

And given the size of these (very large) LLMs, there was interest in how to fine-tune them to problems more efficiently. Innovation in Parameter-Efficient Fine-Tuning (PEFT) such as Adapters and LoRA allowed for faster fine-tuning since there are fewer parameters to adjust in these paradigms. Combined with the advent of 4- and 8-bit quantization, it’s now even possible to fine-tune a model on CPU! (note: most models are trained using 16 or 32 bit floats)

[Note: This is not a comprehensive overview of research trends. For instance, there was considerable research into other areas such as LLMs’ ability to regurgitate information, adversarial attacks, and domain specific LLMs such as Codex (for writing code) as well as early-stage multimodal LLMs (i.e. Transformers that understand images, text, etc). Further, RETRO and webGPT showed that smaller LLMs could perform the same as larger models with efficient querying/ information retrieval].

[And some of these papers like Flash Attention and LoRA came chronologically after the papers discussed in the next few sections].

InstructGPT 🤖

Yet, a major breakthrough in our understanding of LLM behavior was made with the release of the instructGPT paper.

GPT-3 (particularly the large parameter kind) already demonstrated the ability to follow natural language instructions (i.e. “prompts”), although these instructions typically needed to be carefully worded to get a desired output.

Yet, the output tended to be regurgitated language found deep in the dark corners of the Internet: unfiltered, likely offensive, untruthful and most probably not the response the user wanted. If the response wasn’t regurgitated, it could even be entirely made up (what are typically referred to as “hallucinations”).

That is to say, the LLMs’ capacity to follow natural language instructions could be quite underwhelming.

GPT-3 v. InstructGPT in response to a prompt. From OpenAI’s
GPT-3 v. InstructGPT in response to a prompt. From OpenAI’s blog post.

Steerability ⛷️ & Alignment 🧘🏽‍♀️

Training LLMs to predict the next token in a sentence has been surprisingly effective in teaching them to learn a generalizable representation of language. Train for this task over a large corpus (like the entire Internet), maybe add some other prediction tasks into the mix (translation, classification, etc - i.e. “mixture of objectives”), and voila, you have yourself an LLM that, if large enough, can be easily taught to do other specific things with only a handful of examples (i.e. few-shot learners).

Yet, this training objective does not seem to translate into LLMs following “user intent”.

LLMs might be able to answer multiple choice questions well or be fine-tuned to do some specific task; but left to their own devices, they aren’t shown to be particularly good at following user instructions without significant guardrails (particularly in a zero-shot setting).

In fact, many a time they tend to regurgitate an answer that they’ve seen before, or ignore the instructions entirely and ramble, or give a confident sounding answer that was non-sense (i.e. hallucinations). This is the problem that the ML research community calls steerability - the ability to prompt an LLM to provide a desired result.

Compound this with the desire that an LLM output embody a set of values (e.g. lack racism or homophobia, or say Anthropic’s HHH) or quality of output expected by the user and we begin to appreciate the surface of the alignment problem. What values an LLM is aligned to, how to evaluate for alignment, and whether we can align systems more intelligent than ourselves are all interesting open research threads.

Within the instructGPT paper, the authors developed a 2 part solution to trying to tackle this problem:

  1. Supervised (Instruction) Fine-tuning (SFT)
  2. Reinforcement Learning via Human Feedback (RLHF)

Using these 2 sequential training tasks, the authors are able to convert a GPT-3 into what they call InstructGPT.

The InstructGPT paper’s results showed that simply creating larger models was an insufficient condition for developing a steerable and aligned model: in fact, the 175B GPT-3 “prompted” model performed worse on average than the 1.3B parameter InstructGPT.

Comparison of human assessments of the outputs of GPT-3, GPT-3-prompted, “SFT”, and InstructGPT (”SFT”+”RLHF”). Again from the OpenAI
Comparison of human assessments of the outputs of GPT-3, GPT-3-prompted, “SFT”, and InstructGPT (”SFT”+”RLHF”). Again from the OpenAI blog post. Using humans to ascertain the quality of a generative model’s output makes more sense than using a traditional set of benchmarks that need quantitative answers. Both the training and test labelers (completely different set of people - OpenAI did an incredible job designing an experimental design that minimizes label leakage and overfitting to one’s preferences) were selected using a rigorous evaluation criteria to make sure they were “aligned” with OpenAI’s values they want to align their LLM to. These Likert scale measurements capture the steerability and alignment quality of these model outputs.

That is to say, how you trained your LLM can be equally as important a knob as model size.

And while you might not have heard of InstructGPT, this paper’s recipe was the basis for chatGPT as well as a plethora of open source models that one can find on HuggingFace (check out the ever changing 🤗 LLM Leaderboard).

What is a blog post without a meme?
What is a blog post without a meme?

1. Instruction Fine-Tuning 🎛️

The first step to improving the output of a generative model logically seems to be teach it to follow instructions better.

To that end, the authors culminated a series of prompts via submissions to their playground API from real users (i.e. tackle the cold-start problem) as well as a mixture of prompting tasks of their own devise. Using a rigorous selection process, labelers were hired to produce high-quality outputs that were aligned with Anthropic’s 3 Laws: helpful, honest and harmless (HHH, or maybe Triple H?).

Using these gold-standard labels, they fine-tuned a GPT model to learn these outputs. This is what they call “Supervised Fine-tuning” (SFT).

2. RLHF 💎

Now what about teaching an LLM to produce outputs that embody a set of values?

While the public perception of OpenAI is predominantly tied to GPT* models, a non-trivial percentage of their efforts has been focused on attacking the alignment problem. Their solution to this problem, thus far, has been what they call Reinforcement Learning via Human Feedback (RLHF) (note: this is expected to not work in the case of superalignment, or teaching systems smarter than ourselves our values).

Teaching an LLM a set of human values is a challenging modeling task. Traditional loss functions usually measure the model’s internal belief about the probability the data belongs to the correct label (i.e. cross entropy). But human values are more complex; they are hard to encapsulate within a single label . Rather than explicitly labeling data, it might be easier for a human to read two or more LLM outputs and encode their preferences through comparison. An AI system can then implicitly learn these preferences by learning to correctly rank a set of options.

Along these lines, RLHF seeks to teach an AI system a set of preferences by learning a reward function/model through interacting with a human and gaining their input.

This reward model is then used to guide the LLM to produce higher-valued outputs through reinforcement learning. At each step, the reward model takes an input and output and returns a “preferability” number; the LLM seeks to narrow the difference between the “optimal” preference and it’s current output over many interactions. This is used to train the LLM to produce a set of outputs that are “aligned” with the values of the labelers/modelers/etc (for a decent primer on RLHF, check out HuggingFace’s blog post).

Labeler output rankings. Figure 12 in the appendix of the
Labeler output rankings. Figure 12 in the appendix of the paper. The labeler selection process and metrics are worth diving into.

Among the theoretical benefits of using reinforcement learning to teach an LLM human preferences is the potential for reducing an LLM’s tendency towards regurgitation (as John Schulman and others have emphasized recently in lectures). Rather than overfitting/blindly learning a specific label distribution, RLHF is using rewards/penalties to guide the LLM outputs towards a more optimal answer through incentives without explicitly seeing a label (as in next token prediction).

InstructGPT was shown to produce outputs that were less toxic, more truthful, and more steerable. In fact, this paper was a major catalyst in the outpouring of new innovations using LLMs.

Conclusion: LLMs as Reasoning Agents 🤔

An emergent property of LLMs that we have gleamed in the wake of the InstructGPT paper is their capacity as reasoning agents. (Technically this behavior was evident in GPT-3 and PALM-1 before RLHF & fine-tuning was popular, as Jason Wei observes).

An observation by Jan Leike about instructGPT.
An observation by Jan Leike about instructGPT.

Given a set of instructions, an instruction fine-tuned/aligned LLM is able (conditional on size and training quality) to reason through a set of steps to produce a desired output. This has lead to a flurry of research into different prompt writing techniques: self-ask, chain-of-thought, etc. are all different parlor tricks designed to guide an LLM towards deducing the right answer. This set of techniques is what the ML community calls prompt engineering, which seems more an art form than a science these days.

Great
Great tweet from Karpathy on prompting. Read the sub-tweets citing some interesting papers.

As LLMs improve over time, it is possible that researchers/practitioners will not need to lean on these techniques as much, despite sensational claims in the media that you can make $300k being a prompt engineer.

Along with research into prompt engineering, LLM-centric tech stacks (what many refer to as autonomous agents) is an area of enthusiasm. These agents utilize LLMs as problem solvers; provision them with a set of tools at their disposal and these models can solve a surprising number of tasks.

Much of the open source efforts are focused on developing these tools- it’s an exciting time to see what possibilities LLMs will further unlock 😎.

Citation

@article{
  title   = "What We Know About LLMs (Primer)",
  author  = "Thompson, Will",
  journal = "https://willthompson.name",
  year    = "2023",
  month   = "July",
  day   = "23",
  url     = "https://willthompson.name/what-we-know-about-llms-primer"
}