Author: Will Thompson (Twitter)
Intro
Given a set of instructions, Large Language Models (LLMs) show a remarkable (yet, by no means perfect) capacity for reasoning their way through completing a task. This observation has led to a flurry of open-source initiatives and start-ups harnessing LLMs as reasoning agents.
It’s not only remarkable that their reasoning in some situations *appears* near-human-level - it’s remarkable because these models weren’t explicitly trained to reason at all.
Even before the rise in popularity of RLHF and fine-tuning (see previous post), chatGPT’s predecessors (GPT-3 and PALM-1) began to show signs of unexpected capabilities. These models were *simply* trained on next token prediction; yet, scale them past a certain size and we begin to see a phase transition from near-random performance to non-trivial results across a wide range of benchmarks. This is what is referred to as emergent abilities 🪴- an unpredicted qualitative change in response to a quantitative change (i.e. parameter scaling).
There are a few implications of this phenomenon.
For one, this means that these abilities cannot be extrapolated from the performance of smaller models. It also means we cannot predict what new abilities will be unlocked by scaling LLMs to even larger scales. For this reason, emergence is an interesting counter-example to the predictability found in other aspects of LLM behavior (see scaling laws paper).
Understanding emergence may lead to finding new ways to make helpful tools utilizing AI. It may also lead to a better understanding of the potential for AGI down the road. Yet, what remains painfully obvious is that the unpredictable nature of emergence is further evidence of the importance of efforts such as AI interpretability and alignment research.
Emergent Abilities of LLMs 💫
Few-Shot Prompting 🎯
One off the major contributions of the GPT-3 paper was the dissemination of the idea that LLMs could be prompted. Rather than explicitly fine-tuning a model to complete a particular task, a LLM can instead be provided natural language instructions (see above). Out of this observation sprung the idea of in-context learning, where a model is provided a set of “gold-standard” examples in the prompt to illustrate how to complete a given task. Within in-context learning are the analogous concepts of zero-shot prompting (i.e. no examples provided) and few-shot prompting (i.e. a handful of examples are provided) usually described in fine-tuning.
It was in light of this discovery that the GPT-3 paper declared that LLMs are few-shot learners.
And yet, this is not isolated to just GPT-3. Researchers at Google culminated empirical data of the emergent nature of few-shot prompting for 5 different classes of LLMs across 8 different benchmarks. It is only around the ~100B parameter threshold that few-shot prompting seems to surpass standard prompting techniques.
Given these results, it is perhaps unsurprising that few-shot prompting is one of the most popular ways to interact with LLMs. Many papers have been written on the topic of how to construct and select examples for prompts and popular packages such as Langchain have tools for finding “most relevant” examples for a prompt programmatically (see Lilian Weng’s overview of prompt engineering for some interesting papers on this topic).
Chain-of-Thought ⛓️
In addition to in-context learning, another set of interesting abilities stem from prompt augmentation. Chief among them is the ability to prompt an LLM to engage in multi-step or “chain-of-thought” reasoning (CoT). Simply providing an in-context example of how to reason can be sufficient to nudge the LLM to engage in CoT, as seen in the Google Brain paper from late last year.
Yet this approach only appears to surpass standard prompting when an LLM exceeds ∼100B parameters, making it yet another emergent ability.
Yet chain-of-thought reasoning is not limited to few shot-prompting. In the paper Large Language Models are Zero-Shot Reasoners, the authors note that simply appending the words “Let’s think step by step” can lead to impressive multi-step reasoning.
In one arithmetic task, simple adding those words changed the accuracy from 17.7% → 78.7%!
Model Calibration ⚖️
Another important aspect of LLMs is understanding their ability to predict the accuracy of their results, what the ML community calls model calibration.
In the landmark Anthropic Paper “Language Models (Mostly) Know What They Know”, the authors explored the capacity of LLMs to determine the probability their answer is correct when answering True/False. There is a phase transition around ~52B parameters around which point LLMs start to do better at identifying their answer’s correctness. The value of this ability is that it can be used for standard ML tasks such as operating point selection (unless you can get the model logits, which should be well-calibrated probabilities if from the pre-training task) and more complicated system designs such as human-in-the-loop.
U-shaped Inverse Scaling Laws
Finally, while this is likely obvious to most, the fact that scaling up LLMs shows emergent abilities also comes the caveat that there are likely emergent risks as well.
How to uncover these emergent risks is a non-trivial problem. There was recently an Inverse Scaling Prize in which the moderators hoped to unearth cases where model performance deteriorated with scaling. The culmination of that public experiment was a recently published paper, where the authors attempt to discover the root cause of the deterioration in performance as a function of scale. They concluded that the breakdown was a function of, among other things, a preference by the LLMs for regurgitation of training data over following instructions and misleading few-shot demonstrations in the prompt.
Yet, in the GPT-4 technical report, it was revealed that the inverse scaling laws were broken with GPT-4, the largest of the GPT models. This was also found in “Inverse scaling can become U-shaped”, which demonstrated that for larger models the trend reversed. This further emphasizes that we don’t know the real relationship between model performance and scale.
As we continue to entertain the thought of LLMs being integrated into critical components of technology, we should bear in mind that the opacity of our understanding of LLMs comes with a potential unknown right tail. Alignment research and mechanistic interpretability are perhaps our best bet at hedging this risk.
[Note: for a longer treatise on emergent risk and deception, I suggest this excellent post.]
Citation
@article{
title = "Emergent Properties of LLMs",
author = "Thompson, Will",
journal = "https://willthompson.name",
year = "2023",
month = "August",
day = "22",
url = "https://willthompson.name/emergent-abilities-of-llms"
}