Emergent Abilities of LLMs 🪴

Emergent Abilities of LLMs 🪴

Disclaimer: This article does not reflect the views of my employer (past, future, or present).

Author: Will Thompson (Twitter)

Published: @August 22, 2023


Deep thoughts
Deep thoughts by Sam Altman.

Given a set of instructions, Large Language Models (LLMs) show a remarkable (yet, by no means perfect) capacity for reasoning their way through completing a task. This observation has led to a flurry of open-source initiatives and start-ups harnessing LLMs as reasoning agents.

It’s not only remarkable that their reasoning in some situations *appears* near-human-level - it’s remarkable because these models weren’t explicitly trained to reason at all.

Even before the rise in popularity of RLHF and fine-tuning (see previous post), chatGPT’s predecessors (GPT-3 and PALM-1) began to show signs of unexpected capabilities. These models were *simply* trained on next token prediction; yet, scale them past a certain size and we begin to see a phase transition from near-random performance to non-trivial results across a wide range of benchmarks. This is what is referred to as emergent abilities 🪴- an unpredicted qualitative change in response to a quantitative change (i.e. parameter scaling).

Animation borrowed from Jason Wei’s (co-lead of the
Animation borrowed from Jason Wei’s (co-lead of the survey paper & self-described “chain-of-thought” guy) excellent and succinct list of emergent abilities. These are all examples of few shot prompting tasks (see below). Note that the x-axis is not the usual parameter size, training data size or compute, but FLOPs. For a primer on FLOPs calculations, see this helpful post. Emergence is a counter-example to the predictable nature of scaling laws found in other aspects of LLMs.

There are a few implications of this phenomenon.

For one, this means that these abilities cannot be extrapolated from the performance of smaller models. It also means we cannot predict what new abilities will be unlocked by scaling LLMs to even larger scales. For this reason, emergence is an interesting counter-example to the predictability found in other aspects of LLM behavior (see scaling laws paper).

Understanding emergence may lead to finding new ways to make helpful tools utilizing AI. It may also lead to a better understanding of the potential for AGI down the road. Yet, what remains painfully obvious is that the unpredictable nature of emergence is further evidence of the importance of efforts such as AI interpretability and alignment research.

Emergent Abilities of LLMs 💫

Few-Shot Prompting 🎯

Few-shot prompting
Few-shot prompting : Figure 2.1 from the GPT-3 Paper, where “in-context learning” is studied vis-a-vis GPT-3. Note that this model was not fine-tuned at all- this was all at inference time.

One off the major contributions of the GPT-3 paper was the dissemination of the idea that LLMs could be prompted. Rather than explicitly fine-tuning a model to complete a particular task, a LLM can instead be provided natural language instructions (see above). Out of this observation sprung the idea of in-context learning, where a model is provided a set of “gold-standard” examples in the prompt to illustrate how to complete a given task. Within in-context learning are the analogous concepts of zero-shot prompting (i.e. no examples provided) and few-shot prompting (i.e. a handful of examples are provided) usually described in fine-tuning.

It was in light of this discovery that the GPT-3 paper declared that LLMs are few-shot learners.

Figure 1.3 from the
Figure 1.3 from the GPT-3 Paper, which shows the average performance of zero-shot v. few-shot prompting across 42 benchmarks (all inference, no training). The largest model here was 175B.

And yet, this is not isolated to just GPT-3. Researchers at Google culminated empirical data of the emergent nature of few-shot prompting for 5 different classes of LLMs across 8 different benchmarks. It is only around the ~100B parameter threshold that few-shot prompting seems to surpass standard prompting techniques.

Figure 11 in
Figure 11 in the Emergence Paper, illustrating accuracy as a function of model parameter size. The dotted red-line is random. Emergence appears to happen at different thresholds for different phenomena, but for few-shot prompting it seems to appear ~ 100B parameter threshold.

Given these results, it is perhaps unsurprising that few-shot prompting is one of the most popular ways to interact with LLMs. Many papers have been written on the topic of how to construct and select examples for prompts and popular packages such as Langchain have tools for finding “most relevant” examples for a prompt programmatically (see Lilian Weng’s overview of prompt engineering for some interesting papers on this topic).

Chain-of-Thought ⛓️

In addition to in-context learning, another set of interesting abilities stem from prompt augmentation. Chief among them is the ability to prompt an LLM to engage in multi-step or “chain-of-thought” reasoning (CoT). Simply providing an in-context example of how to reason can be sufficient to nudge the LLM to engage in CoT, as seen in the Google Brain paper from late last year.

Figure 1 from “
Figure 1 from “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. Notice that guiding the LLM in how to do arithmetic via the in-context example increases the probability of it getting the correct answer at inference.

Yet this approach only appears to surpass standard prompting when an LLM exceeds ∼100B parameters, making it yet another emergent ability.

Figure 4 in the paper above. Notice the bifurcation point.
Figure 4 in the paper above. Notice the bifurcation point.

Yet chain-of-thought reasoning is not limited to few shot-prompting. In the paper Large Language Models are Zero-Shot Reasoners, the authors note that simply appending the wordsLet’s think step by stepcan lead to impressive multi-step reasoning.

In one arithmetic task, simple adding those words changed the accuracy from 17.7% → 78.7%!

Model Calibration ⚖️

Another important aspect of LLMs is understanding their ability to predict the accuracy of their results, what the ML community calls model calibration.

In the landmark Anthropic Paper “Language Models (Mostly) Know What They Know”, the authors explored the capacity of LLMs to determine the probability their answer is correct when answering True/False. There is a phase transition around ~52B parameters around which point LLMs start to do better at identifying their answer’s correctness. The value of this ability is that it can be used for standard ML tasks such as operating point selection (unless you can get the model logits, which should be well-calibrated probabilities if from the pre-training task) and more complicated system designs such as human-in-the-loop.

Figure 1 in
Figure 1 in “Language Models (Mostly) Know What They Know”. This is done by self-evaluation. Notice the separation of distributions.

U-shaped Inverse Scaling Laws

Finally, while this is likely obvious to most, the fact that scaling up LLMs shows emergent abilities also comes the caveat that there are likely emergent risks as well.

How to uncover these emergent risks is a non-trivial problem. There was recently an Inverse Scaling Prize in which the moderators hoped to unearth cases where model performance deteriorated with scaling. The culmination of that public experiment was a recently published paper, where the authors attempt to discover the root cause of the deterioration in performance as a function of scale. They concluded that the breakdown was a function of, among other things, a preference by the LLMs for regurgitation of training data over following instructions and misleading few-shot demonstrations in the prompt.

Yet, in the GPT-4 technical report, it was revealed that the inverse scaling laws were broken with GPT-4, the largest of the GPT models. This was also found in “Inverse scaling can become U-shaped”, which demonstrated that for larger models the trend reversed. This further emphasizes that we don’t know the real relationship between model performance and scale.

Figure 3 in the GPT-4 technical report. This is on the “Hindsight Neglect” task.
Figure 3 in the GPT-4 technical report. This is on the “Hindsight Neglect” task.
Figure 1 from the Google paper on inverse scaling laws. This is averaged over 10 tasks.
Figure 1 from the Google paper on inverse scaling laws. This is averaged over 10 tasks.

As we continue to entertain the thought of LLMs being integrated into critical components of technology, we should bear in mind that the opacity of our understanding of LLMs comes with a potential unknown right tail. Alignment research and mechanistic interpretability are perhaps our best bet at hedging this risk.

[Note: for a longer treatise on emergent risk and deception, I suggest this excellent post.]


  title   = "Emergent Properties of LLMs",
  author  = "Thompson, Will",
  journal = "https://willthompson.name",
  year    = "2023",
  month   = "August",
  day   = "22",
  url     = "https://willthompson.name/emergent-abilities-of-llms"