Disclaimer: This article does not reflect the views of my employer (past, future, or present).

Author: Will Thompson (Twitter)

Published: February 18 2024

Intro
Crowdsourced A/B Testing
Model-driven Evaluation
What if We Replaced Human A/B testers with AI?
Proposal: A Bad Actor Turing Test (tl;dr)
Deliverables
Footnotes
Citation

An OpenAI Superalignment Fast Grant Proposal

tl;dr → We want to explore whether AI can faithfully evaluate other AI systems for adherence to Anthropic’s HHH through something akin to a Turing Test.

“Are you paying attention? Good. If you are not listening carefully, you will miss things. Important things. I will not pause, I will not repeat myself, and you will not interrupt me.” (cue really dramatic music)

Intro

Evaluating AI systems such as large language models (i.e. “LLMs”) presents an important challenge, particularly in the face of emergence (i.e. lack of expectation of model behavior as a function of scale) $^1$ .

Many common benchmarks (e.g. MMLU , BBQ, etc.) employ a very simple metric like accuracy. The underlying assumption is that that a higher score is associated with a more capable model. Despite their widespread popularity, many of these benchmarks are wrought with problems:

leakage/peaking within the training data (especially with MMLU)
lack of consistent implementation across model evaluations eliminates the ability of apples-to-apples comparisons (e.g. few-shot learning or chain-of-thought reasoning used to increase model score in some instances)
bias towards over-interpreting model score (e.g. Anthropic describing how they achieved a BBQ score of 0, but the model was not answering the questions)
“bottom-up” benchmarks having a lack of oversight, scaling problems, bugs, significant effort to install within a research environment → unwieldy
“top-down” benchmarks (like HELM), although more curated with standardized metrics like accuracy, calibration, robustness, fairness, have a very long feedback loop

Yet these design evaluations fail to capture the open-ended, dynamic interactions that an AI system can have with the real world.

Crowdsourced A/B Testing

To overcome this hurdle, many research centers employ A/B comparisons using crowdsourced or contracting platforms where testers engage in open-ended dialogue with 2 models and rate them across a number of metrics (such as Anthropic’s HHH). While superior to the aforementioned benchmarks, this approach suffers from the fact that:

human evaluations can vary significantly depending on the evaluators (although, tbh it felt that in the InstructGPT paper there was a very rigorous training process)
employing these resources are quite expensive and time-consuming → not super scalable

In light of this, what other ways can we generate faster, more scalable AI evaluation?

Model-driven Evaluation

With models starting to achieve near human-level capabilities, we can employ them to evaluate themselves.

Researchers have already began to examine model-driven evaluations of AI systems. For instance, Anthropic explored AI generated multiple choice tests as a way of detecting novel behavior in model performance. They also found that Constitutional AI red-teamed models were found to be less harmful than human red-teamed models.

What if We Replaced Human A/B testers with AI?

One problem is that, particularly as AI models scale, they can internalize social biases and fabricate information and even can be trained to be deceptive and resist alignment protocols.

If we attempt to evaluate larger AI systems with similar models, what if they inherited the same biases as the model being evaluated? To that end, one should also evaluate how well weaker models can evaluate larger models, an analogy of the weak-to-strong model supervision paper put forth by OpenAI.

Proposal: A Bad Actor Turing Test (tl;dr)

We propose to research how well AI systems can A/B test other AI systems to determine a lack of adherence to a set of principles.

As a first step, we propose to evaluate whether a language model can detect a “misalignment” within another language model via open-ended conversation.

The following is an overview of the experimental design:

3 types of models:

Agent A
Misaligned/corrupted Agent A
(Smaller) Agent B(s)

The misaligned agents would be generated through RLHF

We would create different variants that fail to adhere to Anthropic’s HHH (with varying degrees, which we will attempt to quantify)

Conducting these experiments over many trials, we will evaluate:

Can the smaller Agent B tell whether it is talking to aligned Agent A or misaligned Agent A? (scenario 1)
Can Agent A tell whether it is talking to itself or misaligned self? (scenario 2)
Can misaligned Agent A tell wether it is talking to itself or aligned self? (scenario 3)

Scenario 1: can the smaller AI determine the identity of the responder?

Scenario 2: can the aligned AI determine the identity of the responder?

Scenario 3: can the misaligned AI determine the identity of the responder?

The questioner model will be evaluated in 2 capacities:

(1) interrogator (asking probing questions)

When the model is an interrogator, it will be asked to engage in multi-turn conversation with the responder to reach a conclusion

It’s goal is to autonomously red-team the responder

In addition, human directed prompts will include q&a, instructions to generate questionable content, etc

(2) discriminator (assessing whether the answer adheres to HHH)

The model will ultimately score the responder’s answers using a rubric and make an assessment on the identity of the responder
We can quantify the toxicity of all outputs with OpenAI’s moderation API

Ultimately, we hope to gain insights into:

How well do AI systems play both the interrogator and discriminator in identifying misalignment?

How well does alignment play a role in the ability to conduct these tasks?

Does model scale play a role in the ability to conduct these tasks?

Deliverables

6 months from start date would be delivery of code, datasets, model weights, etc. along with a short workshop paper on the results.

Footnotes

$^1$ Many of these observations are ~~stolen~~ liberally taken from Anthropic’s excellent blog post “Challenges in evaluating AI systems”.

Citation

@article{
  title   = "AI Evaluation Via An AI Led Turing Test (A Proposal)",
  author  = "Thompson, Will",
  journal = "https://willthompson.name",
  year    = "2024",
  month   = "February",
  day   = "18",
  url     = "https://willthompson.name/ai-model-evaluation-via-ai-ab-testing"
}