The Hidden Assumptions Behind AI Labels

An investigation of how internalized beliefs shape AI annotation, and whether large language models follow human definitions or their own.

by Etienne Casanova, Rafal Kocielnik, and R. Michael Alvarez, California Institute of Technology

June 12, 2026

Large language models are quickly becoming "instant annotators." Instead of hiring teams of human data labelers, researchers can now ask an AI system to classify thousands of posts, comments, or survey responses in minutes. Platforms can use LLMs to flag harmful content. Social scientists can use them to code open-ended survey responses, interview transcripts, political speech, and other text that once required teams of trained coders. AI researchers can even use them as judges to evaluate other AI systems.

The appeal is obvious: give the model a definition, provide the text, and ask for a label.

But there is a hidden assumption behind this workflow: that the model will actually follow the definition we give it.

Our research asks a simple question with important consequences: when an LLM labels something as "toxic," "hateful," or "offensive," is it applying our definition, or one it has already learned?

Annotation depends on definitions

At first, a label like "toxic" might seem straightforward. But in practice, toxicity means different things in different settings.

In an online game, "you are awful at this" might be considered toxic because it insults another player and contributes to disruptive behavior. In a hate speech context, the same message might not be toxic because it does not target a protected identity group. In a news comment section, the boundary might be different again.

That means annotation is not just about matching text to labels. It is about applying a specific definition of a concept with contextual interpretation.

This is where LLMs become complicated. By the time you hand an LLM a rulebook, it has already absorbed enormous amounts of text during training. It has seen examples, discussions, moderation policies, and public debates about concepts like toxicity, hate speech, and offensiveness. Over time, it develops an internal understanding of these concepts.

The problem is that this internal understanding may not match the definition a researcher or platform wants to use. Worse, it may skew toward the majority interpretation, not the one a task requires.

Testing whether models follow our definitions

In our paper, "On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance," we study how these internalized beliefs affect AI annotation.

We tested nine large language models across five toxicity-related datasets, spanning domains such as social media, gaming, news comments, and online forums. Each dataset had its own definition of what counted as toxic, hateful, or offensive. We then asked three questions:

First, do models perform better when their internal understanding of a concept matches the task definition?

Second, can better prompts fix the model's initial mistakes?

Third, what happens when the model is given a definition that is related, but wrong for the task?

Together, these questions help us understand whether LLMs behave like flexible instruction-followers, or whether their prior beliefs constrain how they annotate.

Finding 1: The model's internal concept matters

A common concern about LLM evaluation is memorization. If a model performs well on a dataset, maybe it has seen those examples before. That is an important issue, but in our setting, memorization was not the main explanation.

Instead, what mattered more was whether the model's internal concept matched the task definition.

To study this, we introduced a measure called Definition-Specific Familiarity, or DSF. The idea is simple: ask the model to explain, in its own words, what a concept like toxicity means, then compare that explanation to the dataset's official definition.

If the two are close, the model has high definition-specific familiarity. If they are far apart, the model may be applying a different concept than the one the user intended.

We found that DSF was positively associated with annotation performance. In other words, models did better when their internal understanding of the task aligned with the dataset's definition. By contrast, measures of text memorization did not show the same positive relationship after controlling for dataset differences.

This suggests that for annotation tasks, the key question is not only "has the model seen this text before?" but also "does the model understand the task the same way we do?"

Finding 2: Prompting does not reliably fix mistakes

One possible response is: if the model's internal concept is wrong, just give it a better prompt.

Unfortunately, that often does not work.

We measured what we call the rescue rate: when a model gets an example wrong in a zero-shot setting, how often does an improved prompt fix the mistake?

Across our experiments, the overall rescue rate was only 34.8%. That means nearly two-thirds of the model's initial errors were not corrected by additional instructions.

Even more importantly, high-confidence mistakes were especially hard to fix. When models were confidently wrong, prompting rarely changed their answer.

We call this decision stickiness. Once the model settles on an interpretation, especially with high confidence, better instructions may not be enough to move it.

This matters because many LLM annotation pipelines assume that prompt engineering can solve most problems. Our results suggest that prompting helps, but it has limits. It often improves predictions that the model was already likely to get right, rather than reliably correcting the model's deeper misunderstandings.

Finding 3: Models can be confidently wrong under bad definitions

We also tested what happens when models are given misaligned definitions. For example, a model might be asked to annotate general toxicity using a narrower hate speech definition, or to annotate hate speech using a broader gaming toxicity definition.

The models did not simply ignore these definitions. They changed their behavior in response to them. Narrower definitions made models label fewer examples as toxic. Broader definitions made them label more examples as toxic.

That responsiveness is useful, but also risky. It means models will often follow the definition they are given, even when that definition is wrong for the task.

The most concerning part was confidence. Models remained highly confident even when applying misaligned definitions. Their confidence scores looked similar across zero-shot, aligned, and misaligned conditions.

This means confidence is not a reliable warning signal. A model can be confidently applying the mismatched definition.

For researchers and practitioners, this creates a serious validation problem. If an LLM labels data with high confidence, that does not necessarily mean it understood the task correctly. It may simply be confident within the wrong framework.

Why this matters for AI governance and research

LLM annotation is becoming increasingly common in high-stakes settings. Researchers use LLMs to label political speech, misinformation, toxicity, and public opinion. Platforms use them to moderate online content. AI developers use "LLM-as-a-judge" systems to evaluate model outputs.

In all of these settings, definitions matter.

A content moderation system that uses the wrong definition of toxicity could censor harmless speech while failing to catch harmful behavior. A social science study that uses an LLM to code survey responses could produce misleading results if the model's internal concept differs from the researcher's coding scheme. An AI benchmark judged by another LLM could reward outputs that match the judge model's assumptions rather than the intended evaluation criteria.

The broader lesson is that LLMs are not neutral labeling machines. They are instruments with internal assumptions. Like any scientific instrument, they need to be calibrated, tested, and validated before being used at scale.

Three safeguards for using LLM annotators

Our findings suggest three practical safeguards.

First, measure definition alignment before running large-scale annotation. Instead of only testing accuracy after the fact, researchers should ask whether the model's internal understanding of the task matches the intended definition.

Second, stress-test multiple definitions. Small changes in wording can shift model behavior, so annotation pipelines should evaluate how sensitive results are to plausible alternative definitions.

Third, avoid relying on confidence alone. A model's confidence may reflect certainty given the prompt, not certainty that the prompt matches the correct task definition.

The central lesson is that AI annotation is not just about choosing the best model or writing a better prompt. It is about making sure the model, the task definition, and the intended use of the labels are aligned before those labels are used at scale.

This paper was accepted as a Spotlight paper and Oral presentation at ICML 2026. It is available here. Code is available here.

Paper: https://arxiv.org/abs/2606.00467

Code: https://github.com/etmaca5/llm-internalized-priors-for-annotation

ICML paper page: https://icml.cc/virtual/2026/oral/71177

Contacts: Etienne Casanova, [email protected]; Rafal Kocielnik, [email protected]

Connect With Us

Interested in subscribing to receive LCSSP's latest updates?

If you are external to Caltech, fill out this form.

If you are on the Caltech network, use these links to subscribe: LCSSP Mailing List, LCSSP BioPolicy Initiative, and LCSSP Democracy Mailing List.

For help or questions, please reach out to [email protected]!