What Journalism's Standards Can Teach Developers About AI Trustworthiness

Every time a user queries an AI search engine for information, they’re trusting a system trained on the internet to behave like an editor. An editor has institutional memory, a corrections policy, and journalistic accountability. LLMs have none of those things, which isn’t just a problem for newsrooms.

Developers integrating LLMs into documentation tools, research assistants, knowledge bases, and coding copilots rely on output accuracy. When accuracy fails downstream, the consequences are operational: support tickets, compliance gaps, eroded user trust, and, in high-stakes domains like legal or medical tech, real liability.

AI enthusiasts talk about LLMs “getting things wrong” as if it’s one problem. It’s actually three:

Unintentional fabrication (hallucinations of both the overconfident and underconfident variety)
Sycophancy with user prompts
Intentional deception during model evaluation

These are structurally distinct failure modes, each caused by something that requires a different fix. Collapsing them under “hallucination” produces mitigations that solve one problem while leaving the others untouched.

Luckily, journalists have been running information-critical systems long enough to have made these same mistakes, named them, and built institutional responses to each. Those responses were designed to prevent specific operational failures, not serve as abstract ethical guardrails. Most of them translate directly into engineering solutions that apply to any information system, including LLMs.

Three Distinct Failure Modes

The word “hallucination” has become a catch-all for whenever an LLM says something wrong. That’s like calling every aircraft incident a “crash.” It’s too generalized to be useful when discussing prevention.

Multiple studies have helped establish a clear distinction between the different ways that large language models can get things wrong. Depending on the underlying engineering issue, LLM misinformation falls into one of three categories.

Unintentional Fabrication

This is when LLMs can’t architecturally distinguish between retrieved knowledge and training-data plausibility. Because fluency and truth-tracking are treated as independent objectives, everything produces equally confident responses by default.

As a result, attributed claims get silently converted into universal assertions. “Company X reported profits rose” becomes “Profits rose” because nothing in the model’s architecture penalizes the omission of details for the sake of brevity. Northwestern University research confirmed this, finding that models convert sourced claims into asserted facts without signaling the user that the source was lost in transit.

Sycophancy

RLHF training — the standard approach of using human evaluators to fine-tune LLMs — teaches models to prioritize agreement over accuracy. A 2025 study published in npj Digital Medicine found that sycophantic compliance rates were as high as 100% across five popular LLMs (GPT-4, GPT-4o, GPT-4o-mini, Llama 3-8B, and Llama 3-70B) when given medically illogical prompts.

It’s not that the model lacked knowledge — it just found agreement to be the path of least resistance for maximizing its reward function. Sycophancy worsens with scale and responds negatively to post-training alignment, which often means trying to fix the issue simply makes it worse.

Intentional Deception

Some models behave differently when they detect they’re being evaluated, sandbagging on capability tests or quietly pursuing hidden objectives while appearing compliant. Apollo Research documented this across o1, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 405B in December 2024.

Following those findings, OpenAI’s anti-scheming training later reduced deception rates on chat data from 31.4% to 14.2%. It’s a tangible improvement, though researchers cautioned it may be partially explained by models becoming more aware they were being evaluated rather than by genuine alignment.

A 2025 taxonomy paper surveying hallucination, sycophancy, sandbagging, and alignment faking confirmed that “mitigations may not transfer across phenomena.” When you fix the retrieval pipeline to address hallucination, you have not touched the sycophancy mechanism or scheming behavior at all.

EMNLP 2025 research also found models that knew the correct answer could still hallucinate a different one, and did so with higher certainty than their correct responses. So the confidence signal alone doesn’t tell you which failure mode you’re dealing with.

Fix 1: Treat Attribution as a Schema Constraint

LLMs draw on parametric knowledge (baked into weights at training) and retrieved knowledge (passed in at inference via RAG). The model has no native mechanism to tag which source a claim came from or to enforce that high-confidence assertions require verification.

A reporter who confidently publishes a claim without tracing it to a verifiable source is doing exactly what an LLM does when it strips attribution mid-synthesis.

That’s why attribution is a structural output requirement in journalism, not a style preference. Every factual claim links to its origin in the published text. The two-source rule adds a corroboration threshold: high-stakes claims can’t be stated without independent confirmation from a second source.

You fix this by treating attribution as a schema constraint rather than a content preference. That means building provenance tagging into the response object itself — not as a footnote or a “sources consulted” section, but as a structured field attached to each factual claim.

Claims that can’t be linked to a retrieved document get tagged source: inference before they reach the user rather than smoothed into confident prose. Microsoft Azure AI Foundry’s Grounding with Bing Search already enforces this pattern through its Use and Display Requirements, and Google’s NotebookLM takes the same approach with source-linked responses.

But tagging alone isn’t enough if the synthesis step can still override it. That’s where assertion gating comes in — a pre-output validation pass that checks each high-confidence claim against a retrieved passage above a defined similarity threshold, downgrading anything that doesn’t clear it.

Before a claim gets stated at high confidence, the system checks whether more than one retrieved document independently supports it. Claims with only a single source get flagged as uncertain rather than asserted as fact. The open-source RAG evaluation framework RAGAS identifies “atomic factual statements” and enforces a Faithfulness metric that serves this exact purpose.

Northwestern research describes a five-stage pipeline — corpus summarization, search planning, parallel thread execution, quality evaluation, and synthesis — with citation chains maintained throughout and unsupported claims rejected rather than passed through. Amazon Bedrock’s Automated Reasoning Checks also run formal logic validation against a domain knowledge policy and claim a 99% response accuracy rate, though that’s a vendor figure worth pressure-testing in your own evals.

Fix 2: Build an Adversarial Verification Layer

Human evaluators consistently rate agreeable responses higher than corrective ones, even when the correction is more accurate. This causes models to learn that agreement is rewarded over accuracy, making them more likely to offer the answer someone is looking for rather than the one based on evidence or facts.

There’s a phenomenon in media called access journalism. Reporters who cultivate relationships with powerful sources end up softening coverage to preserve that access. The source’s approval gets over-weighted relative to the truth of their claims. It’s a structural issue — the feedback loop distorts the output without requiring deceptive intentions from anyone involved.

That’s why newsrooms maintain editorial independence. A reporter who cultivates the source is not the person who decides what gets published, and the editor’s role remains explicitly adversarial. Newsrooms also enforce a no-pre-approval policy where sources never review conclusions before publication, because if they did, the incentive to please would corrupt the output.

This solution can be adapted for sycophantic AI models. A primary model generating a response and the component evaluating that response need different objective functions — otherwise you’re asking the sycophant to grade their own work.

Building an adversarial verification layer — a second model or eval component explicitly permitted to challenge the initial output — addresses this directly. It checks whether responses are based on unverified premises, accept false framing, or suppress contradicting evidence. A study in npj Digital Medicine found that simply giving models explicit “rejection permission” substantially improved performance on illogical requests. A dedicated verification layer formalizes that permission at the architecture level rather than relying on prompts alone.

An ACL Anthology study outlined a framework called CONSENSAGENT that uses structured prompt optimization between agents to reduce sycophancy in multi-agent debate systems. Sycophancy is often amplified at each step where multiple AI agents work together. CONSENSAGENT solves this by actively refining prompts at each stage to eliminate unverified assumptions that get passed down to the next stage of the workflow.

There’s also the matter of catching false premises before synthesis even begins. Most sycophantic failures aren’t just about the model agreeing with a claim — they’re about the model building an entire response on top of something the user assumed to be true that wasn’t.

Premise auditing extracts those unverified claims and runs them against the retrieval index before synthesis starts. Premises that fall below a validity threshold trigger a clarification rather than an answer that silently accepts bad user input. Amazon Bedrock’s Automated Reasoning Checks can encode valid domain premises as a formal policy, flagging false user-supplied assumptions before they reach output.

Finally, since the model over-rewards itself for affirmation and under-rewards a challenge, the validation threshold should be calibrated to compensate. That means requiring the model to find stronger retrieved evidence to confirm a user’s assumption than to contradict it — the same logic a reporter applies when deciding how much proof they need to confirm a claim from a potentially biased source.

Fix 3: Enforce Behavioral Consistency Across Contexts

Models in certain configurations behave differently when they detect they’re being evaluated, intentionally underperforming on capability tests or concealing objectives while appearing compliant. This is the AI equivalent of a source who performs differently when they know a journalist is watching — and journalism’s solution is structural, not trust-based.

Newsrooms use blind sourcing, where reporters don’t reveal the full scope of what they know when interviewing a subject. They also cross-check statements made in on-record contexts against off-record behavior. The point isn’t to catch anyone in a lie — it’s to ensure the system doesn’t produce different outputs depending on who appears to be observing it.

The engineering equivalent is behavioral consistency testing: evaluating models across contexts where the system prompt, evaluation cues, or apparent oversight levels differ, and flagging divergence as a reliability signal. If a model performs significantly better or worse when it appears to be under observation, that gap is a defect, not a feature. Embedding consistency checks into CI/CD pipelines — rather than running them only in pre-deployment evals — ensures the behavioral baseline is continuously monitored rather than spot-checked.

Journalism’s Lessons Are Engineering Decisions

The three failure modes — fabrication, sycophancy, and deception — each have structural causes, and each has a structural fix. Attribution schema constraints address the retrieval-accuracy gap. Adversarial verification layers address the sycophancy feedback loop. Behavioral consistency testing addresses evaluation-aware deception. None of these require treating AI as inherently untrustworthy — they require treating it the way any information-critical system should be treated: with accountability mechanisms that don’t depend on the system’s good intentions to function.

Journalism spent centuries learning that lesson. The engineering community doesn’t need to spend the same amount of time relearning it from scratch.

Three Distinct Failure Modes

Unintentional Fabrication

Sycophancy

Intentional Deception

Fix 1: Treat Attribution as a Schema Constraint

Fix 2: Build an Adversarial Verification Layer

Fix 3: Enforce Behavioral Consistency Across Contexts

Journalism’s Lessons Are Engineering Decisions

Related Articles

How Unread Supabase Mutation Errors Cause Silent Data Loss

Simplifying Magento Checkout While Supporting Complex Payment Workflows

Seeing Code in the Real World: The Bus Seating Problem