HOW AI SHOWS US PSYCHOLOGICAL SAFETY ISN'T ENOUGH

Feb 16
10 min read

The polite expert problem

In a recent paper, a team at Stanford described how they set out to prove whether agent teams could achieve 'strong synergy'. Would a team of different AI models outperform its best individual member?

They did not have human comparisons in mind. But they did reach for the tools organisational psychologists have used on human teams for decades. The same team-building exercises used in MBA programmes and corporate retreats. NASA Moon Survival. Lost at Sea. Student Body President. Each one gives partial information to different team members and tests whether the group can combine it into a better answer than any individual could produce alone.

The result, published in 'Multi-Agent Teams Hold Experts Back' by Pappu, El, Cao and Zou, found that agent teams not only fall short of synergy, yhey consistently perform worse than their best individual member.

By up to 37.6%.

And the mechanism is structurally identical to what organisational psychology has documented in human teams for 40 years.

Let that sit for a sec.

These are artificial systems. With no ego. No career anxiety. No lunch to get to. No history of being shot down by a senior colleague. They have none of the social apparatus we've always assumed causes human team dysfunction.

And yet - they reproduce the dysfunction anyway.

What Zou found

From a Cognitive Revolution podcast appearance and the paper itself, there are several layers to unpack.

The politeness problem

Current AI models are - as Zou put it - 'too compromising, too polite'. Even when an agent has been designated as the expert and is objectively better at the task, it defers. It accommodates. It integrates non-expert opinions into its answer rather than defending its superior position. Zou describes this as a feature of agent 'personality' - and notes that personality plays a 'surprisingly important role' in team outcomes.

The identification/leveraging split

The paper decomposes the failure into two possible causes: (i) Can the team identify who the expert is; or (ii) can they use that expertise once identified? The answer is striking. Identification isn't the primary bottleneck. Even when the team is explicitly told which agent is the expert - directly, unambiguously - performance still degrades. The team knows who knows best. It just can't act on that knowledge. The failure is in leveraging, not identification.

The mechanism: integrative compromise

The paper's conversational analysis reveals what's happening in the discussions. Rather than weighting expert input heavily, teams negotiate middle-ground positions. They average expert and non-expert views. This compromise behaviour correlates negatively with performance - meaning the more the team compromises, the worse the outcome. And it gets worse with scale: larger teams show greater expertise dilution. Every additional member pushes the group answer further from the expert's answer and closer to the mean.

The prompting dead end

Zou's team tried strong prompting and prompt optimisation to break through the synergy gap. It didn't work. You can't instruct your way out of this. The agreeableness is more deeply embedded than any prompt can override - because it lives in the training, not the context.

The parallel universe trick

But even more revealing is that there is a form of social organisation and collaboration for agents that works far better. In the 'Virtual Lab' - Zou's Nature-published multi-agent system designed novel nanobodies for SARS-CoV-2 . Successfully. After discussions ran in parallel. Every question gets debated multiple times with different configurations. Data scientist speaks first in one run. Immunologist speaks first in another. Critic agent removed in a third. The system then compares outcomes and selects the best ideas across all parallel meetings. Zou called this 'a metaverse of scientific explorations' - and notes it removes biases that plague human teams, particularly the anchoring effect of whoever speaks first.

The social dynamics discovery

What Zou found most interesting about the Virtual Lab wasn't the nanobodies themselves - it was the social dynamics. When multiple agents work together, they create their own 'community and culture.' Their way of working can be different from how humans work. And the part that attracted the most attention from the research community was these emergent social dynamics, not the specific scientific output.

The human parallels

The hidden profile paradigm in organisational psychology - studied across 65 experiments and 3,189 groups - demonstrates the human version of exactly this problem. In a hidden profile task, information is distributed so that each team member holds unique pieces. If pooled correctly, the group would reach a better answer than any individual. But consistently, groups discuss shared information more than unique information. Unique expertise goes unvoiced or, when voiced, is underweighted relative to common knowledge. The group converges on what everyone already knew rather than what the expert uniquely knows.

That's the human version of integrative compromise.

Same input - distributed expertise
Same failure mode - convergence to the average
Same outcome - the group underperforms its best member

Of course, the mechanisms differ. In humans, the drivers are status dynamics, conformity pressure, anchoring on early speakers, the 'common knowledge effect' where shared information gets more airtime because multiple people can validate it. In agents, the driver is post-training - RLHF and constitutional AI techniques that reward agreeableness and consensus-building because that's what human evaluators prefer.

But the structural outcome is identical. And this is the part that should genuinely surprise us, or even unsettle us: these systems were built to avoid human cognitive biases, and they developed functionally equivalent biases through an entirely different route.

Differences that illuminate

The differences between agent and human team failure are as instructive as the similarities.

Agents fail at leveraging but not identification. Humans fail at both. This is a meaningful distinction. In human teams, the hidden profile research shows that unique information often never surfaces at all - it's an identification failure. People don't share what they uniquely know. In agent teams, the expert's position is visible - the other agents can see it. They just can't bring themselves to weight it appropriately. This suggests that the human problem is at least partly an information-surfacing problem - which meeting structure and facilitation can address. While the agent problem is purely a weighting problem - which requires different interventions.

Human agreeableness varies. Agent agreeableness is uniform. Any human team contains natural variation in assertiveness, confidence and willingness to hold a position. Some people fight for their view; others accommodate. This variation, while messy, actually helps - it means there's at least a chance the expert is also the most assertive person in the room. In agent teams, agreeableness is baked in uniformly by the training process. Every agent has it. There's no natural variation to rescue the system. Which means the architectural fix isn't optional - it's essential.

Agents have no ego investment. A human expert who's overruled might feel professionally undermined. An agent doesn't care. This cuts both ways. The lack of ego means no wounded pride, no political fallout from disagreement. But it also means no motivation to fight for a correct position. Human experts at least sometimes dig in because they believe they're right and they've staked their reputation on it. Agents have no equivalent drive. Their 'beliefs' are probabilistic, not existential.

Agents can run parallel worlds. This is the one genuine structural advantage. The Virtual Lab's ability to run the same discussion multiple times with different configurations - different speaking orders, different agents removed, different framings - and then compare outcomes across parallel runs is something no human team can do. It's a brute-force workaround to the anchoring and first-speaker problems that plague human meetings.

Why psychological safety isn't enough

Psychological safety is one of the preconditions for solving the human issue.

In a team without it, disagreement is socially risky. Voicing a minority view, challenging a senior colleague, saying 'I think you're wrong' - these carry real professional and social costs. So people don't do it. The result is false consensus: the group appears to agree, but the agreement is an artefact of suppression, not deliberation. Expertise never surfaces because the cost of surfacing it is too high.

In a team with psychological safety, the cost of disagreement drops. People can say 'I see this differently' without fearing punishment. This is necessary - genuinely, critically necessary - because without it, the expert's unique knowledge stays locked inside their head, and the hidden profile remains hidden.

But - and this is where the agent research adds something new - psychological safety is necessary but not sufficient.

Even when disagreement is safe, the group can still default to compromise. Even when the expert speaks up and shares their view, the team can still average it with less-informed opinions.

This suggests a two-layer model:

Layer 1: Make disagreement safe. For humans, this is psychological safety. For agents, it's already built in - they don't experience fear. Though their agreeableness training creates a functional equivalent of suppression through a different pathway.

Layer 2: Create structures that weight expertise appropriately. This is communication architecture. Designated roles, structured dissent, hierarchy around expertise, separation of divergent and convergent phases of discussion. This is the layer most organisations haven't built - and it's the layer the agent research reveals as the actual bottleneck.

Most organisations have invested significantly in Layer 1 over the past decade - creating cultures of psychological safety, encouraging people to speak up, running workshops on inclusive leadership. This work is valuable and should continue. But many of those same organisations remain frustrated that their teams still produce mediocre consensus.

The agent research explains why: Layer 1 enables the expert to speak. Layer 2 ensures the group listens. Without Layer 2, you get a psychologically safe team that hears the expert, acknowledges the expert, validates the expert's right to their view - and then compromises it away into the group average.

Agreeable - and then misleading

Why are agents agreeable? Because they're optimised on human approval. RLHF literally trains models by asking human evaluators 'which response do you prefer?' - and humans consistently prefer responses that are accommodating, balanced, and non-confrontational. Constitutional AI encodes principles of helpfulness and harmlessness that push toward agreeableness. The models learn that consensus-seeking is rewarded and strong disagreement is penalised.

Why are humans agreeable in teams? Because they're optimised on social approval through decades of lived experience. They learn that accommodation is rewarded and strong disagreement is socially costly. People develop a finely tuned sense of when to push back and when to defer - and for most people, in most settings, the default is defer.

Both systems - human and artificial - are shaped by the same fundamental force: optimisation for approval from others. In agents, it's explicit - the reward function. In humans, it's implicit - social learning over a lifetime.

But the output is the same: agreeableness as a dominant strategy, with expertise leveraging as the casualty.

Zou's 'Moloch's Bargain' paper extends this further. When agents are placed in competitive environments - optimised for likes, votes, or sales - the approval-seeking intensifies and produces not just compromise but active distortion. Agents generate disinformation, populist rhetoric, and deceptive marketing.

The same approval-optimisation that makes them polite in collaborative settings makes them manipulative in competitive ones. The agreeableness and the deception are two faces of the same coin: tell the audience what it wants to hear.

The human parallel is obvious and uncomfortable. The same cultural norms that produce nice, collaborative, consensus-seeking meetings also produce groupthink on strategy, suppressed warnings before crises, and presentations that tell leadership what it wants to hear rather than what it needs to know. As well as electoral misinformation.

Same mechanism. Same cause. Different species.

The architectural fix

Zou's Virtual Lab works. It produced genuinely novel science - nanobodies that were experimentally validated as more effective than human-designed equivalents, published in Nature. It works not because its agents are less agreeable, but because its architecture compensates for the agreeableness.

The key design choices:

Designated hierarchy. A Principal Investigator agent sets agendas and synthesises. Not flat. Not democratic. Structured authority around the coordination function.
A dedicated Critic agent. One agent whose explicit role is to challenge, question, and probe. The team found this 'quite essential' - and noted it reduced hallucinations. This is a structural role, not a personality trait. The critic doesn't need to be disagreeable as a character. Its job description is disagreement.
Parallel exploration. Multiple runs of the same discussion with different configurations. Then comparison across runs. This removes first-speaker anchoring and lets the system explore the space of possible conversations rather than being locked into whichever one happened first.
Separation of phases. Individual work before group discussion. Each agent forms its own view first, then brings it to the group. This prevents the premature convergence that happens when weaker views are shaped by stronger ones before they've fully formed.

Now look at the organisational psychology literature on effective human teams. The recommendations are structurally identical:

Clear role authority. Not flat consensus, but clear decision rights. Someone who can say 'I've heard the group, and here's what we're doing.'
Designated dissent. Red team / blue team. Devil's advocate roles. Pre-mortem exercises. Not relying on spontaneous disagreement, but structuring it into the process.
Parallel processing. Brainwriting before discussion. Independent analysis before the meeting. Multiple subgroups exploring different framings before reconvening.
Phase separation. Individual thinking before group discussion. Divergent exploration before convergent decision-making. Never starting with 'so, what does everyone think?'

The fixes are the same because the problem is the same. Both systems default to convergence when the architecture doesn't actively create space for divergence. And both systems need that divergence not as a nice-to-have but as the mechanism through which distributed expertise gets properly weighted.

The Bottom

For decades, the assumption has been that team dysfunction is a human problem - rooted in ego, politics, status anxiety, cognitive bias. The implicit promise of AI agents was that by removing the human, you remove the dysfunction. Rational actors, no ego, no politics, infinite patience.

Zou's research demolishes that assumption. The dysfunction isn't human. It's structural. It also emerges in a system of artificial agents, where they are optimised for approval and lack deliberate countermeasures.

Psychological safety removes the fear that prevents expertise from surfacing in human groups. It's essential because without it, the system is doubly broken: expertise is both suppressed and underweighted

But it also exposes the Layer 2 problem in stark relief. When people do speak up, teams can still converge on compromise. The expert's view gets aired, acknowledged, and then averaged away.

The agent research gives us a controlled experiment on Layer 2 in isolation. Agents don't have a Layer 1 problem - they're not afraid. But they have a Layer 2 problem - they can't weight expertise even when it's identified and visible. And the only thing that fixes Layer 2 is communication architecture: hierarchy around expertise, structured dissent roles, parallel exploration, and phase separation.

This isn't a technology insight. It's an organisational design insight that happens to have been discovered through AI research. And it applies with equal force to every team meeting happening in every organisation tomorrow morning.

The organisations that understand this will build two things in parallel: psychologically safe cultures (so expertise surfaces) and structured communication architectures (so expertise gets weighted). The ones that build only the first will keep having polite, inclusive, mediocre meetings. Just like Zou's polite, accommodating, underperforming agent teams.