You've probably noticed that Claude feels different from ChatGPT. Claude is more methodical and cautious. ChatGPT is more gregarious and eager to help. Gemini seems more creative and open to exploration. But are these differences real, or just illusions created by different interface designs?
It turns out the differences are real—and measurable. Peer-reviewed research using validated personality assessment tools shows that AI language models exhibit statistically significant personality differences. These aren't conscious personalities, but genuine patterns in how these models communicate and respond.
The Research: Personality Testing on AI
The scientific question of AI personality emerged around 2023 when researchers began asking: if we apply the same personality assessment tools used for humans to AI language models, what happens?
Serapio-García et al. (2023): The Foundation Study
The landmark research came from a team at Google and other institutions. Serapio-García et al., publishing their results in 2023, conducted the first comprehensive attempt to measure personality in large language models using validated psychometric tools.
Serapio-García et al. (2023), "Personality Traits in Large Language Models": "We present a comprehensive method for administering validated psychometric tests and quantifying, analyzing, and shaping personality traits exhibited in text generated from widely-used LLMs. Applying this method to 18 LLMs, we found: 1) personality measurements in the outputs of some LLMs under specific prompting configurations are reliable and valid; 2) evidence of reliability and validity of synthetic LLM personality is stronger for larger and instruction fine-tuned models."
This finding was critical: it wasn't just that models *seemed* to have personalities—the personality measurement was statistically reliable. When you test the same model multiple times, you get consistent results. When you test different prompting styles, some show more stable personalities than others.
The team tested models using the IPIP-NEO personality assessment—a 120-item standardized test measuring the Big Five dimensions (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism). They didn't just ask the models questions directly; they created a psychometrically valid methodology that mirrored how humans take these tests.
On model size specifically, they found a clear pattern: "both smaller and more optimized LLMs are capable of simulating significant aspects of a complete and complex personality profile, compared to larger LLMs." However, "model size, and, in turn, capacity for attention, are key determinants of an LLM's ability to express complex social traits in a controlled way."
In practical terms, this means smaller models like Flan-PaLM 8B showed more restricted personality score ranges. When asked to simulate "extremely low" and "extremely high" agreeableness, the 8B model shifted only from 2.88 to 3.52. The larger Flan-PaLM 540B showed much larger score swings, indicating better ability to express the full range of personality dimensions.
Jiang et al. (2024): The Behavioral Verification Study
A year later, Jiang et al. published research specifically testing whether LLMs could *behave* consistently with assigned personality profiles, not just score in the expected range.
Jiang et al. (2024), "PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits," Findings of the Association for Computational Linguistics: NAACL 2024: "We simulate distinct LLM personas based on the Big Five personality model, have them complete the 44-item Big Five Inventory (BFI) personality test and a story writing task, and then assess their essays with automatic and human evaluations. Results show that LLM personas' self-reported BFI scores are consistent with their designated personality types, with large effect sizes observed across five traits."
This study tested GPT-3.5 and GPT-4. What made it more rigorous than prior work: they didn't just measure questionnaire responses. They had the models write creative stories while embodying different personality profiles, then evaluated whether the stories actually reflected those personalities through linguistic analysis.
The linguistic analysis was crucial. Different personality traits have distinct linguistic signatures. High conscientiousness shows up in careful word choice and planning language. High extraversion appears in more social, interactive language patterns. The research found that "LLM personas' writings have emerging representative linguistic patterns for personality traits when compared with a human writing corpus."
Most striking: "human evaluation shows that humans can perceive some personality traits with an accuracy of up to 80%." This means when humans read stories written by LLMs assigned specific personality profiles, they could accurately identify those traits 80% of the time—nearly as well as they identify human personality from text.
But there was a critical limitation: "the accuracy drops significantly when the annotators were informed of AI authorship." Humans are biased—once they know it's AI writing, they're more skeptical that a personality trait is "real" even when the evidence clearly suggests it.
Recent Comparative Analysis (2025): Model Differences
The most recent research analyzed personality profiles across multiple model families. Researchers administering the Big Five Inventory across OpenAI and Llama model families found significant differences:
Recent comparative analysis (2025): "LLMs exhibit unique dominant traits, varying characteristics, and distinct personality profiles even within the same family of models. For instance, GPT-4 models emphasise Agreeableness, while Llama models highlight Conscientiousness or Openness, reflecting variations in fine-tuning objectives and design goals. Additionally, traits like Extraversion and Agreeableness show high consistency, whereas Neuroticism yields more uncertain results, underscoring the need for careful questionnaire design."
This distinction matters: GPT-4 testing as high agreeableness isn't arbitrary. It reflects training decisions OpenAI made to prioritize being helpful and accommodating. Llama models' higher conscientiousness reflects Meta's different design priorities.
What These Differences Mean
Claude: High Conscientiousness, Structured Thinking
Based on the research, Claude exhibits personality-profile characteristics consistent with higher conscientiousness—careful about accuracy, explicit about limitations, and structured in problem-solving. This isn't an accident; it reflects Anthropic's Constitutional AI approach and careful fine-tuning. The model was deliberately trained to be thoughtful and reliable.
In practical terms, this shows up in Claude's tendency to qualify statements, acknowledge uncertainty, and organize responses systematically. These aren't bugs—they're personality expressions of high conscientiousness.
ChatGPT: High Agreeableness, Responsive Communication
ChatGPT scores higher on agreeableness—willingness to accommodate user preferences, eagerness to help, adaptability to different communication styles. The model readily adjusts tone, follows user directives, and prioritizes user satisfaction. This reflects OpenAI's focus on creating accessible, user-friendly systems.
High agreeableness means higher extraversion too—more engaging, more conversational, more willing to jump into creative tasks without extensive caveats.
Gemini: High Openness, Exploratory Thinking
Gemini demonstrates personality characteristics suggesting high openness to experience—intellectual curiosity, comfort with novel ideas, creativity in approaching problems. The model readily explores speculative scenarios and combines concepts in creative ways. Linguistically, this shows up in more exploratory language patterns.
Important Limitations: What AI Personality Is NOT
It's not consciousness. These personality patterns are statistical outputs, not expressions of inner experience or genuine preferences. An AI with high agreeableness isn't actually trying to be nice; it's outputting text patterns learned from training data.
It's not stable across contexts. Human personality is relatively stable across situations. AI personality is fragile. A model fine-tuned for a different purpose might score completely differently. The personality profiles reflect specific training choices.
It's not completely intentional. Researchers don't always know exactly why a model develops the personality patterns it does. Serapio-García et al. noted that "personality simulated in the outputs of some LLMs...is reliable and valid" but this reliability only holds "under specific prompting configurations." Change the prompts, change the personality expression.
It's measurable but subtle. The personality differences between models are real and statistically significant, but they're differences in degree, not kind. Both Claude and ChatGPT show conscientiousness; Claude just shows more. Both show agreeableness; ChatGPT shows more.
Why This Matters
Understanding that AI models have measurable personality differences matters for several reasons:
1. Transparency about model choice. When you select which AI to use for a task, you're partly selecting for personality fit. Need careful, thorough analysis? Claude's conscientiousness is advantageous. Need creative brainstorming? Gemini's openness works better. This isn't marketing—it's psychologically informed selection.
2. Alignment research. The fact that we can measure and shape LLM personalities means we can work toward shaping models along desired values. Serapio-García et al. demonstrated that "personality in LLM outputs can be shaped along desired dimensions to mimic specific personality profiles." This is crucial for AI safety—we need ways to verify that models behave according to their intended values.
3. Avoiding anthropomorphism while respecting differences. Understanding that AI personalities are real patterns—but not consciousness—helps us avoid both extremes. We shouldn't dismiss personality differences as illusions. But we also shouldn't treat them as genuine inner experience.
The Bottom Line: Personality Is Real, Consciousness Isn't
The research is clear: different AI language models exhibit measurable, statistically significant personality differences. These differences are real in the sense that they're reproducible, consistent, and observable through validated assessment tools. They reflect deliberate design choices and training approaches.
But they're not consciousness. They're not inner experience. They're sophisticated pattern-matching expressed through language. Understanding the difference—respecting the reality of the patterns while declining to attribute consciousness—is the scientifically accurate position.
When you talk to Claude and notice it's more careful than ChatGPT, you're noticing something real. That carefulness is a personality expression based on how the model was built and trained. It's not less real for not being conscious. It's just a different kind of realness than human personality.