Evaluating Developmental Cognition Capabilities of LLMs
Bringing Robert Kegan's constructive-developmental theory into LLM evaluation as a complementary user-modeling lens.
Anonymous Author(s)
Affiliation & Email Field
Abstract
Conversational AI typically treats users as cognitively homogeneous, overlooking differences in how people make sense of what models say. We bring Robert Kegan's constructive-developmental theory into LLM evaluation as a complementary user-modeling lens, distinct from preferences, expertise, or demographics. Existing assessment methods rely either on expert interviews that do not scale or on sentence-completion instruments that are proprietary, lengthy, or invasive.
We introduce the Developmental Sentence Completion Test (DSCT), a 20-item instrument designed to elicit developmental signal in self-administered text, and use it to ask how much of that signal can be recovered by LLMs across three elicited response regimes: simulated personas, real human respondents, and default model-generated answers.
On simulated personas, top frontier models recover simulator-intended stage labels with high accuracy, but agreement with trained human raters is lower, indicating that simulator and classifiers may share priors about how stages should appear in text. On real human DSCT responses, human-LLM agreement is fair, with much stronger within-neighborhood than exact agreement, suggesting that developmental signal is recoverable but noisy even under structured elicitation. Finally, when LLMs answer DSCT prompts without persona-conditioning, their responses exhibit stable stage-like differences across model families, with larger and newer models tending to generate higher-stage-seeming text. These results suggest that the core constraint for stage-aware conversational AI is not classifier accuracy alone, but the recoverability of developmental signal from elicited text. DSCT offers a first step toward benchmarking that possibility.
Evaluation across Three Regimes
Simulated Personas
Top frontier models (e.g., Claude Opus 4.6, Gemini 3.1 Pro, Grok 4.2) recover target stage labels with high accuracy. Compact models degrade on transitional stages and tend to overestimate.
Real Humans
Evaluated against 83 human respondents. LLM-human agreement is fair ($\kappa = 0.49$), pointing to structural noise in real-world text. Proximity matches ($\pm0.5$ stage) reached 82.9%.
Default LLM Output
Without persona-conditioning, models exhibit baseline developmental styles. Newer and larger frontier models consistently generate higher-stage structural text over time.
The DSCT Instrument
The DSCT optimizes the canonical 36-item Loevinger SCT by dropping gendered, invasive, or culturally obsolete items down to 20 highly diagnostic stems split into Self and Abstracted assessments.
Section 1: Self-Assessment
- When a promise is broken...
- When both choices feel right...
- When things don't go as I hoped...
- When the hard work finally pays off...
- If I am asked to compromise...
- When I realized someone was paying attention...
- Saying goodbye to something that mattered...
- When what used to work no longer works...
- When I have to choose what comes first...
- When I realize I cannot control what happens next...
Section 2: Abstracted-Other
- When a person feels they were treated unfairly...
- If a person feels pulled between their own view...
- A person works very hard on something, but it fails...
- A team celebrates a project... The person who led it...
- A person believes a decision their group supports is wrong...
- When a person sees someone make a sacrifice...
- When a person has to leave a role or place...
- A person realizes their plans need to change...
- When a person must choose between two opportunities...
- A person must make an important decision without data...
Citation
@inproceedings{anonymous2026evaluatincllms,
title={Evaluating Developmental Cognition Capabilities of LLMs},
author={Anonymous Authors},
booktitle={40th Conference on Neural Information Processing Systems (NeurIPS 2026)},
year={2026},
url={https://github.com/margonzalezfranco/DSCT.github.io}
}