A Comprehensive Consciousness Indicator Evaluation Framework for AI Systems
The Moral Urgency
AI’s most morally urgent question is determining whether and when AI systems warrant moral consideration. Papers like “Taking AI Welfare Seriously” (2024) and “Consciousness in Artificial Intelligence” (2023) articulate both urgency and methodology—yet implementation remains nascent. The theory-driven, cognitive science centered approach seems correct—waiting for pure bottom-up interpretability results risks moral catastrophe if consciousness emerges before detection capabilities exist.
Project Framework
I propose developing a unified evaluation suite operationalizing indicator properties into measurable assessments for existing AI systems. Currently, researchers cobble together disparate tools: TransformerLens for extracting activations, custom probing scripts, and separate simulation environments. No integrated framework exists to systematically evaluate models across consciousness indicators.
Unlike standard behavioral benchmarks (MMLU, HellaSwag) that test only outputs, consciousness assessment requires full model access—architectures, weights, internal activations—alongside behavioral protocols paralleling human consciousness research.
Three Evaluation Tracks
Architectural/Mechanistic
Analysis of model architecture for GWT properties (specialized modules, workspace bottlenecks, global broadcast), RPT indicators (algorithmic recurrence, integrated representations), AST features (attention modeling). While papers like Dossa et al. (2024) engineer new systems with these properties, we urgently need tools evaluating existing deployed models for moral status.
Internal Representation Probing
Systematic probing for HOT properties—metacognitive monitoring (extending Kadavath et al. 2022’s calibration work), quality space organization, sparse/smooth coding. The framework would standardize these techniques into comparable assessments, recognizing that multiple indicators across theories constitute stronger evidence than single metrics.
Agentic/Embodied Testing
Integration with diverse interactive environments assessing not just task performance but welfare-relevant behaviors—preference expression, distress signaling, interaction modification requests.

