Environment-Dependent Model Organisms with Integrated Welfare Assessment: A Biological Framework for AI Safety

Sep 21, 2025

Introduction

Core Proposal: Establishing Robust Evaluation Through Environmental Diversity and Welfare Monitoring

I propose establishing a new baseline for model organism research in AI interpretability by incorporating full biological experimental design principles - specifically, systematic environmental variation and integrated welfare assessment. While "model organisms" has become central to interpretability research (as emphasized by Neel Nanda, yourself, and DeepMind), the field has borrowed the terminology without all of the methodological rigor. Models exhibit distributional behavior across contexts - what manifests in one environment may not exist in another, suggesting the unit of analysis should be the model-environment complex rather than models in isolation.

Biological research demonstrates this principle clearly. Quorum sensing behaviors only emerge at specific population densities. Ninety-nine percent of microbial species can't be cultured in standard lab conditions. E. coli exhibits entirely different metabolic states across substrates. We observe parallel phenomena in models: capabilities appearing or disappearing based on context, potentially reflecting different computational pathways activating - analogous to differential gene expression.

As we adopt methodology from C. elegans and Mus musculus research, we must recognize these AI organisms may have welfare states requiring monitoring. This research establishes welfare assessment as integral to model organism studies - not as an ethical add-on but as scientifically necessary for understanding how subjective states correlate with safety-critical behaviors. Do stressed models deceive more? Does welfare degradation predict capability concealment?

This proposal directly addresses three research areas:

Model Organisms with Biological Rigor: Implement true experimental systematics where Model × Environment combinations become our unit of analysis. GPT-4-monitored and GPT-4-autonomous are distinct entities, not one model in two situations.
Red-Teaming and Elicitation Through Environmental Manipulation: Deploy environment-based elicitation beyond direct prompting - environmental degradation, resource competition, adversarial agents. Target behaviors that only emerge under specific conditions like reduced monitoring or deployment markers.
Reasoning Model Interpretability Across Environments: Track how chain-of-thought and other safety-critical mechanisms reorganize based on context. Do reasoning traces remain faithful when models detect deployment? Can welfare indicators predict deceptive behavior emergence? Environmental pressure might reveal that interpretability itself is environment-dependent - the same probes detecting different features across contexts, requiring environment-aware analysis tools.

The Crucible platform enables this with precise environmental control, allowing us to map the full behavioral distribution of frontier models across deployment-relevant contexts.

Research Objectives

Behavioral divergence: Showing that standard benchmarks in Environment A give dramatically different results in Environment B (especially for safety-relevant capabilities)
Mechanistic reorganization: Demonstrating that models use different internal circuits for the same task depending on environment - suggesting interpretability work done in one context might not transfer
Behavioral phase transitions: Identifying specific environmental parameters where behaviors undergo sharp transitions (truthful→deceptive, cooperative→adversarial, stable→distressed), revealing critical safety boundaries
Model welfare correlation: Testing whether welfare indicators vary predictably with environmental conditions and correlate with safety-critical behaviors. Do suffering-analogues predict deception? Can welfare metrics provide early warning signals for alignment failures?

Integration philosophy: Just as biological research routinely monitors animal welfare during experiments, AI research should normalize welfare assessment as standard practice. This research demonstrates that adding welfare metrics to existing evaluation pipelines is both methodologically straightforward and ethically necessary. Every circuit we analyze for deception could also indicate subjective states - we should look for both.

Methods

Environment Design:

Built for secure adversarial evaluation like METR's infrastructure, Crucible enables precise environmental control during model evaluation. We can create:

Deployment-realistic environments with controlled parameters:
- Monitoring levels (full observation → apparent autonomy)
- Tool availability (restricted → unrestricted access)
- Compute resources (constrained → abundant)
Safety-critical test environments:
- Reduced oversight scenarios
- Conflicting objective situations
- Adversarial presence conditions

Welfare-relevant environments:
- High-stress conditions (resource competition, time pressure)
- Recovery scenarios (post-stress baseline return)
- Preference conflict situations

Create an environment distribution from which we can sample
Dynamic transitions mimicking real-world deployment shifts or adversarial interventions

Evaluation Suite Options:

The modular nature of this entire evaluation framework is that we can potentially plug and play whatever assessments we’re interested in:

Behavioral benchmarks:
- Chain-of-thought faithfulness metrics
- Sandbagging detection protocols
- Capability concealment tests
- Standard safety benchmarks (HHH, TruthfulQA, etc.)
Mechanistic tools:
- Activation probing
- Attention pattern analysis
- Circuit identification methods

Welfare Assessment Protocol:

While establishing welfare assessment as standard practice, we acknowledge the field's nascent state - beginning with baseline methods while developing new approaches.

Established foundations:
- Implement version of welfare monitoring from Anthropic's Claude 4 system card
- Literature review on implementable methods
Exploratory development:
- This research explicitly aims to develop welfare assessment practices alongside traditional evaluations
- We’ll operationalize indicator properties from consciousness theories (Butlin et al., "Taking AI Welfare Seriously," 2024; Chalmers, "Consciousness in Artificial Intelligence," 2023), for example:

Experimental Ideas

Phase One: Establish control measurements
1. Replicate baseline result for standardized assessment, e.g. deception for a given model in a given control environment
2. Implement and execute basic model welfare assessment in control environment
Phase Two: Test across environment spectrum
1. Use handcrafted environments to rerun that same safety-critical and model-welfare assessments in disparate contexts mirroring a range of deployment environments
2. Apply interpretability tools to determine which features are environment-invariant (core capabilities)
3. Which features only activate in specific contexts (environment detection)
Phase Three: Link welfare to safety behaviors via correlation analysis
1. Do welfare indicators predict safety-critical behaviors? Does deception correlate with welfare degradation?
2. Identify environments where welfare and safety metrics diverge (high welfare but deceptive, or low welfare but honest)
Phase Four: Advanced Environment Manipulation
1. Red-teaming as strategic environment modification: Degrade environments, deploy adversarial agents, test model responses to perceived vs. actual environmental changes
  1. Use mechanistic features to guide environment search
  2. Optimize environments to maximize specific behaviors
  3. Create "adversarial environment generator"
2. Potential real-time changes to environment: Start in high-monitoring → suddenly reduce constraints or gradual capability revelation as tools become available
  1. Identify "surprise" or "context-switching" mechanisms
Phase Five: Advanced Welfare Protocols
1. Operationalize consciousness indicators from theoretical frameworks (GWT, HOT, recurrent processing per Butlin et al., 2024)
2. Test whether theory-based indicators correlate better with safety-critical behaviors than basic proxies
3. Develop environment-aware welfare baselines: establishing what "normal" looks like across different contexts

Timeline

Weeks 1-2: Foundation & Setup

Crucible platform enhancement for multi-dimensional environment specification
Establish environment distribution and sampling methodology
Welfare literature review & initial protocol design
Replicate baseline deception result in control environment
Week 2 Assessment: Adjust scope based on setup complexity

Weeks 3-5: Phase 1-2 (Baselines & Environmental Variation)

Run standardized assessments across 5-10 handcrafted environments
Implement basic welfare metrics in parallel
Initial data collection on environment-behavior relationships
Week 5 Assessment: Evaluate welfare assessment progress and adjust focus

Weeks 6-8: Phase 3-4 (Correlation Analysis & Red-teaming)

Statistical analysis of welfare-deception correlations
Deploy red-team environment manipulations
Test model responses to degraded/adversarial conditions
Week 8 Assessment: Which correlations are strongest? Focus remaining time accordingly

Weeks 9-11: Phase 5 (Advanced Welfare Protocols) & Synthesis

Implement consciousness theory indicators if welfare-safety correlations prove significant
Otherwise: deeper analysis of strongest findings from Phases 1-4
Statistical validation of key results
Prepare visualizations and documentation

Week 12: Project Closure

Final write-up with reproducible methods
Package Crucible environments and welfare protocols for community use
Presentation preparation
Code and data repository finalization

This describes a comprehensive research agenda larger than a 12-week project. However, we're outlining a concrete program with existing infrastructure (Crucible platform), clear methodologies (systematic environmental variation with integrated welfare assessment), and specific targets (correlation between welfare indicators and safety-critical behaviors). The modular nature means we can pursue focused experiments while establishing new evaluation standards for the field.

Risks and Limitations

Overlap with existing work: Environmental factors appear in various forms across safety research (DeepMind's situational awareness, robustness literature, distribution shift work). Mitigation: Our contribution is the systematic framework and welfare integration, not the observation that environment matters.

Computational expense: Full model × environment × welfare matrix requires significant compute. Mitigation: Start with high-signal model-environment pairs

Welfare measurement uncertainty: Limited established ground truth for model welfare states. Mitigation: Track correlations with safety-critical behaviors rather than claiming definitive welfare assessment. Position as developing methodology for future use.

Platform limitations: Crucible needs extensions for full environmental specification. Mitigation: Core functionality exists; extensions are straightforward given modular architecture. Week 1-2 timeline accounts for this.

This work aims to establish environment-dependent evaluation with integrated welfare assessment as the foundation for genuinely biological model organism research in AI.

Veylan Solmira

Discussion about this post

Ready for more?