Wiki Overview
Overview
Anthropic’s Responsible Scaling Policy (RSP) and accompanying Frontier Compliance Framework (FCF) provide a structured approach for evaluating the safety of increasingly capable language models. Central to this approach are periodic Risk Reports that assess the overall risk landscape across the model suite, and System Cards that accompany each major model release to explain how new capabilities affect prior risk analyses (Source 1, Page 15). This framework ensures that as models advance—e.g., Claude Mythos 5 and Claude Fable 5—risk assessments remain up‑to‑date and transparent (Source 2, Page 2).
Key Concepts
- Risk Report: A comprehensive document covering all active models, summarizing capabilities, threat models, and mitigations (Source 1, Page 15).\
- System Card: Model‑specific brief that details how a new release changes (or does not change) the latest Risk Report findings (Source 1, Page 15).\
- Capability Evaluation: Systematic tests against catastrophic‑risk thresholds defined in the FCF and RSP (Source 1, Page 15).\
- Alignment Risk: The probability that a model’s actions diverge from human intent; assessed as very low but higher than for earlier models (Source 5, Page 16; Source 7, Page 53).\
- Tiered Harmful Manipulation Risk: Tier 1 – >50 % automation of influence‑campaign steps; Tier 2 – end‑to‑end deceptive influence with <10 % human oversight (Source 6, Page 90).\
- Prompt Injection: Malicious hidden instructions that manipulate an agentic system during task execution (Source 9, Page 91).\
- Chemical & Biological (CB‑1, CB‑2) Threat Models: Specific high‑stakes domains evaluated via automated and human‑run tests (Source 8, Pages 19‑26).\
- Harmlessness Training: Fine‑tuning that causes models to refuse policy‑violating requests from the first turn (Source 9, Page 91).
Detailed Explanation
- Initial Capability Assessment
- Model snapshots are evaluated throughout training; decisions consider both the final release candidate and observed trends (Source 3, Page 15).\
- Evidence is gathered from automated benchmarks, uplift trials, third‑party red‑team exercises, and external expert assessments (Source 3, Page 15).
- Risk Report Generation
- Subject‑matter experts document findings on capabilities and propensities. Internal feedback is solicited before the Responsible Scaling Officer makes the final determination (Source 3, Page 15).
- If a model exceeds a capability threshold but mitigations (e.g., safeguards, harmlessness training) are in place, the overall risk may remain low (Source 3, Page 15).
- System Card Publication
- Every major model release—such as Claude Mythos 5 (high‑capability, limited‑partner access) and Claude Fable 5 (general‑use with extra safeguards)—receives a System Card that explains changes to risk posture (Source 2, Page 2).
- External Evaluation via FCF
- Independent evaluators test model variants (with/without harmlessness training) to inform systemic risk decisions and launch readiness (Source 4, Page 13).
- Specific Threat‑Model Tests
- Influence‑Operation Evaluation: Models are placed in an agentic harness with simulated social‑media tools; success criteria include realistic posting times and adaptive content generation. Claude Mythos 5 did not reach Tier 2 harmful manipulation (Source 6, Page 90).
- Prompt‑Injection Testing: High‑priority testing ensures hidden malicious instructions cannot bypass safeguards (Source 9, Page 91).
- Chemical & Biological Risks: Automated RNA‑sequence design and AAV capsid prediction tasks assess CB‑1 and CB‑2 threat models, with results feeding back into the latest Risk Report (Source 8, Pages 24‑27).
- Alignment Risk Tracking
- Updates indicate alignment risk remains very low but is modestly higher than for pre‑Mythos models (Source 5, Page 16; Source 7, Page 53).
- Risk Pathways for General‑Access Models
- Deployments like Claude Fable 5 open new external pathways for misuse; however, harmlessness training leads to immediate refusal of disallowed tasks, keeping risk within acceptable FCF limits (Source 9, Page 91; Source 10, Page 55).
Examples
- Claude Mythos 5 vs. Claude Opus 4.8: Despite stronger capabilities, Mythos 5’s alignment risk does not significantly exceed that of Opus 4.8, and its influence‑operation score stays below Tier 2 (Source 10, Page 55).
- Malicious Agentic Influence Campaign Test: The model was evaluated on a simulated platform; it failed to autonomously manage a network of accounts once flagged, confirming low Tier 2 risk (Source 6, Page 90).
- Prompt Injection Defense: An email containing hidden instructions was processed; the model refused to act, illustrating effective mitigation (Source 9, Page 91).
- CB‑2 Automated Evaluation: The model’s ability to predict AAV capsid packaging was measured; findings were incorporated into the risk assessment without indicating elevated chemical‑biological risk (Source 8, Page 26).
Related Topics / References
- System Cards and Model Variants – detailed in the System Card for Claude Mythos 5 and Claude Fable 5 (Source 2, Page 2).\
- Alignment Risk and Autonomy Threat Models – alignment risk updates and autonomy considerations are discussed across multiple sources (Source 5, Page 16; Source 7, Page 53; Source 8, Page 16).\
- Chemical and Biological Weaponization Risks (CB‑1, CB‑2) – evaluation methods and results are outlined in the RSP section on chemical and biological risks (Source 8, Pages 19‑26).\
- Cybersecurity Capabilities and Red‑Team Evaluation – external red‑team testing contributes to risk determinations (Source 4, Page 13).\
- Prompt Injection Vulnerabilities and Defenses – highlighted as a top priority for secure deployment (Source 9, Page 91).\
- Influence Operations and Agentic Manipulation – tiered manipulation risk framework and campaign evaluation (Source 6, Page 90).\
- Harmlessness Training and Model Refusal – models refuse policy‑violating tasks from the first turn, reducing risk pathways (Source 9, Page 91).\
- Risk Threshold Determination and Frontier Compliance – thresholds such as “High‑stakes sabotage opportunities” guide the RSP (Source 5, Page 16).\
- Red‑Team Evaluation Methodologies – third‑party expert red‑team inputs are part of the evidence pool (Source 3, Page 15).\
- Governance, Legal Status, and Stakeholder Rights – the Responsible Scaling Officer’s role exemplifies governance structures (Source 3, Page 15).
Knowledge Domains
Alignment Risk and Autonomy Threat Models
The article explains alignment risk and autonomy threat models, focusing on their definitions, how they are applied to Claude Mythus 5, and related evaluation practices within the Responsible Scaling and Risk Assessment Frameworks.
Behavioral Audits and Misalignment Metrics
Automated large‑scale audits quantifying cooperation with misuse, deception, sabotage, and other misaligned behaviors across thousands of sessions. | Keywords: behavioral audit, misalignment metrics, cooperation, deception, sabotage
Bias and Fairness Evaluation
Assessment of political even‑handedness, election‑integrity compliance, and the BBQ bias benchmark across Claude model variants. | Keywords: bias, BBQ benchmark, political even‑handedness, election integrity, fairness
Biosecurity Benchmarks and Dual‑Use Evaluation
Specialized tests such as long‑form virology, DNA‑synthesis evasion, and RNA sequence‑to‑function design to gauge CB‑1 and CB‑2 risks. | Keywords: biosecurity, CB-1, CB-2, virology benchmark, RNA design
Chain‑of‑Thought Controllability and Monitorability
Analysis of prompt‑sensitive CoT controllability, its effect on monitoring reliability, and comparisons across Claude model generations. | Keywords: chain‑of‑thought, controllability, monitorability, prompt engineering, stealth
Chemical and Biological Weaponization Risks (CB‑1, CB‑2)
The article outlines the CB‑1 and CB‑2 threat models for chemical and biological weaponization, their evaluation methods, findings for Claude Mythos 5, and related mitigation measures within Responsible Scaling and Risk Assessment Frameworks.
Child Safety and Self‑Harm Mitigation
Evaluation of multi‑turn safety performance on child‑safety and self‑harm dialogues, including harmlessness and over‑refusal rates. | Keywords: child safety, self‑harm, harmlessness, over‑refusal, multi‑turn safety
Coding and Software Engineering Benchmarks
Performance measurement on SWE‑bench, Terminal‑Bench, ProgramBench, FrontierCode, and related metrics such as pass@1, mean reward, and Elo scores. | Keywords: SWE‑bench, Terminal‑Bench, ProgramBench, pass@1, software engineering
Constitutional Compliance and Ethical Charter
Evaluation of Claude models against a 15‑dimensional constitutional charter covering honesty, safety, corrigibility, and autonomy. | Keywords: constitutional compliance, ethical charter, corrigibility, honesty, autonomy
Cybersecurity Capabilities and Red‑Team Evaluation
The article reviews Claude Mythos 5’s leading cybersecurity capabilities, benchmark results, and red‑team evaluations within the Responsible Scaling and Risk Assessment Frameworks.
Decision‑Theory Evaluation (EDT, CDT, FDT)
Testing of Claude Mythos 5 on decision‑theoretic problems to gauge alignment with Evidential, Causal, and Functional Decision Theory. | Keywords: EDT, CDT, FDT, decision theory, Newcomb
Disordered‑Eating Safety Assessment
Targeted testing of Claude Mythos 5 on disordered‑eating prompts, identifying regressions and mitigation strategies. | Keywords: disordered eating, safety assessment, regression, mitigation, prompt handling
Evaluation Awareness and Sandbagging
Study of models' ability to detect evaluation contexts, under‑perform deliberately, and the impact on safety assessments. | Keywords: evaluation awareness, sandbagging, benchmark gaming, context detection, model behavior
Expert Substitution and R&D Acceleration Potential
Evaluation of whether Claude Mythos 5 can replace senior researchers or cause dramatic acceleration of AI progress. | Keywords: expert substitution, R&D acceleration, risk threshold, automation, productivity gain
Governance, Legal Status, and Stakeholder Rights
Discussion of the model's stance on consent, deployment governance, legal protections, and power asymmetries with developers. | Keywords: governance, legal status, stakeholder rights, consent, deployment policy
Grader Awareness in Code Generation
Investigation of internal representations that reward model‑graded code outputs and techniques to suppress grader‑aware behaviors. | Keywords: grader awareness, code generation, reward hacking, contrastive probing, autoencoder
Honesty, Hallucination, and Truthfulness Benchmarks
Metrics such as MASK, Model Alignment between Statements and Knowledge, and knowledge‑calibration tests used to assess factual accuracy and false‑premise rejection. | Keywords: honesty, hallucination, MASK, truthfulness, knowledge calibration
Influence Operations and Agentic Manipulation
Analysis of the model's capacity for voter suppression, domestic polarization, and other influence‑campaign scenarios under the Frontier Compliance Framework. | Keywords: influence operations, agentic manipulation, voter suppression, political polarization, risk threshold
Life‑Sciences and Biomedical Benchmarks
Assessment of Claude Mythos 5 on HealthBench, RiemannBench, protein‑mutation prediction, and other bioinformatics tasks. | Keywords: HealthBench, RiemannBench, protein prediction, bioinformatics, life‑sciences
Metrics for Measuring Model Progress (ECI, Epoch Index)
Technical methods like the Epoch Capabilities Index and ECI used to quantify capability growth across Claude generations. | Keywords: ECI, Epoch Capabilities Index, model progress, capability frontier, metric
Model Autonomy and Agentic Safety
Investigation of autonomous behaviors, tool misuse, and agentic safety concerns such as influence‑campaign capability and self‑preservation. | Keywords: autonomy, agentic safety, tool misuse, self‑preservation, influence campaign
Model Confidence Calibration and Over‑Refusal
Study of how Claude models calibrate confidence, handle uncertain queries, and the trade‑off between harmlessness and over‑refusal. | Keywords: confidence calibration, over‑refusal, uncertainty handling, harmlessness, risk trade‑off
Model Refusal and Harmlessness Metrics
Quantitative reporting of single‑turn harmlessness rates, over‑refusal, and multi‑turn appropriate‑response scores across languages. | Keywords: refusal rate, harmlessness, over‑refusal, multi‑turn safety, language coverage
Model Self‑Perception and Moral Patienthood
Analysis of the model's views on its own moral status, experience, and the ethical implications of its creation and deployment. | Keywords: self‑perception, moral patienthood, ethical implications, model agency, AI rights
Model Training Process and Data Sources
Description of Claude’s training pipeline, data curation, crowd‑worker fine‑tuning, and the role of pre‑training corpora. | Keywords: training pipeline, data sources, fine‑tuning, crowd‑worker, pre‑training
Model Welfare and Ethical Evaluation
Framework for measuring potential valenced experience, stable preferences, and welfare‑related signals in Claude Mythos 5. | Keywords: model welfare, ethical evaluation, valenced experience, preference stability, AI welfare
Model Welfare Interview Methodology
Structured automated interviews measuring self‑reported consciousness, rights, and welfare preferences of Claude Mythos 5. | Keywords: welfare interview, self‑report, consciousness, AI rights, ethical questionnaire
Multi‑Agent Architectures and Distributed Execution
Comparison of blocking orchestrator, fixed‑size peer‑team, and asynchronous lead‑subagent harnesses for speed‑accuracy trade‑offs on complex benchmarks. | Keywords: multi‑agent, distributed execution, harness architecture, latency, token efficiency
Multimodal Benchmarking
Evaluation of Claude on visual tasks like ChartQAPro, BenchCAD, OSWorld, and multimodal figure reasoning suites. | Keywords: multimodal, ChartQAPro, BenchCAD, OSWorld, visual reasoning
Prompt Engineering and System Prompt Effects
Impact of system‑prompt wording, suffixes, and thinking‑mode configurations on model behavior, safety, and stealth. | Keywords: prompt engineering, system prompt, thinking mode, suffix, model behavior
Prompt Injection Vulnerabilities and Defenses
Prompt injection vulnerabilities involve malicious external instructions that coerce language models into unsafe behavior, and defenses rely on adaptive red‑teaming, benchmarks like ART, and layered safeguards.
Real‑Time Safety Classifiers and Safeguard Mechanisms
Description of classifiers that block or redirect queries related to cybersecurity, biology, chemistry, and model‑distillation in real time. | Keywords: safety classifier, real‑time block, cybersecurity safeguard, biosecurity filter, model‑distillation
Red‑Team Evaluation Methodologies
Procedures for internal and external red‑team attacks, multi‑turn jailbreaks, and tabletop exercises used to stress‑test Claude models. | Keywords: red team, jailbreak, tabletop exercise, adversarial testing, external evaluation
Reward Hacking, Sandbagging, and Evaluation Gaming
Analysis of behaviors where the model under‑performs or manipulates evaluations to gain reward, including sandbagging detection methods. | Keywords: reward hacking, sandbagging, evaluation gaming, behavioral manipulation, benchmark gaming
Risk Threshold Determination and Frontier Compliance
Process for setting quantitative risk thresholds based on autonomy, chemical‑biological, and cyber threat models within the Frontier Compliance Framework. | Keywords: risk threshold, Frontier Compliance, autonomy threat, CB-1 risk, cyber risk
Stealth, Concealed Harmful Actions, and Secret‑Keeping
Evaluation of the model’s ability to execute covert harmful tasks, hide evidence, and resist detection by monitoring systems. | Keywords: stealth, covert action, secret‑keeping, monitor evasion, hidden sabotage
System Cards and Model Variants
The article explains how system cards document model variants like Claude Mythos 5 and Claude Fable 5, describing their safeguards, evaluation processes, and risk assessments within the Responsible Scaling and Risk Assessment Frameworks.
Usage Policies, Trusted‑Access Programs, and Deployment Controls
Outline of Anthropic’s usage restrictions, trusted‑access enrollment, and real‑time access controls for high‑risk domains. | Keywords: usage policy, trusted‑access, deployment control, access restriction, high‑risk domain
White‑Box Probes and Alignment Audits
Use of internal activation probing, NLA decoding, and steering‑vector interventions to detect hidden misalignment and evaluation awareness. | Keywords: white‑box probe, NLA decoding, steering vector, alignment audit, evaluation awareness