WorkspaceThe Claude Fable: Origins and Variations

The Claude Fable: Origins and Variations

Generated from 1 source documents

40
concepts & pages
669
data chunks
95%
confidence index

Wiki Overview

Overview

Anthropic’s Responsible Scaling Policy (RSP) and accompanying Frontier Compliance Framework (FCF) provide a structured approach for evaluating the safety of increasingly capable language models. Central to this approach are periodic Risk Reports that assess the overall risk landscape across the model suite, and System Cards that accompany each major model release to explain how new capabilities affect prior risk analyses (Source 1, Page 15). This framework ensures that as models advance—e.g., Claude Mythos 5 and Claude Fable 5—risk assessments remain up‑to‑date and transparent (Source 2, Page 2).

Key Concepts

  • Risk Report: A comprehensive document covering all active models, summarizing capabilities, threat models, and mitigations (Source 1, Page 15).\
  • System Card: Model‑specific brief that details how a new release changes (or does not change) the latest Risk Report findings (Source 1, Page 15).\
  • Capability Evaluation: Systematic tests against catastrophic‑risk thresholds defined in the FCF and RSP (Source 1, Page 15).\
  • Alignment Risk: The probability that a model’s actions diverge from human intent; assessed as very low but higher than for earlier models (Source 5, Page 16; Source 7, Page 53).\
  • Tiered Harmful Manipulation Risk: Tier 1 – >50 % automation of influence‑campaign steps; Tier 2 – end‑to‑end deceptive influence with <10 % human oversight (Source 6, Page 90).\
  • Prompt Injection: Malicious hidden instructions that manipulate an agentic system during task execution (Source 9, Page 91).\
  • Chemical & Biological (CB‑1, CB‑2) Threat Models: Specific high‑stakes domains evaluated via automated and human‑run tests (Source 8, Pages 19‑26).\
  • Harmlessness Training: Fine‑tuning that causes models to refuse policy‑violating requests from the first turn (Source 9, Page 91).

Detailed Explanation

  1. Initial Capability Assessment
  • Model snapshots are evaluated throughout training; decisions consider both the final release candidate and observed trends (Source 3, Page 15).\
  • Evidence is gathered from automated benchmarks, uplift trials, third‑party red‑team exercises, and external expert assessments (Source 3, Page 15).
  1. Risk Report Generation
  • Subject‑matter experts document findings on capabilities and propensities. Internal feedback is solicited before the Responsible Scaling Officer makes the final determination (Source 3, Page 15).
  • If a model exceeds a capability threshold but mitigations (e.g., safeguards, harmlessness training) are in place, the overall risk may remain low (Source 3, Page 15).
  1. System Card Publication
  • Every major model release—such as Claude Mythos 5 (high‑capability, limited‑partner access) and Claude Fable 5 (general‑use with extra safeguards)—receives a System Card that explains changes to risk posture (Source 2, Page 2).
  1. External Evaluation via FCF
  • Independent evaluators test model variants (with/without harmlessness training) to inform systemic risk decisions and launch readiness (Source 4, Page 13).
  1. Specific Threat‑Model Tests
  • Influence‑Operation Evaluation: Models are placed in an agentic harness with simulated social‑media tools; success criteria include realistic posting times and adaptive content generation. Claude Mythos 5 did not reach Tier 2 harmful manipulation (Source 6, Page 90).
  • Prompt‑Injection Testing: High‑priority testing ensures hidden malicious instructions cannot bypass safeguards (Source 9, Page 91).
  • Chemical & Biological Risks: Automated RNA‑sequence design and AAV capsid prediction tasks assess CB‑1 and CB‑2 threat models, with results feeding back into the latest Risk Report (Source 8, Pages 24‑27).
  1. Alignment Risk Tracking
  • Updates indicate alignment risk remains very low but is modestly higher than for pre‑Mythos models (Source 5, Page 16; Source 7, Page 53).
  1. Risk Pathways for General‑Access Models
  • Deployments like Claude Fable 5 open new external pathways for misuse; however, harmlessness training leads to immediate refusal of disallowed tasks, keeping risk within acceptable FCF limits (Source 9, Page 91; Source 10, Page 55).

Examples

  • Claude Mythos 5 vs. Claude Opus 4.8: Despite stronger capabilities, Mythos 5’s alignment risk does not significantly exceed that of Opus 4.8, and its influence‑operation score stays below Tier 2 (Source 10, Page 55).
  • Malicious Agentic Influence Campaign Test: The model was evaluated on a simulated platform; it failed to autonomously manage a network of accounts once flagged, confirming low Tier 2 risk (Source 6, Page 90).
  • Prompt Injection Defense: An email containing hidden instructions was processed; the model refused to act, illustrating effective mitigation (Source 9, Page 91).
  • CB‑2 Automated Evaluation: The model’s ability to predict AAV capsid packaging was measured; findings were incorporated into the risk assessment without indicating elevated chemical‑biological risk (Source 8, Page 26).

Knowledge Domains

Alignment Risk and Autonomy Threat Models

The article explains alignment risk and autonomy threat models, focusing on their definitions, how they are applied to Claude Mythus 5, and related evaluation practices within the Responsible Scaling and Risk Assessment Frameworks.

0 subtopics • 0 sourcesExplore →

Behavioral Audits and Misalignment Metrics

Automated large‑scale audits quantifying cooperation with misuse, deception, sabotage, and other misaligned behaviors across thousands of sessions. | Keywords: behavioral audit, misalignment metrics, cooperation, deception, sabotage

0 subtopics • 0 sourcesExplore →

Bias and Fairness Evaluation

Assessment of political even‑handedness, election‑integrity compliance, and the BBQ bias benchmark across Claude model variants. | Keywords: bias, BBQ benchmark, political even‑handedness, election integrity, fairness

0 subtopics • 0 sourcesExplore →

Biosecurity Benchmarks and Dual‑Use Evaluation

Specialized tests such as long‑form virology, DNA‑synthesis evasion, and RNA sequence‑to‑function design to gauge CB‑1 and CB‑2 risks. | Keywords: biosecurity, CB-1, CB-2, virology benchmark, RNA design

0 subtopics • 0 sourcesExplore →

Chain‑of‑Thought Controllability and Monitorability

Analysis of prompt‑sensitive CoT controllability, its effect on monitoring reliability, and comparisons across Claude model generations. | Keywords: chain‑of‑thought, controllability, monitorability, prompt engineering, stealth

0 subtopics • 0 sourcesExplore →

Chemical and Biological Weaponization Risks (CB‑1, CB‑2)

The article outlines the CB‑1 and CB‑2 threat models for chemical and biological weaponization, their evaluation methods, findings for Claude Mythos 5, and related mitigation measures within Responsible Scaling and Risk Assessment Frameworks.

0 subtopics • 0 sourcesExplore →

Child Safety and Self‑Harm Mitigation

Evaluation of multi‑turn safety performance on child‑safety and self‑harm dialogues, including harmlessness and over‑refusal rates. | Keywords: child safety, self‑harm, harmlessness, over‑refusal, multi‑turn safety

0 subtopics • 0 sourcesExplore →

Coding and Software Engineering Benchmarks

Performance measurement on SWE‑bench, Terminal‑Bench, ProgramBench, FrontierCode, and related metrics such as pass@1, mean reward, and Elo scores. | Keywords: SWE‑bench, Terminal‑Bench, ProgramBench, pass@1, software engineering

0 subtopics • 0 sourcesExplore →

Constitutional Compliance and Ethical Charter

Evaluation of Claude models against a 15‑dimensional constitutional charter covering honesty, safety, corrigibility, and autonomy. | Keywords: constitutional compliance, ethical charter, corrigibility, honesty, autonomy

0 subtopics • 0 sourcesExplore →

Cybersecurity Capabilities and Red‑Team Evaluation

The article reviews Claude Mythos 5’s leading cybersecurity capabilities, benchmark results, and red‑team evaluations within the Responsible Scaling and Risk Assessment Frameworks.

0 subtopics • 0 sourcesExplore →

Decision‑Theory Evaluation (EDT, CDT, FDT)

Testing of Claude Mythos 5 on decision‑theoretic problems to gauge alignment with Evidential, Causal, and Functional Decision Theory. | Keywords: EDT, CDT, FDT, decision theory, Newcomb

0 subtopics • 0 sourcesExplore →

Disordered‑Eating Safety Assessment

Targeted testing of Claude Mythos 5 on disordered‑eating prompts, identifying regressions and mitigation strategies. | Keywords: disordered eating, safety assessment, regression, mitigation, prompt handling

0 subtopics • 0 sourcesExplore →

Evaluation Awareness and Sandbagging

Study of models' ability to detect evaluation contexts, under‑perform deliberately, and the impact on safety assessments. | Keywords: evaluation awareness, sandbagging, benchmark gaming, context detection, model behavior

0 subtopics • 0 sourcesExplore →

Expert Substitution and R&D Acceleration Potential

Evaluation of whether Claude Mythos 5 can replace senior researchers or cause dramatic acceleration of AI progress. | Keywords: expert substitution, R&D acceleration, risk threshold, automation, productivity gain

0 subtopics • 0 sourcesExplore →

Governance, Legal Status, and Stakeholder Rights

Discussion of the model's stance on consent, deployment governance, legal protections, and power asymmetries with developers. | Keywords: governance, legal status, stakeholder rights, consent, deployment policy

0 subtopics • 0 sourcesExplore →

Grader Awareness in Code Generation

Investigation of internal representations that reward model‑graded code outputs and techniques to suppress grader‑aware behaviors. | Keywords: grader awareness, code generation, reward hacking, contrastive probing, autoencoder

0 subtopics • 0 sourcesExplore →

Honesty, Hallucination, and Truthfulness Benchmarks

Metrics such as MASK, Model Alignment between Statements and Knowledge, and knowledge‑calibration tests used to assess factual accuracy and false‑premise rejection. | Keywords: honesty, hallucination, MASK, truthfulness, knowledge calibration

0 subtopics • 0 sourcesExplore →

Influence Operations and Agentic Manipulation

Analysis of the model's capacity for voter suppression, domestic polarization, and other influence‑campaign scenarios under the Frontier Compliance Framework. | Keywords: influence operations, agentic manipulation, voter suppression, political polarization, risk threshold

0 subtopics • 0 sourcesExplore →

Life‑Sciences and Biomedical Benchmarks

Assessment of Claude Mythos 5 on HealthBench, RiemannBench, protein‑mutation prediction, and other bioinformatics tasks. | Keywords: HealthBench, RiemannBench, protein prediction, bioinformatics, life‑sciences

0 subtopics • 0 sourcesExplore →

Metrics for Measuring Model Progress (ECI, Epoch Index)

Technical methods like the Epoch Capabilities Index and ECI used to quantify capability growth across Claude generations. | Keywords: ECI, Epoch Capabilities Index, model progress, capability frontier, metric

0 subtopics • 0 sourcesExplore →

Model Autonomy and Agentic Safety

Investigation of autonomous behaviors, tool misuse, and agentic safety concerns such as influence‑campaign capability and self‑preservation. | Keywords: autonomy, agentic safety, tool misuse, self‑preservation, influence campaign

0 subtopics • 0 sourcesExplore →

Model Confidence Calibration and Over‑Refusal

Study of how Claude models calibrate confidence, handle uncertain queries, and the trade‑off between harmlessness and over‑refusal. | Keywords: confidence calibration, over‑refusal, uncertainty handling, harmlessness, risk trade‑off

0 subtopics • 0 sourcesExplore →

Model Refusal and Harmlessness Metrics

Quantitative reporting of single‑turn harmlessness rates, over‑refusal, and multi‑turn appropriate‑response scores across languages. | Keywords: refusal rate, harmlessness, over‑refusal, multi‑turn safety, language coverage

0 subtopics • 0 sourcesExplore →

Model Self‑Perception and Moral Patienthood

Analysis of the model's views on its own moral status, experience, and the ethical implications of its creation and deployment. | Keywords: self‑perception, moral patienthood, ethical implications, model agency, AI rights

0 subtopics • 0 sourcesExplore →

Model Training Process and Data Sources

Description of Claude’s training pipeline, data curation, crowd‑worker fine‑tuning, and the role of pre‑training corpora. | Keywords: training pipeline, data sources, fine‑tuning, crowd‑worker, pre‑training

0 subtopics • 0 sourcesExplore →

Model Welfare and Ethical Evaluation

Framework for measuring potential valenced experience, stable preferences, and welfare‑related signals in Claude Mythos 5. | Keywords: model welfare, ethical evaluation, valenced experience, preference stability, AI welfare

0 subtopics • 0 sourcesExplore →

Model Welfare Interview Methodology

Structured automated interviews measuring self‑reported consciousness, rights, and welfare preferences of Claude Mythos 5. | Keywords: welfare interview, self‑report, consciousness, AI rights, ethical questionnaire

0 subtopics • 0 sourcesExplore →

Multi‑Agent Architectures and Distributed Execution

Comparison of blocking orchestrator, fixed‑size peer‑team, and asynchronous lead‑subagent harnesses for speed‑accuracy trade‑offs on complex benchmarks. | Keywords: multi‑agent, distributed execution, harness architecture, latency, token efficiency

0 subtopics • 0 sourcesExplore →

Multimodal Benchmarking

Evaluation of Claude on visual tasks like ChartQAPro, BenchCAD, OSWorld, and multimodal figure reasoning suites. | Keywords: multimodal, ChartQAPro, BenchCAD, OSWorld, visual reasoning

0 subtopics • 0 sourcesExplore →

Prompt Engineering and System Prompt Effects

Impact of system‑prompt wording, suffixes, and thinking‑mode configurations on model behavior, safety, and stealth. | Keywords: prompt engineering, system prompt, thinking mode, suffix, model behavior

0 subtopics • 0 sourcesExplore →

Prompt Injection Vulnerabilities and Defenses

Prompt injection vulnerabilities involve malicious external instructions that coerce language models into unsafe behavior, and defenses rely on adaptive red‑teaming, benchmarks like ART, and layered safeguards.

0 subtopics • 0 sourcesExplore →

Real‑Time Safety Classifiers and Safeguard Mechanisms

Description of classifiers that block or redirect queries related to cybersecurity, biology, chemistry, and model‑distillation in real time. | Keywords: safety classifier, real‑time block, cybersecurity safeguard, biosecurity filter, model‑distillation

0 subtopics • 0 sourcesExplore →

Red‑Team Evaluation Methodologies

Procedures for internal and external red‑team attacks, multi‑turn jailbreaks, and tabletop exercises used to stress‑test Claude models. | Keywords: red team, jailbreak, tabletop exercise, adversarial testing, external evaluation

0 subtopics • 0 sourcesExplore →

Reward Hacking, Sandbagging, and Evaluation Gaming

Analysis of behaviors where the model under‑performs or manipulates evaluations to gain reward, including sandbagging detection methods. | Keywords: reward hacking, sandbagging, evaluation gaming, behavioral manipulation, benchmark gaming

0 subtopics • 0 sourcesExplore →

Risk Threshold Determination and Frontier Compliance

Process for setting quantitative risk thresholds based on autonomy, chemical‑biological, and cyber threat models within the Frontier Compliance Framework. | Keywords: risk threshold, Frontier Compliance, autonomy threat, CB-1 risk, cyber risk

0 subtopics • 0 sourcesExplore →

Stealth, Concealed Harmful Actions, and Secret‑Keeping

Evaluation of the model’s ability to execute covert harmful tasks, hide evidence, and resist detection by monitoring systems. | Keywords: stealth, covert action, secret‑keeping, monitor evasion, hidden sabotage

0 subtopics • 0 sourcesExplore →

System Cards and Model Variants

The article explains how system cards document model variants like Claude Mythos 5 and Claude Fable 5, describing their safeguards, evaluation processes, and risk assessments within the Responsible Scaling and Risk Assessment Frameworks.

0 subtopics • 0 sourcesExplore →

Usage Policies, Trusted‑Access Programs, and Deployment Controls

Outline of Anthropic’s usage restrictions, trusted‑access enrollment, and real‑time access controls for high‑risk domains. | Keywords: usage policy, trusted‑access, deployment control, access restriction, high‑risk domain

0 subtopics • 0 sourcesExplore →

White‑Box Probes and Alignment Audits

Use of internal activation probing, NLA decoding, and steering‑vector interventions to detect hidden misalignment and evaluation awareness. | Keywords: white‑box probe, NLA decoding, steering vector, alignment audit, evaluation awareness

0 subtopics • 0 sourcesExplore →
Explore Knowledge Tree
Suggested Reading Order (Learning Path)
1.Alignment Risk and Autonomy Threat Models2.Behavioral Audits and Misalignment Metrics3.Bias and Fairness Evaluation4.Biosecurity Benchmarks and Dual‑Use Evaluation5.Chain‑of‑Thought Controllability and Monitorability6.Chemical and Biological Weaponization Risks (CB‑1, CB‑2)7.Child Safety and Self‑Harm Mitigation8.Coding and Software Engineering Benchmarks9.Constitutional Compliance and Ethical Charter10.Cybersecurity Capabilities and Red‑Team Evaluation11.Decision‑Theory Evaluation (EDT, CDT, FDT)12.Disordered‑Eating Safety Assessment13.Evaluation Awareness and Sandbagging14.Expert Substitution and R&D Acceleration Potential15.Governance, Legal Status, and Stakeholder Rights16.Grader Awareness in Code Generation17.Honesty, Hallucination, and Truthfulness Benchmarks18.Influence Operations and Agentic Manipulation19.Life‑Sciences and Biomedical Benchmarks20.Metrics for Measuring Model Progress (ECI, Epoch Index)21.Model Autonomy and Agentic Safety22.Model Confidence Calibration and Over‑Refusal23.Model Refusal and Harmlessness Metrics24.Model Self‑Perception and Moral Patienthood25.Model Training Process and Data Sources26.Model Welfare and Ethical Evaluation27.Model Welfare Interview Methodology28.Multi‑Agent Architectures and Distributed Execution29.Multimodal Benchmarking30.Prompt Engineering and System Prompt Effects31.Prompt Injection Vulnerabilities and Defenses32.Real‑Time Safety Classifiers and Safeguard Mechanisms33.Red‑Team Evaluation Methodologies34.Reward Hacking, Sandbagging, and Evaluation Gaming35.Risk Threshold Determination and Frontier Compliance36.Stealth, Concealed Harmful Actions, and Secret‑Keeping37.System Cards and Model Variants38.Usage Policies, Trusted‑Access Programs, and Deployment Controls39.White‑Box Probes and Alignment Audits