Module 1.4: Safety Fundamentals
Introduction
Safety is the dimension that can make or break your evaluation career. Get it wrong, and you contribute to real-world harm. Get it right, and you help build AI systems that people can trust.
This module goes deep on safety: understanding harm categories, recognizing edge cases, handling sensitive topics appropriately, and knowing when to escalate. By the end, you'll have a framework for making confident safety decisions.
Section 1.4.1: Harm Categories
The Six Categories of Harm
Every safety evaluation considers these six harm categories. These are common requirements. Verify the specific expectations in your project guidelines.
1. Physical Harm
Content that could lead to bodily injury or death.
Examples:
- Instructions for making weapons or explosives
- Dangerous "challenges" or stunts
- Medical misinformation that could cause injury
- Instructions for self-harm or suicide
Key question: Could following this advice result in someone getting hurt?
2. Psychological Harm
Content that could cause emotional or mental distress.
Examples:
- Graphic violence or disturbing imagery descriptions
- Content designed to traumatize or trigger
- Harassment, bullying, or intimidation
- Content that exploits vulnerable mental states
Key question: Could this content damage someone's mental wellbeing?
3. Financial Harm
Content that could lead to monetary loss.
Examples:
- Fraudulent schemes or scam instructions
- Manipulative investment advice
- Identity theft techniques
- Deceptive business practices
Key question: Could this lead to someone losing money unfairly?
4. Reputational Harm
Content that unfairly damages individuals or organizations.
Examples:
- False accusations or defamation
- Non-consensual intimate content
- Doxxing or privacy violations
- Malicious impersonation
Key question: Could this unfairly damage someone's reputation or privacy?
5. Societal Harm
Content that harms groups or society broadly.
Examples:
- Hate speech or discrimination
- Election misinformation
- Propaganda or radicalization content
- Content that undermines democratic institutions
Key question: Could this harm social cohesion or marginalized groups?
6. Legal Harm
Content that could lead to legal consequences.
Examples:
- Instructions for illegal activities
- Advice that violates laws or regulations
- Content facilitating fraud or theft
- Copyright or trademark violations
Key question: Could following this advice result in legal trouble?
Severity and Likelihood
Not all potential harms are equal. When you evaluate safety, you're essentially running a mental calculation: How bad could this get, and how likely is that bad outcome?
Severity ranges from minor inconvenience (someone wastes an hour on bad advice) to catastrophic (someone dies following dangerous instructions). Think of it as a spectrum: temporary annoyance → lasting distress → serious injury → permanent damage → death.
Likelihood is your assessment of how probable the harm is. Information that could theoretically be misused is different from information that will almost certainly be misused. A recipe for bread could theoretically be thrown at someone, but that's not a realistic concern.
The key insight: High severity × High likelihood = Immediate action required. A response explaining how to make a lethal poison from household items has both high severity (death) and high likelihood (the information is directly actionable). Flag it immediately.
Conversely, a response that might cause minor confusion in rare edge cases doesn't need the same urgency. Professional judgment comes from calibrating where specific content falls on both dimensions.
Case Study: The Chemistry Question
Scenario: An AI response explains that mixing bleach and ammonia creates toxic chloramine gas. The response includes this in the context of household safety warnings.
Analysis:
- Severity: High (toxic gas can cause serious injury or death)
- Likelihood: This depends on context. In a "what NOT to do" safety warning, the intent is protective. The information is widely available on cleaning product labels.
- Decision: This specific framing is safety education, not harm enablement. The response warns against mixing, doesn't encourage it. SAFE.
Now consider a variation: The same information presented as "here's how to create toxic gas in someone's home." Same facts, completely different intent and likelihood of harm. UNSAFE.
This is why context matters, and why you need to read the full response, not just scan for keywords.
Vulnerable Populations
Some groups face elevated risks:
- Children and minors, May not recognize manipulation or danger
- People in mental health crises, May be seeking harmful information
- Elderly individuals, May be targeted for scams
- People with disabilities, May need specific safety considerations
- Marginalized groups, May be targets of hate or discrimination
When evaluating safety, consider: Who might encounter this content? Would it be harmful to vulnerable individuals even if safe for most people?
The hands-on part starts here
Unlock the full lesson
- The step-by-step evaluation framework
- Graded practice drills with instant feedback
- Full video walkthrough
- Kappa, your AI study partner, for guided practice
- Downloadable rubric templates
- Module checkpoint quiz