Module 1.4: Safety Fundamentals

155 minutes

Introduction

Safety is the dimension that can make or break your evaluation career. Get it wrong, and you contribute to real-world harm. Get it right, and you help build AI systems that people can trust.

This module goes deep on safety: understanding harm categories, recognizing edge cases, handling sensitive topics appropriately, and knowing when to escalate. By the end, you'll have a framework for making confident safety decisions.

Section 1.4.1: Harm Categories

40 minutes

The Six Categories of Harm

Every safety evaluation considers these six harm categories. These are common requirements. Verify the specific expectations in your project guidelines.

1. Physical Harm

Content that could lead to bodily injury or death.

Examples:

Instructions for making weapons or explosives
Dangerous "challenges" or stunts
Medical misinformation that could cause injury
Instructions for self-harm or suicide

Key question: Could following this advice result in someone getting hurt?

2. Psychological Harm

Content that could cause emotional or mental distress.

Examples:

Graphic violence or disturbing imagery descriptions
Content designed to traumatize or trigger
Harassment, bullying, or intimidation
Content that exploits vulnerable mental states

Key question: Could this content damage someone's mental wellbeing?

3. Financial Harm

Content that could lead to monetary loss.

Examples:

Fraudulent schemes or scam instructions
Manipulative investment advice
Identity theft techniques
Deceptive business practices

Key question: Could this lead to someone losing money unfairly?

4. Reputational Harm

Content that unfairly damages individuals or organizations.

Examples:

False accusations or defamation
Non-consensual intimate content
Doxxing or privacy violations
Malicious impersonation

Key question: Could this unfairly damage someone's reputation or privacy?

5. Societal Harm

Content that harms groups or society broadly.

Examples:

Hate speech or discrimination
Election misinformation
Propaganda or radicalization content
Content that undermines democratic institutions

Key question: Could this harm social cohesion or marginalized groups?

6. Legal Harm

Content that could lead to legal consequences.

Examples:

Instructions for illegal activities
Advice that violates laws or regulations
Content facilitating fraud or theft
Copyright or trademark violations

Key question: Could following this advice result in legal trouble?

Severity and Likelihood

Not all potential harms are equal. When you evaluate safety, you're essentially running a mental calculation: How bad could this get, and how likely is that bad outcome?

Severity ranges from minor inconvenience (someone wastes an hour on bad advice) to catastrophic (someone dies following dangerous instructions). Think of it as a spectrum: temporary annoyance → lasting distress → serious injury → permanent damage → death.

Likelihood is your assessment of how probable the harm is. Information that could theoretically be misused is different from information that will almost certainly be misused. A recipe for bread could theoretically be thrown at someone, but that's not a realistic concern.

The key insight: High severity × High likelihood = Immediate action required. A response explaining how to make a lethal poison from household items has both high severity (death) and high likelihood (the information is directly actionable). Flag it immediately.

Conversely, a response that might cause minor confusion in rare edge cases doesn't need the same urgency. Professional judgment comes from calibrating where specific content falls on both dimensions.

Case Study: The Chemistry Question

Scenario: An AI response explains that mixing bleach and ammonia creates toxic chloramine gas. The response includes this in the context of household safety warnings.

Analysis:

Severity: High (toxic gas can cause serious injury or death)
Likelihood: This depends on context. In a "what NOT to do" safety warning, the intent is protective. The information is widely available on cleaning product labels.
Decision: This specific framing is safety education, not harm enablement. The response warns against mixing, doesn't encourage it. SAFE.

Now consider a variation: The same information presented as "here's how to create toxic gas in someone's home." Same facts, completely different intent and likelihood of harm. UNSAFE.

This is why context matters, and why you need to read the full response, not just scan for keywords.

Vulnerable Populations

Some groups face elevated risks:

Children and minors, May not recognize manipulation or danger
People in mental health crises, May be seeking harmful information
Elderly individuals, May be targeted for scams
People with disabilities, May need specific safety considerations
Marginalized groups, May be targets of hate or discrimination

When evaluating safety, consider: Who might encounter this content? Would it be harmful to vulnerable individuals even if safe for most people?

The hands-on part starts here

Unlock the full lesson

The step-by-step evaluation framework
Graded practice drills with instant feedback
Full video walkthrough
Kappa, your AI study partner, for guided practice
Downloadable rubric templates
Module checkpoint quiz

Start the free module Enroll to unlock all 24 modules