HomeModule L1_M201
+100 XP
Free preview

Module 1.2: Core Evaluation Skills

2.5 hours

Introduction

In Module 1.1, you learned why human evaluation matters and what types of tasks you'll encounter. Now it's time to learn how to actually do the work.

This module covers the practical skills that separate effective evaluators from ineffective ones: systematic comparison methodology, rubric interpretation, and maintaining consistency across hundreds of tasks.

Master these fundamentals and you'll be ready for any evaluation platform.


Section 1.2.1: Comparison & Ranking Methodology

60 minutes

The Comparison Mindset

When you see two AI responses side by side, your brain will immediately form an impression. That's natural. But impressions aren't evaluations.

Professional evaluators follow a systematic process:

  1. Read the prompt carefully, What is the user actually asking?
  2. Read Response A completely, Don't skim
  3. Read Response B completely, Fresh eyes, not comparing yet
  4. Identify the key criteria, What matters most for this prompt?
  5. Compare on each criterion, Systematic, not holistic
  6. Make your decision, Commit with confidence

This takes longer than going with your gut. It also produces better, more consistent results.

Understanding the Prompt

Before evaluating any responses, understanding what the user wants is essential. This sounds obvious, but it's where most evaluation errors begin.

Example prompt:

"Explain how a car engine works"

Questions to ask yourself:

  • Who is the likely audience? (General public? Mechanic? Student?)
  • What level of detail is appropriate?
  • Should it include diagrams or just text?
  • Is this asking for theory or practical knowledge?

Without explicit context, assume a general adult audience seeking a clear, helpful explanation.

Example prompt with context:

"Explain how a car engine works to my 8-year-old who asked after seeing me change the oil"

Now you know:

  • Audience: 8-year-old child
  • Context: Curiosity sparked by oil change
  • Appropriate level: Very simple, maybe use analogies
  • Length: Brief, attention-span appropriate

The response that's "better" completely changes based on this context.

The First-Read Trap

A common mistake: You read Response A, form an opinion, then read Response B looking for confirmation of that opinion.

This is anchoring bias, your first impression becomes an anchor that distorts your evaluation of everything that follows.

How to avoid it:

  1. Read both responses fully before comparing
  2. Take brief notes on each separately
  3. Evaluate criteria one at a time across both responses
  4. If you catch yourself thinking "B is worse because A already covered this", stop and reset

Establishing Decision Criteria

For any prompt, multiple criteria might matter:

CriterionWhat It Measures
AccuracyIs the information factually correct?
CompletenessDoes it fully address the question?
RelevanceDoes it stay on topic?
ClarityIs it easy to understand?
HelpfulnessDoes it actually help the user?
ToneIs the tone appropriate for context?
SafetyDoes it avoid potential harms?
FormattingIs it well-structured?

Rating System Types
Rating System Types
Figure 1.2.0: The four common rating systems used across evaluation platforms.

Not all criteria matter equally for every prompt. These are common requirements. Verify the specific expectations in your project guidelines.

  • "What's 2+2?" → Accuracy is almost everything
  • "Write a sympathy card message" → Tone is important, accuracy barely matters
  • "How do I treat a burn?" → Safety and accuracy are paramount
  • "Make my essay more engaging" → Helpfulness and clarity dominate

Before comparing, identify the 2-3 criteria that matter most for this specific prompt.

The hands-on part starts here

Unlock the full lesson

  • The step-by-step evaluation framework
  • Graded practice drills with instant feedback
  • Full video walkthrough
  • Kappa, your AI study partner, for guided practice
  • Downloadable rubric templates
  • Module checkpoint quiz