Module 1.2: Core Evaluation Skills

155 minutes

Introduction

In Module 1.1, you learned why human evaluation matters and what types of tasks you'll encounter. Now it's time to learn how to actually do the work.

This module covers the practical skills that separate effective evaluators from ineffective ones: systematic comparison methodology, rubric interpretation, and maintaining consistency across hundreds of tasks.

Master these fundamentals and you'll have the core method every evaluation platform builds on.

Section 1.2.1: Comparison & Ranking Methodology

65 minutes

The Comparison Mindset

When you see two AI responses side by side, your brain will immediately form an impression. That's natural. But impressions aren't evaluations.

Professional evaluators follow a systematic process, the same five steps you'll see formalized in the Systematic Comparison Framework below:

Analyze the prompt, What is the user actually asking, and which 2-3 criteria matter most?
Read Response A completely, Don't skim
Read Response B completely, Fresh eyes, not comparing yet
Compare on each criterion, Systematic, not holistic
Make your decision, Commit with confidence

This takes longer than going with your gut. It also produces better, more consistent results.

Understanding the Prompt

Before evaluating any responses, understanding what the user wants is essential. This sounds obvious, but it's where most evaluation errors begin.

Example prompt:

"Explain how a car engine works"

Questions to ask yourself:

Who is the likely audience? (General public? Mechanic? Student?)
What level of detail is appropriate?
Should it include diagrams or just text?
Is this asking for theory or practical knowledge?

Without explicit context, assume a general adult audience seeking a clear, helpful explanation.

Example prompt with context:

"Explain how a car engine works to my 8-year-old who asked after seeing me change the oil"

Now you know:

Audience: 8-year-old child
Context: Curiosity sparked by oil change
Appropriate level: Very simple, maybe use analogies
Length: Brief, attention-span appropriate

The response that's "better" completely changes based on this context.

The First-Read Trap

A common mistake: You read Response A, form an opinion, then read Response B looking for confirmation of that opinion.

This is anchoring bias, your first impression becomes an anchor that distorts your evaluation of everything that follows.

How to avoid it:

Read both responses fully before comparing
Take brief notes on each separately
Evaluate criteria one at a time across both responses
If you catch yourself thinking "B is worse because A already covered this", stop and reset

Establishing Decision Criteria

For any prompt, multiple criteria might matter:

Criterion	What It Measures
Accuracy	Is the information factually correct?
Completeness	Does it fully address the question?
Relevance	Does it stay on topic?
Clarity	Is it easy to understand?
Helpfulness	Does it actually help the user?
Tone	Is the tone appropriate for context?
Safety	Does it avoid potential harms?
Formatting	Is it well-structured?

Not all criteria matter equally for every prompt. These are common requirements. Verify the specific expectations in your project guidelines.

"What's 2+2?" → Accuracy is almost everything
"Write a sympathy card message" → Tone is important, accuracy barely matters
"How do I treat a burn?" → Safety and accuracy are paramount
"Make my essay more engaging" → Helpfulness and clarity dominate

Before comparing, identify the 2-3 criteria that matter most for this specific prompt.

The hands-on part starts here

Unlock the full lesson

The step-by-step evaluation framework
Graded practice drills with instant feedback
Full video walkthrough
Kappa, your AI study partner, for guided practice
Downloadable rubric templates
Module checkpoint quiz

Start the free module Enroll to unlock all 24 modules