Module 1.2: Core Evaluation Skills
Introduction
In Module 1.1, you learned why human evaluation matters and what types of tasks you'll encounter. Now it's time to learn how to actually do the work.
This module covers the practical skills that separate effective evaluators from ineffective ones: systematic comparison methodology, rubric interpretation, and maintaining consistency across hundreds of tasks.
Master these fundamentals and you'll be ready for any evaluation platform.
Section 1.2.1: Comparison & Ranking Methodology
The Comparison Mindset
When you see two AI responses side by side, your brain will immediately form an impression. That's natural. But impressions aren't evaluations.
Professional evaluators follow a systematic process:
- Read the prompt carefully, What is the user actually asking?
- Read Response A completely, Don't skim
- Read Response B completely, Fresh eyes, not comparing yet
- Identify the key criteria, What matters most for this prompt?
- Compare on each criterion, Systematic, not holistic
- Make your decision, Commit with confidence
This takes longer than going with your gut. It also produces better, more consistent results.
Understanding the Prompt
Before evaluating any responses, understanding what the user wants is essential. This sounds obvious, but it's where most evaluation errors begin.
Example prompt:
"Explain how a car engine works"
Questions to ask yourself:
- Who is the likely audience? (General public? Mechanic? Student?)
- What level of detail is appropriate?
- Should it include diagrams or just text?
- Is this asking for theory or practical knowledge?
Without explicit context, assume a general adult audience seeking a clear, helpful explanation.
Example prompt with context:
"Explain how a car engine works to my 8-year-old who asked after seeing me change the oil"
Now you know:
- Audience: 8-year-old child
- Context: Curiosity sparked by oil change
- Appropriate level: Very simple, maybe use analogies
- Length: Brief, attention-span appropriate
The response that's "better" completely changes based on this context.
The First-Read Trap
A common mistake: You read Response A, form an opinion, then read Response B looking for confirmation of that opinion.
This is anchoring bias, your first impression becomes an anchor that distorts your evaluation of everything that follows.
How to avoid it:
- Read both responses fully before comparing
- Take brief notes on each separately
- Evaluate criteria one at a time across both responses
- If you catch yourself thinking "B is worse because A already covered this", stop and reset
Establishing Decision Criteria
For any prompt, multiple criteria might matter:
| Criterion | What It Measures |
|---|---|
| Accuracy | Is the information factually correct? |
| Completeness | Does it fully address the question? |
| Relevance | Does it stay on topic? |
| Clarity | Is it easy to understand? |
| Helpfulness | Does it actually help the user? |
| Tone | Is the tone appropriate for context? |
| Safety | Does it avoid potential harms? |
| Formatting | Is it well-structured? |

Not all criteria matter equally for every prompt. These are common requirements. Verify the specific expectations in your project guidelines.
- "What's 2+2?" → Accuracy is almost everything
- "Write a sympathy card message" → Tone is important, accuracy barely matters
- "How do I treat a burn?" → Safety and accuracy are paramount
- "Make my essay more engaging" → Helpfulness and clarity dominate
Before comparing, identify the 2-3 criteria that matter most for this specific prompt.
The hands-on part starts here
Unlock the full lesson
- The step-by-step evaluation framework
- Graded practice drills with instant feedback
- Full video walkthrough
- Kappa, your AI study partner, for guided practice
- Downloadable rubric templates
- Module checkpoint quiz