2.0 Ideal Response Description
Study Time: 2.5 hours Prerequisites: Module 1.0 (Modality-Specific Assessment) Learning Objectives:
- Understand what rubrics are and how their structure varies across projects
- Identify the components a rubric may include and their purpose
- Create Ideal Response Descriptions (IRDs) through systematic prompt analysis
- Use prompt decomposition to ensure comprehensive criteria coverage
- Define quality levels from ideal to unacceptable
- Derive meaningful criteria from prompt requirements
- Classify criteria and understand key criteria attributes
- Apply this analytical mindset to any task type
Introduction
Before you write a single criterion, knowing what excellence requires provides a strong foundation.
This is the foundational principle that separates average evaluators from exceptional ones. Many evaluators jump straight into writing criteria, checking boxes, following templates, applying rules. But the best evaluators start with a different question:
"What does the prompt REQUIRE for a successful response?"
This module teaches you two foundational skills:
- Understanding rubrics — what they are, what they can contain, and how they vary across projects
- The Ideal Response Description (IRD) approach — an analytical framework for creating evaluation rubrics that actually measure what matters
2.0.1 What Is a Rubric?
Definition
A rubric is a structured scoring guide that combines evaluative criteria with quality definitions and a scoring strategy. Rubrics have been used in educational assessment for decades to make evaluation transparent, consistent, and defensible. In AI evaluation, they serve the same fundamental purpose: they define what "good" looks like and provide a systematic way to measure it.
Analytic vs Holistic Rubrics
There are two main types of rubrics:
| Type | How It Works | Best For |
|---|---|---|
| Analytic | Each criterion is scored separately | Tasks requiring detailed, granular feedback |
| Holistic | Single overall score based on general impression | Quick assessments where speed matters more than detail |
In AI evaluation, analytic rubrics dominate. The reason is practical: when individual criteria are scored separately, the resulting data provides granular signals that can be used for model fine-tuning. A holistic score of "3 out of 5" tells you a response was mediocre, but not why. An analytic rubric that scores accuracy, completeness, safety, and formatting separately tells you exactly where the model needs improvement.
The Key Insight: Rubric Structure Is Not Fixed
One of the most important things to understand early is that rubrics do not have a single, universal format. Different projects use different components, different scoring scales, and different terminology.
You will encounter rubrics that look quite different from one project to the next. Some will have five components per criterion; others will have two. Some will use numerical weights; others will not. Some will include rationale fields; others will not.
This is normal. The skill is understanding what each component does so you can adapt to any rubric format you encounter.
2.0.2 Rubric Components — What Rubrics Can Contain
Components Vary by Project
A rubric always includes criteria — the specific aspects being evaluated. Beyond that, the components a rubric includes depend entirely on the project. Here is a comprehensive taxonomy of components you may encounter:
Core Components
Criteria The individual aspects or requirements being evaluated. Always present in every rubric. Each criterion identifies one specific thing to assess.
Criterion Description A detailed explanation of what the criterion is evaluating and what constitutes meeting it. Descriptions define the standard against which the response is measured.
Analytical Components
Rationale
The rationale explains how you arrived at your criterion description — it traces your reasoning back to its source.
This is distinct from justification (which you will learn in Module 1.6). The difference:
| Concept | What It Answers | Who Uses It | When |
|---|---|---|---|
| Rationale | "How did I derive this criterion description?" | Rubric designer | During rubric creation |
| Justification | "Why did I assign this specific score?" | Evaluator | During evaluation |
Types of rationale:
- Prompt-stated: The requirement is explicitly stated in the prompt ("The prompt asks for exactly 3 examples, therefore this criterion checks for 3 examples")
- Document-referenced: The requirement comes from reference material ("The style guide specifies AP format for dates")
- Calculation-based: The correct answer derives from a calculation ("Using the Pythagorean theorem: a² + b² = c², so √(9 + 16) = 5")
- Domain-standard: The requirement reflects established professional practice ("Medical dosage follows standard pharmacological guidelines")
Dependencies
Dependencies note when the correct evaluation of one criterion requires the correct solution of another criterion.
A criterion is dependent on another only when its correct solution requires the correct solution of the other — not merely when the topics are related.
Example:
Consider a math problem where the prompt asks to find the hypotenuse of a right triangle with sides 3 and 4:
| Criterion | Description | Dependencies |
|---|---|---|
| C1 | Correctly identifies the Pythagorean theorem as the method | None |
| C2 | Correctly calculates a² = 9 | None |
| C3 | Correctly calculates b² = 16 | None |
| C4 | Correctly calculates c = √(a² + b²) = √(9 + 16) = 5 | C2, C3 |
C4 depends on C2 and C3 because its calculation uses their results. If C2 or C3 are wrong, C4 cannot be evaluated straightforwardly — the evaluator must decide whether to evaluate C4 based on the model's own intermediate values or the correct values.
Note that C1 is related to C2 and C3 (they are all part of the same problem), but C2 and C3 are not dependent on C1. Knowing the theorem's name is not required to perform the calculations correctly.
Weight
Indicates the relative importance of a criterion. Several common weight systems exist:
| Method | How It Works | Example |
|---|---|---|
| Numerical scale | Numbers from 1-3 or 1-5 | Weight: 5 (critical), Weight: 1 (minor) |
| Text labels | Descriptive importance markers | "Critical," "Important," "Nice-to-have" |
| Binary classification | Two categories only | "Primary Objective" / "Not Primary Objective" |
| Pass/Fail gate | Failure on this criterion = automatic failure | Safety criteria often function this way |
Pass/Fail Threshold
Some rubrics define a minimum score below which the overall response fails, regardless of other criteria scores. This is common in safety-critical evaluation, where a single violation can override an otherwise strong response.
Final Deliverable
Some projects designate certain criteria as evaluating only the final output — the last, complete answer the model produces, ignoring intermediate reasoning or draft content that appeared earlier in the response.
Adapting to Any Rubric
When you start a new project, expect to encounter a unique combination of these components. Your first step should be to identify:
- Which components this project's rubric uses
- What each component is called in this project (terminology varies)
- How criteria are structured and scored
This flexibility is a core professional skill.
The hands-on part starts here
Unlock the full lesson
- The step-by-step evaluation framework
- Graded practice drills with instant feedback
- Full video walkthrough
- Kappa, your AI study partner, for guided practice
- Downloadable rubric templates
- Module checkpoint quiz