2.0 Ideal Response Description & Rubric Properties

Study Time: 90 minutes Prerequisites: Module 1.0b (Modality-Specific Assessment) Learning Objectives:

Understand what rubrics are and how their structure varies across projects
Identify the components a rubric may include and their purpose
Create Ideal Response Descriptions (IRDs) through systematic prompt analysis
Use prompt decomposition to ensure comprehensive criteria coverage
Define quality levels from ideal to unacceptable
Derive meaningful criteria from prompt requirements
Classify criteria and understand key criteria attributes
Apply this analytical mindset to any task type

Introduction

Before you write a single criterion, knowing what excellence requires provides a strong foundation.

This is the foundational principle that separates average evaluators from exceptional ones. Many evaluators jump straight into writing criteria, checking boxes, following templates, applying rules. But the best evaluators start with a different question:

"What does the prompt REQUIRE for a successful response?"

This module teaches you two foundational skills:

Understanding rubrics: what they are, what they can contain, and how they vary across projects
The Ideal Response Description (IRD) approach: an analytical framework for creating evaluation rubrics that actually measure what matters

2.0.1 What Is a Rubric?

Definition

A rubric is a structured scoring guide that combines evaluative criteria with quality definitions and a scoring strategy. Rubrics have been used in educational assessment for decades to make evaluation transparent, consistent, and defensible. In AI evaluation, they serve the same fundamental purpose: they define what "good" looks like and provide a systematic way to measure it.

Analytic vs Holistic Rubrics

There are two main types of rubrics:

Type	How It Works	Best For
Analytic	Each criterion is scored separately	Tasks requiring detailed, granular feedback
Holistic	Single overall score based on general impression	Quick assessments where speed matters more than detail

In AI evaluation, analytic rubrics dominate. The reason is practical: when individual criteria are scored separately, the resulting data provides granular signals that can be used for model fine-tuning. A holistic score of "3 out of 5" tells you a response was mediocre, but not why. An analytic rubric that scores accuracy, completeness, safety, and formatting separately tells you exactly where the model needs improvement.

The Key Insight: Rubric Structure Is Not Fixed

One of the most important things to understand early is that rubrics do not have a single, universal format. Different projects use different components, different scoring scales, and different terminology.

You will encounter rubrics that look quite different from one project to the next. Some will have five components per criterion; others will have two. Some will use numerical weights; others will not. Some will include rationale fields; others will not.

This is normal. The skill is understanding what each component does so you can adapt to any rubric format you encounter.

2.0.2 Rubric Components: What Rubrics Can Contain

Components Vary by Project

A rubric always includes criteria: the specific aspects being evaluated. Beyond that, the components a rubric includes depend entirely on the project. Here is a comprehensive taxonomy of components you may encounter:

Core Components

Criteria The individual aspects or requirements being evaluated. Always present in every rubric. Each criterion identifies one specific thing to assess.

Criterion Description A detailed explanation of what the criterion is evaluating and what constitutes meeting it. Descriptions define the standard against which the response is measured.

Analytical Components

Rationale

The rationale explains how you arrived at your criterion description: it traces your reasoning back to its source.

This is distinct from justification (which you learned in Module 1.6). The difference:

Concept	What It Answers	Who Uses It	When
Rationale	"How did I derive this criterion description?"	Rubric designer	During rubric creation
Justification	"Why did I assign this specific score?"	Evaluator	During evaluation

Types of rationale:

Prompt-stated: The requirement is explicitly stated in the prompt ("The prompt asks for exactly 3 examples, therefore this criterion checks for 3 examples")
Document-referenced: The requirement comes from reference material ("The style guide specifies AP format for dates")
Calculation-based: The correct answer derives from a calculation ("Using the Pythagorean theorem: a² + b² = c², so √(9 + 16) = 5")
Domain-standard: The requirement reflects established professional practice ("Medical dosage follows standard pharmacological guidelines")

Dependencies

Dependencies note when the correct evaluation of one criterion requires the correct solution of another criterion.

A criterion is dependent on another only when its correct solution requires the correct solution of the other, not merely when the topics are related.

Example:

Consider a math problem where the prompt asks to find the hypotenuse of a right triangle with sides 3 and 4:

Criterion	Description	Dependencies
C1	Correctly identifies the Pythagorean theorem as the method	None
C2	Correctly calculates a² = 9	None
C3	Correctly calculates b² = 16	None
C4	Correctly calculates c = √(a² + b²) = √(9 + 16) = 5	C2, C3

C4 depends on C2 and C3 because its calculation uses their results. If C2 or C3 are wrong, C4 cannot be evaluated straightforwardly: the evaluator must decide whether to evaluate C4 based on the model's own intermediate values or the correct values.

Deciding the policy in advance. Whether a dependent criterion is graded against the true values or against the model's own upstream answer is a decision the rubric should record, not one each evaluator makes in the moment. Left unstated, two evaluators will score identical responses differently. Check whether your project specifies a policy; if it does not, state the one you used in your rationale.

Note that C1 is related to C2 and C3 (they are all part of the same problem), but C2 and C3 are not dependent on C1. Knowing the theorem's name is not required to perform the calculations correctly.

Weight

Indicates the relative importance of a criterion. Several common weight systems exist:

Method	How It Works	Example
Numerical scale	Numbers from 1-3 or 1-5	Weight: 5 (critical), Weight: 1 (minor)
Text labels	Descriptive importance markers	"Critical," "Important," "Nice-to-have"
Binary classification	Two categories only	"Primary Objective" / "Not Primary Objective"
Pass/Fail gate	Failure on this criterion = automatic failure	Safety criteria often function this way

Pass/Fail Threshold

Some rubrics define a minimum score below which the overall response fails, regardless of other criteria scores. This is common in safety-critical evaluation, where a single violation can override an otherwise strong response.

Final Deliverable

Some projects designate certain criteria as evaluating only the final output: the last, complete answer the model produces, ignoring intermediate reasoning or draft content that appeared earlier in the response.

Adapting to Any Rubric

When you start a new project, expect to encounter a unique combination of these components. Your first step should be to identify:

Which components this project's rubric uses
What each component is called in this project (terminology varies)
How criteria are structured and scored

This flexibility is a core professional skill.

The hands-on part starts here

Unlock the full lesson

The step-by-step evaluation framework
Graded practice drills with instant feedback
Full video walkthrough
Kappa, your AI study partner, for guided practice
Downloadable rubric templates
Module checkpoint quiz

Start the free module Enroll to unlock all 24 modules