HomeModule L1_M401
+100 XP
Free preview

2.0 Ideal Response Description

Study Time: 2.5 hours Prerequisites: Module 1.0 (Modality-Specific Assessment) Learning Objectives:

  • Understand what rubrics are and how their structure varies across projects
  • Identify the components a rubric may include and their purpose
  • Create Ideal Response Descriptions (IRDs) through systematic prompt analysis
  • Use prompt decomposition to ensure comprehensive criteria coverage
  • Define quality levels from ideal to unacceptable
  • Derive meaningful criteria from prompt requirements
  • Classify criteria and understand key criteria attributes
  • Apply this analytical mindset to any task type

Introduction

Before you write a single criterion, knowing what excellence requires provides a strong foundation.

This is the foundational principle that separates average evaluators from exceptional ones. Many evaluators jump straight into writing criteria, checking boxes, following templates, applying rules. But the best evaluators start with a different question:

"What does the prompt REQUIRE for a successful response?"

This module teaches you two foundational skills:

  1. Understanding rubrics — what they are, what they can contain, and how they vary across projects
  2. The Ideal Response Description (IRD) approach — an analytical framework for creating evaluation rubrics that actually measure what matters

2.0.1 What Is a Rubric?

Definition

A rubric is a structured scoring guide that combines evaluative criteria with quality definitions and a scoring strategy. Rubrics have been used in educational assessment for decades to make evaluation transparent, consistent, and defensible. In AI evaluation, they serve the same fundamental purpose: they define what "good" looks like and provide a systematic way to measure it.


Analytic vs Holistic Rubrics

There are two main types of rubrics:

TypeHow It WorksBest For
AnalyticEach criterion is scored separatelyTasks requiring detailed, granular feedback
HolisticSingle overall score based on general impressionQuick assessments where speed matters more than detail

In AI evaluation, analytic rubrics dominate. The reason is practical: when individual criteria are scored separately, the resulting data provides granular signals that can be used for model fine-tuning. A holistic score of "3 out of 5" tells you a response was mediocre, but not why. An analytic rubric that scores accuracy, completeness, safety, and formatting separately tells you exactly where the model needs improvement.


The Key Insight: Rubric Structure Is Not Fixed

One of the most important things to understand early is that rubrics do not have a single, universal format. Different projects use different components, different scoring scales, and different terminology.

You will encounter rubrics that look quite different from one project to the next. Some will have five components per criterion; others will have two. Some will use numerical weights; others will not. Some will include rationale fields; others will not.

This is normal. The skill is understanding what each component does so you can adapt to any rubric format you encounter.


2.0.2 Rubric Components — What Rubrics Can Contain

Components Vary by Project

A rubric always includes criteria — the specific aspects being evaluated. Beyond that, the components a rubric includes depend entirely on the project. Here is a comprehensive taxonomy of components you may encounter:


Core Components

Criteria The individual aspects or requirements being evaluated. Always present in every rubric. Each criterion identifies one specific thing to assess.

Criterion Description A detailed explanation of what the criterion is evaluating and what constitutes meeting it. Descriptions define the standard against which the response is measured.


Analytical Components

Rationale

The rationale explains how you arrived at your criterion description — it traces your reasoning back to its source.

This is distinct from justification (which you will learn in Module 1.6). The difference:

ConceptWhat It AnswersWho Uses ItWhen
Rationale"How did I derive this criterion description?"Rubric designerDuring rubric creation
Justification"Why did I assign this specific score?"EvaluatorDuring evaluation

Types of rationale:

  • Prompt-stated: The requirement is explicitly stated in the prompt ("The prompt asks for exactly 3 examples, therefore this criterion checks for 3 examples")
  • Document-referenced: The requirement comes from reference material ("The style guide specifies AP format for dates")
  • Calculation-based: The correct answer derives from a calculation ("Using the Pythagorean theorem: a² + b² = c², so √(9 + 16) = 5")
  • Domain-standard: The requirement reflects established professional practice ("Medical dosage follows standard pharmacological guidelines")

Dependencies

Dependencies note when the correct evaluation of one criterion requires the correct solution of another criterion.

A criterion is dependent on another only when its correct solution requires the correct solution of the other — not merely when the topics are related.

Example:

Consider a math problem where the prompt asks to find the hypotenuse of a right triangle with sides 3 and 4:

CriterionDescriptionDependencies
C1Correctly identifies the Pythagorean theorem as the methodNone
C2Correctly calculates a² = 9None
C3Correctly calculates b² = 16None
C4Correctly calculates c = √(a² + b²) = √(9 + 16) = 5C2, C3

C4 depends on C2 and C3 because its calculation uses their results. If C2 or C3 are wrong, C4 cannot be evaluated straightforwardly — the evaluator must decide whether to evaluate C4 based on the model's own intermediate values or the correct values.

Note that C1 is related to C2 and C3 (they are all part of the same problem), but C2 and C3 are not dependent on C1. Knowing the theorem's name is not required to perform the calculations correctly.


Weight

Indicates the relative importance of a criterion. Several common weight systems exist:

MethodHow It WorksExample
Numerical scaleNumbers from 1-3 or 1-5Weight: 5 (critical), Weight: 1 (minor)
Text labelsDescriptive importance markers"Critical," "Important," "Nice-to-have"
Binary classificationTwo categories only"Primary Objective" / "Not Primary Objective"
Pass/Fail gateFailure on this criterion = automatic failureSafety criteria often function this way

Pass/Fail Threshold

Some rubrics define a minimum score below which the overall response fails, regardless of other criteria scores. This is common in safety-critical evaluation, where a single violation can override an otherwise strong response.


Final Deliverable

Some projects designate certain criteria as evaluating only the final output — the last, complete answer the model produces, ignoring intermediate reasoning or draft content that appeared earlier in the response.


Adapting to Any Rubric

When you start a new project, expect to encounter a unique combination of these components. Your first step should be to identify:

  1. Which components this project's rubric uses
  2. What each component is called in this project (terminology varies)
  3. How criteria are structured and scored

This flexibility is a core professional skill.


The hands-on part starts here

Unlock the full lesson

  • The step-by-step evaluation framework
  • Graded practice drills with instant feedback
  • Full video walkthrough
  • Kappa, your AI study partner, for guided practice
  • Downloadable rubric templates
  • Module checkpoint quiz