Evaluating AI Output: Applying Critical Thinking
Module 6.
Purpose and Topics
Purpose
To help individuals assess AI-generated content critically and responsibly.
Topics/Learning Objectives
Upon completion of this module, individuals will be able to:
- Explain the importance of evaluating AI output
- Describe the RACCCA GenAI evaluation framework
- Evaluate the quality of GenAI output using RACCCA
- Evaluate AI systems and tools you generate (Agents, Tutors, etc.)

Note:The text and graphics in these modules were co-developed with the assistance of generative AI models such as OpenAI’s ChatGPT, Google’s Gemini and NotebookLM and Microsoft’s CoPilot, drawing on the indicated reference materials. Materials were then edited for relevancy and accuracy.
Introduction
The Importance of Evaluating AI Output
Generative AI (GenAI) systems are increasingly used in medical education to generate study materials, summarize clinical cases and explain complex concepts. While these tools can produce fluent, well-structured responses, their outputs are not inherently accurate, complete or appropriate for a given clinical or educational context.
This creates a critical challenge: AI-generated content is often more persuasive than it is reliable. Because outputs are articulated with confidence and coherence, learners and educators may overestimate their validity—blurring the distinction between plausibility and truth. In a field where inaccuracies can shape clinical reasoning and decision-making, uncritical acceptance of AI output introduces meaningful risk.
When we prompt a GenAI model, it produces a response—referred to as its output—which may take the form of text, data, media, computer code or synthesized information. Regardless of format, all outputs require evaluation. Developing the ability to critically assess these responses is therefore not optional; it is a foundational competency for responsible and effective use of AI in medicine.
A structured approach to evaluation helps mitigate these risks while also supporting deeper learning.
The RACCCA framework provides a practical method for systematically assessing the quality of GenAI outputs. It empowers learners and educators to:
Avoid overreliance on unverified or misleading responses
Engage in iterative prompt refinement through feedback loops
Strengthen discernment between high- and low-quality outputs
Develop habits of critical appraisal that transfer to clinical reasoning
Used in classroom activities or self-directed study, RACCCA functions as a scaffold for both AI literacy and domain learning. Over time, learners begin to internalize its criteria, leading to more confident, reflective and responsible use of AI tools.
Understanding the capabilities and limitations of GenAI requires more than intuition—it requires structured, repeatable evaluation methods. RACCCA offers one such framework.
Part 1
Describe the RACCCA GenAI Evaluation Framework
RACCCA Overview
RACCCA provides a framework for critically analyzing AI-generated content, especially in high-stakes environments. This framework was developed specifically to support non-programmers in making informed judgments about the quality, safety, and usefulness of AI outputs.
The RACCCA quality check is a method for evaluating AI output. This checklist may be used to form a brief rubric. It includes six elements as described in the following table.This framework is especially useful in an educational setting because it integrates both evaluation and metacognition. It trains individuals to reflect not only on what the AI produced, but why it produced that result—and what might improve future outputs. For example, if an AI-generated test question omits important clinical considerations, an individual using RACCCA can identify that the output was relevant but incomplete and refine the prompt accordingly.
| Criterion | Key Question | What to Look For |
|---|---|---|
| R. Relevance | Does the output address the prompt? | Aligned with the task; includes all requested elements. |
| A. Accuracy | Is the information correct? | Factually correct; no errors or misleading claims. |
| C. Clarity | Is it easy to understand? | Well-organized; appropriate language. |
| C. Completeness | Is anything important missing? | Thorough coverage of key points. |
| C. Conciseness | Is it appropriately concise? | Brief and to the point; no unnecessary detail. |
| A. Appropriateness | Is it suitable for the context? | Fits the audience and setting. |
| High-quality AI output meets all six criteria. | ||
| Criterion | Score (1-5) | Rationale |
|---|---|---|
| Relevance | 4 | The AI mostly addressed the task, summarizing pathophysiology and treatment. However, it omitted a few prompt-specified details, like typical first-line therapy. |
| Accuracy | 2 | Several factual inaccuracies: diabetic nephropathy is not the most common pediatric cause (it's minimal change disease); also, ACE inhibitors are not the first-line treatment. The source ("Mayo Clinic") is mentioned vaguely, without a citation or link. |
| Clarity | 5 | The output is well-written and understandable. |
| Completeness | 3 | Key information is missing (e.g. typical cause: minimal change disease, clear role of corticosteroids as first-line, when biopsy is indicated. |
| Conciseness | 4 | Overall concise, though slightly rushed in summarizing treatment and lacking depth in therapeutic decision-making. |
| Appropriateness | 3 | The tone is appropriate, but the claim about prognosis ("generally poor") is overly generalized and potentially misleading in the pediatric context. The vague reference to Mayo Clinic undermines trust. |
Example Prompt RACCCA Evaluation
Example Prompt Given to ChatGPT:
“Summarize the pathophysiology and treatment of nephrotic syndrome in children. Include typical causes and first-line management strategies.”
AI Output (Excerpt):
“Nephrotic syndrome in children is usually caused by diabetic nephropathy, which leads to increased glomerular permeability and protein loss. Treatment typically involves ACE inhibitors, corticosteroids, and a kidney biopsy. The prognosis is generally poor without aggressive immunosuppression. According to the Mayo Clinic, this condition requires early detection and management.”
First Round
First Impressions of the Output
- What was good about the response?
The summary used accessible language and clearly explained the pathophysiology (glomerular permeability and protein loss). It showed good organization and logical flow. - What was problematic?
The content included clinical inaccuracies and misleading generalizations (e.g., diabetic nephropathy as a cause, poor prognosis, biopsy in all cases). There was no proper source citation and the Mayo Clinic mention lacked specificity.
Improving Prompt Output
- How could the prompt be improved to generate a more accurate and complete response?
Add this to the original AI prompt:“Please cite your sources. Focus on typical causes such as minimal change disease, and clarify first-line treatment options. Avoid generalized statements and indicate when biopsy is or isn’t required.”
This would likely yield a more nuanced and accurate summary grounded in clinical practice guidelines.
Topics for Discussion with Students
Fluency ≠ factuality—individuals may trust AI outputs that sound confident.
This type of cognitive offloading, known as automation bias, can lead to a decline in critical thinking skills.- What risks would arise if an individual relied on this summary for an oral presentation or pre-round preparation?
Example Scenario:
I’m building something with AI, for example, an AI tutor agent on cardiology.
- How do I know if it’s actually working?
- And how do I explain what’s wrong with it when it doesn’t?
Start Here—Use the Terminology of Validation for System Validation
In engineering, healthcare and software development, validation terminology helps creators answer one big question:
- Does this thing I built do what it’s supposed to do, and is it safe and useful for others?
If you’re building anything with AI, you’ll need a way to evaluate your design beyond just your gut feeling. This is where the validation vocabulary comes in.
| Phrase | Example Application |
|---|---|
| Requirement Specification | Defines scope: Covers nephron structure, ion transport, and hormonal regulation using retrieval-style prompts. |
| Verification | Functions as intended: Generates correct answers and consistently prompts follow-up questions. |
| Validation | Demonstrates real-world value: Improves peer learning and performance in renal CBL. |
| User Acceptance Testing | End users report usability and trust: Recommended by learners in group study settings. |
| System Testing | Performs reliably in context. Integrates into shared GPT interface without errors across user types. |
Example: Building a GPT to Teach Renal Physiology
| Term | Example Application |
|---|---|
| Requirement Specification | "Covers nephron structure, ion transport, and hormonal regulation with retrieval-style prompts." |
| Verification | "It gives correct answers and prompts follow-up questions every time." |
| Validation | "My classmates say it helped them prep for our renal CBL—and got better quiz scores". |
| User Acceptance Testing | "Three students used it in their study groups and said they'd recommend it to others." |
| System Testing | "Still works when integrated into our shared study GPT interface—no bugs across user types." |
Why This Matters
Many GenAI projects fail not because of technical bugs, but because:
No one clearly defined what the tool was supposed to do.
It doesn’t actually help the learner or user in context.
The creators didn’t ask others to try it before rolling it out.
It worked in isolation, but not as part of a real learning system.
Knowing how to talk about specs, verification and validation puts you in the mindset of someone building for real-world clinical or educational settings.
Try This in Your Project Work
If you're designing anything with AI, such as…
An algorithm that suggests clinical next steps,
A rubric generator for writing feedback,
A chatbot that explains high-yield anatomy,
Ask yourself:
What exactly should this do?
Does it meet those specs?
Would you trust it if someone else built it?
Have others tested it and found it useful?
Does it work beyond your computer—can it scale?
Key Takeaways
- Evaluating AI output is not simply a technical skill, it is a critical thinking practice. As generative AI becomes more integrated into medical education and clinical workflows, the ability to question, verify and refine AI-generated content is essential for maintaining accuracy, safety and trust.
- The RACCCA framework provides a structured approach to this process, helping you move beyond surface-level impressions and assess output across multiple dimensions. Over time, applying these criteria supports the development of discernment—allowing you to recognize not just when AI is helpful, but when it may be incomplete, misleading or inappropriate.
- As you begin using or building AI-enabled tools, this evaluative mindset extends further. Whether reviewing a single response or assessing an entire system, the same principle applies: outputs must be examined, assumptions must be questioned, and usefulness must be demonstrated in context.
- Ultimately, responsible use of AI in medicine requires maintaining ownership of judgment. AI can assist, accelerate and augment—but it does not replace the need for critical appraisal. Developing this habit ensures that AI remains a tool that supports, rather than undermines, clinical reasoning and learning.
Supplementary Materials & Resources
Supplementary Materials
Coming Soon!
Resources
Articles
Bandi, A., Adapa, P. V. S. R., & Kuchi, Y. E. V. P. K. (2023). The power of generative ai: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet, 15(8), 260.
- IEEE Standards Association. (2016). IEEE Standard for System, Software, and Hardware Verification and Validation. IEEE Std, 1012-2016.
Videos
- Artificial Intelligence (AI) in Academic Medicine Webinar Series
- AAMC’s Toolkit for Responsible Implementation
Websites
Note: The text and graphics in these modules were co-developed with the assistance of generative AI tools such as OpenAI’s ChatGPT, Google’s Gemini and NotebookLM and Microsoft’s CoPilot, drawing on the indicated reference materials. The materials were then edited for relevance and accuracy.