AI Evals para AI Product Managers & Testers (EVA)

Ensure the quality of your AI solutions: the critical skill of the LLM era.

Lab

28 hours

En Español

July 7th - 30th

USD 1000USD 800

Early bird price until Jun 30

From USD 295 for Labs subscriptions

Enroll Now

An 8-session live training for product and testing professionals who want to adopt the practice of systematic AI product evaluation. We work through a real end-to-end case on the platforms the market is using today, combining applied theory, demos with operational tools, and hands-on exercises so every participant can learn by doing.

What is it about?

Products that integrate LLMs don't break like traditional software. They fail silently, they fail differently each time, and they often fail better or worse depending on the model, the prompt, or the day. That probabilistic nature strips product and quality teams of the most basic tool they had: a test that passes or fails consistently. Without evals, shipping decisions become a mix of intuition, cherry-picked demos, and discussions not backed by data.

This program teaches the discipline that is replacing that intuition with evidence. Across 8 sessions you'll learn to generate synthetic datasets, to read outputs and build rigorous failure taxonomies, to combine deterministic evals with LLM-as-judge, to calibrate judges against human criteria, to evaluate systems with RAG and agents, and to run all of this in production with continuous monitoring.

The program is designed for two profiles that come to evals from different places but share the same language. If you come from the product world, you'll gain the quantitative judgment you need to decide when an AI system is ready for real users. If you come from the testing world, you'll transform your discipline of systematic failure analysis into one of the most in-demand skills today: AI Quality.

The platform chosen for the course lets you cover most of the journey without writing code.

Tools

We will use the following tool

OpenAI Platform

Who is this for?

This is a program designed for two audiences that need to reach the same place from different starting points.

Product Professionals

Product Manager, Product Owner, Product Lead, or equivalent roles: if you're leading or about to lead an AI product, you'll walk away with the judgment to decide with evidence when a system is ready, which quality dimensions matter in your case, how to communicate risk to stakeholders, and how to specify evals so your technical team can implement them. You stop relying on cherry-picked demos to back shipping decisions.

Testing Professionals

QA Engineer, Tester, Quality Lead, or related roles: your discipline of systematic failure analysis, calibration across evaluators, and rigor in reporting is exactly the foundation on which AI Quality is built. You'll walk away with a concrete roadmap to transform that experience into one of the most in-demand skills today: designing, validating, and operating evaluations for probabilistic systems.

In both cases we assume prior experience in product or testing. We don't start from zero: we start from your professional judgment and add the tools and mental model that AI demands.

By the end of the program you will be able to:

Distinguish which behaviors of an AI system are evaluated with automated criteria and which require an LLM-based judge.

Generate diverse and representative test datasets from the dimensions of the problem, without depending on real traffic.

Read outputs systematically and build a taxonomy of failure modes that feeds the entire evaluation cycle.

Decide when a lightweight evaluation layer is enough and when the problem calls for an AI-based solution (LLM-as-judge) with judgment.

Design rubrics and operate LLM-based judges, recognizing and mitigating the typical biases they are subject to.

Validate that a judge is aligned with human criteria before trusting its measurements.

Design evaluation suites for RAG systems that cover what is retrieved, what is answered, and what happens when there is no context.

Evaluate agents and multi-step systems by attending to the full trajectory, and identify when the type of evaluation you've been doing is no longer enough.

Operate evals in production: continuous monitoring, sampling, regression detection, and guardrails as an operational layer.

Communicate quality and risk to non-technical stakeholders with quantitative backing.

Curriculum

Kick-Off

Early access to set up your work environment and get familiar with the real case that runs through the entire program: an AI application that we will put, session by session, through a complete evaluation process.

Fundamentals

What an eval is, when it enters the development cycle of an AI application, and why traditional QA practices aren't enough to evaluate probabilistic systems. This session is designed to produce the mental model shift the rest of the course needs: moving from a world where a test passes or fails to one where the result is a distribution, datasets live and change, and the decision is never about quality alone. We work on the first practical skill of the course: how to build a diverse and representative set of test cases when you don't yet have real traffic.

Failure taxonomy

With the test cases from Session 1, the time comes to run them against a model and observe what happens. Here you learn the skill that most sets apart someone who understands evals from someone who just configures them: reading outputs systematically and building a rigorous taxonomy of failure modes. This taxonomy is the input that feeds everything that comes after.

Deterministic Evals

The program's first layer of automation. Here we work on criteria that can be expressed unambiguously and verified without anyone's opinion: format, constraints, matches against reference values. The session covers which quality dimensions fall into this layer, when it's enough to stay here, and when the problem calls for stepping up a level. It closes with a core skill for PMs who work with technical teams: how to specify an evaluation criterion so that an engineer can implement it without having to ask again.

Probabilistic Evals

The heart of the program. When a criterion is subjective, contextual, or hard to capture with rules, the path is to use a model as an evaluator (LLM-as-judge). The session teaches how to write clear rubrics, how to choose between a categorical and a numerical judge based on the decision the result needs to support, and how to recognize the typical biases these judges are subject to (preference for longer responses, sensitivity to order, anchoring on the familiar) along with concrete techniques to mitigate them.

Evaluators Evaluation

An unvalidated judge (LLM-as-judge) is an opinion disguised as a measurement. This session teaches how to confirm that an LLM-as-judge is effectively aligned with human criteria: calibration sessions, cross-annotation, and inter-rater agreement metrics. Each participant iterates on their judge until achieving sufficient alignment and learns to document the validation process. It's a highly valued skill in the AI Quality market and transfers almost directly from the calibration culture of traditional testing.

RAG System Evaluation

Systems that retrieve information before generating a response require specific evaluations that don't appear in purely generative applications. Was what the system retrieved correct? Is the response backed by what was retrieved, or did the model add things? What happens when there is no available information and the model answers anyway? The session teaches how to design these evaluations from a product perspective, without getting into the implementation details of the retriever.

Agent Evaluation

Systems that use external tools or take multiple steps before reaching a response aren't evaluated the same way as a single response. What matters is the full trajectory: whether the agent reached the goal, whether the path was reasonable, how it recovered when something failed along the way. The course case extends to this scenario, and we work through what changes compared to everything covered so far. The session discusses honestly when the type of evaluation you've been doing is no longer enough and what the market does in those cases.

Evals in Production

The closing session. Up to this point we worked against controlled test sets; in this session we move to the real world: continuous monitoring, sampling in production, regression detection when a prompt or model changes, and guardrails as an operational layer. We also cover how to communicate risk and quality to non-technical stakeholders, a core skill for PMs and for anyone who has to justify shipping decisions to the business.

Frequently Asked Questions

Everything you need to know about this course

Your Instructor

Martin Alaimo

Trainer, consultant, and educator dedicated to the creation of Digital Products and Business Agility. To date, he has worked with more than 200 organizations and supported over 8,000 professionals in their career development journeys.

His approach is situational and hands-on, delivering immersive learning through innovative experiences that enable practical, immediately applicable outcomes—especially in areas often overlooked by traditional academia.

He has spoken at more than 30 conferences across the United States and 14 countries in Latin America and Europe, and is the author of six books on product and digital innovation.

His most recent book, AI Strategy Workshop, provides tools to move beyond the “feature factory” mindset and integrate artificial intelligence with strategic intent and real business impact.

As part of his commitment to innovation, he is an organizing member of Product Tank, the world’s largest Product Management community.

He is one of the few experts to hold the highest-level certifications in Agile practices: Certified Scrum Trainer (CST), Certified Enterprise Coach (CEC), Certified Team Coach (CTC), Certified Agile Leadership Educator (CAL Educator), and Path to CSP Educator.

Explore his complete professional profile and thought leadership activities on LinkedIn.

July 7th - 30th

USD 1000→USD 800

Early bird price until Jun 30

From USD 295 for Labs subscriptions

Enroll Now

Annual Subscription

2 live trainings for only USD 1295
Includes on-demand webinar library
Includes Flash Workshops at no extra cost
3rd training onwards: USD 295 each

I'm ready!

Ready to step forward?

You're not starting from zero. You're choosing to advance with purpose.

This is my next step View all trainings

Learning better is also a decision.