AI Evals para AI Product Managers & Testers (EVA)
Ensure the quality of your AI solutions: the critical skill of the LLM era.
July 7th - 30th
An 8-session live training for product and testing professionals who want to adopt the practice of systematic AI product evaluation. We work through a real end-to-end case on the platforms the market is using today, combining applied theory, demos with operational tools, and hands-on exercises so every participant can learn by doing.
What is it about?
Products that integrate LLMs don't break like traditional software. They fail silently, they fail differently each time, and they often fail better or worse depending on the model, the prompt, or the day. That probabilistic nature strips product and quality teams of the most basic tool they had: a test that passes or fails consistently. Without evals, shipping decisions become a mix of intuition, cherry-picked demos, and discussions not backed by data.
This program teaches the discipline that is replacing that intuition with evidence. Across 8 sessions you'll learn to generate synthetic datasets, to read outputs and build rigorous failure taxonomies, to combine deterministic evals with LLM-as-judge, to calibrate judges against human criteria, to evaluate systems with RAG and agents, and to run all of this in production with continuous monitoring.
The program is designed for two profiles that come to evals from different places but share the same language. If you come from the product world, you'll gain the quantitative judgment you need to decide when an AI system is ready for real users. If you come from the testing world, you'll transform your discipline of systematic failure analysis into one of the most in-demand skills today: AI Quality.
The platform chosen for the course lets you cover most of the journey without writing code.
Tools
We will use the following tool
Who is this for?
Product Professionals
Product Manager, Product Owner, Product Lead, or equivalent roles: if you're leading or about to lead an AI product, you'll walk away with the judgment to decide with evidence when a system is ready, which quality dimensions matter in your case, how to communicate risk to stakeholders, and how to specify evals so your technical team can implement them. You stop relying on cherry-picked demos to back shipping decisions.
Testing Professionals
QA Engineer, Tester, Quality Lead, or related roles: your discipline of systematic failure analysis, calibration across evaluators, and rigor in reporting is exactly the foundation on which AI Quality is built. You'll walk away with a concrete roadmap to transform that experience into one of the most in-demand skills today: designing, validating, and operating evaluations for probabilistic systems.
In both cases we assume prior experience in product or testing. We don't start from zero: we start from your professional judgment and add the tools and mental model that AI demands.
By the end of the program you will be able to:
Distinguish which behaviors of an AI system are evaluated with automated criteria and which require an LLM-based judge.
Generate diverse and representative test datasets from the dimensions of the problem, without depending on real traffic.
Read outputs systematically and build a taxonomy of failure modes that feeds the entire evaluation cycle.
Decide when a lightweight evaluation layer is enough and when the problem calls for an AI-based solution (LLM-as-judge) with judgment.
Design rubrics and operate LLM-based judges, recognizing and mitigating the typical biases they are subject to.
Validate that a judge is aligned with human criteria before trusting its measurements.
Design evaluation suites for RAG systems that cover what is retrieved, what is answered, and what happens when there is no context.
Evaluate agents and multi-step systems by attending to the full trajectory, and identify when the type of evaluation you've been doing is no longer enough.
Operate evals in production: continuous monitoring, sampling, regression detection, and guardrails as an operational layer.
Communicate quality and risk to non-technical stakeholders with quantitative backing.
Curriculum
1 Kick-Off
Kick-Off
2 Fundamentals
Fundamentals
3 Failure taxonomy
Failure taxonomy
4 Deterministic Evals
Deterministic Evals
5 Probabilistic Evals
Probabilistic Evals
6 Evaluators Evaluation
Evaluators Evaluation
7 RAG System Evaluation
RAG System Evaluation
8 Agent Evaluation
Agent Evaluation
9 Evals in Production
Evals in Production
Frequently Asked Questions
Everything you need to know about this course
No. The main platform we use lets you configure most evaluations from the interface, without needing to write code. In the sessions where code appears, we read and discuss it as a specification (something a PM needs to be able to read to have conversations with their technical team), not as an implementation task. If you come from product, you'll be able to follow the entire course; if you come from testing and have a technical background, you'll be able to go as deep as you want.
The OpenAI account and credits we'll use throughout the program, with an approximate budget of USD 20 to USD 30 total. During the Kick-Off we share a step-by-step tutorial so you arrive with everything set up.
Your experience is the foundation this course is built on, not something that makes you overqualified for it. What you gain is the specific mental model that evaluating probabilistic systems demands, the concrete platforms the AI Quality market uses, and the practice of calibrating LLM judges — which is where your discipline of calibration across evaluators becomes enormously valuable. You'll recognize much of the language and learn to apply it on new ground.
Yes. The course doesn't assume prior training in traditional testing. The concepts that come from the QA world (failure taxonomy, calibration across evaluators, rigorous reporting) are introduced from scratch, in the context of AI. If you already have product judgment, the course equips you to combine it with the quantitative rigor that AI shipping decisions require.
We chose the program's platform based on two criteria: 1) that it lets you cover most of the journey without fighting with tooling, and 2) that it's representative of what the AI Quality market uses today: OpenAI Evals.
Your Instructor
Martin Alaimo
Trainer, consultant, and educator dedicated to the creation of Digital Products and Business Agility. To date, he has worked with more than 200 organizations and supported over 8,000 professionals in their career development journeys.
His approach is situational and hands-on, delivering immersive learning through innovative experiences that enable practical, immediately applicable outcomes—especially in areas often overlooked by traditional academia.
He has spoken at more than 30 conferences across the United States and 14 countries in Latin America and Europe, and is the author of six books on product and digital innovation.
His most recent book, AI Strategy Workshop, provides tools to move beyond the “feature factory” mindset and integrate artificial intelligence with strategic intent and real business impact.
As part of his commitment to innovation, he is an organizing member of Product Tank, the world’s largest Product Management community.
He is one of the few experts to hold the highest-level certifications in Agile practices: Certified Scrum Trainer (CST), Certified Enterprise Coach (CEC), Certified Team Coach (CTC), Certified Agile Leadership Educator (CAL Educator), and Path to CSP Educator.
Explore his complete professional profile and thought leadership activities on LinkedIn.
July 7th - 30th
Annual Subscription
- 2 live trainings for only USD 1295
- Includes on-demand webinar library
- Includes Flash Workshops at no extra cost
- 3rd training onwards: USD 295 each
Ready to step forward?
You're not starting from zero. You're choosing to advance with purpose.
Learning better is also a decision.