The Story Paper: Scalable Certification of Reasoning in the Age of AI
Extending Triple-Jump / OSCE-Style Assessment Through AI-Mediated Governance
Advances in artificial intelligence are rapidly expanding how learning occurs. Individuals can now access tools that assist with explanation, synthesis, and problem-solving across a wide range of domains. As a result, learning is becoming more scalable, adaptive, and widely accessible.
This shift has important implications for professional certification.
In high-stakes fields such as medicine, law, engineering, and aviation, competence depends not only on knowledge, but on the ability to reason under conditions of uncertainty. Established assessment models—including oral examinations, Objective Structured Clinical Examinations (OSCEs), and related formats—are designed to evaluate this capability directly. They rely on contextual scenarios, interactive probing, and expert judgment to assess how candidates think.
These approaches are effective. However, they do not scale easily.
As participation in professional pathways grows—and as learning increasingly occurs in AI-assisted environments—a gap is emerging. While learning can scale, the certification of reasoning remains constrained by resource-intensive methods. At the same time, the presence of AI-assisted tools introduces new ambiguity in assessment, making it more difficult to distinguish between assisted output and independently generated reasoning.
Addressing this gap requires more than incremental adjustments to existing methods. Expanding standardized testing, strengthening proctoring, or restricting access to tools can help in specific contexts, but they do not fundamentally extend the ability to evaluate reasoning at scale.
A potential direction involves re-examining how assessment is structured and governed. This includes separating learning environments from certification processes, so that training can incorporate guidance and assistance while evaluation focuses on independent reasoning. It also involves representing training in terms of structured exposure to classes of reasoning challenges, rather than specific questions and answers, and verifying whether candidates can reason consistently across variations of those challenges.
Importantly, this direction is not tied to a single system or platform. It reflects an architectural and governance perspective that can support multiple, independent implementations across institutions.
Because this problem spans educational design, technical infrastructure, certification authority, and policy considerations, it cannot be addressed by any single entity. Progress requires coordinated collaboration among certification bodies, academic institutions, technical participants, and governance stakeholders.
Such progress can occur incrementally. Initial efforts may focus on conceptual validation and limited pilot studies, followed by the development of shared specifications that enable broader adoption while preserving institutional autonomy.
As AI continues to expand the scale and accessibility of learning, the question of how to verify reasoning becomes increasingly central. The aim of this work is to contribute to that discussion by outlining a framework through which reasoning-based certification may evolve—carefully, collaboratively, and at scale.
I The Inflection Point
In recent years, advances in artificial intelligence have begun to reshape how knowledge is accessed, acquired, and applied. AI-assisted learning environments are no longer experimental; they are rapidly becoming integrated into educational workflows across disciplines. In many contexts, learners now have continuous access to tools that can retrieve, synthesize, and even generate domain-relevant information in real time.
This shift has profound implications for professional education and certification. Historically, assessment systems have evolved around the assumption that access to information is limited and that knowledge must be internalized to be applied. As a result, a significant portion of assessment design has focused on evaluating recall, recognition, and structured problem-solving within controlled conditions.
That assumption is no longer stable.
As access to information becomes effectively ubiquitous, the distinction between knowing and reasoning becomes more consequential. The ability to retrieve information is increasingly commoditized; the ability to interpret, prioritize, and act on that information in context remains a core professional competency. In high-stakes domains such as medicine, law, engineering, and aviation, this distinction is not academic—it directly affects decision quality, safety, and outcomes.
Existing assessment models have long recognized this. Oral examinations, Triple-Jump, Objective Structured Clinical Examinations (OSCEs), and related formats were designed specifically to evaluate reasoning under conditions of uncertainty. They rely on interactive, often adversarial engagement to probe how a candidate thinks, not simply what they know. These approaches remain among the most trusted methods for assessing professional competence.
However, they do not scale.
They depend on expert time, are logistically complex, and are difficult to standardize across large populations. As demand for professional certification grows—and as learning pathways diversify globally—these constraints become more pronounced. The result is a growing tension between what is known to be the most meaningful form of assessment and what is feasible to deliver at scale.
This tension is now being amplified.
AI is enabling learning to scale rapidly, across geographies and institutions, often outside traditional curricular boundaries. As this expansion continues, the question is no longer whether individuals can access knowledge or guidance, but how their independent reasoning can be verified in environments where assistance is readily available.
We are entering a phase where learning can scale globally, but the certification of reasoning cannot—at least not using existing models alone.
Addressing this gap does not require abandoning established assessment practices. Rather, it requires examining how their core strengths—contextual probing, adversarial dialogue, and evaluation of reasoning—might be extended through new forms of infrastructure that preserve trust while enabling scale.
II What Already Works
Established models of professional assessment have long recognized that competence is not defined solely by the possession of knowledge, but by the ability to apply that knowledge under conditions of uncertainty. Across multiple domains, assessment formats have evolved to evaluate this capability directly.
Oral examinations, Objective Structured Clinical Examinations (OSCEs), and related multi-stage formats—such as the "triple jump"—are widely regarded as among the most effective approaches for assessing reasoning. While they differ in structure, they share several essential characteristics.
First, they are interactive. Candidates are not evaluated through static responses alone, but through engagement with an examiner or a structured scenario. This interaction allows for probing beyond initial answers, revealing how a candidate interprets information, adjusts to new inputs, and navigates ambiguity.
Second, they are contextual. Rather than isolating discrete facts, these assessments situate candidates within realistic scenarios that require synthesis across multiple concepts. The candidate must prioritize, make trade-offs, and justify decisions in a manner that reflects real-world practice.
Third, they are adaptive and, at times, adversarial. Examiners introduce variations, challenge assumptions, and explore edge cases to test the robustness of a candidate's reasoning. This dynamic probing is critical: it distinguishes between surface-level familiarity and deeper conceptual understanding.
Finally, they rely on expert judgment. Trained evaluators bring domain-specific insight to the assessment process, allowing them to recognize not only correct or incorrect conclusions, but also the quality and structure of the reasoning that led to those conclusions.
These characteristics—interaction, context, adaptability, and expert judgment—are central to why such formats are trusted. They provide a level of insight into reasoning that is difficult to achieve through more standardized, non-interactive methods.
At the same time, these strengths are closely tied to their limitations.
Because they depend on expert time and individualized engagement, they are resource-intensive. Scaling them across large candidate populations requires significant coordination and introduces variability in delivery. Efforts to standardize these formats can improve consistency but may reduce some of the flexibility that gives them their evaluative power.
As a result, a practical balance has emerged in many certification systems: highly scalable methods are used to assess broad knowledge, while more resource-intensive formats are reserved for evaluating reasoning in depth.
This balance has been workable under historical conditions.
However, as the scale and nature of learning evolve, the gap between these two modes of assessment becomes more pronounced. The most meaningful form of evaluation—direct assessment of reasoning—remains the least scalable. The most scalable methods, while valuable, provide a more limited view of how candidates think in complex, real-world situations.
The problem is not that reasoning cannot be assessed. It is that it cannot be assessed at scale.
The question, therefore, is not whether existing methods are effective. They are. The question is whether their core strengths can be preserved and extended in a way that allows reasoning to be evaluated more consistently and at greater scale.
III The Emerging Gap
The conditions under which learning occurs are changing rapidly.
AI-assisted environments are increasingly integrated into both formal and informal education. Learners can now access systems that provide explanations, suggest approaches, generate examples, and assist in problem-solving across a wide range of domains. In many cases, these tools are available continuously and operate with a level of responsiveness that approximates guided instruction.
This shift is expanding access to learning in meaningful ways. It allows individuals to engage with complex material more efficiently, receive immediate feedback, and explore multiple approaches to a problem. It also enables learning to occur outside traditional institutional structures, across geographies and at a scale that was previously difficult to achieve.
At the same time, it introduces new ambiguity into the assessment process.
When assistance is readily available, the boundary between independently generated reasoning and externally supported output becomes less clear. A response may be correct, well-structured, and aligned with domain expectations, but it may not fully reflect the candidate's own reasoning process. Conversely, a candidate with strong underlying reasoning ability may rely on tools in ways that obscure their independent capability.
This ambiguity is not easily resolved through restriction.
Limiting access to tools during assessment may reduce certain forms of assistance, but it does not address the broader shift in how individuals learn and prepare. Nor does it reflect the environments in which professional reasoning is increasingly exercised, where access to information and tools is often expected rather than prohibited.
As a result, the focus of assessment is beginning to shift.
The central question is no longer whether a candidate can arrive at a correct answer in isolation, but whether they can demonstrate consistent, coherent reasoning across variations of a problem, including in settings where information and support are present. This requires evaluation approaches that can distinguish between pattern recognition, assisted output, and durable reasoning capability.
At the same time, the scale of participation in professional pathways continues to grow. Learners are engaging from a wider range of educational backgrounds, often through hybrid or non-traditional pathways. Institutions are under increasing pressure to maintain standards while accommodating this expansion.
Taken together, these trends create a structural gap.
As learning becomes AI-mediated, the boundary between assisted performance and genuine reasoning becomes increasingly difficult to define and verify. On one side, learning is becoming more accessible, adaptive, and scalable. On the other, the most trusted methods for evaluating reasoning remain resource-intensive and limited in reach.
This gap is not theoretical. It is already beginning to surface in discussions around assessment integrity, the role of AI in education, and the future of credentialing. As these discussions evolve, the question is not whether change will occur, but how it can be guided in a way that preserves trust while adapting to new conditions.
IV Why Incremental Adjustments Are Not Sufficient
In response to evolving assessment challenges, a range of incremental approaches has emerged. These include expanding the use of standardized testing, strengthening proctoring mechanisms, and introducing controls on access to external tools during evaluation. Each of these measures can play a role in maintaining assessment integrity under certain conditions.
However, they are not designed to address the core shift outlined in the preceding section.
Standardized testing methods—particularly those focused on recognition or selection—are well-suited for evaluating broad knowledge efficiently. They can be scaled, calibrated, and administered consistently across large populations. Yet by design, they provide limited visibility into how a candidate reasons through complex, context-dependent problems. Increasing their use does not materially extend the ability to assess reasoning; it primarily reinforces the assessment of knowledge under constrained formats.
Similarly, enhanced proctoring and monitoring technologies aim to ensure that assessments are completed under prescribed conditions. While these measures can reduce certain forms of unauthorized assistance, they do not resolve the underlying ambiguity introduced by AI-assisted learning. Even in tightly controlled environments, it remains difficult to determine whether performance reflects durable reasoning or the application of learned patterns that may not generalize across variations.
Restrictions on access to tools present a related limitation. Preventing or limiting the use of external resources during assessment may approximate traditional testing conditions, but it does not reflect the environments in which professional reasoning is increasingly exercised. In many real-world settings, access to information and decision-support tools is not only permitted but expected. Designing assessments that depend on the absence of such tools risks evaluating a context that is progressively less representative of actual practice.
Taken together, these approaches share a common characteristic: they operate within the existing assessment paradigm. They adjust parameters—format, supervision, access—without fundamentally changing what is being observed or how it is evaluated.
The challenge is not preventing access to information, but verifying reasoning in its presence.
The emerging challenge is not primarily about controlling inputs. It is about observing and verifying reasoning in contexts where inputs are abundant and assistance may be present. When access to information is limited, assessment can rely more heavily on recall and constrained problem-solving. When access is effectively unconstrained, the emphasis must shift toward evaluating how candidates interpret, integrate, and act on information across changing conditions.
As a result, incremental adjustments, while useful, are unlikely to close the gap between scalable assessment and meaningful evaluation of reasoning. Addressing that gap requires a shift in perspective—from restricting access to information toward designing mechanisms that can reliably observe reasoning in its presence.
The next section outlines a direction for how such a shift might be approached, not as a single system, but as a framework for structuring and governing reasoning-based evaluation at scale.
V A Direction, Not a Product
Addressing the gap between scalable assessment and meaningful evaluation of reasoning does not begin with a single system or platform. It begins with a shift in how assessment is structured and governed.
At a high level, this shift involves separating functions that have historically been coupled. In many current models, the same environment may be responsible for delivering instruction, capturing learner interactions, and contributing to evaluation outcomes. While efficient, this coupling can make it difficult to distinguish between how a candidate was trained, what support was available during that training, and how their independent reasoning should be assessed.
A direction for addressing this challenge is to introduce clearer boundaries between these functions.
One such boundary is between learning and certification. Training environments can be optimized for guidance, feedback, and iterative improvement, including the use of AI-assisted tools. Certification environments, by contrast, can be structured to evaluate reasoning independently, using methods designed specifically for verification rather than instruction. Maintaining separation between these environments helps ensure that assessment outcomes reflect demonstrable reasoning capability rather than the characteristics of the training process.
A second boundary concerns identity and performance data. In traditional systems, identifying information, training interactions, and assessment results are often linked within the same data context. As learning becomes more distributed and tool-assisted, preserving trust may require limiting how these elements are combined. Structuring assessment in a way that minimizes dependence on identity-linked training data can support more neutral and portable forms of certification.
A third element involves how exposure to reasoning scenarios is represented. Rather than relying solely on test items or static question banks, it is possible to describe a candidate's training in terms of structured exposure to classes of reasoning challenges within a defined domain. Such representations do not encode specific questions or answers, but instead characterize the types of problems encountered and the conceptual dimensions engaged during training.
Within this context, evaluation can shift toward verifying whether a candidate can reason consistently across variations of those conceptual dimensions. This may involve interactive or adversarial probing, in which the candidate is asked to respond to new scenarios that are derived from, but not identical to, those encountered during training. The objective is not to reproduce prior responses, but to demonstrate the ability to apply reasoning under changed conditions.
The objective is not to build a system, but to define a governance architecture within which independent systems can operate.
Importantly, these elements are not tied to a single implementation. They describe a set of architectural considerations that can be realized in different ways, across institutions and systems. The emphasis is on defining how components interact and how responsibilities are separated, rather than prescribing specific technologies.
Framed in this way, the problem becomes one of governance as much as of engineering. It involves establishing rules and structures that allow learning environments, evaluation mechanisms, and certification authorities to operate with appropriate independence, while still contributing to a coherent assessment process.
The following sections consider why such an approach necessarily involves multiple stakeholders and how it may be developed incrementally through collaborative effort.
VI Why This Requires Collaboration
The considerations outlined in the previous section point toward a form of assessment that cannot be defined or implemented by a single entity alone. They involve questions of educational design, technical infrastructure, certification authority, and governance—each of which sits within a different institutional domain.
Certification bodies play a central role in establishing standards and maintaining public trust. Their authority is grounded in the ability to define what constitutes competence and to ensure that certification processes are rigorous, consistent, and fair. Any evolution in assessment models must align with these responsibilities.
Academic institutions and educators contribute domain expertise and pedagogical structure. They design curricula, define learning objectives, and develop the scenarios through which reasoning is cultivated. As learning environments become more adaptive and distributed, their role in shaping the underlying structure of reasoning remains essential.
Technical participants—including developers of AI systems, platforms, and infrastructure—bring capabilities that enable new forms of interaction, data handling, and evaluation. These capabilities are necessary to support assessment at scale, but they must be applied within clearly defined boundaries to preserve the integrity of the certification process.
Policy and governance stakeholders provide oversight related to privacy, data use, and regulatory alignment. As assessment systems incorporate new forms of data and interaction, these considerations become more complex and more critical. Ensuring that systems are not only effective but also trustworthy requires explicit governance frameworks.
These roles are interdependent but not interchangeable.
No single participant can fully define the problem or its solution in isolation. Certification bodies cannot unilaterally design technical infrastructure; technical providers cannot define certification standards; educational institutions cannot establish cross-institutional governance on their own. Each contributes a necessary perspective, and each depends on the others to maintain balance.
This is inherently a multi-party problem requiring coordinated governance, not centralized control.
This interdependence suggests that progress in this area is best approached through coordinated effort rather than centralized control. A collaborative model allows for the development of shared specifications, common definitions, and interoperable components, while preserving institutional autonomy. It also enables iterative validation, where ideas can be tested in limited contexts, refined based on feedback, and gradually extended.
Such an approach reduces risk. It avoids premature standardization and allows different stakeholders to evaluate the implications of new models within their own operational and regulatory frameworks. It also creates space for consensus to emerge around what aspects of reasoning should be assessed, how they should be represented, and how outcomes should be interpreted.
Importantly, collaboration in this context does not require immediate commitment to large-scale change. It can begin with structured discussions, small pilot efforts, and shared exploration of specific questions. Over time, these activities can contribute to the formation of more formal governance structures and specifications.
Framed in this way, the evolution of reasoning-based certification is not a single initiative to be adopted, but a process to be shaped collectively. The final section outlines how such a process might proceed incrementally, with an emphasis on validation, transparency, and continuity with existing systems.
VII An Incremental Path Forward
Given the complexity of professional certification and the level of trust it requires, any evolution in assessment models must proceed carefully. Abrupt changes to established systems are neither practical nor desirable. Instead, progress is more likely to occur through a series of incremental steps that allow new approaches to be explored, evaluated, and refined over time.
A first step involves conceptual validation. This includes clarifying definitions, identifying key assumptions, and engaging stakeholders in structured discussion around the problem space. At this stage, the objective is not to implement new systems, but to establish a shared understanding of the challenges and the potential directions for addressing them.
From there, limited pilot efforts can be introduced. These pilots need not replace existing assessment processes; rather, they can operate alongside them, focusing on specific aspects of reasoning-based evaluation. For example, a pilot might explore how candidates respond to variations of a defined class of problems, or how interactive probing can be structured in a consistent and repeatable way.
Such pilots provide an opportunity to observe how proposed approaches perform in practice. They allow questions of feasibility, reliability, and interpretability to be examined in controlled settings. Importantly, they also enable participating institutions to assess alignment with their own standards, policies, and operational constraints.
As insights are gathered, they can inform the development of shared specifications. These specifications would define how different components—learning environments, evaluation mechanisms, and certification processes—interact within a governed framework. The goal is not to prescribe a single implementation, but to enable multiple, independent implementations to operate with a degree of interoperability and consistency.
Throughout this process, transparency is essential. Assumptions, methods, and outcomes should be documented and made available for review. This supports broader participation and helps build confidence that new approaches are being developed in a deliberate and accountable manner.
Equally important is continuity with existing systems. Established assessment formats, including oral examinations and OSCEs, embody principles that have been validated over time. An incremental approach seeks to extend these principles—such as contextual evaluation and adversarial probing—rather than replace them. In doing so, it preserves the foundations of trust on which professional certification depends.
Progress does not require replacement of existing systems, but structured extension and validation.
Taken together, these steps outline a path that is both cautious and constructive. They recognize the need to adapt to changing conditions while respecting the complexity of the systems involved.
As AI continues to expand the scale and accessibility of learning, the question of how to verify reasoning will become increasingly central. Addressing this question does not require immediate transformation, but it does require deliberate progress.
The aim of this work is to contribute to that progress by providing a structured basis for discussion, experimentation, and collaboration.
Contribute to the next phase of reasoning-based certification
We are inviting certification bodies, academic institutions, technical participants, and governance stakeholders to engage in structured discussion and collaborative development.