AI-native assessment · Human-first teaching
MentorplAI turns every writing task into a longitudinal signal: how students work with AI, how their judgment develops across tasks, and how reliably they evaluate each other's work.
| Student | Task 1 | Task 2 | Task 3 |
|---|---|---|---|
| A. Müller | 82Agency 28% | 85Agency 31% | 89Agency 25% |
| B. Okonkwo | 71Agency 55% | 69Agency 52% | 78Agency 47% |
| C. Nakamura | 76Agency 61% | 61Agency 87% | 58Agency 91% |
| D. Patel | 54Agency 33% | 68Agency 29% | 75Agency 27% |
| E. Johansson | 80Agency 22% | — | — |
Most assessment measures what was produced.
MentorplAI measures how students work — and whether that improves.
These existed before AI. They're just more visible now.
A polished submission can come from genuine skill or from minimal engagement with a capable model. Grading the output alone doesn't tell you which.
Without seeing how a student works — what choices they make, whether their thinking develops — it's hard to give useful feedback or spot who needs help.
Working effectively with AI is a professional skill. Assessing as though AI doesn't exist doesn't prepare students for what they're already doing.
One coherent loop per task. Each loop adds to the picture.
Create a module, write a task prompt, and add quality baselines — sample responses at each level from naive to professional. The AI can generate them from your prompt. These baselines calibrate the grading system.
Before writing, each student authors a personal skill — a short set of DOs and DON'Ts capturing how they intend to work with AI on this task. This makes their approach explicit and gives them something to reflect on across tasks.
Their skill prompts a personalised draft. Students write and edit from there in a structured studio. How much they diverge from the draft becomes the Agency signal — a measure of how much they shaped the output beyond the starting point.
Students compare pairs of submissions against five dimensions. Comparing directly is more reliable than scoring on a scale — it's the same principle used to evaluate AI models. Each evaluator's reliability is measured alongside the submissions they judge.
Each task produces a grade from submission quality (65%) and evaluator reliability (35%). Across tasks, lecturers see how grades and agency evolve — patterns that a single snapshot cannot reveal.
No single number is the story. Their combination — and how they change across tasks — is where the insight is.
A quality level derived from pairwise peer comparison, calibrated against the lecturer's quality baselines. Rising scores across tasks suggest real development.
How much the student shaped and steered the output beyond the AI starting draft. Persistently low agency across tasks may signal limited engagement — though context always matters.
How consistently and accurately a student judges peers' work — agreement with consensus, calibration against known quality anchors, and informativeness of their choices.
Each design choice is grounded in something that already works.
Comparing two things directly is more reliable than scoring them on an abstract scale. This is how AI models are evaluated at scale — MentorplAI applies the same principle to student work.
Having students articulate and revise their own DOs and DON'Ts creates a record of how their approach to AI develops across tasks — and surfaces misalignments between what they say they'll do and what they produce.
Evaluating others' work develops critical judgment that producing work alone doesn't. Done systematically, peer evaluation signals both the quality of the work reviewed and the quality of the reviewer.
A single data point is ambiguous. Patterns across multiple tasks are meaningful. Tracking how a student behaves over time reveals what a snapshot assessment cannot.
Signals to inform teaching — not replace it.
The longitudinal dashboard shows every student across every task — grade trajectory, agency score, and evaluator reliability — in one view. Spot who's improving and who needs a conversation.
Deliver personalised written feedback — with attachments or links — directly to a student, visible only to them in their results. High-signal conversations, not bulk annotation.
Share a QR code in class. Students join, write, and evaluate each other's work in minutes. Ranked grades on the spot. Works without setting up a full module first.
Being clear about limits is part of using any tool well.
A structured environment that makes how students work with AI visible — not just what they produce.
A longitudinal record that helps lecturers have better, more specific conversations with students.
An assessment design that treats AI collaboration as a skill to develop — one that can be observed across tasks.
A peer evaluation framework that develops critical judgment, not just produces a grade.
An AI detector. Signals are for lecturers to interpret with context — not automatic verdicts.
A surveillance tool. The system surfaces patterns for teaching decisions — it doesn't replace professional judgment.
Foolproof. Students can game any system. What changes here is the cost: shallow work produces consistent patterns across tasks that are harder to sustain convincingly.
A replacement for good pedagogy. The system works best when used as a starting point for teaching — not an endpoint for evaluation.
"I haven't seen any system designed to teach students how to work with AI — and that's what this is."
— Assessment researcher, education technology
MentorplAI is live and being used in courses today. Try it with your next assignment.