A lot of competence does not live in writing. A nurse demonstrating a clinical handover, a trades apprentice talking through a wiring diagram, a sales trainee handling an objection, a language learner holding a conversation. Ask any of them to write an essay about it and you lose most of the signal you actually care about.
So assessors collect video and audio. A recorded role-play, a spoken explanation, a demo with narration. It is far better evidence of the real skill. It is also miserable to grade at scale, and that single fact quietly pushes a lot of programmes back toward written tasks that measure the wrong thing.
Why video and audio are worth the trouble
Some things only show up in performance.
- Procedure under real conditions. Whether someone does the steps in the right order, safely, without prompting. A written answer shows they know the steps. A video shows whether they can actually do them.
- Communication and delivery. Clarity, pacing, how they respond when something goes off-script. Central to teaching, care, sales, and frontline roles, and invisible on paper.
- Spoken fluency. For language and communication training, audio is the assessment. A transcript flattens exactly the thing you are measuring.
When the competency is a performance, text is a proxy at best. Multi-modal evidence is the real thing.
Why it is so painful to grade
Video and audio have a property text does not - you have to consume them in real time. You cannot skim a ten-minute clip the way you scan an essay. To assess it properly you watch the whole thing, often twice, scrubbing back to check a specific moment against the rubric.
That makes the per-submission cost brutal. Twenty-five ten-minute videos is over four hours of watching before a single mark is written, and that is the optimistic estimate. So programmes that should use video either limit how much they collect or quietly drift back to writing because it grades faster. Both are the assessment tail wagging the pedagogy dog.
What multi-modal assessment actually does
The aim is not to remove the assessor from watching the work. It is to make the watching targeted instead of exhaustive.
An AI assessment layer processes the video or audio - transcribing speech, noting where in the timeline the rubric criteria appear to be met - and produces a first pass against the criteria with timestamps. Instead of watching ten minutes blind, the assessor sees that criterion two looks met around the 3:40 mark and criterion four looks weak near 7:10, and goes straight to those moments to confirm.
The assessor still watches what matters and still owns the judgement. What disappears is the four hours of linear scrubbing just to locate the moments worth judging. The same applies to uploaded files and written work - Scorafy is built to read text, video, audio, and file submissions against the rubric you set, cite the evidence, and route every result to a qualified assessor for review and sign-off.
The cautions that come with it
Multi-modal assessment is powerful, and it carries specific responsibilities that text does not.
Transcription is imperfect. Accents, jargon, background noise, crosstalk all degrade speech-to-text. A first pass built on a flawed transcript can be confidently wrong, which is precisely why the assessor confirming against the actual recording is not optional.
Recordings are sensitive personal data. A video of someone’s face and voice is more identifying than an essay. Where it is stored, how long it is kept, and whether it is ever used to train a model are real questions under GDPR and the Australian Privacy Principles. Get clear answers before you collect anything.
The decision must stay human. A performance assessment that affects a qualification is high-stakes, and the EU AI Act treats it as such. The AI locates and suggests. A qualified assessor decides. No solely automated outcomes - that line does not move because the format changed.
What it changes for a programme
When grading video and audio stops costing four hours a cohort, the constraint that quietly shapes assessment design lifts. You can collect the evidence that actually shows the skill, instead of the evidence that is cheapest to mark. Apprentices can submit demos. Language learners can submit conversations. Care students can submit recorded scenarios. And the assessor reviews a targeted first pass rather than burning out on linear playback.
The goal was never to take the human out of judging performance. It was to stop the cost of watching from quietly deciding what you are allowed to assess at all.