Beyond Scales - In Theory

This is one half of a pair of articles dealing with the theory and the practice of moving “beyond scales” in AI and education. The two posts together demonstrate how our thoughts have developed since 2023, and the examples from practitioners around the world of work which has gone beyond the original AI Assessment Scale. The posts can be read in either order and the other half can be found here. This post deals with the broader understandings of frameworks for AI and assessment.

Those of us who work closely with GenAI think we can see certain trends; trends like increasingly capable models, multimodality, coding tools, and the imminent flood of AI wearables. And because we see them, we tend to assume everyone else can too: classic “curse of knowledge” territory. Whether we’re right or wrong, our closeness to the technology shapes how we think and write about its implications in education.

I’ve got two frequent concerns writing about AI in education: meeting people where they’re at, and telling people where I think they should be. There’s a presumptuousness to both. How do you really know “where someone’s at” if you haven’t worked in their school, lectured at their university, lived through the office politics, briefings, and PD sessions? And who are we to say where anyone should be?

Every week I’m contacted by people who tell me that as an institution they’ve done “nothing at all”. But even that doesn’t tell me the full story, because there are almost certainly staff and students within the school who’ve been using AI extensively for the last few years. Pockets of secret expertise, sometimes pushed underground into shadow IT by institutional rules, sometimes by the wariness of their colleagues, or sometimes because of sectoral constraints. The converse is equally true: schools and universities that consider themselves “ahead” undoubtedly have staff who feel left behind. Position on the mysterious continuum of GenAI knowledge is never as uniform as the institution imagines it to be.

I feel that tension acutely with the AI Assessment Scale, which, for a variety of reasons, has become a widely used framework for trying to come to terms with GenAI in education. It has been published and critiqued and revised and defended and critiqued and defended again, many times over. Some of the schools I work with adopted it years ago and have long since adapted it beyond recognition; others are encountering it for the first time this week. We, its authors, have moved on in our own thinking too, and have written extensively about those thoughts elsewhere.

In this article I’ll explore what happens when frameworks meet the real world, and encounter that tension between “where people are” and “where we think they should be”. This article is, in a sense, a critique of scales and other heuristics, informed by my own experiences of the AI Assessment Scale. But it shouldn’t be read as only critique, since the ongoing project has yielded far more interesting and positive findings than “traffic lights don’t work.”

You should also check out the partner article detailing some of those practical, and successful, examples: the ways schools and universities around the world have adapted the Scale and made it their own. I’ll pick up the narrative from our most recent publications, Reimagining the AI Assessment Scale and How (Not) to Use the AI Assessment Scale, and demonstrate how the conversations have moved beyond those fixed points in time.

Beyond Scales – In Practice

Lethal mutations

When a framework gets adopted widely enough, it inevitably encounters contexts its authors never designed for. Along those lines, K-12 history teacher and PhD candidate Vince Wall recently compared the way he was seeing the AIAS implemented to Dylan Wiliam’s commentary on “lethal mutations” in education. Lethal mutations happen when practices get scaled across a messy, unpredictable education system. Wiliam’s original concept refers specifically to evidence-based practices that lose effectiveness through poor implementation. The AIAS is a heuristic rather than an empirical finding, but in Wall’s article the mechanism is the same: coherent ideas and useful structures, distorted by scale and context.

I’ve lethally mutated a few ideas over my teaching career. I rebelled against what I saw as the limiting structures of lesson objectives and success criteria. How many English Literature teachers walk into a lesson with something as concrete as an “objective”…? So, passive aggressively, I just made them up. The LOs on the board held the mysterious promise of learning, with abstract phrases like, “formulate and support original interpretations using textual evidence,” that existed purely for the pleasure of whichever school leader happened to be walking by my classroom. In the actual classroom we read books and had conversations.

The lethal mutation was my understanding of what a lesson objective and success criteria was, filtered through the lens, murkily perhaps, of the way I believed English Literature should be taught.

As a school leader I’ve also witnessed firsthand how frameworks like Understanding by Design mutate, turning into twee acronyms (GRASPS? WHERETO?) that disguise themselves as curriculum design while the unit congeals around a single high-stakes examination-style assessment task at the end. Systemic pressures often win out over what seems like best practice. The authors can attempt to guard against lethal mutations. We can pilot and reflect and start to build a base of evidence around what works well and what doesn’t. But at the end of the day any idea can fall prey to misinterpretation, pressure to embed it too quickly, or systems that prioritise consistency over context.

And often, of course, the ideas themselves need to change. The AIAS project began in 2023, and has tried to keep pace with a technology which has demonstrably moved faster than any other in our lifetimes. Part of that has been the shift from policing the technology to designing around it. But even so, we will always struggle to keep up with the leaps and bounds of GenAI. That doesn’t mean we won’t try. And luckily, not all mutations are lethal. I’ve had the privilege of seeing many adaptations over the past few years that have undoubtedly improved on our original frameworks, and I’ve written about some of them in the companion article.

But even the best adaptations are negotiating the technology within the same basic paradigm: categories, levels, scales, use and non-use. The question worth asking now is whether that paradigm itself is still fit for purpose. The answer, of course, is more complex than just “yes” or “no”.

Cautionary tales for thinking beyond scales

Jason Lodge recently gave a talk at Deakin University’s CRADLE on what’s next for assessment validity in Australian Higher Education. If you find yourself (incredibly, improbably) with an hour and 15 minutes spare whilst you’re reading this article, I’d encourage you to pause here and watch before you go any further. If not, then you can take my comments below as a cherry-picked sample of my favourite parts, and then come back and watch it later.

Lodge talks about the unhelpful dichotomies which add an extra layer of confusion to the conversations about AI and education. AI is going to help us; AI will harm us. It makes us more intelligent; it makes us dumber. It’s going to revolutionise education; it will destroy it. Dichotomies in education and elsewhere are rarely helpful.

He also points out that we have had metaphors, models, and frameworks that emerged at certain points in time, and which are no longer sufficient. This includes “traffic lights” and scales. Lodge makes the case that the way people actually work with AI is too entangled, too messy, to be captured by simple categories. “We’re talking about people who use multiple AI tools all at once to do multiple tasks,” he says. “To say that ‘this is appropriate’ and ‘that’s not appropriate’ is a really crude categorisation of something that is enormously complicated and entangled.”

He is right, and yet as I argue in the other half of this essay-pair, right now that isn’t how the majority of educators and students are using AI. The entanglement that Lodge describes is accurate, and it is on the horizon, but it exists alongside a much more mundane reality in most classrooms and lecture halls. Again, there is the tension between “where we are now” and “where we should be”.

In a similar vein, Lodge draws a useful distinction between what he calls the acute problem and the chronic problem. The acute problem is the one that makes headlines every few months: students cheating, degrees devalued, institutional reputations under fire. The chronic problem is harder: what and how do we teach now, and what do we assess, and what role does AI play in all of it?

No more copy-paste solutions

So where do we go when we move beyond traffic lights and scales? Lodge suggests that the best thinking we have for addressing the acute problem in Australian HE is the two-lanes model: secure assessments in lane one, everything else in lane two. He identifies the model as a useful heuristic for dealing with the “right now” problems: securing assessment and assuring learning while universities face frequent accusations of malpractice in the mainstream media.

The two-lanes model is, in one sense, what “beyond scales” looks like in theory: a simple heuristic, designed for a specific sector, addressing a specific problem. Simplicity is precisely what makes frameworks travel, and travel is precisely what introduces lethal mutations.

In Lodge’s presentation, he highlights the difference between K-12 and Higher Education when dealing with assessment security:

…we really need these kinds of clear decision making frameworks in higher ed compared to something like senior secondary or secondary schooling altogether. They have trained educators, they have small classes and they have a lot of contact time with students. So the equation in terms of how they assure the learning in that kind of environment is very different to what we grapple with in higher education.

But remember what we’ve learned so far about lethal mutations and three-years of the AIAS. I’ve already encountered schools using the two-lanes model to justify high-stakes examinations in lane one, propping up the fallacy that the only way to do assessment in senior secondary school is through invigilated high-stakes examinations. There is comfort in reinforcing the commonly held opinion that “they have to do it at the end of Year 12, so you might as well use exams.” This is a different conversation, which I’ve been writing about since I was a Head of English. Coursework is not an exam, and VCE, HSC, or other examinations – and now AI – should not be used as an excuse to make it such. The lethal mutation: taking an assessment security framework designed for higher education and using it as a rationale to justify, or reinforce, poor practice in K-12.

But K-12 is not the only sector where the model risks being adopted without sufficient consideration and contextualisation. If you watch to the end of Lodge’s CRADLE session, you’ll hear Deakin’s Sue Sharpe raise the question of disability. AI is already being integrated into a wide range of assistive technologies, and “ring-fenced” solutions, physically cutting students off from AI, might have unforeseen or unaccounted for consequences.

If a lane one assessment is in-person and technology-free, students with diagnosed disabilities may be disadvantaged. If they apply for reasonable adjustments, and are granted access to technologies, they risk being accused of playing the system. This already happens in both higher education and K-12. As Sharpe argues, disability is not a complexity for the future to deal with but “a complexity to weave in as we go.”

Disability is a question of specificity: who is this student, what do they need, and what happens when a broad framework meets a particular person? Caitlin Maling and Catherine Noske make a version of this argument at a different scale. They write that GenAI exists in what they call “no-place,” and the frameworks we build around it tend to inherit that placelessness. Trying to compensate by insisting on secured, in-person assessment displaces but does not resolve this tension. When a framework designed to manage a “placeless” technology is copy-pasted across jurisdictions without attention to place, culture, and community, we are back to issues of scale and implementation. Lethal mutations.

And even if we could address all the complexities we can see right now, the technology is not going to wait for us. Corbin, Sharpe, and Dawson’s recent paper on wearable AI calls out an important near-future concern with implications for “secure” assessment. The physical exclusion of AI from assessment has depended on two conditions: separability (that AI use is a discrete act distinguishable from the student’s own cognition) and observability (that AI use produces behaviours that can, in principle, be detected). Wearable AI erodes both, and it isn’t a problem five, ten, or fifteen years away: it’s unfolding now, while we’re in the middle of updating our frameworks and mental models. In the Q&A of Lodge’s CRADLE talk, Corbin echoed Sharpe’s comment on disability: “the chronic problem is not that future problem… it’s now the acute problem. Perhaps people haven’t realised it.”

If the AIAS was a product of its time, and earlier versions are now insufficient for the complexities of the technology, then we should expect the same trajectory for two-lanes, particularly as new technologies make the physical exclusion of AI from assessment increasingly untenable. The question is whether we learn from the AIAS journey, engage seriously with the critiques, and adapt to context before the lethal mutations set in, or whether we repeat the cycle as we go beyond scales.

So, what are we going to do?

In the Q&A, Phillip Dawson confessed his “secret worry” to the audience: that we’ll never actually crack this. That if we abandon the desire to ring-fence and exclude AI from some assessment moments, we won’t find a way to do the alternative that’s fit for assurance purposes. Lodge’s answer was honest: “No. Phill could be right.”

Of course we’ll never crack it. We didn’t crack assessment before AI, and we probably won’t crack it after. But we don’t benefit from digging into our own trenches, defending the frameworks we’ve built as though they were permanent structures rather than scaffolding. The task is to keep building, keep testing, and keep being honest about what breaks and when the scaffolds need to be pulled down. Whatever comes next will need to account for what scales and traffic lights and lanes have struggled with: that AI use is not a single, discrete act but something increasingly woven into how people think and work; that the same framework lands differently in a lecture hall of 300 and a Year 9 English class of 25; that disability, place, culture, and context are not edge cases to be patched in later but conditions that any serious approach has to be built around from the start. And it will need to be built with the understanding that it, too, will be insufficient for whatever comes after it.

Every framework produced to handle AI in education, including our own, has been useful for a time and insufficient for what came next. That isn’t failure: it’s how the work gets done. I’ve spent three years watching educators do great things with, without, through, and often against AI. I expect the next three will be no different.

References

Bassett, M. (2026, May 16). Blinded by the (traffic) lights: The intellectual bankruptcy of AI use scales. Dangerous AIdeas. https://dangerousaideas.substack.com/p/blinded-by-the-traffic-lights-the

Corbin, T., Sharpe, S., & Dawson, P. (2026). On AI glasses and wearable AI in assessment. Assessment & Evaluation in Higher Education. https://doi.org/10.1080/02602938.2026.2661367

Curtis, G. J. (2025). The two-lane road to hell is paved with good intentions: Why an all-or-none approach to generative AI, integrity, and assessment is insupportable. Higher Education Research & Development, 44(8), 2151–2158. https://doi.org/10.1080/07294360.2025.2476516

Furze, L. (2021, July 24). You’re doing coursework wrong: Why SACs are not exams. Leon Furze. https://leonfurze.com/2021/07/24/youre-doing-coursework-wrong-why-sacs-are-not-exams/

Furze, L., Perkins, M., & Roe, J. (2024). The AI Assessment Scale (AIAS) in action: A pilot implementation of GenAI-supported assessment. Australasian Journal of Educational Technology. https://ajet.org.au/index.php/AJET/article/view/9434

Furze, L. (2026, April 20). Shadow AI: Bringing covert AI use out of the dark. Leon Furze. https://leonfurze.com/2026/04/20/shadow-ai-bringing-covert-ai-use-out-of-the-dark/

Guest Pryal, K. R. (2025, December 24). The dangerous over-accommodation myth. Psychology Today. https://www.psychologytoday.com/au/blog/living-neurodivergence/202512/the-dangerous-over-accommodation-myth

Jones, W. M., & Wiliam, D. (2022, October 13). Lethal mutations in education and how to prevent them. Evidence Based Education. https://evidencebased.education/resource/lethal-mutations-in-education-and-how-to-prevent-them/

Bridgeman, A., Liu, D., and Weeks, R. (2024, September 12). Program level assessment design and the two-lane approach. Teaching@Sydney. https://educational-innovation.sydney.edu.au/teaching@sydney/program-level-assessment-two-lane/

Lodge, J. (2026). Entangled intelligence [Video]. YouTube. https://youtu.be/Igmg0FDvW3I

Mangan, K. (2026, May 23). Are that many students really faking disabilities? The Chronicle of Higher Education. https://www.chronicle.com/article/are-that-many-students-really-faking-disabilities

Maling, C., & Noske, C. (2025). Writing places in the spaces of AI. Australian Literary Studies, Special Issue: AI and the Future of Literary Studies. https://www.australianliterarystudies.com.au/articles/writing-places-in-the-spaces-of-ai

Perkins, M., Roe, J., & Furze, L. (2025). How (not) to use the AI Assessment Scale. Journal of Applied Learning and Teaching, 8(2), 14–23. https://doi.org/10.37074/jalt.2025.8.2.15

Perkins, M., Roe, J., & Furze, L. (2025). Reimagining the Artificial Intelligence Assessment Scale: A refined framework for educational assessment. Journal of University Teaching and Learning Practice. https://open-publishing.org/journals/index.php/jutlp/article/view/1707

Thomas, R. (2026, February 7). How Australia’s university students are using AI to cheat their way to a degree. Weekend Australian Magazine. https://www.theaustralian.com.au/weekend-australian-magazine/how-australias-university-students-are-using-to-ai-to-cheat-their-way-to-a-degree/news-story/2bd02fe5c5dce5c74914fc01bf883df0

Wall, V. (2025, October 27). From labels to learning: Rethinking AI assessment in history departments. Disrupted History. https://disruptedhistory.com/2025/10/27/from-labels-to-learning-rethinking-ai-assessment-in-history-departments/

Walsh, D. (2022). The curse of knowledge: Why experts struggle to explain their work. https://mitsloan.mit.edu/ideas-made-to-matter/curse-knowledge-why-experts-struggle-to-explain-their-work

Wiggins, G., & McTighe, J. (2012). Understanding by Design framework [White paper]. ASCD. https://files.ascd.org/staticfiles/ascd/pdf/siteASCD/publications/UbD_WhitePaper0312.pdf

Dr Leon Furze