Don’t use GenAI to grade student work

As a former secondary English teacher, senior examination assessor, and lecturer for initial teacher education, I understand the allure of using Generative AI (GenAI) for grading student work. We’re all familiar with the workload of assessment and reporting. The idea of a tool that could save time and streamline the grading process is undeniably appealing. It’s no surprise, then, that the market is flooded with AI-powered grading solutions, all promising to make our lives easier.

However, as I’ve explored in previous posts on the capabilities and limitations of GenAI, I firmly believe that this technology is fundamentally unsuited for high-stakes student assessment. At its core, GenAI generates probabilistic outputs based on patterns in training data, lacking true understanding and the ability to make qualitative judgments. This leads to inconsistency and bias in grading, raising serious concerns about fairness and reliability.

The use of AI in grading also raises a host of ethical and equity issues. As I wrote in “Generative AI doesn’t ‘democratize creativity’“, the notion that AI levels the playing field is often an illusion. In reality, relying on AI for grading may exacerbate existing inequities and privilege certain groups of students over others.

In this post, I’ll go deeper into the reasons why I believe GenAI should not be used for grading, drawing on recent experiments and real-world examples. I’ll also explore the potential risks and unintended consequences of AI-powered assessment. By the end, I hope to convince you that, despite the temptation, GenAI is a dead-end when it comes to evaluating student work.

The fundamental limitations of LLMs

At the heart of the problem with using GenAI for grading is the fundamental way these systems work. GenAI produces outputs based on probabilistic patterns in its training data, without any real understanding, reasoning, or ability to make qualitative judgments. Essentially, it’s making what appear to be educated guesses, but are in effect just statistical patterns.

This means that the grades GenAI assigns can vary significantly based on seemingly minor differences in prompt language or details in the student work. In a quick experiment, I fed the same Year 9 persuasive writing piece into ChatGPT multiple times, changing only the student name. The grades ranged from 78 to 95 out of 100 – a massive discrepancy based on a single variable.

Check out the full post on LinkedIn

Such inconsistency completely undermines the fairness and reliability of the grading process. It creates a false sense of objectivity and accuracy, when in reality, the grades are no more meaningful than a roll of the dice.

But it gets worse. GenAI models are trained on vast datasets scraped from the internet, which can encode all sorts of societal biases and discrimination. The models can make inferences about student attributes like race, gender, or background based on their writing, potentially disadvantaging certain groups.

And here’s the rub: this kind of bias is a lot harder to detect and address compared to human grader bias. We’ve got strategies like anonymisation and moderation to mitigate human bias, imperfect as they are. But with AI, simply removing the student’s name doesn’t cut it. The bias is baked into the model at a much deeper level, based on the training data and the patterns it’s picked up.

So not only are the grades inconsistent and unreliable, they’re also likely to be biased in ways we can’t easily control for. It’s a recipe for disaster when it comes to fair, equitable assessment.

Improving the prompt only improves the appearance of accuracy

One of the biggest criticisms in the (lengthy) comment thread alongside the original post was that my prompt was overly simplistic. Why didn’t I use a more elaborate prompt, guiding the LLMs response with criteria, an assignment sheet, or maybe even exemplar marked student work?

It’s a fair question. And yes, more detailed prompts can help anchor the AI’s responses and make them somewhat more consistent. But let’s be clear – it’s a superficial fix. The AI still doesn’t have any deep understanding of the work it’s grading. It’s just getting better at pattern-matching and churning out responses that fit the rubric.

Others argued that human graders are biased and inconsistent too. And that’s of course true to an extent. But here’s the thing: we know about human bias, and we’ve got strategies to mitigate it, like blind marking and moderation. We invest in professional development for markers. When I assessed the senior English certificate examination (VCE English), we marked blind, double marked, and had a qualified team of expert assessors with years of training to moderate results. And crucially, human graders have the capacity for contextual understanding and empathy that AI just doesn’t.

The argument that “AI is biased but so are humans,” is a false equivalence: apples and oranges. While it’s true that both AI and humans can exhibit biases, the nature and impact of those biases are not comparable. Human biases stem from individual experiences, cultural backgrounds, and societal influences, whereas AI biases are typically the result of biases present in the training data or introduced by the designers. Additionally, human educators can actively work to recognise and mitigate their biases, while AI systems lack this self-awareness and agency.

Bias is an incredibly complex problem in AI systems, and there might be even more subtle issues than we’re aware of. It seems as though the more sophisticated models become, the better they are at inferring details about authors based on even minor details in texts. For example, the output of an AI response to an assessment item can be influenced dramatically by changing only one word, as in these experiments by Melissa Warr and Punya Mishra where they changed one word (“classical” for “rap” music) and it impacted the outcome.

These findings are supported in the article Dialect prejudice predicts AI decisions about people’s character, employability, and criminality by Valentin HofmannPratyusha Ria KalluriDan JurafskySharese King who demonstrate that “dialect prejudice has the potential for harmful consequences by asking language models to make hypothetical decisions about people, based only on how they speak.” This could, of course, impact students through their writing if AI is used to grade written assessments.

I often hear the argument that AI is improving at a breakneck pace, and these issues will be ironed out “any day now”. It’s the “this is the worst AI we’ll ever use” argument favoured by people like Professor Ethan Mollick. But while the tech is certainly developing fast, the fundamental problems aren’t going away.

Bigger language models and more training data alone won’t magically give AI genuine understanding or eliminate bias. We can’t take the hype and marketing at face value. As educators, we need to critically evaluate the evidence and implications, not just jump on the bandwagon.

The other equity issues of GenAI grading

One of my biggest concerns about using AI for grading is the potential for it to worsen existing equity gaps in education: access to AI tools is far from equal.

Imagine a student from a low-income family, living in a remote area with patchy internet. They’re relying on free, basic AI tools because that’s all they can access. Contrast that with a student from a well-off background, who can afford subscriptions to advanced models like GPT-4 and has top-notch devices at their disposal.

From the outset, the field is far from level. The student with access to better AI tools can generate higher quality work, iterate and refine it, and potentially game the system when it comes to things like detection tools. Meanwhile, the disadvantaged student is stuck with clunky, less sophisticated outputs.

Now, layer AI grading on top of that. If the grading AI is swayed by the polish and complexity of the work – which, as we’ve seen, it often is – then the student with better AI tools has an unfair advantage. They’re likely to score higher grades, through no fault or merit of their own.

Over time, this can exacerbate achievement gaps and reinforce privilege. It’s the classic “rich get richer” scenario, but with algorithms.

Flip it around, and consider what happens if schools, universities, and educators have access to differing levels of GenAI. What does it look like for a student whose work is assessed with GPT-3.5 compared to GPT-4? Efforts to “debias” AI have thus far only proven successful to an extent, and only for the more powerful models. For example, though it is far from perfect, GPT-4 exhibits less bias than GPT-3.5. So an educator or institution with the financial and technical resources to use a more powerful model will provide more sophisticated and potentially less biased feedback.

This is exacerbated by the fact that many educators and schools simply have not had the time to adjust to these technologies, and are being preyed upon by enterprising companies who aggressively push AI solutions. Many of these “AI for teachers” platforms build on top of GPT-3.5 or GPT-3.5 Turbo, as it would be more expensive to use GPT-4.

And that’s only if the problem is resolved. A few months ago, Ryan Tannenbaum demonstrated the problem with GPT-3.5, 3.5 Turbo, and Claude 2. It’s the same experiment that I replicated with GPT-4o, and will no doubt be replicable into the near future.

Who owns the contents of a student’s brain?

I have huge issues with the cavalier approach to student intellectual property, extending back to before Generative AI was even in the mix. Plagiarism detection software, for example, is a multi-billion dollar industry built on top of student creativity and knowledge. I have never seen an education provider where students are permitted to “opt-in” to these platforms, and yet millions of pieces of their work are uploaded into the companies’ databases daily. This has allowed these companies to grow and scale, with the volume of work compounding the usefulness and marketability of the software.

This problem is now exacerbated by AI. While plagiarism tools are no longer effective, it isn’t stopping people from uploading student’s IP into technology platforms. I’ve written about this before, but a lot of these AI tools are already built on pretty shaky ground when it comes to data provenance and consent. As educators, we have a duty of care to make sure student data isn’t being misused or exploited.

So any use of AI in assessment needs to come with clear policies on data handling. No sharing student work with third-party AI providers without explicit, informed consent.

More appropriate uses of GenAI in assessment and feedback

While the use of GenAI for high-stakes, summative assessment is fraught with risks and limitations, as outlined in the previous sections, there are nevertheless several promising applications of this technology in education that warrant careful consideration and exploration. I also want to delineate between learning, assessment, feedback, and grading: these terms aren’t interchangeable.

One area where GenAI can offer significant value is in providing low-stakes, formative feedback to students. By leveraging the natural language processing capabilities of models like GPT-4, educators can offer immediate, targeted feedback on specific aspects of student work, such as grammar, spelling, and punctuation. GenAI can also highlight areas for improvement in the structure, clarity, or coherence of a piece of writing, and offer prompts to encourage deeper reflection or analysis.

Crucially, however, such AI-generated feedback should be treated as a supplement to, rather than a replacement for, human feedback. As noted in my earlier post, Critic, Creator, Consumer, a balanced and thoughtful approach to GenAI involves recognising both its potential and its limitations. Educators should carefully review and contextualise the feedback provided by AI systems, and create opportunities for students to discuss, unpack, and apply that feedback in dialogue with their instructor.

Used judiciously, AI-generated feedback could encourage students to take greater ownership over their learning, engaging in self-assessment and revision of their work. It’s also a workload-win. Teachers no longer have to take home piles of books and make grammatical or functional corrections that, in all likelihood, students will never read anyway.

GenAI has the potential to support adaptive learning and personalised support for students, though this is far from proven. By analysing patterns in student performance data, AI systems can identify specific skills or concepts that a student is struggling with, and recommend targeted resources or activities to address those individual learning needs: this is the basic premise of adaptive learning technologies including some of the features in Khan Academy’s chatbot. This kind of “old school” AI based analytics could be coupled with GenAI to generate customised practice questions or prompts, adjusting the difficulty or complexity based on a student’s demonstrated skill level. I’m yet to see this have a positive impact in practice, however.

As with all applications of GenAI in education, these adaptive learning systems must be designed and implemented with great care and attention to potential biases, limitations, and unintended consequences. They should be used to support and enhance, rather than replace, the role of human educators in diagnosing learning needs, building relationships, and guiding student growth.

Get the free eBook: Rethinking Assessment for Generative Artificial Intelligence. A 60 page eBook containing all of my articles on why detection doesn’t work, what to do instead, and how to rethink assessment for GenAI.

Conclusions, but no clear answers

Fundamentally, the notion of using a language model (LLM) to grade student work is problematic, regardless of how sophisticated the prompt or input may be. While an LLM might generate feedback that appears thoughtful and thorough, it is essential to recognize that the model is producing a probabilistic output based on patterns in its training data, without the capacity for genuine reasoning or understanding. This holds true even if the input includes a detailed rubric, specific grading criteria, or sample student work with assessed comments. These additional contextual elements might help to anchor the LLM’s response and make it seem more convincing, but ultimately, the output remains a product of statistical inference rather than true comprehension.

Arguing that “it’s better if you use a better prompt” is a flawed premise because it fails to address the core limitations of LLMs in the context of evaluating student work. It’s like suggesting that a spell checker can be used to assess the aesthetic qualities of a poem if given enough guidelines – while the spell checker may identify words that are commonly associated with poetic language, it lacks the appreciation for imagery, emotion, and figurative expression that a human reader would bring to the analysis.

In the same manner, an LLM may generate feedback that seems appropriate on the surface, but it lacks the deeper insights, critical thinking, and subjective judgment that a human educator brings to the grading process. We may be flawed, but at least we have the capacity to think, reason, and evaluate.

The use of AI in high-stakes assessment also raises concerns about fairness, accountability, and transparency: all key aspects of the Australian Framework for Generative AI in Schools along with other international education guidelines. While human graders may exhibit biases, they can be trained to recognise and mitigate them, whereas an LLM’s biases are inherent to its training data and architecture, and may be more difficult to identify and address. The appearance of objectivity in AI-generated grades may mask underlying disparities and hinder efforts to ensure equitable evaluation. I would argue that using AI to generate results, numerical grades, or final, summative evaluations is actually contravenes the Framework

The use of LLMs for grading student work is a misapplication of the technology that fails to appreciate the fundamental differences between human and artificial intelligence. Rather than seeking to automate the assessment process, we should focus on leveraging AI to support and enhance human educators’ abilities, while preserving the essential role of human judgment and expertise in evaluating student learning.

We need to begin by questioning why we grade work at all. Is assessment and grading simply a matter of competition, ranking, and placement? Do students actually need a letter or numerical grade at all? Or, as TEQSA asked recently in reference to the use of AI tools in assessment, are there other ways that students can demonstrate learning, and that educators can assess whether students have learned?

If nothing else, Generative AI is forcing us to have these difficult, sometimes uncomfortable conversations. Thanks to everyone who has joined in the discussion so far.

The Practical AI Strategies online course is available now! Over 4 hours of content split into 10-20 minute lessons, covering 6 key areas of Generative AI. You’ll learn how GenAI works, how to prompt text, image, and other models, and the ethical implications of this complex technology. You will also learn how to adapt education and assessment practices to deal with GenAI. This course has been designed for K-12 and Higher Education, and is available now.

I regularly work with schools, universities, and faculty teams on developing guidelines and approaches for Generative AI. If you’re interested in talking about consulting and PD, get in touch via the form below:

Go back

Your message has been sent

Warning
Warning
Warning
Warning.

14 responses to “Don’t use GenAI to grade student work”

  1. I’m not sure if I would use AI to grade.
    Every complaint made against software can also be made against us as teachers… even giving different grades based on nothing more than a name change.
    This is a good framework to begin wrapping our heads around the issues, though. The article is well written, thoughtful and honest. I’m just not yet convinced of your conclusion.

    1. Thanks for your comment Darr

  2. The caution against using AI like GenAI to grade student work is significant. While technological advancements are valuable, human assessment remains crucial for nuanced understanding. Initiatives like MBA Course Fees in Greater Noida and Top B.Tech College in Greater Noida reflect the importance of human-centric education practices over solely AI-based systems.

  3. Early on, I used AI as a tool for formative assessment; however, over time, I noticed that it often spews out the same generalized remarks on every essay no matter how fantastic my prompt is. In the end, I found that it’s easier and more accurate to annotate the student essays and add my own brief overall assessment rather than re-write what AI produced. Punctuation and grammar are very low on my scoring list—I’m much more concerned with content, clarity, and organization….something I can easily detect with my own eyes.

    1. Thanks for sharing your process and observations!

  4. […] Don’t use GenAI to grade student work – Leon Furze […]

  5. […] not the automatic grading of student work, facile chatbot “tutors” for students to have endless back-and-forth conversations with […]

  6. […] It is also understood that within a few months of posting this, more competitors will emerge to compete in the AI student grading space. My critique of Class Companion has as much to do with finding its holes and necessary workarounds as it does with speaking to the cautionary caveats when evaluating student work with generative AI. If you want to learn more reasons why AI should NOT be used for student evaluation, you can read very good arguments HERE.  […]

  7. […] due to inherent issues with bias and inconsistency. I wrote about this at length in two posts: Don’t use GenAI to grade student work and Racist, Robotic, and Random, so I won’t labour the point […]

  8. Peter Carlo Paccone Avatar
    Peter Carlo Paccone

    My Answer to Two Questions I’m Often Asked About AI and Grading
    https://ppaccone.medium.com/my-answer-to-two-questions-im-often-asked-about-ai-and-grading-207bad921cfb

    1. Thanks for sharing Peter and also for referencing my article in your own

  9. […] ethical implications. While OpenAI’s course covers some of ChatGPT’s limitations, like that it can’t fairly grade students’ work, Warwick found the modules on privacy and safety to be “very limited” — and […]

  10. […] Furze capped off this discussion with a recap in which he rightly argues that while AI can support a… […]

Leave a Reply