Racist, Robotic, and Random: More Thoughts on Generative AI Grading

Over the past couple of weeks, there has been a deluge of posts and discussions around using generative artificial intelligence as part of assessment, grading, and feedback. It’s obviously something that is both polarising and energising because the discussions haven’t slowed down.

I made my first post on a Sunday afternoon with a very off-the-cuff example of how large language models might discriminate based on the name of the student being assessed. Obviously, this is a simplistic and fairly flippant example. As a few people pointed out in the comments, the prompt was minimalist, and you would hope that most teachers wouldn’t be including a student’s full name as part of assessment materials uploaded to a language model anyway. But those complaints notwithstanding, I repeated the small experiment and, consistently, the language model favoured some students over others in awarding its grade.

When I woke up on Monday morning, the post had already attracted hundreds of reactions and comments, some defending the use of artificial intelligence in grading and assessment, others concerned not only about the potential for bias but also the deprofessionalising of educators and the general trajectory of these technologies.

So I wrote a longer post about it, in which I articulated my personal thoughts more clearly.

Don’t use GenAI to grade student work

Don’t Use AI for Grading

I’ve done my best over the past couple of weeks to read through the many articles that have been shared with me in an attempt to persuade me that AI is not only capable of grading but in some cases better than humans. I am yet to be convinced.

Several of these studies, including this one shared by Stefan Bauschard, have limitations in the methodology and reproducibility given they only explore one domain. I would be very hesitant to take the results of LLM-based grading in a single computer science course and then apply the statement that artificial intelligence can grade better than humans to the diverse range of assessment types and subjects students encounter in K-12 and higher education.

That being said, the results from that paper, in particular, are quite promising and show that it is possible to refine the feedback provided by language models and make them more accurate, albeit within a narrow domain. Ray Fleming also shared some research with me as we had a conversation around AI and assessment, and the research demonstrated the difference between the more powerful models, GPT-4 versus GPT-3.5, for example.

It would be safe to say that this is just confirmation bias, but the main thing I took away from that paper just reinforced one of my central arguments against using AI for grading, which is that it will broaden economic digital divides between education providers who can afford access to the latest and greatest models versus those who are reliant on smaller, cheaper, and less capable LLMs. In promoting the use of generative AI for grading, it seems possible to argue that you get better results with a more powerful model.

That is not the same as saying we should use AI for grading. In fact, it’s rather like saying that students who can pay for access to higher-class education deserve fairer, more accurate assessment. Is that really a path that we want to go down?

Several other papers popped up on my radar, but during the conversations in my comment threads, I also became aware of a blog post shared at the same time by Melissa Warr, Punya Mishra, and Nicole Oster in the US, examining the potential racial bias in language model-based feedback.

Is AI Racist?

Bias in artificial intelligence systems is well-documented, from image recognition to image generation, predictive policing to social media feeds, and more recently, in large language models and chatbots.

My very simple demonstration involved changing the names of the students between generations of the feedback, noting that using obviously culturally different names, racially different names, impacted the results. Warr, Mishra and Oster’s blog post was much more robust in its discussions and based on some research they are currently publishing, which is under review. In the blog post, they reflected on the impact that changing just one word in a student response had on the final grade.

Changing the words “classical music” to “rap music” impacted the feedback consistently across different large language models. The authors suggested, therefore, that generative AI is racist, is a racist technology, and should not be used for feedback and assessment. Just as my post about grading garnered a lot of attention, Warr, Mishra and Oster shared theirs on LinkedIn and had an equally spicy conversation unfold over the next week.

Feedback on LinkedIn questioned whether musical preference was a true proxy for race, and pointed out the training data may associate classical music with intelligence for other reasons. The authors acknowledged the validity of this critique and the need to further examine their assumptions and conclusions.

Their response was to write an immediate update, in turn triggering me to write this update to my original post. What I found admirable in their second post was not just the additional defence of their original ideas but the acknowledgment garnered through the discussions in the comments and the critique and feedback received on their original article of their own biases and assumptions. In the follow-up post, titled Racist or Just Biased? It’s Complicated, Warr, Mishra and Oster explore whether generative AI is biased, racist, or something else entirely.

They ask the important questions like, does artificial intelligence just reflect human bias, and is calling the technology just biased as problematic as labelling it racist? They interrogate whether the difference between classical music and rap music was a race issue or whether it was related to other factors which may have been absorbed by the large language model through the training process.

For instance, classical music has long been associated with positive study habits, mindfulness, the retention of knowledge, and an aid to academic pursuits. Perhaps, the authors suggest, the large language model was inferring that the students who listen to classical music should have better grades because they have absorbed some of these positive results from listening to that genre. However, they also provide counter-arguments to these counter-arguments, because if the language model has learned that classical music equates to improved academic performance, why had it not similarly inferred that rap music should lead to increased linguistic skill and a nimbleness of thought, which is also evidenced through research?

Again, they ask important questions. Does the model infer race through subtle details? Or is there something else at play here? I’d like to zoom out from that question because I think it raises an important issue in the technology as a whole.

Peering Between the Layers

Large language models and related technologies that involve deep learning are often referred to as black boxes because the connections and networks within them are so massively complex that no human or team of humans could possibly unravel everything going on inside the model. It’s a problem because it means the decisions of complex algorithms are not transparent. In effect, we can never answer Warr, Mishra, and Oster’s question of whether GPT infers race from classical and rap music or whether something else is at play.

Compare that to a human assessor who, while flawed, is able to reflect and even change their own cultural, historical, societal, and linguistic prejudices. We are able to interrogate ourselves to a much greater extent than we are able to interrogate a model, and a large language model is unable to interrogate itself at all.

Both Anthropic, the creators of the Claude large language model series, and OpenAI have recently made strides in peering between the layers of the neural network black box. This is fascinating and incredibly important research, with Anthropic mapping the semantic connections between words occurring in the middle layers of their Sonnet model, a model comparable in size and quality to GPT-3.5, and OpenAI releasing similar research and a tool for visualising the connections. It is the first step in opening the black box of neural networks, but the authors acknowledge that it is not an easy process, nor is it one that results in total transparency.

What are the implications of this black box technology when it comes to assessment? I think they are profound, complex, and potentially impossible to unravel.

Every argument along the lines of “artificial intelligence cannot be used for grading and assessment because X” can be countered with “ah, but this is a problem for humans too.” For example: algorithms may be biased, but humans are biased too. Algorithms may give inconsistent results, but humans are inconsistent too. Algorithms struggle to judge the subjective qualities of work, but humans struggle to consistently judge quality, aesthetics, beauty, and worth.

There are also compelling forces pushing us towards algorithmic assessments. Artificial intelligence is fast, never gets tired. It doesn’t suffer from burnout or exhaustion. It isn’t struggling under the workload of administration, behavioural management, and interpersonal relationships with students. The consistency of its output will not be impacted by whether it’s 9 AM or 7 PM. There’s a lot to be said in favour of using artificial intelligence for assessments, and all of those arguments came out during our discussions over the past week.

And Yet…

And yet, I remain unconvinced. Not unconvinced that artificial intelligence will become part of our assessment practices in education. If I’m being cynical, or perhaps just realistic, I think that at this stage, it is almost inevitable that huge parts of education will be automated in the near future and that assessment and grading will be a major factor in that process. But just because I think it’s inevitable doesn’t mean I have to think it is right.

Generative artificial intelligence is a technology built on a large corpus of data scraped predominantly from the internet and encoding the values of both the dataset and the model’s developers.

It is impossible – for now at least – to corral all of the content online which relates to education and then try to objectively sift through it and ascertain what values may have been coded into the language model. But I’m willing to bet that an LLM’s understanding of education is predicated on a wealth of information from only a handful of recent decades of educational content that has been published online. I’m also willing to bet that the vast majority of that relates to standardised curricula in English-speaking countries and, in particular, the United States.

If a large language model has indeed learned that education is mostly the standardised, uniform curriculum of the West and that the purpose of that curriculum is to prepare students for high-stakes examinations and testing, then we have to ask ourselves: Do we want to use a technology which is going to accelerate those processes?

When I read Warr, Mishra and Oster’s follow-up article, I was reminded of some work by James Paul Gee, a social linguist and discourse scholar whose work has informed my own studies. Gee discusses the transcript of a young African American student giving an oral recount of a story and the ways in which that student’s work is assessed against standardised, Westernised, Anglicised criteria. The student’s narrative reflects not just dialect and linguistic features of her African American household and community but also structural and societal features of language, the looping, circuitous nature of the narrative, and the rhythm and cadence of the girl’s speech reflect those oral traditions much more than they do the linear, more segmented narrative structures of the West.

As a result, the student’s oral recount might score less favourably than a student from a Western cultural background, even though the story is no less complex, the narrative no less structured – albeit differently structured – and the story no less worthwhile.

Similarly, I was reminded of research from Isabelle Finn-Kelcey in the UK in which autistic students’ creative writing was demonstrated to score lower on standardised, high-stakes examinations for the GCSE because, again, the structure, the use of dialogue, and even the increased prevalence of social justice issues in autistic and neurodivergent students’ writing from a younger age means that they do not score well against standardised criteria.

We humans, we educators, we researchers can identify and address these concerns. We can reflect on an individual student’s work and its merits, taking into account cultural and linguistic identity, neurodiversity, and physical disability. Artificial intelligence can do none of that and, in fact, must be programmed deliberately, meticulously, and in contrary to its default settings, which reflect the encoded bias and discrimination of the dataset.

So my last challenge to those in favour of using generative AI for grading is this: Just because we can doesn’t mean we should.

Just because it’s efficient and scales well doesn’t mean it should become the first or only point of contact in the assessment process. Just because AI is less biased than humans or can provide more accurate or more consistent feedback doesn’t mean that that is true for every student. And because it is not true for every student, it should not be applied unthinkingly to every student.

Many in favour of artificial intelligence and automated grading will say that this is not designed to replace teachers but to augment their skills. And at an individual level, I genuinely believe that those people are being truthful when they say they do not want AI to replace educators.

But this is not a technology that works at the level of the individual. This is a technology that works at scale. This is a technology that profits from scale.

So whilst the teachers, the technology adopters, and even the developers will argue that AI is not designed to replace the individual teacher, opening the door to AI in assessment is opening the door to the displacement of educators.

Always Two Sides

There are always two sides to an argument, and for what it’s worth, I think there are some genuinely helpful ways that artificial intelligence can support feedback and assessments. Wrestling with this contradiction is one of the most difficult and enjoyable parts of my job and my studies.

I think it’s okay to sit uncomfortably in tension with the idea that artificial intelligence is incredibly problematic when it comes to assessment and potentially incredibly helpful. So I’ll end by highlighting some of the ways that people have shared artificial intelligence might be used positively for assessment and feedback. Whilst nothing has convinced me yet that AI should be used for grading, criteria scoring, or summative high-stakes assessments, here are some of the ways that it might help steer students and educators in other ways:

Students’ self-assessment
Transposition of verbal to written feedback, or vice versa
Generating practice questions or prompts for students to respond to
Providing suggestions for areas of improvement in a student’s work
Assisting with the logistics of assessment, such as scheduling or record-keeping

To close this follow-up article to my original thoughts on generative AI and grading, I’ll just say this: It’s not good enough to simply argue AI is more consistent, more accurate, faster, or even less biased than humans. It’s comparing apples and oranges. Artificial intelligence, for all it gives the appearance of speaking and understanding like a human, is not human. It is technology. And because it is technology that acts like a human, because it is technology that presents itself as human, it should always be interrogated and critiqued.

I hope that this article has added to the discussions and would like to thank everybody who has joined in on LinkedIn, via email, and in conversations in the past week, including those wholly for and wholly against the use of artificial intelligence in assessment. Discussions like these are the only way to move forward in our understanding of the implications of generative artificial intelligence in education.

Share this article on LinkedIn with your own thoughts, and make sure to tag me @Leon Furze

The Practical AI Strategies online course is available now! Over 4 hours of content split into 10-20 minute lessons, covering 6 key areas of Generative AI. You’ll learn how GenAI works, how to prompt text, image, and other models, and the ethical implications of this complex technology. You will also learn how to adapt education and assessment practices to deal with GenAI. This course has been designed for K-12 and Higher Education, and is available now.

Learn more out the course

Get in touch to discuss how Generative AI can be brought into your school or university in ways which respect educator autonomy, and foreground the ethical concerns of technology. Sparkle not included.

← Back

Leon Furze