AI Detection in Education is a Dead End

When you live in a research/social media bubble like I do, it’s easy to take certain things for granted. For example, I always overestimate the number of people who are using generative AI regularly in their day to day work.

The reality, as of April 2024, is the majority of people within and outside of education haven’t had the time, the interest, or the inclination to use much generative AI beyond free tools like the unpaid version of ChatGPT.

Something else I take for granted is the fact that AI detection tools do not work. Since the release of ChatGPT in November 2022, universities have been confronted with a number of products for detecting generative artificial intelligence. Largely, these tools have been born out of fear that large language model based technologies like ChatGPT will be used by students to cheat on assessment tasks.

It’s an understandable and entirely valid concern, especially given statistics on how many students engage in academic misconduct (and the fact that considering many of these studies are self reported, means those percentages are probably much higher). But companies developing generative AI detection tools often prey on education providers in a way which is predatory and largely driven by commercial and not academic interest.

There are already dozens of AI detection tools on the market. To avoid giving any of them any free publicity, I’m not going to mention any of them directly in this article. Suffice it to say that since I started working with generative artificial intelligence two years ago, I have yet to see a detection tool that is reliable or accurate.

Outside of my bubble, where I have easy access to novel research, and the ability and inclination to test these tools myself, many education providers are still in the dark when it comes to detection tools and they can be lulled into a false sense of security by the companies selling them.

In this post, I’ll discuss some of my personal objections to AI detection tools, and explore a new piece of research that once again proves AI detection tools don’t work.

How Do AI Detection Tools Work?

Unlike traditional plagiarism checkers which compare texts to a large database of existing text (and don’t get me started on the amount of students’ intellectual property being hoarded by these companies for profits), AI detection tools use pattern matching to identify generated text.

Language models operate by processing huge amounts of text data and learning probabilistic rules about how language works. They then use these rules to create novel text.

However, language models often have tells which can be more predictable than human writing. For example:

  • Lack of variation in sentence structure
  • Overuse of certain words such as conjunctions (e.g. “however”, “furthermore”, “in addition”, “in conclusion”)
  • Overuse of particular vocabulary (e.g. “delves”, “navigates complexities”)
  • Predictable sentence length, paragraph length
  • Predictable grammatical constructions

Detection tools work on pattern matching these features and also, in some cases, use traditional plagiarism detection methods to look for text which may be recreated verbatim from a language model’s training data set.

Why It Doesn’t Work

Although AI detection tools can successfully identify some generated content, there are several points at which the tools break down, making them unsuitable as an academic integrity checkpoint.

First of all, large language models continue to develop at an incredibly rapid pace. A powerful model like Claude 3 Opus from Anthropic produces much more varied and less predictable text than GPT-4, which itself provides more sophisticated text than the free version of ChatGPT or other models which are more limited in capacity, such as the free version of Google Gemini or Microsoft Copilot when it is using GPT-3.5.

This means that using a more powerful model reduces the efficacy of detection tools until the detection tools are tweaked and improved based on the new model. Essentially it’s an arms race between generation and detection, and one which, given the resources of developers like Microsoft, Google, and OpenAI, detection tool companies cannot hope to win.

It’s also easy to circumvent or break detection tools using adversarial techniques. These are deliberate prompting tactics designed to work around the detection tools. Some examples of adversarial techniques include:

  • Instructions in the prompt to vary sentence structure
  • Instructions to incorporate deliberate errors and make the outputs more human-like
  • Ping-ponging from one model to another, laundering the outputs
  • Using more powerful models when the limitations of a specific tool are known
  • Creating system prompts designed to circumvent as many points of detection/prediction as possible

AI Detection as an Equity Issue

Having explored a few of the reasons why AI detection tools can fail, it’s important to now consider why they shouldn’t be used at all as a point in an academic integrity conversation.

To do so, I’m going to illustrate the point using a scenario. Imagine four students complete the same assessment task. The conditions of the assessment task specify that no generative artificial intelligence tools may be used. Detection tools will be employed as an integrity measure after submission. The students must complete this assessment task in their own time, outside of the school/university.

Ashley is a regional student with limited access to digital technologies at home. They are therefore reliant on their institution’s computers and network. The institution has blocked direct access to generative AI tools.

Ashley checks GPT, Gemini, and Co-pilot, but since they’re blocked ends up having to use the free credits of a third party application built on top of GPT-3.5. They’re also limited to completing this task during the time they have on campus at lunchtime or immediately after classes before returning home.

Bob is an English as an additional language (EAL) student from a migrant family where English is not spoken in the home. Bob uses the free version of ChatGPT because he has heard from fellow students that it is a good translation tool. He uses ChatGPT to translate both the assignment questions and his answers.

Alice comes from a low socio-economic background with low levels of literacy in the home and limited digital literacy. Alice uses Microsoft Copilot at home on her phone as a way to understand the requirements of the task and to help make her ideas seem more academically written.

Marie is an English first language speaker from a wealthy household. Her mother is a software engineer and her father is an intellectual property lawyer. Marie writes her response using her father’s access to Claude (Opus), requiring a $20 a month USD subscription. She inputs the assignment questions and generates her entire response verbatim.

Just for good measure, and because she knows how these tools work, she pastes the response into GPT-4 (another subscription-based model) and then back again into Claude with the instruction to make it a little bit more sophisticated, a little bit more varied, and to incorporate some direct quotes from the materials from class that she uploads as a PDF (a capability only available in paid models). Marie’s final response is comprehensive, accurate, and sophisticated. It is also entirely fabricated by GenAI.

The four students submit their work independently. The detection tool flags:

  • Ashley’s work as 90% AI-generated
  • Bob’s as 100% AI-generated
  • Alice’s as 85% AI-generated
  • Marie’s as 20% AI-generated

Of the students, you could argue that Bob and Alice attempted to use generative AI as an assistive technology to help understand the task and to form their answers. Alice’s use was perhaps a little bit more heavy-handed. And all four students have certainly breached the requirements of the task by using generative AI in the first place.

The fact is, the student who used the generative AI tools with the most deliberate, nefarious intent was Marie, who was also the least likely to get caught. Marie is the student who was already advantaged by the education system, advantaged by her socio-economic status, and now advantaged by a heavy-handed approach to policing the technology.

This is the equity issue of generative AI detection:

GenAI detection tools privilege students who are English first language, have access to paid large language models/applications, and are more digitally literate.

AI Detection is a Workload Issue

Now let’s shift our attention to look at an issue which is close to my heart. In 2016, I completed my Master’s in Education which culminated in an action research project exploring how professional learning can mitigate the risk of teacher burnout. During that research, it became very clear that the factors contributing to teacher burnout are many and varied. Amongst those factors is the workload imposed by assessment and reporting practices.

In both K-12 and higher education, assessment is big business, and at the end of most assessment work, educators spend hours marking, moderating, and reporting. Assessment is an important but time-consuming part of the job.

Checking for and monitoring cases of academic misconduct is unfortunately part of this task. In many senior secondary and higher education institutions, this includes processes such as automatic plagiarism checking, and the responsibility generally falls to the teacher or lecturer in charge of the class.

Typically, the process goes something like this: For assessment tasks that are completed outside of examination conditions, in electronic format, students are required to submit their work through a plagiarism detection platform, often built into the learning management system (LMS). Either students upload to this platform directly or their teachers upload a collection of assignments in bulk.

The assessments are processed by the plagiarism checking system and reports are generated. Having used these tools myself for senior secondary English and for undergraduate teacher training courses, I can attest that whilst they’re not hugely time consuming, this process does add a layer on to the assessment and reporting process. If a student’s work is reported beyond a particular threshold (say, 20% to allow for genuine quotes and citations), then the assessor has to go in, manually identify the areas which have been flagged as plagiarism, and then report back to the student. In extreme cases of plagiarism, this will then kick along to whatever the next stage of the institution’s academic integrity policies are, for example resubmission, zeros, and so on.

Whilst this is a brief imposition on the educators, the use of similar approaches with generative AI is much more burdensome on educators. This is because, unlike plagiarism tools, generative AI tools do not give a clear cut result. The percentage likelihood of AI generated content is less accurate than plagiarism detection, more open to interpretation, and therefore requires more consideration on the educator’s part. It requires more nuanced and potentially more stressful conversations between the educator and the student, and the potential for much more kickback from the students and many more appeals. In many contexts, both students and parents are aware that detection tools are not as accurate as plagiarism tools.

The added time and stress of using generative AI detection tools is a burden on educators who are already in an industry with a high risk of burnout and attrition.

The Practical AI Strategies online course is available now! Over 4 hours of content split into 10-20 minute lessons, covering 6 key areas of Generative AI. You’ll learn how GenAI works, how to prompt text, image, and other models, and the ethical implications of this complex technology. You will also learn how to adapt education and assessment practices to deal with GenAI. This course has been designed for K-12 and Higher Education, and is available now.

New Research

Last year and early this year, I had the privilege of working on papers on an AI assessment scale with Dr. Mike Perkins, Dr. Jasper Roe and Associate Professor Jason MacVaugh. I’ve detailed the AI Assessment Scale elsewhere and you’re welcome to download a free ebook of activities aligned to the scales which allow for generative assessment.

Mike and Jasper, along with other authors, have just published a preprint of their latest research testing over 800 samples of writing against various detection tools. Mike shared the research on LinkedIn with this comment:

Our latest preprint shows the results of 805 tests of human samples, initial GenAI output, and GenAI output after we applied adversarial techniques designed to evade detection by AI text detectors. We saw a non manipulated mean accuracy rate of 39.5%, dropping to 22.1% after the application of the adversarial techniques

The preprint can be found on arXiv here: https://arxiv.org/abs/2403.19148

The researchers also found concerning rates of false accusations (15%) where the tools incorrectly flagged human-written samples as AI-generated. At the same time, a high percentage of AI-generated texts went undetected, and a lower rate of false positives appears to come with an increased rate of undetected content. This points to major risks for both students being unfairly accused and dishonest usage of AI going unnoticed.

https://arxiv.org/abs/2403.19148

Interestingly, the outputs from different AI models had varying levels of detectability, with text from Google’s Bard being the easiest to identify compared to GPT-4 and Anthropic’s Claude. However, Bard-generated text also saw the biggest drop in detectability after applying adversarial techniques.

Several conclusions emerge, but alongside my other comments in this article the key is that not only are AI detection tools largely ineffective, they are also a short-term, ill-advised, and possibly unethical approach to academic integrity in light of generative AI. The current limitations of these tools underscore the need for a critical, nuanced approach if implementing them in higher education, and highlight the importance of exploring alternative AI-aware assessment strategies.

Over the next few months, I’ll be writing extensively about approaches that K-12 and tertiary organisations can take to update their assessment strategies in ways which don’t rely on ineffective technologies.

Over the last few years, I’ve worked with dozens of schools and universities and served on the boards of several not-for-profits, and have been involved in strategic planning, teaching and learning, assessment, and of course generative artificial intelligence.

If you’d like to discuss any of these areas with your organisation, please get in touch via the following form:

3 responses to “AI Detection in Education is a Dead End”

  1. […] L. (9/4/2024), AI Detection in Education is a Dead End, Leon […]

  2. […] L. (9/4/2024), AI Detection in Education is a Dead End, Leon […]

  3. […] The intention here should be to develop a clear faculty guideline on assessment approaches which can be shared with students prior to setting assessment tasks. You should be able to clearly articulate to students how AI can (or can’t) be used in any given assessment task. Importantly, you also need to be able to explain why AI can or can’t be used. It is no longer possible to simply say “don’t use AI” and hope for the best. As you might have learned in the second step of this process, it is very likely students could use AI in ways which are undetectable, and surprisingly sophisticated. […]

Leave a Reply

Discover more from Leon Furze

Subscribe now to keep reading and get access to the full archive.

Continue reading