Letting the Robots In

The web is changing beneath our digital feet, and many people haven’t noticed.

For thirty years, the visible parts of the internet have worked pretty much the same way: a human types something into a browser, a server sends back HTML, and the browser renders it into something readable. Every website you’ve ever visited, including this one, is built on that assumption. The audience is a person, sitting at a screen, reading.

That assumption is now breaking down thanks to Large Language Models like GPT and Gemini.

How AI reads the web (badly)

When you ask Claude, or Gemini, or ChatGPT a question that requires current information, it doesn’t just pull from memory or training data. Almost all modern LLM-based applications can search the web, visit pages, and “read” them by pulling the content into their context windows. But “reads” is doing a lot of heavy lifting in that sentence, because what the AI actually receives is a mess.

Take any blog post on this site. A human visitor sees a clean page with a title, some text, and maybe an image or two. An AI agent visiting the same page receives all of that, plus navigation menus, footer code, stylesheet references, JavaScript, cookie notices, metadata, social sharing buttons, and hundreds of lines of WordPress theme markup. The actual content of the article might represent 20% of what gets delivered. The rest is noise.

HTML code snippet showcasing a contact form with error messages and hidden input fields.
The world according to bots

This is problematic because AI systems work with limited context windows, essentially a cap on how much text they can process at once. Every line of navigation HTML or theme JavaScript that gets sent to an AI agent is space that could have been used for actual content. It’s like posting someone a handwritten letter but wrapping it in six layers of bubble wrap and packing foam: the content is in there, somewhere, but good luck finding it efficiently.

For training data, this was an acceptable trade-off: after all, AI companies weren’t about to actually ask anyone whether they could use their life’s work to train a model. Companies like OpenAI scraped the entire web (and downloaded illegal ebooks, and purchased millions of physical second hand books and annihilated them), processed it at scale, and could afford to strip the junk. But we’re moving into a different era now: one where AI agents browse the web in real time, on behalf of users, looking for specific information. Efficiency costs a lot more when you’re doing it live.

Markdown for agents

This is the problem that Cloudflare, the infrastructure company whose network sits in front of roughly 20% of the world’s websites, has just offered to solve.

Their new feature, “Markdown for Agents,” works like this: when an AI agent visits a website and says “I’d prefer markdown, please” (via a technical header in the request), Cloudflare intercepts the page, strips out all the HTML gubbins, and returns clean, structured markdown instead. No navigation. No theme code. No JavaScript. Just content.

Screenshot of a blog post discussing OpenAI's decision to start advertising, highlighting the financial challenges the company faces and the shift towards a freemium model in the AI industry.
Markdown is much cleaner than HTML

The result is apparently dramatic. Cloudflare tested it on their own blog and found that the same page, delivered as markdown instead of HTML, used about 80% fewer tokens. Of course, the feature requires an additional paid Cloudflare plan and is designed for sites that already use Cloudflare as their CDN… God forbid anything in the AI world come for free, even something as simple as a HTML-Markdown converter. But the principle it establishes is significant: the web is developing a second layer, optimised for machine readers, running alongside the one humans see.

However, there’s another side to the argument. In a thread on X, Google’s John Mueller and Bing’s Fabrice Canel argued against duplicate markdown pages for websites. Large language models have been trained extensively on HTML and have no problem parsing human-readable content for code designed for the web. If every website suddenly made a shadow-site of .md files, it might also increase crawl times from search engines, slowing performance on the website (which has an adverse effect on things like Google search ranking).

Basically, the jury is out. Cloudflare hosts a decent chunk of the web, but Google and Bing between them manage the vast majority of search. If Google decides to tweak its algorithms to ignore or even penalise sites with markdown mirrors, then it probably doesn’t matter much what Cloudflare thinks.

What I’ve done on this website (and why)

I’m hedging my bets, and doing a few little experiments of my own. As a writer, I have a vested interest in my work being read by humans. But I’m definitely interested to see the effects of “AI-friendly” approaches. I can already see in my website stats when users have arrived here via recommendations from ChatGPT, and the volume of those referrals have increased enormously in the past three years.

To test out a few approaches, I’ve installed a WordPress plugin for markdown and added a “LLMs.txt” on the site. The plugin is called Markdown Alternate, built by Joost de Valk (the creator of Yoast SEO). It’s only a couple of weeks old, and far from a finished plugin. But it does something similar to Cloudflare’s feature at the WordPress level rather than the network level (and it’s free…).

With the plugin active, every post on this blog now has a markdown version available. You can access it by adding ?format=markdown to any post URL (try it: add ?format=markdown to the end of this page’s address). AI agents that send the right header automatically receive the markdown version instead of HTML. The plugin also adds a small signal in each page’s code that tells visiting agents “a cleaner version of this content is available if you want it.” Whether the AI agents read that or not is still very much up in the air.

The markdown version includes the full text of the post along with structured metadata: the title, date, author, categories, and tags. It strips out everything else. No theme, no sidebar, no scripts.

I’ve also added an llms.txt file to the site, which acts as a kind of table of contents for AI systems, pointing them towards the most important content. Think of it as a curated reading list for robots. Whether AI applications actually read llms.txt or not is still up in the air. Google’s John Mueller – who is also against markdown – stated on X that no AI systems use them, but some SEO platforms like Yoast are starting to report traffic.

I’ll be monitoring my own traffic to see if any of this makes a difference – positive or negative. I can also keep an eye on the speed of the site (which hasn’t changed) and the statistics from Google search. It’s an interesting experiment but it won’t make or break the blog.

Why I’m not worried… much

The default position for many writers right now, and it’s an understandable one, is to see all of this as a threat.

A recent Wired investigation found that mysterious bot traffic, much of it traced to Chinese IP addresses, has been sweeping across websites of all sizes, from niche publishers to US federal agencies. AI bot traffic surged 300% over the past year according to Akamai, and by the end of 2025, roughly one in every thirty visits to monitored websites was from an AI scraping bot. More than 13% of those bots were bypassing robots.txt entirely, the file that’s supposed to tell crawlers which pages to leave alone.

“This is the cost of being on the internet to some degree,” says Akamai’s Brent Maynard. “You’re open, and you’re in public view.”

He’s right. And I think that framing, being open and in public view as a cost, is exactly the assumption worth questioning.

I wrote about the resistance to AI scraping just last week, and I have enormous sympathy for artists and writers whose work has been hoovered up by companies like OpenAI without permission, compensation, or credit. The silent protest record from Ed Newton-Rex’s campaign sits on my shelf. I believe in those causes, and I understand why the visual arts and music communities are disproportionately affected.

But my situation is different, and I think the situation of most educators is different too.

This blog has been freely available for over fifteen years. I’ve never put up a paywall. All of the Teaching AI Ethics series is published under a Creative Commons Attribution-NonCommercial-ShareAlike licence, including the new website. That deliberate choice reflects a core belief of mine: educational knowledge should be shared as widely as possible. The ideas on this site – and the Teaching AI Ethics one – are (hopefully) more useful in the world than they are locked behind an obnoxious popup login screen.

Unlike illustrators and musicians, making my written work more accessible to AI systems doesn’t threaten my livelihood. If anything, it increases the chances that when someone asks an AI “how should our school approach AI in assessment,” the answer draws on my articles and my language, which might even make them more likely to seek me out for the work that actually requires a human being.

The Creative Commons problem

Here’s where things get uncomfortable, though. I said my Teaching AI Ethics series is CC BY-NC-SA. That means anyone can share it and adapt it, as long as they credit me, don’t use it commercially, and share any adaptations under the same licence. It’s a framework that’s been respected for over two decades in education, academia, and open source communities, and the one we applied to our AI Assessment Scale work for the same reasons.

AI companies don’t respect it. Not only because they’ve decided the licence doesn’t apply to them, but because the technology simply doesn’t have a mechanism for honouring it. When an LLM ingests a CC-licensed article during training, there’s no process for ensuring attribution in outputs, no way to enforce the non-commercial clause, and no mechanism for share-alike. The content goes in, gets blended with billions of other documents, and comes out as part of a response that carries no provenance at all.

This isn’t a problem I can solve by withholding my content. It’s a structural problem with how AI training and inference currently work. And I’d argue that the people best positioned to push for better attribution and licensing frameworks in AI are the ones who are already operating in the open, already using Creative Commons, and already articulating what fair use of knowledge looks like, rather than those who’ve locked everything down and disengaged from the conversation entirely.

We already have the structures for this. Creative Commons has been doing it for twenty years. The problem isn’t a lack of frameworks. It’s that AI companies have chosen not to respect them.

What about Google?

There’s also a very legitimate fear about AI search: if AI systems get better at reading and synthesising my content, then Google’s AI Overviews (and similar features from Perplexity, ChatGPT search, and others) will just serve up my smooshed up ideas directly in search results, and nobody will ever click through to my actual site.

This is absolutely a real concern for anyone writing online, whether it’s me and my blog or the New York Times. Traffic to independent websites has been declining for years as Google has progressively kept users on its own pages. AI Overviews accelerate that trend.

But I think the fear is slightly misplaced for educators, for two reasons.

First, this is happening regardless of whether I make my content easier for AI to read. Google’s crawlers are already indexing my HTML pages and feeding them into AI Overviews. Serving a cleaner markdown version to AI agents doesn’t give Google anything it doesn’t already have. It just means the version it gets is more accurate and better structured. And if we decide that we don’t like the way Google is handling that content, we can always quit Google.

Second, the value of an educational blog has never really been about raw pageviews. The people who matter, the ones who become email subscribers, who book PD sessions, who recommend the work to a colleague, arrive with intent. They want the full picture, the specific context. An AI summary might satisfy a casual question, but it doesn’t replace the experience of reading a piece in full. If it could, the piece probably wasn’t worth writing in the first place.

What I’m not doing

I want to be clear about what this isn’t. I’m not creating a separate, AI-only version of my website. Google’s John Mueller has also warned against serving different content to machines than to humans, a practice that looks a lot like the old SEO trick of “cloaking.” Several search engine experts have raised concerns that the new markdown-for-agents approach could enable a “shadow web” where sites serve manipulated content to AI crawlers while presenting something different to human visitors.

I’m not doing that. The markdown version of each post is the same content as the canonical HTML version, just without the theme wrapper. There’s no separate site, no hidden content, no AI-only messaging. The robots get what everyone else gets. They just get it without the bubble wrap.

background of bubble pack with bumpy surface
Photo by Erik Mclean on Pexels.com

What this means for educators

If you’ve read this far (well done), you might be wondering what any of this has to do with teaching and learning.

The shift towards AI agents as primary, or at least equal, consumers of web content is one of those slow-moving, infrastructure-level changes that most people won’t notice until it’s already happened. But it matters for anyone who creates and shares educational content online, because it reshapes the question of who gets to be a source of knowledge.

As AI systems increasingly mediate how people access information, the voices that get surfaced will be the ones those systems can find and read. If you’re an educator with genuine expertise in a subject, and you’ve been sharing that expertise freely, you have a stake in whether AI can accurately represent your work. Making content AI-accessible isn’t about chasing a new audience. It’s about ensuring that when these systems answer questions in your domain, they draw on expertise rather than content farms.

This is really just the latest version of a tension educators have been navigating since the early days of the internet: how to share knowledge freely while still protecting professional work. AI adds a new dimension, but it doesn’t change the fundamental challenge. If you believe, as I do, that educational knowledge is a public good, then making it available to AI systems is a natural extension of making it available to anyone with a web browser. And if the idea of companies profiting from freely shared knowledge feels uncomfortable, consider traditional academic publishing, where billion-dollar corporations charge individuals and universities thousands of dollars to publish and to read, while authors receive nothing. The AI scraping problem isn’t new. It’s just wearing a different hat.

What is genuinely new is the attribution gap. Creative Commons, citation conventions, academic integrity frameworks: educators have spent decades building the infrastructure for giving credit where it’s due. AI systems can’t honour any of it. When an LLM ingests a CC-licensed article, there’s no mechanism for ensuring attribution in its outputs, no way to enforce a non-commercial clause, no process for share-alike. That’s not a technical inevitability. It’s a policy failure, and it’s one the education community is well positioned to push on. We’ve been thinking about these questions longer than most.

Which brings me to the classroom. The idea that the web has visible and invisible layers, that the same content can be served differently depending on who or what is requesting it, that AI systems are increasingly browsing on our behalf: these are critical AI literacy concepts. They belong in the same conversations we’re already having about how search engines work, how algorithms curate information, and how to evaluate sources. The web isn’t just for humans any more. Our students should know that, and they should think critically about what it means.

I don’t think letting the robots in is capitulation. I think it’s a recognition that the web is changing, and it will change with us or without us.

Want to learn more about GenAI professional development and advisory services, or just have questions or comments? Get in touch:

← Back

Thank you for your response. ✨

Leave a Reply