The AI Iceberg: Understanding ChatGPT

Analogies are useful for understanding complex ideas, and there are plenty of complexities for educators trying to wrap their heads around ChatGPT. In this post, I’ll try to explain some of the features of the chatbot and the model it’s built on top of. I’m deliberately avoiding any kind of analogy that represents the AI as magical, mythical, human, or godlike – we’ve seen enough of them.

I’m not claiming that this analogy is watertight or that there is no better way to conceptualise ChatGPT. But after fifteen years working in secondary education, I do know that if you can’t express it as an iceberg, a pyramid, or a Venn diagram then it’s not worth expressing.

I’ve been using the iceberg analogy for a little while and refining it as I’ve gone along. At various points, I’ve run the analogy through ChatGPT to sense check certain comparisons. I’ve also checked it against my understanding of how LLMs work. I’m no computer scientist, so if you’ve got a suggestion, criticism, or correction then email me or leave a comment and I’ll work it into v2. Maybe version two will have colour coded hats. Who knows?

The AI iceberg

Picture an iceberg floating in the ocean, or maybe on a clichéd educational infographic. The visible part above the waterline is relatively small compared to the massive structure hidden beneath the surface. Now, imagine that this iceberg represents an LLM, like GPT-3 or 4, with its different components distributed above and below the waterline.

The Dataset: Underwater Bulk

The bulk of the iceberg, hidden underwater, represents the vast dataset on which the LLM is trained. This data forms the bedrock of the model’s knowledge and capabilities. It’s vast and mostly unseen during any interaction with the model, but it’s always there, informing every output.

Different models are trained on different combinations of datasets. For companies like OpenAI and Google, some of that information is proprietary. While we have some information on GPT-3’s training data, GPT-4 is more of a mystery, and Google’s PaLM is off-limits. But we do know a little about the kinds of data these behemoths* large models are trained on. We know, for instance, that they contain data from sources like the Common Crawl, The Pile, Wikipedia, and coding site GitHub. They may also be trained on social media sites like Twitter and Reddit. All of this comes with plenty of side effects, including the kind of bias and discrimination that I’ve written about elsewhere.

*Editorial note: slap on the wrist for a mythical beast analogy.

The LLM: Above the Waterline

Emerging above the waterline is the LLM itself, the result of the training process fuelled by the vast dataset beneath. This visible portion is what we interact with when we use applications built on top of the LLM. It’s akin to the complex structure of the iceberg we can see, formed and supported by the data “underwater.”

A Large Language Model is an artificial intelligence (AI) system that has been trained to understand and generate human language. These models are designed to predict the likelihood of a certain word given the words that came before it in a sentence or text. This ability allows them to generate coherent and contextually appropriate sentences, paragraphs, and even entire texts.

The most commonly talked about LLM right now is OpenAI’s GPT, currently in version 3.5 (free) or 4 (subscription or access via Bing and the API). But there are many more out there, some open source, and some owned by companies like Google. There’s a comprehensive list over on GitHub which also contains some of the seminal papers on LLMs, if you’re into that sort of thing.

Applications like ChatGPT: The Snowman Sculpture

Lastly, picture a carefully sculpted snowman sitting on top of the iceberg. This represents an application like ChatGPT, which is built on top of the general LLM. The snowman is a more specialised figure carved from the raw material of the iceberg, just as ChatGPT is a version of the GPT model that has been fine-tuned specifically for conversational tasks.

Incidentally, in my original iceberg analogy I offhandedly describe ChatGPT as something like a “little flag sticking out of the top”. When I ran the analogy through ChatGPT, the snowman was suggested as a more suitable alternative (alongside “a fabulous ice sculpture”, which I rejected).

You’ll see ChatGPT referred to (and referring to itself) as a Large Language Model trained by OpenAI. I’m making the distinction here to show that GPT is actually the powerful part of the model, and ChatGPT a refinement. Maybe I’m splitting hairs. I don’t care – the snowman stays for now.

Stretching the metaphor

An analogy isn’t any use at all if it can’t be stretched to breaking point (melting point?), so let’s throw a couple more ideas at this:

  1. The Ocean: In this analogy, the ocean in which the iceberg floats can represent the internet at large. The internet is the vast environment from which the dataset (the underwater part of the iceberg) is sourced. Like the ocean, the internet is expansive, diverse, and filled with…
  2. Sharks: Sharks or other dangerous sea creatures could symbolise potential threats or challenges in the internet environment. These could include misinformation, bias, inappropriate and toxic content, or data privacy issues. These dangers can influence the dataset and, subsequently, the behaviour of the LLM.

If you’re interested in posts like this then please subscribe to the mailing list:

Success! You're on the list.

That’s it! Like I said, feel free to send through any comments or suggestions, or analogies if your own.

Right now I’m working with several schools and higher education on AI policy, academic integrity, and professional learning. If you’d like to enquire about availability for Term 3 and 4, use the form below:

2 responses to “The AI Iceberg: Understanding ChatGPT”

  1. […] The AI Iceberg: Understanding ChatGPT – Leon Furze […]

  2. […] all this means, is that powerful AI across a range of applications from language models to facial recognition and the systems we use to collect data in education can not only reflect but […]

Leave a Reply