GPT-5 Review: Benchmarks vs Reality of the PhD in Your Pocket

OpenAI has updated its flagship model to GPT-5: a long-hyped update to GPT-4o and o3. The product launch took place at 10am PST on Thursday August 7th, and ran for over an hour as OpenAI developers and CEO Sam Altman talked through the new features.

As an Australian, I wasn’t going to get up at 3am to watch the live launch. But I did take some time this morning to watch the video before trying GPT-5 out for myself. In this article, I’m going to run through some of the key moments from the launch and compare it to my initial experience using the upgraded AI. I have also made a video demonstrating the updated capabilities, and showing the differences between the free and paid models.

Read the full transcript of the video at the end of this article.

A PhD Level Expert?

One of the first comments from Altman in the promotional video which launched GPT-5 was that it offered “a legitimate PhD expert” compared to 2022’s GPT-3, which he referred to as a high-schooler. He went on to refer to the model as a “superpower on demand”, and the PhD-in-your-pocket refrain continued throughout the launch.

But it’s hard to figure out exactly what this means, other than GPT-5’s ability to pass a few specific benchmarks. The “PhD expert” in question excels at mathematics and coding benchmarks, and can out-compete other models in tests of reasoning and logic. But in reality, outperforming other AIs (and perhaps other humans) in a narrow range of predominately science and maths based tasks does not make it a “PhD expert” – it just makes it competent in those particular areas.

And when I tested the new model on some tasks from my kind of PhD, in the Language/Arts field of Discourse Analysis, it was predictably hopeless. Celebrating the fact that a piece of computer software can perform accurate calculations seems like something from the 1980s, and not something which is going to somehow lead us to Artificial General Intelligence.

Rhythm and Prose

Another upgrade celebrated in the launch advertisement was the increased “rhythm and prose” compared to GPT-4o. I’m willing to concede that GPT-5 is much better than 4o at writing, and you can see an example of that in the video before. But I think we need to ask serious questions about exactly why OpenAI has come for writing.

I’m yet to see any indication that the company has respect for authors or creators, whether its writers, artists, or filmmakers. I’ve seen OpenAI make memes out of intellectual property, lambast the media for attempting to challenge their misuse of IP, and generally justify the mass-scraping of creative works. But I’ve never seen them dedicate any energy to helping artists.

So this improved writing ability of GPT-5 should probably taken with a healthy dose of cynicism. It’s good marketing (it could probably even write better marketing for itself now). It’s going to appeal to students who already use ChatGPT to write their work. It will appeal to companies who already use ChatGPT to write their advertisements, their emails, and even their job applicant rejection letters. But it probably won’t immediately appeal to most writers and creatives.

GPT-5 Really Likes Purple

Many of the updates appear to be aesthetic rather than performance-related. The new opening screen for a chat features a soft, abstract background reminiscent of the banner images used throughout OpenAI’s website. The menus are a little cleaner. The model selector has dropped from about 26 models to just two or three, depending on whether you’re using free or paid. You can also go into the settings and change the colour of ChatGPT’s message bubbles, if you really want to.

Screenshot of chatgpt homepage

And according to the launch video, GPT-5 “really likes purple” when its building apps and websites.

More broadly, using GPT-5 feels oddly more aesthetic. There have definitely been some tweaks to the UI under the hood which are quite subtle. It was hard to put my finger on, but everything felt a little smoother, a little more deliberate. more designed. OpenAI is pumping a lot of time, energy, and marketing dollars into shedding its “scrappy startup” status and maturing into a real platform.

Personalities, Memories, and Deceptive Behaviours

While my short experiments this morning weren’t enough to explore every new feature, we’re told in the launch promo that there have been some improvements to the features ChatGPT which make it more personalised and bespoke to individual users. This includes a research-preview of “Personalities” for paying users, allowing the user to select “professional”, “personal” or “sarcastic” personalities for the chatbot.

Musk’s Grok has already done this, and frankly I find the whole “sarcastic chatbot” thing incredibly lame. I’m sure someone must enjoy it if OpenAI has decided to copy the feature.

Memories have apparently been improved too. In my experiments, it did draw on chat memories (in my paid account) when completing tasks like the Melbourne trip. However, some of the memories were incorrect. I noticed, for example, that ChatGPT “remembers” that I live in Melbourne, which I don’t, and it also made quite a few assumptions about what I would like/dislike based on equally false information.

And finally, in the launch video, we get a detailed discussion of the new safety features and training methods to make GPT “less deceptive”. GPT will now demure when asked questions about potentially harmful topics, such as how to make explosives. This follows a blog post earlier this week where OpenAI outlined other “social harm” mitigations including a reminder to tell users to take a break, and new response methods to avoid answering personal crises related questions like “should I break up with my boyfriend?”

https://openai.com/index/gpt-5-safe-completions/

There’s an unsettling inconsistency thought between Altman’s recent comments that people shouldn’t use ChatGPT for therapy or form relationships with the chatbot, and the way that GPT-5 has been trained tom be more friendly, more helpful, and more human-like. Conversations on what GPT should-and-should-not answer are followed in the launch video by a conversation with a cancer survivor who uses ChatGPT extensively to understand medical notes. It’s hard to tell exactly what the messaging is around ChatGPT’s physical and mental health advice.

ChatGPT for Developers

At its core, GPT has always been software developed for other software developers. It is a code-completion tool that has, somewhere along the way, been marketed as a general consumer technology. That’s why a huge section of the launch video is dedicated to discussing GPT-5 for coding, the Codex application, and its use in software production workflows.

I’m not going to go into all that here, but you should check out Simon Willison’s blog if you’re interested in that side of things.

Final Thoughts

After a few hours of initial experiments, GPT-5 does feel like a different model. I stopped using GPT-4o months ago, finding that the “improvements” from March 2023’s GPT-4 were mostly a backwards step. GPT-4o was incredibly prone to hallucination, failed to understand basic tasks, and was generally hopeless for most serious work. So, I started using the “reasoning model” o3 for most tasks.

But o3 was slow, and prone to giving overly verbose answers or getting stuck in indecisive loops. Deep Research offered a step up in terms of quality, but takes an incredibly long time to work.

I wrote about all of those features in a video mini-PD you can watch here:

Free PD Video: How to Use ChatGPT

GPT-5 is definitely an improvement over GPT-4o. It feels like a more streamlined o3, which gives the same kind of detailed responses but cuts down on the excessive creation of tables and dot point summaries. I didn’t notice any hallucinations, but I didn’t use it much. The coding of simple apps and simulations was noticeably better in the paid version, but only so-so in the free. Every other feature seems more-or-less unchanged for now, including Deep Research, Canvas, and the various Connector tools like Gmail and Canva.

Sam Altman has a history of hyping up his products to the extreme, and then downplaying the reality after launch. We saw it with GPT-4, and again with 4o. We even saw it with the ludicrously useless GPT-4.5, which I would wager you’ve probably never used. The pattern goes something like this:

“What’s coming is so powerful I’m scared of it!”

“We have secret models beating all the bench marks!”

“We’re heading towards AGI!”

“Our new model is powerful, and way better than the last one”

“We’re releasing something soon but it won’t be AGI”

“This model is OK I guess, and you’ll probably like it.”

That peak and trough of hype makes for great marketing, and I’m sure that even as I’m typing out this post there are a thousand breathless hot takes filling up your social media feeds. All I can say is: try it for yourself before you make any judgements. ChatGPT is better now than it was 24 hours ago. But it isn’t a PhD expert, it isn’t a superpower on demand, and it certainly isn’t everything that Altman has been promising.

Want to learn more about GenAI professional development and advisory services, or just have questions or comments? Get in touch:

← Back

Thank you for your response. ✨

Video transcript

Transcribed with Otter, formatted and edited with GPT-5

Introduction

Hi, I’m Leon Furze, and in this little video, I’m going to give a fly-through of the new GPT-5 with everything that I’ve tested out so far, as well as a few refreshers on some of the core features of ChatGPT.

OpenAI just released GPT-5 overnight with a very long announcement video — about an hour and 15 minutes — covering all of the features in detail. Of course, we can’t always take these marketing promotions for granted, and we’ve seen in the past that they sometimes demonstrate features that don’t actually exist. So I thought it would probably be worth getting in there and trying out these things for myself.

In the first 24 hours of a release with any of these products, things do tend to slow down a little bit. This video is quite heavily edited: a lot of the sections are sped up to double speed, or even 4x, 8x and 20x speed, just so that we’re not sitting here watching ChatGPT think for a long time. I’ll run through as many features as possible, and I hope you learn something about the new models.


First Impressions and Interface Changes

The first big change is that OpenAI has removed all of its old models. This is in the paid version I’m looking at here for now, and most of the rest of the tools are exactly the same. Agent Mode, Deep Research, Image, and so on are all in the same place. Some of them have been moved into an additional drop-down menu.

Getting rid of 04-mini, 40303-pro, and all of those has consolidated everything to GPT-5. Everything else looks pretty much the same, except for this nicely coloured background.

In the free version, I checked and noticed that ChatGPT and ChatGPT Plus are the options, so we can’t actually select GPT-5. In the settings — where I always like to look for any minor changes — nothing’s really changed either.

We still have:

  • Memory, with the ability to turn it on and off.
  • The option to stop them from training on your data.
  • Multi-factor authentication.

In both paid and free versions, most things are the same and located in the same places. I went through all of these features in a video earlier this week, so check that out if you want the full settings overview.


Trip Planning Test

I started with a pretty trivial task: Help me to plan a trip to Melbourne.

It asked me a few questions in response, and under the hood here — even in the free version — it’s now using GPT-5, though a lighter-weight model.

I manually turned on Web Search so it could give accurate weather and location details. It pulled recent weather data from the Australian Government Bureau of Meteorology website and gave me typical weather information for my trip. So far, so good.


Building a Simple Budgeting App

Next, I turned off Web Search and asked it to build a simple budgeting app. The first attempt in the free version gave me a basic text field for item name and dollar amount. I could add a few items, but it wasn’t impressive.

When I asked it to make it better and contextualise it for the Melbourne trip, the free version hit an error — not unusual in the first 24–48 hours of a release.

I switched to the Pro version, turned on GPT-5 Thinking, and tried again. It explored different ways to put the itinerary together and gave me a starter three-day plan.

I then specified my travel dates (August 26–30) and asked for an all-in-one travel app with:

  • Live weather update.
  • Budgeting app.
  • Itinerary.
  • Various widgets.

It produced a more complex app than GPT-4 or 3 could manage: live weather, a forecasting section (though it had a data-pull issue), itinerary with editable notes, and a budget app with some nice visualisations. Some features — like CSV export — didn’t work, but the improvement in coding was clear.


Image Generation

This wasn’t advertised as improved, but I tested it anyway. The results — for example, a cat with the wrong number of legs — showed that image-generation issues remain unchanged.


Education-Focused Experiments

Specialist Maths Exam Questions

I tested GPT-5’s reasoning with senior-level Victorian specialist maths questions. With Thinking Mode on, it worked through the questions quickly and accurately, even with harder, end-of-exam problems.

This aligns with OpenAI’s claims of improved mathematical reasoning and benchmark results.

Senior School Physics Lesson Plan

I asked for VCE Unit 1 Physics (2025) lesson materials. With Thinking Mode on, it conducted extensive web searches, found the latest curriculum documents, and produced an accurate lesson plan that matched official Areas of Study and Learning ideas.

Physics Simulation in Canvas Mode

I then tested Canvas Mode by asking for a wave simulation. It produced a fully interactive simulation with controls for wave speed, amplitude, frequency, and type (transverse/longitudinal), updating in real time. This was a big step up from previous versions.


Creative Writing Test

I asked for a short high-fantasy passage in a 1980s Raymond E. Feist style. Thinking Mode identified relevant authors and quickly produced text.

For me, the result was still somewhat formulaic, with predictable rhythm. In Canvas Mode, I asked it to improve the piece — it made small adjustments, but nothing that would replace an author.


Audio Transcript Reformatting

One of my common workflows is taking a raw audio transcript and reformatting it while correcting transcription errors. I tested this with a raw Otter transcript. GPT-5 handled this surprisingly well, preserving my language rather than overwriting it — something most models struggle with.


Timetable Image to Interactive App

I uploaded a photo of a school timetable and asked it to turn it into an interactive application.

Instead of populating the timetable, it created a drag-and-drop builder. It had correctly extracted all the timetable data from the image but didn’t use it as expected. In cases like this, it’s worth experimenting with multiple tabs and variations, as OpenAI suggested in their launch video.


Closing Thoughts

That’s a quick run-through of some GPT-5 features. Some things worked, some didn’t. Expect plenty of hype in the coming weeks and months.

I anticipate refinements and perhaps updates to Deep Research, image generation, and maybe even video generation integration.

Overall: definite improvements over GPT-4o, but I encourage you to try it yourself rather than relying solely on the hype.

Thank you very much.

6 responses to “GPT-5 Review: Benchmarks vs Reality of the PhD in Your Pocket”

  1. Thank you very much! I really appreciate your work. I have one central question. Underneath is still LLM, right? If the answer is “yes”, then: LLM’s are fundamentally probabilistic systems. Both you and Mollick mention things that point to the fact that what operates underneath is a machine that predicts and generates in not-straightforward, understandable and often slightly or completely wrong ways. All AI (rule based systems, machine learning, deep learning, transfer learning, generative AI, agentic AI) come with serious ethical and societal implications (e.g. bias, discrimination, privacy concerns, misinformation). Generative AI and agentic AI just exacerbates these implications, particularly because, as you mention, since the launch of ChatGPT November 30, 2022, AI has very much become user-ai. This comes with the problem that many users are neither technical experts nor subject experts (relating to what they use e.g. GPT-5 for). So…. Houston… did the societal/world problems not just grow even bigger, when looking from a sustainable world perspective?

  2. Gpt 5 is rubbish, worse then GPT 4. ASlower, not listening and dealing properly with requests. The worse version of OPENAI CHAT GPT

  3. I did try to use it for our company SEO, at http://www.procarswoking.com but is slow, and not as helpfull as version 4. So a bad impression tonus in Uk.

  4. […] GPT-5, and whether it represents a monumental step forward towards Artificial General Intelligence (heads up – it doesn’t). But aside from the hype headlines, a few interesting things are happening around open source […]

  5. […] adopters have rigorously probed GPT-5’s capabilities. In coding, it shines. Leon Furze noted its efficiency in Cursor AI, though Claude Opus 4 retains a slight edge in UI design. A […]

  6. […] model released followed OpenAI’s (much less successful) launch of GPT-5 in August this year. GPT-5 was so hyped that the reality couldn’t possibly live up to the […]

Leave a Reply to GPT-5 First Impressions: the Good, the Bad, and the Unexpected | VibeCentralCancel reply