Everything I’ve Learned so far About OpenAI’s Agents

The hype train has fully left the station, and every AI punter on every social media channel is going wild about OpenAI’s new “Agents”. Unfortunately, most of the commentators haven’t actually tried the product – they’re relying on OpenAI’s promo video. Even those who have tried Agents seem to have been wooed by its accompanying marketing spiel.

But OpenAI’s product is unfinished: another example of a tech company releasing a technology into the wild just because they can, and not because they should. OpenAI wants to be first to market with a successful browser using AI agent, and they’ll be damned if a little fact like “it doesn’t work” will get in the way.

In my previous post, I explored some of the much hyped features of Agents: making PowerPoints and other resources. It was, predictably, hopeless. But that didn’t stop the hypesters commenting and messaging me with cries of “just wait six months!” and “you’re prompting it wrong!”

So I tried, and tried, and tried again. I wanted desperately to see in Agents what others were seeing: some sort of glorious techno-optimistic future where we’re all freed from the burden of things like… online shopping and… making PowerPoints.

On LinkedIn, I shared a few more attempts. The most successful involved getting o3-pro – OpenAI’s most powerful, most expensive model – to research all the capabilities of Agents and write clear, structured instructions in JSON. Large Language Model chatbots can write and read JSON very well, and using the structured data format is a proven way to get consistent results. So I had o3-pro write a set of instructions in JSON to make a 3-slide PowerPoint. It was still… uninspiring.

What can Agent actually do?

In a last ditch attempt to get Agent to produce something remotely useful, I decided to just… ask what it could do. This tactic works well with Claude, Anthropic’s AI model – ask for a list of available tools and it will tell you all of its features. So I figured I’d try a series of exercises with Agents to have it list and demonstrate all of its bells and whistles.

The following video is a full, uncut (but sped up) recording of this test. It demonstrates everything I’ve found so far and, alongside my earlier post, also shows the current limitations.

Now, I know that it will get better. I’ve been saying since 2024 that “computer using agents” are absolutely the future of this technology, and absolutely will be pushed into the market by Google, Meta, Microsoft, and every other tech company. Do I like it any better, knowing that it’s coming? No.

Following the video, scroll down for my observations and everything I’ve learned so far about agents. If you think I’ve missed any of the tools, commands, or limitations let me know in the comments.

Available Tools and Applications

Core Tools

  • Browser Tool – Chromium-based web browser for internet browsing
  • Computer Tool – Virtual desktop environment with full GUI control
  • Container Tool – Linux environment for running command-line programs and scripts
  • Image Generation Tool – Calls ImageGen to generate an image
  • Memento Tool – Internal utility for saving and recalling summaries of work across sessions. Rather hilariously, this tool is actually a hallucination! It appears (as far as I can tell) to have come from a 2019 LinkedIn article by Jesus Rodriguez, plus perhaps some OpenAI forum posts and a Medium article, which have been absorbed while Agent idly browsed the internet for its own tool list. I’m leaving it in as a great example of how strange and fairly useless Agent is at the moment.

Available Applications

  • Linux-based virtual desktop
  • Chrome Browser (Chromium-based)
  • LibreOffice Suite including:
    • Writer (word processor)
    • Calc (spreadsheets)
    • Impress (presentations)
    • Draw (drawing application)
    • Base (database)
    • Math (formula editor)

Computer Tool Functions

Core Functions

  • computer.initialize – Launches virtual desktop session
  • computer.get – Captures screenshot of current desktop state
  • computer.sync_file – Transfers files from virtual machine to user environment
  • computer.switch_app(app_name) – Switches between applications
    • Supported apps: "chrome" and "libreoffice"

GUI Interaction Actions (computer.do)

  • click – Mouse click at coordinates
  • double_click – Double-click at coordinates
  • drag – Drag mouse along coordinate path
  • keypress – Press keyboard keys with modifiers (e.g., CTRL+A, CTRL+Z)
  • move – Move mouse to new position based on screen coordinates
  • scroll – Scroll content vertically/horizontally
  • type – Type text into active field
  • wait – Pause to allow UI updates

Multi-Action Sequences

  • Can chain multiple actions together for complex interactions
  • Fills forms and performs sequences on websites
  • Navigate websites and applications programmatically via the GUI

Programming and Development Capabilities

Python Programming

  • Full Python environment in container
  • Can import libraries
  • Can create and execute Python scripts
  • Generate data visualisations and graphs
  • Handle data processing and analysis
  • Create documents programmatically using modules like:
    • python-pptx (PowerPoint creation)
    • python-docx (Word document creation)

File Operations

  • Create, save, and manipulate files in virtual environment
  • Export files for download (ZIP, PDF, CSV, images, etc.)
  • Better file handling via command line than GUI
  • Can organise files in folders within virtual environment

Web Development

  • Create HTML websites with CSS and JavaScript
  • Download and reference external images and resources
  • Build responsive websites with smooth scrolling
  • Generate complete web projects with multiple files

Internet and Web Capabilities

Web Browsing

  • Search and navigate websites
  • Download PDFs and extract content
  • Access web-based applications (no login required)
  • Navigate Wikipedia, news sites, forums, webpages that do not require logins
  • Use open web-based tools like:
    • OpenStreetMap
    • Sketchpad applications
    • Games
    • Unsplash for stock images

Limitations

  • Frequent 404 errors and navigation issues
  • Cross-origin security restrictions
  • Cannot access sites requiring authentication
  • More error-prone than typical human browsing
  • Must hand over to human user for logins, unless details are provided in the prompt (don’t do that!)

Document Creation Methods

Method 1: Python-based (Faster)

  • Uses Python libraries for document creation
  • Limited styling and formatting options
  • Faster execution but basic appearance
  • Programmatic approach to content generation

Method 2: LibreOffice GUI (Slower but marginally better looking)

  • Full LibreOffice interface control
  • Better visual formatting and styling
  • Much slower and more error-prone
  • Can create (some) tables, insert images, format text
  • Prone to GUI interaction issues, clicking in the wrong place, and getting stuck in a loop

Image Generation and Processing

  • Access to ChatGPT’s image generator
  • Can create abstract images, graphics, and visual elements
  • Generate charts and graphs from data
  • Download and organise stock images from web sources

Data Analysis and Visualisation

  • Process datasets and perform statistical analysis
  • Create various types of graphs and charts
  • Handle CSV files and data exports
  • Generate random data for testing and demonstrations

What OpenAI Agents CANNOT Do

Installation Limitations

  • Cannot install new applications – Security guardrail prevents software installation
  • Limited to pre-installed Chrome and LibreOffice for now

Performance Issues

  • GUI interactions are slow and error-prone compared to command-line operations
  • Frequent 404 errors when browsing websites
  • Gets confused with multiple simultaneous tasks – tends to go in circles
  • Can’t play Minesweeper
  • Struggles to access authenticated websites
  • Cross-origin security restrictions limit web browsing capabilities
  • Struggles with complex table creation in LibreOffice
  • File operations via GUI are unreliable compared to command-line
  • Uncertain about its own available tools – sometimes doesn’t know what it can do. At one point, referred to a third-party blog post about its own toolset…

Output Quality Issues

  • Output reports, PowerPoints, and other documents are sparse on information and poorly formatted
  • Generates annoying OpenAI citation links in documents (the long non-clickable number sequences)
  • Poor table formatting in LibreOffice documents
  • Inconsistent task completion – may fail to complete all requested updates

Download a PDF of the features here:

PostScript: Installing Packages

After running these tests, it occurred to me to also check in on the ability to install packages via the command line. Having told me it could not install new applications (beyond Chrome and LibreOffice), I figured it should at least be able to download and install things via the command line. First, I asked if it could list its currently installed packages:

There’s obviously a lot preinstalled in the environment, but the output of the list “broke” after the first 18 items.

It then proceeded to install and test some new packages, proving that it can add new capabilities this way.

What have I missed? Get in touch below or leave a comment at the end of this article.

Want to learn more about GenAI professional development and advisory services, or just have questions or comments? Get in touch:

← Back

Thank you for your response. ✨

6 responses to “Everything I’ve Learned so far About OpenAI’s Agents”

  1. […] Quelle: Everything I’ve learned so far about OpenAI’s Agents […]

  2. […] Browser based and other “AI Agents”: semi-autonomous web browsing tasks and for some reason lots of online shopping and travel bookings… […]

  3. […] of custom instructions, a defined knowledge base (for example from SharePoint), and a specific job. An OpenAI Agent, on the other hand, is a browser-using, internet searching, semi-autonomous “assistant” […]

  4. […] Everything I’ve Learned so far About OpenAI’s Agents […]

Leave a Reply to Free PD Video: How to Use ChatGPT – Leon FurzeCancel reply