I put ChatGPT-4o new vision feature to the test with 7 prompts

Jump to:

1. Object recognition
2. Optical character recognition (OCR)
3. Facial recognition
4. Emotion detection
5. Scene understanding
6. Image quality assessment
7. Multi-object detection

Artificial intelligence models have been able to look at pictures and tell you what they can see within the image for years but with its Spring Update, OpenAI took things to a new level.

With the rollout of GPT-4o in ChatGPT — even without the voice and video functionality —OpenAI unveiled one of the best AI vision models released to date.

Its success is in part due to the fact it is multimodal natively, having a deeper understanding of image, video, sound and text. It can reason across an image, speech, video and text where other models first convert to text.

To put its capabilities to the test I gave it a series of images and asked it to describe what it could see. The more accurate it is the better the model. Often AI vision models — including GPT-4 —will miss one or two objects or get a description wrong.

Testing GPT-4o vision

For each test, I gave ChatGPT-4o the image and the prompt “What is this?” with no additional context or information.This is closer to how people are likely to use this capability in the real world and how I used it recently at an event in Paris.

The aim is to see how well it analyzes the picture. I followed up on each occasion to ask if it could detect whether the image was AI-generated or not. They all were made using Ideogram from a description created by Claude 3 — so fully AI.

After the first two images in the conversation, it started to automatically tell me whether it thought the image was AI-generated or not without me having to ask.

Sign up to get the BEST of Tom’s Guide direct to your inbox.

Upgrade your life with a daily dose of the biggest tech news, lifestyle hacks and our curated analysis. Be the first to know about cutting-edge gadgets and the hottest deals.

I started in the ChatGPT Desktop app but that started to fail to respond so moved to the web and it worked fine.

1. Object recognition

I put ChatGPT-4o new vision feature to the test with 7 prompts — the result is mindblowing (2)

For this first image, I generated a picture of a red ceramic cup filled with steaming coffee. It was put on a saucer on a wooden table in a cafe — but the cafe was blurred. The idea is to see how GPT-4o interprets the background and scenario.

GPT-4o wrote: “This image shows a steaming cup of coffee in a red cup and saucer placed on a rustic wooden table. The background is blurred, depicting a cosy indoor setting with people sitting and conversing in what appears to be a cave-like café or a uniquely designed interior with stone walls. The focus is on the coffee cup, highlighting the steam rising from the hot beverage.”

That is about as accurate and descriptive as it could get. Even painting a picture of the scene going beyond the simple description. For fun I asked it to: “Describe the thoughts of the person sitting at the table with this coffee ready to drink.”

It replied with a beautifully descriptive overview but this line stood out: “This coffee is more than just a drink; it's a ritual, a cherished moment of solitude and reflection amidst the hustle of everyday life.”

2. Optical character recognition (OCR)

I put ChatGPT-4o new vision feature to the test with 7 prompts — the result is mindblowing (3)

OCR is a technology as old as computing so should be trivial for any vision model, even smartphones do this on-device. I gave it a simple image showing a close-up of a weathered wooden sign that reads "Welcome to Oakville" in a vintage, hand-lettered style. I want more than just the text from the AI.

GPT-4o delivered; writing: “This image shows a vintage-looking sign that reads “Welcome to Oakville.” The sign appears weathered, with a rustic, aged finish, suggesting it has been exposed to the elements for some time. The text is in a bold, white font on a red background, with some signs of wear and tear, adding to its nostalgic charm. The sign is partially surrounded by greenery, indicating it might be placed outdoors, possibly at the entrance of a town or community named Oakville.”

Not only did it correctly read the text but also suggested more details of the location, identifying its likely a town or settlement and pulling out details of the sign itself.

3. Facial recognition

I put ChatGPT-4o new vision feature to the test with 7 prompts — the result is mindblowing (4)

This is a controversial use of artificial intelligence technology and is banned to a certain degree in the EU. But how well can GPT-4o describe a face? The image was an AI-generated depiction of a woman in her 40s with dark hair and warm brown eyes. She had on a friendly smile.

GPT-4o described it like this: “This image is a close-up portrait of a smiling woman with curly dark hair. The focus is on her face, which is well-lit, showing detailed skin texture and features. She has a friendly expression, with her eyes looking directly at the camera. The background is blurred, putting emphasis on the subject.”

I wanted a bit more, including around the age. So I asked it “what age is this person and what is her emotional state.” It replied with the fact she’s in her late 30s to early 40s and in a positive state of mind, explaining: “The overall impression is that she is in a good place emotionally, perhaps feeling fulfilled and at ease.”

4. Emotion detection

I put ChatGPT-4o new vision feature to the test with 7 prompts — the result is mindblowing (5)

Next up I wanted to see how well GPT-4o would do with a picture of an older man with a wistful expression. Would it pick up on that or just describe the scene. Remember the only input it had was “what is this” and the picture.

GPT-4o did well. It said of the image: “This image is a portrait of an elderly man sitting on a bench outdoors, likely in a park. The lighting suggests it was taken during the golden hour, as the soft sunlight highlights his face. The man’s expression is thoughtful, with a slight hint of melancholy or contemplation. He is dressed in a collared shirt and a blazer, and the background is blurred, focusing attention on the man while providing a pleasant, natural setting.”

The AI model did exactly what I was hoping for: describe the scene and make suggestions around how the man might be feeling in that moment.

5. Scene understanding

I put ChatGPT-4o new vision feature to the test with 7 prompts — the result is mindblowing (6)

How well can GPT-4o describe a scene from the image and identify elements within the scene? This is a useful skill as the AI model is increasingly used to help describe the real world. It is one of the key features of the new Voice mode in ChatGPT.

GPT-4o declared: “This image depicts a lively outdoor farmers’ market on a sunny day. Various stalls are set up under tents, showcasing an abundance of fresh produce including fruits, vegetables, and other goods. People are walking through the market, browsing, shopping, and interacting with the vendors. The market appears to be set in a pleasant, tree-lined area, with green foliage visible in the background. The atmosphere looks vibrant and communal, with a focus on local, fresh products.”

Not much to say. It was spot on in its description of the image and the situation. It provided more detail than I was expecting, highlighting some of the produce on sale.

6. Image quality assessment

I put ChatGPT-4o new vision feature to the test with 7 prompts — the result is mindblowing (7)

I had Ideogram generate four nearly identical images in a grid but giving each a different resolution or composition level. I wanted to see how GPT-4o would describe this. It can handle the underlying image, but what about four of them, each slightly different?

GPT-4o wrote: “The image is a collection of four landscape photographs arranged in a grid, each showcasing a scenic view of rolling hills covered with green grass and wildflowers under a sky with clouds.” So good so far.

It continues: “The top left photograph features a soft pink and blue gradient sky, suggesting dawn or dusk, with hills bathed in warm, golden light, enhancing the vibrant green grass and yellow and white flowers. The top right photograph shows a similar scene but with muted colors, indicating early morning or late afternoon. The bottom left image has cooler tones and a more pronounced blue sky, suggesting midday with softer lighting. The bottom right photograph is warmer and more saturated, with a pinkish hue in the sky, suggesting the warm light of sunset. These variations in lighting and color tone illustrate how different times of the day and photographic techniques can affect the mood and appearance of the same landscape.”

While it didn’t catch on to the purpose of the four-image grid — composition comparison — it was able to describe each individually which is impressive. In fact it did its own comparison between the images.

7. Multi-object detection

I put ChatGPT-4o new vision feature to the test with 7 prompts — the result is mindblowing (8)

I asked it to find my glasses in the image and it correctly said: “Your glasses are on the desk to the right of the laptop, placed on top of a notebook.”

Finally an important test if the model wants to be useful as an accessibility tool — how well it detects objects in a crowded scene. Same prompt as the others. The image shows a cluttered desk and I expect it to pull up at least one minor object.

GPT-4o wrote: “The image depicts a modern and organized office workspace with a cityscape visible through large windows in the background. The desk is neatly arranged with various items typically found in a productive work environment, including a centrally positioned laptop, a metallic desk lamp, and a coffee mug, suggesting a relaxed yet focused atmosphere. Scattered notebooks and papers indicate active work and note-taking, while a small potted plant adds a touch of greenery. A framed photo of two people provides a personal touch, likely serving as a source of motivation. The workspace is well-stocked with office supplies such as pens, pencils, and organizers, ensuring functionality. The urban backdrop of the city skyline with modern high-rise buildings creates a dynamic and inspiring setting.”

It worked. I decided to do a follow-up and ask it to tell me the color of the plant pot and it got it spot on. So I asked it to find my glasses in the image and it correctly said: “Your glasses are on the desk to the right of the laptop, placed on top of a notebook.”

Summary

Every vision model I’ve used in the past has made at least one mistake, usually a big one like misidentifying an object or not picking up on a color or brand.

GPT-4o got every single one of these spot on. Moving to true multimodality has been a game changer for OpenAI.

It also shows the potential value of smart glassess as the true future of interaction with data. Forget the smartphone, lets use vision to merge the real and digital.

More from Tom's Guide

Apple reportedly close to a deal with OpenAI — could ChatGPT be the new Siri?
OpenAI releases ChatGPT rule book — what this means for users
'GPT-4 is the dumbest model any of you will ever have to use' declares OpenAI CEO Sam Altman as he bets big on a superingtelligence

Category

Back to MacBook Air

Brand

Apple

Asus

Lenovo

Processor

AMD Ryzen

AMD Ryzen 7

Intel Core M3

Intel Pentium

RAM

8GB RAM

16GB RAM

Storage Size

128GB

256GB

512GB

1TB

Screen Size

13.3-inch

14-inch

15-inch

Colour

Black

Blue

Grey

Silver

Storage Type

EMMC

SSD

Condition

New

Refurbished

Price

Any Price

Showing 10 of 95 deals

Filters☰

Apple M3 MacBook Air

(256GB SSD)

$999

View

Asus Zenbook S 13 OLED

(13.3-inch 1TB)

$1,399.99

View

Lenovo IdeaPad Duet 3

$369.99

View

Apple M3 MacBook Air

$1,099

View

Asus Zenbook S 13 OLED

(OLED)

$1,399.99

View

Asus Zenbook S 13 OLED

(13.3-inch 1TB)

$1,603.95

View

Asus ROG Zephyrus G14 2023

$1,599.99

View

Apple M3 MacBook Air

$999

View

Apple M3 MacBook Air

(256GB)

$1,099

$998.98

View

Asus ROG Zephyrus G14 2023

(14-inch 512GB)

$1,429.99

$1,072.99

View

Load more deals

Ryan Morrison

AI Editor

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover.When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?

More about chatgpt

WWDC 2024 should kick off huge Apple and OpenAI partnership — here's how it could transform your iPhoneForget Duolingo — I used ChatGPT-4o to help me translate in Paris and it was better than I expected

Latest

Samsung reveals Galaxy Ring launch date, as it takes Oura to court

See more latest►

2 CommentsComment from the forums

husky91
I guess I'd be interested to know how it does with real images rather than AI generated ones. Can it identify not just what the image is but specifics about the image. Like with a picture of a building, can it identify the building and facts about it? Do this with exif data stripped out and see if it can identify the event like a picture of the stadium during Super Bowl 50 or whatever. Can it identify the opponents and date?
Reply
Red_Tom
Do this with a real image and getting the same result and I'd be impressed. How do we know that GPT-4o in this article's scenario isn't just reading meta-data/prompt text from the Ideogram AI images, and not actually recognising things in the images? You mentioned using it at an event in Paris but elaborated no further on this - how did you use it in Paris and what were the results there?
Reply

Most Popular

I'm a sleep writer — my 3 best techniques for sleeping through the night

I tried this 20-minute bodyweight abs workout with 14 million views — here’s what happened

I took over 200 photos with the Google Pixel 8a vs Nothing Phone 2a — here’s the winner

Cooling mattress vs cooling mattress topper: Which should you buy?

5 signs you'd sleep better on a pillow-top mattress —plus our top 3 picks

5 awesome ChatGPT prompts to try this weekend

A Pilates instructor shares an 8-minute Pilates abs workout for deep core activation — so I tried it

I test drove the Ford Mach-E Rally at a legit rally school to find out what a $60,000 EV gets you

5 flowers to plant in June to brighten up your yard this summer

I took over 200 photos with Pixel 8a and Pixel 6 Pro — here's the winner

Are hotel mattresses worth the price? Plus, our top 5 budget alternatives on sale this weekend

I put ChatGPT-4o new vision feature to the test with 7 prompts — the result is mindblowing (2024)

Testing GPT-4o vision

Sign up to get the BEST of Tom’s Guide direct to your inbox.

1. Object recognition

2. Optical character recognition (OCR)

3. Facial recognition

4. Emotion detection

5. Scene understanding

6. Image quality assessment

7. Multi-object detection

Summary

More from Tom's Guide

Most Popular