Vision language models (VLMs) are a new type of artificial intelligence that combines computer vision with the natural language processing capabilities of large language models. VLMs are often capable of chatting but they are designed to particularly excel in understanding and generating text about images.
Over the past several months, many VLMs have been released including open source ones that can be run locally on home computers. In this post, I will be testing their capabilities using this webcomic:
Here is the prompt that I used:
What is the message being conveyed by this image?
And here are the results:
Chameleon:
The image is a cartoon-style diagram that depicts a simple flowchart. The main subject is a sheep, which is depicted as a black outline with a white body. The sheep is situated in the center of the image and faces to the left. The flowchart itself is drawn on a white background and features three main branches, each of which is labeled with a word or phrase. The branches originate from a central node labeled "This is a flowchart." The leftmost branch is labeled "Chickens," while the middle branch is labeled "Foxes." The rightmost branch is labeled "Flies." Each branch then splits into smaller sub-branches, which are labeled with additional words or phrases. The leftmost branch has two sub-branches: "Lay eggs" and "Get eaten." The middle branch has three sub-branches: "Catch mice," "Catch rabbits," and "Get eaten." The rightmost branch has two sub-branches: "Bite humans," and "Get eaten." Each sub-branch ends in a small circle or oval, which adds a sense of finality to the flow
ChartGemma:
The image is a satirical depiction of the illusion of free choice, highlighting the potential limitations of democratic systems.
Idefics2-8B:
There are two cows, one on each side of a wall. The cow on the left represents Democrats and the cow on the right represents Republicans. They both want to get to the "Slaughter House" but there's a big obstacle in their way. This illustrates how both parties work together for their own interests instead of working together as a whole.
Idefics2-8B-chatty:
The message being conveyed by this image is that there are limited options or choices available for both Democrats and Republicans, which can be seen as a metaphorical representation of the political landscape in the United States. The cow standing at the bottom of the image symbolizes the American people who are often faced with only two main political parties to choose from when voting. This limited choice can lead to a sense of frustration or dissatisfaction among voters, as they may feel that their preferences and concerns are not fully represented by either party.
The image also highlights the idea that both Democrats and Republicans might have some similarities or shared beliefs, despite their differences. This can be seen through the fact that they are both standing on the same side of the wall, suggesting that they share common ground or goals. However, the wall itself represents the division between these two groups, emphasizing the polarization and partisanship that often characterize American politics.
In conclusion, the image of the cow standing below the two political parties and the wall between them serves as a reminder of the limited choices and the division that exists within the American political system, while also hinting at potential areas of agreement or commonality between the two major parties.
LLaVA 1.5:
The image conveys a message about the concept of "free choice" and the illusion of it. It features a cow standing in front of a sign that says "Slaughter House" and another sign that says "Republicans." The cow is looking at the signs, which suggests that it is aware of its situation and the potential consequences of its actions. The image might be suggesting that even though the cow has the freedom to choose, its choices are ultimately limited by its environment and the decisions made by others. This can be interpreted as a commentary on the limitations of free choice in certain situations or the influence of external factors on individual decision-making.
OpenGPT 4o:
The image is a cartoon that illustrates a political divide between Democrats and Republicans. It features a cow standing in the middle of a road with a sign that reads "SLAUGHTER HOUSE" on one side and "THE ILLUSION OF FREE CHOICE" on the other. The cow appears to be looking at the sign on the "SLAUGHTER HOUSE" side, which suggests that the cow is being influenced by the illusion of free choice on that side. The message being conveyed is that the illusion of free choice can be misleading and that the true power in politics is held by those who control the "SLAUGHTER HOUSE," which is a metaphor for the concentration of power in the hands of a small elite.
In order to make things a bit more challenging for the AI, I then repeated the same experiment but with the explanation text removed:
Here are the results of that second experiment:
Chameleon:
This image is conveying a message that when you make a big change, it can cause a lot of pain and discomfort, but in the long run, it will lead to great rewards.
Chameleon (another attempt):
The image is a cartoon illustration that shows three different types of borders and their purpose. The illustration is accompanied by a text that reads "Borders serve a purpose". The illustration is a clear representation of how borders help define and protect different areas or entities.
ChartGemma:
The image suggests that Republicans are moving further apart from Democrats, while Democrats are moving further apart from Republicans.
Idefics2-8B:
This cartoon shows a cow looking at two doors. One door has "Democrats" written on it and the other one has "Republicans." The cow represents the American people who are trying to decide which party they want to vote for in an election.
Idefics2-8B-chatty:
The message being conveyed by this image is that there are two distinct groups of people or political parties - Democrats and Republicans - who have different opinions and beliefs. These differences are symbolized by the cow standing in front of the two separate walls, representing the divide between these groups. The cow looking at the wall of the "Slaughter House" suggests that the Democrats might feel threatened or endangered by the actions or policies of the Republicans, while the cow looking at the wall of the "Republicans" indicates that the Republicans may also have concerns or disagreements with the Democrats. Overall, the image highlights the polarization and division within politics, where opposing viewpoints often lead to conflict and tension.
LLaVA 1.5:
The image conveys a message about the political divide in the United States, specifically between Democrats and Republicans. The cow is standing in front of two doors, each labeled with the name of a political party. The cow represents a neutral observer, and the doors symbolize the two opposing political parties. The image suggests that the political divide is as deep as the ocean, with no easy way to bridge the gap between the two parties. This can be interpreted as a commentary on the challenges faced in achieving consensus and cooperation between the two parties in the political arena.
OpenGPT 4o:
The image is a cartoon that illustrates a humorous take on the political divide in the United States. It features a cow standing in the middle of a road with two signs on either side, labeled "Democrats" and " Republicans". The cow appears to be facing the "Democrats" sign, but is facing away from the " Republicans" sign. This could be interpreted as a commentary on how people may be more interested in the policies or ideas of the party they align with, rather than the party itself. The image is not meant to be taken seriously and is likely intended as a light-hearted joke.
As you can see, any perceived understanding immediately falls apart once the explanation text is removed. And even when it isn’t, there is still room for improvement. Some models identified the cow as a sheep, while others identified two cows when there is only one. Chameleon in particular seems to be quite prone to hallucinating. Given that it’s actually one of the more recent models, I’m not sure why it did so badly.
If you know of any other non-paid models that I could test for this experiment then feel free to post them in the comments. As the technology improves, I’m sure better VLMs will be released in the future that will easily pass this test. Until then, it’s probably fair to say that the answer to the question posed by this post’s title is “no” – i.e. current AIs aren’t quite intelligent enough to understand webcomics yet.