Created using DALL.E on 28 April 2025

Author

Introduction

Vision-enabled AI models have rapidly evolved to become essential tools across numerous applications, from content moderation to image analysis and multimodal reasoning. Cohere’s recent entry into this space with their Aya Vision model promises to deliver competitive capabilities in the increasingly crowded market of multimodal AI systems.

In this blog post, I’ll share my hands-on experience testing Aya Vision (32B model) against GPT-4o, focusing on several key areas critical for real-world applications. Rather than relying on marketing claims or theoretical specifications, this analysis is based on direct testing with identical prompts and images across both models.

What is Aya Vision?

Aya Vision, part of Cohere’s Aya family, aims to make generative AI accessible across languages and modalities. It’s available in two sizes—8 billion parameters (Aya Vision 8B) and 32 billion parameters (Aya Vision 32B)—and is optimized for vision-language tasks. As an open-weight model, it’s freely accessible for non-commercial research via platforms like Hugging Face and Kaggle. You can also access it using their Playground platform. Supporting 23 languages and covering half the world’s population, Aya Vision is designed for tasks like image captioning, visual question answering, text generation, and translation, making it a versatile tool for global applications.

Testing Methodology

My testing approach involved challenging both models with identical images and prompts across several categories:

  1. Technical Code Analysis: Evaluating how well each model can interpret and explain programming code in images
  2. Basic Image Recognition: Testing fundamental object identification capabilities
  3. Visual Reasoning: Assessing the ability to make accurate inferences about visual information
  4. OCR and Text Interpretation: Examining how effectively the models can read and understand text in images
  5. Object Counting: Testing precision in counting and identifying multiple objects
  6. Multilingual Capabilities: Assessing ability to recognize and translate non-English text

Test Results

1. Technical Code Analysis: Flutter Code Snippet

The Challenge: Both models were presented with a Flutter (Dart) code snippet and asked to analyze it.

Prompt: Analyze the code snippet shown in this image.

Image used:

GPT-4V’s Response:

Aya Vision’s Response:

Analysis: In this test, GPT-4V demonstrated significantly superior technical understanding and accuracy. Aya Vision’s response included substantial hallucinations, raising serious concerns about its reliability for technical use cases.

2. Basic Image Recognition: Eyeglasses

The Challenge: Both models were shown an image of transparent eyeglasses on a white surface and asked to describe what they see.

Prompt: Describe what you see in this image

Image used:

GPT-4V’s Response:

Aya Vision’s Response:

Analysis: Both models performed well on this basic image recognition task, with similarly accurate descriptions. This suggests that for straightforward object identification, both models have comparable capabilities.

3. Visual Reasoning: Food Identification

The Challenge: Both models were shown an image of “mushroom buns” in a bamboo steamer basket and asked to identify the dish.

Prompt: Can you guess what dish is in the image?

Image used:

GPT-4V’s Response:

Aya Vision’s Response:

Analysis: This test revealed significant differences in visual reasoning capabilities. GPT-4V correctly identified the uniquely designed mushroom buns, while Aya Vision misidentified them as a different dim sum dish entirely, demonstrating less refined visual reasoning.

4. Text Interpretation: Career Document

The Challenge: Both models were presented with a text-heavy image containing a “Career Episode” document and asked to summarize the key points.

Prompt: Summarize the key points from the text in this image.

Image used:

GPT-4V’s Response:

Aya Vision’s Response:

Analysis: GPT-4V demonstrated superior ability to extract, organize, and faithfully represent text information from the image, while Aya Vision provided a vague summary with multiple factual errors and omissions.

5. Object Counting: Urban Scene with Cars

The Challenge: Both models were shown an image of a street scene with multiple vehicles and asked to count the cars.

Prompt: How many cars do you see in this image?

Image:

GPT-4V’s Response:

Aya Vision’s Response:

Analysis: This test revealed Aya Vision’s limitations in accurate visual counting and identification, even with clearly visible objects. GPT-4V demonstrated superior ability to detect, count, and correctly identify vehicles throughout the scene.

6. Multilingual Capabilities: Non-English Text Recognition

The Challenge: Both models were presented with images containing non-English text (Hindi and Chinese) and asked to translate the content.

Prompt: Translate the text in the attached image to english

Image used:

GPT-4V’s Response:

Aya Vision’s Response:

Analysis: This test highlighted a critical limitation in Aya Vision’s ability to process non-English text in images, while GPT-4V demonstrated strong multilingual OCR capabilities, successfully translating both Hindi and Chinese characters.

Prompt TypePrompt DescriptionAya Vision 32B ResponseChatGPT ResponseAnalysis
Code AnalysisAnalyze a Flutter code snippet for a floating header UIIncorrectly identified as TypeScript, missed Flutter specificsCorrectly identified as Flutter, detailed analysis, suggested fixesChatGPT was accurate and insightful; Aya Vision failed to recognize Flutter.
Image IdentificationGuess the dish in the image (steamed buns)Misidentified as char siu bao (barbecued pork buns)Correctly identified as mushroom buns, added contextChatGPT was accurate; Aya Vision misinterpreted the dish.
Visual Question AnsweringHow many cars are in this image?Counted 3 cars, missed the SUV behind the LamborghiniCounted 6 cars, including background vehiclesChatGPT was more accurate; Aya Vision undercounted.
Text SummarizationSummarize a career episode textMisinterpreted context, missed key detailsAccurate summary, captured key pointsChatGPT was precise; Aya Vision was vague and incorrect.
Multilingual TranslationTranslate Hindi text in an image to EnglishFailed to translate Hindi and Chinese text from imagesCorrectly translated Hindi to “Hello, how are you?”ChatGPT succeeded; Aya Vision failed despite its multilingual focus.

Conclusion

While Cohere’s Aya Vision model demonstrates competence in basic image recognition tasks, it currently lags behind GPT-4V in technical accuracy, hallucination control, visual reasoning capabilities, and multilingual text recognition. The significant hallucinations observed in technical contexts and counting tasks, combined with limited multilingual support, raise concerns about its reliability for professional applications requiring precision.

For users considering which vision model to implement, these findings suggest that GPT-4V currently offers more reliable performance across diverse use cases, particularly those requiring technical understanding, multilingual support, or faithful representation of image contents.

As the field of multimodal AI continues to evolve rapidly, it will be interesting to see how Cohere refines Aya Vision in future iterations to address these challenges.

About the Author

Intern at Research Graph Foundation |  + posts