Explained with Working Examples

The Phi3 model series and its variations (Source [5])

Introduction

The field of artificial intelligence has taken another significant leap forward with the introduction of Phi-3.5-vision, a cutting-edge multimodal AI model developed by Microsoft. This lightweight yet powerful model represents a new frontier in machine learning, combining advanced natural language processing capabilities with sophisticated image understanding. Phi-3.5-vision stands out for its ability to handle both text and visual inputs with remarkable efficiency, making it a versatile tool for a wide range of applications in today’s increasingly visual digital landscape.

What sets Phi-3.5-vision apart is its impressive balance of performance and resource efficiency. With a context length of 128,000 tokens and an architecture optimised for both single and multi-image processing, this model opens up new possibilities for developers and researchers alike. From general image understanding and optical character recognition to complex tasks like chart interpretation and video summarisation, Phi-3.5-vision demonstrates capabilities that were once the domain of much larger, more resource-intensive models. As we delve deeper into this article, we’ll explore the key features, practical applications, and working examples that showcase the true potential of this innovative AI model.

Understanding Phi-3.5-vision

Key Features and Capabilities

Phi-3.5-vision stands out for its versatility and efficiency in processing both textual and visual information. Its capabilities extend across a wide range of tasks, making it a powerful tool for various applications.

Key Features

  1. Extensive context length: 128,000 tokens
  2. Efficient performance in resource-constrained environments
  3. Multimodal processing of text and images

Main Capabilities

  1. Visual Processing

2. Text and Data Extraction

3. Multimodal Integration

How it Differs from Previous Models

Phi-3.5-vision represents a significant advancement over its predecessors, offering improved performance and broader applicability while maintaining a relatively compact size.

Key differences include:

  1. Enhanced Multimodal Processing

2. Performance and Efficiency

3. Ethical Considerations

4. Versatility

Technical Specifications

The architecture of Phi-3.5-vision is designed to optimise performance while maintaining efficiency, allowing it to handle complex tasks with relatively modest computational requirements.

  1. Architecture Details

— Image encoder

— Connector

— Projector

— Phi-3 Mini language model

2. Training Specifications

3. Training Techniques

Environment Setup

Hardware Requirements

Phi-3.5-vision is optimised for efficiency but requires specific hardware for optimal performance. The model has been tested and performs well on NVIDIA A100A6000, and H100 GPUs. Ensure your system has sufficient VRAM to handle the 4.2B parameter model.

Dependencies Installation

To run Phi-3.5-vision locally, you’ll need to set up a Python environment with specific package versions. Follow these steps:

  1. Create a new virtual environment using your preferred method (venv, conda, etc.).
  2. Create a file named requirements.txt with the following content:

3. Install the required packages using pip:

Note: Ensure that torch is installed with CUDA support for GPU acceleration. You may need to visit the PyTorch website to get the correct installation command for your specific CUDA version.

Best Practices for Prompting

Once your environment is set up, using Phi-3.5-vision effectively involves crafting appropriate prompts. Here are some key guidelines:

  1. Use the chat format: Phi-3.5-vision is optimised for chat-like interactions. Structure your prompts as follows: For single image tasks:

2. For multi-image tasks:

3. Be specific: Clearly state what you want the model to analyse or describe in the image(s).

4. Leverage multi-modal capabilities: Combine text and image prompts to fully utilise the model’s strengths.

5. Adjust for task type:

Working Examples

Single Image Analysis

One of the most fundamental tasks for a vision-language model is describing a single image. Let’s explore how Phi-3.5-vision handles this task with a practical example. For this demonstration, we’ll use an image of a dog and ask the model to describe it in detail.

Example Dog for Phi3.5 Single Image Analysis (Source [3])

Here’s the Python code for this example

When we run this code with our dog image, Phi-3.5-vision generates the following description as seen in the terminal screenshot:

Generated Image Description

This output demonstrates several impressive capabilities of Phi-3.5-vision:

  1. Object Recognition: The model correctly identifies the main subject as a dog, even suggesting a specific breed (Poodle).
  2. Action Understanding: It recognises that the dog is in mid-air and carrying a stick, inferring playful activity.
  3. Detail Observation: The model notes specific details like the dog’s well-groomed fur and happy expression.
  4. Scene Comprehension: It describes the background, mentioning the grassy field and yellow flowers.

Multiple Image Analysis

Phi-3.5-vision’s ability to process and analyse multiple images simultaneously is a powerful feature that sets it apart from many other vision models. Let’s explore this capability with an example where the model describes two different pet images: the dog from the above example and a cat.

Example Cat for Phi3.5 Multi Images Analysis (Source [4])

Here’s the Python code for this example

When we run this code with our two pet images, Phi-3.5-vision generates the following response:

Generated Multiple Image Description.

This output demonstrates several impressive capabilities of Phi-3.5-vision in multi-image processing:

  1. Simultaneous Analysis: The model can process and analyse multiple images at once, providing detailed descriptions for each without confusion.
  2. Inference of Intent: The model goes beyond mere description, inferring that the dog is playing and the cat might be hunting, based on their postures and expressions.

Multi-turn Conversation with Images

Phi-3.5-vision’s ability to engage in multi-turn conversations about multiple images showcases its versatility in visual analysis and comparison. Let’s explore this capability with an example dialogue involving two different images: the dog and the cat in the examples above.

Here’s the Python code for this example

The generated dialogue style output is shown in the following image:

Generated Dialogue and Comparative Analysis

This output demonstrates several impressive capabilities of Phi-3.5-vision:

  1. Memory and Context Retention: The model remembers details from previous turns. When asked about the dog’s breed in the second turn, it confidently identifies it as a Poodle without needing to re-analyse the image.
  2. Multi-Image Processing: The model can handle multiple images in a single conversation. It seamlessly transitions from discussing the dog image to analysing the new cat image.
  3. Comparative Analysis: In the third turn, the model not only describes the new cat image but also compares it to the previous dog image, highlighting differences in coat type, species, and posture.

Document Understanding

One of Phi-3.5-vision’s most impressive capabilities is its ability to understand and analyse complex documents, including scientific papers with charts and figures. This feature demonstrates the model’s potential for advanced OCR (Optical Character Recognition) and chart interpretation. Let’s explore this capability with an example using a scientific paper and a chart within it.

Academic Paper (Source [2]) Abstract and Introduction Section for Phi3.5 Document Analysis
Chart from Academic Paper (Source [2]) for Phi3.5 Chart Analysis

Here’s the Python code for this example

When we run this code with our scientific paper images, Phi-3.5-vision generates the following analyses:

Generated Document and Chart Analysis for Academic Paper

This output demonstrates several impressive capabilities of Phi-3.5-vision in document understanding:

  1. Text Recognition and Comprehension: The model accurately reads and comprehends complex text, including the paper’s title, abstract content, and figure captions.
  2. Structure Understanding: It identifies different sections of the paper, such as the abstract and figures, demonstrating an understanding of scientific paper structure.
  3. Chart Analysis: For the bar chart, the model provides a comprehensive analysis, including axis labels, legend interpretation, and trend identification.
  4. Data Interpretation: The model not only describes the chart’s visual elements but also interprets the data trends and their potential significance.

Benchmark and Performance

Phi-3.5-vision has been evaluated on a variety of benchmarks to assess its performance in different aspects of vision-language tasks. Let’s review some of the key results and compare them with other prominent models in the field.

Overview of Benchmarks

MMMU (Multimodal Understanding) Benchmark: Phi-3.5-vision score: 43.0 This benchmark tests the model’s ability to understand and reason about multimodal inputs.

MMBench (Multimodal Benchmark): Phi-3.5-vision score (dev-en): 81.9 MMBench evaluates the model’s performance on a wide range of multimodal tasks.

ScienceQA (img-test): Phi-3.5-vision score: 91.3 This benchmark tests the model’s ability to answer science-related questions based on images.

MathVista (testmini): Phi-3.5-vision score: 43.9 MathVista evaluates the model’s capability in visual mathematical reasoning.

TextVQA (val): Phi-3.5-vision score: 72.0 This benchmark assesses the model’s ability to answer questions about text in images.

POPE (test): Phi-3.5-vision score: 86.1 POPE tests the model’s ability to verify the presence of objects in images.

Comparison with other models

Benchmark Result between Phi3.5 and Popular Models (Source [1])

Phi-3.5-vision demonstrates competitive performance when compared to other prominent vision-language models, including larger ones like InternVL-2, Gemini-1.5, GPT-4o-mini, and Claude-3.5-Sonnet. Despite its relatively small size of 4.2B parameters, Phi-3.5-vision holds its own across various benchmarks. It’s particularly noteworthy that on the TextVQA benchmark, which tests the model’s ability to understand text within images, Phi-3.5-vision outperforms all compared models, including those with significantly larger parameter counts.

While Phi-3.5-vision doesn’t consistently top every benchmark, its performance is remarkably balanced. It shows strong results in general multimodal understanding (MMMU, MMBench), specialised tasks like science question answering (ScienceQA), and object verification (POPE). This balanced performance, combined with its smaller size, makes Phi-3.5-vision an attractive option for applications where computational efficiency is a concern without significantly compromising capability.

Limitations and Consideration

Phi-3.5-vision, while powerful, has limitations that users should be aware of. Its performance may vary across languages, and it can potentially generate inaccurate information or reflect biases present in its training data. The model’s scope for code generation is primarily limited to Python and common packages. When deploying Phi-3.5-vision, responsible AI practices are crucial. These include being transparent about AI-generated content, safeguarding privacy, implementing content moderation, preventing misuse, maintaining human oversight in critical applications, and ensuring legal and ethical compliance. Users should also be mindful of potential biases, especially in applications involving sensitive categories. Continuous evaluation of the model’s performance and impact in various contexts is essential for its responsible use. By acknowledging these limitations and adhering to responsible AI principles, users can effectively leverage Phi-3.5-vision’s capabilities while mitigating potential risks.

Conclusion

Phi-3.5-vision represents a significant advancement in multimodal AI, offering impressive capabilities in image understanding, text analysis, and visual reasoning. Its balanced performance across various benchmarks, coupled with its relatively compact size, positions it as a versatile tool for a wide range of AI applications. From enhancing document analysis to powering more intuitive human-computer interactions, Phi-3.5-vision has the potential to drive innovation across industries. As research in this field continues to evolve, we can anticipate further improvements in efficiency, accuracy, and broader language support, opening up new possibilities for AI-powered solutions in an increasingly visual digital world.

References

[1] Microsoft Phi-3.5-vision-instruct Model. View Model.

[2] Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion. Read Paper.

[3] Dog Image Source: Encyclopædia Britannica. View Source.

[4] Cat Image Source: Purina. View Source.

[5] Phi3.5-Mini: Overview. Read Article.

About the Author

Mingrui Gao
Intern at Research Graph Foundation |  + posts