Talk To Your Image — A Step-by-Step LLaVa-1.5

What is LLaVa?

LLaVA (Large Language-and-Vision Assistant) is a model that can be trained end-to-end by combining a Vision encoder and LLM.

A vision encoder processes visual data like images and transforms it into a latent representation.

On the other hand , the LLM processes data from both the vision encoder and text input to generate a response.

LLaVA trains these two components end-to-end to enable multimodal visual-linguistic transformation.

As a result, LLaVA showed high performance in visual reasoning ability, as an early study in visual instruction tuning.

LLaVA challenges

However, LLaVA underperformed on academic benchmarks that demand short-form responses, like providing the number of the correct option from a given set

This challenge is believed to be due to the fact that LLaVA is not pretrained on large-scale data, as in other studies.(Supplement: LLaVA uses image-text conversation data automatically generated by GPT-4)

Research purpose and overview

n this research, we conducted the following investigations and verifications with the main purpose of improving the performance of LLaVA.

Improved Vision-Language connectorThe connector between Vision encoder and Language Model has been changed from linear conversion to two-layer MLP.
Investigating the effects of scalingWe scaled each of the three elements: training data, model size, and input image resolution, and investigated the effects.

Based on these survey results, when we combined all the measures that improved performance in the benchmark, we achieved SoTA in 11 out of 12 tasks.

Improvements to LLaVA

Methods such as InstructBLIP have struggled to support both short-form and long-form VQA (Visual Question Answering) for two primary reasons.

In visual instruction tuning, using formats like “Q: {Question} A: {Answer}.” without clear prompts can be ambiguous about the expected answer format. This ambiguity might cause the LLM to overfit, resulting in short-form responses even during natural, visually-based conversations.
InstructBLIP trains only QformerTo manage both short and long sentences without fine-tuning the LLM, one must regulate the LLM response length using Qformer’s output, similar to Prefix-Tuning. However, Qformer’s capabilities are more constrained than LLM’s and don’t offer precise control over the anticipated output format.

to address this, we appended a prompt to the question’s end, explicitly defining the output format for short answers. Here’s an illustrative addition:

When LLM was fine-tuned using such prompts, LLaVA was able to properly adjust the output format based on the user’s instructions.

As shown in Table 1 below, just by including VQAv2 with this sentence added in training, the performance of LLaVA in MME that requires short sentence responses significantly improved from 502.8 to 1323.8, and InstructBLIP was not used at this stage . 111 points higher.

MLP Vision-Language Connector

Drawing inspiration from the enhanced outcomes of self-supervised learning when changing from linear projection to MLP, we augmented the connector’s expressive capacity by adapting a two-layer MLP into a vision-language connector. Consequently, we observed a heightened multimodal capability in LLaVA compared to its original linear projection setup. The extent of this enhancement is detailed under “3 +MLP VL connector” in Table 1.

Add datasets focused on specific tasks

To enhance various capabilities of the model, we added four datasets that focus not only on VQA but also OCR and domain-level recognition.All of these are used in InstructBLIP.

VQA that requires extensive knowledge- OKVQA- A-OKVQA
VQA that requires OCR- OCRVQA- TextCaps

Here, A-OKVQA was converted to a multiple choice task and the following format was used for the prompt:

As shown in “4 +OKVQA/OCR” in Table 1, LLaVA outperforms InstructBLIP in all three tasks with only a subset of the dataset used by InstructBLIP, demonstrating that LLaVA’s design is effective. suggests.

Additionally, with the addition of a region-level VQA dataset (RefCOCO), the model is now able to more accurately identify and locate specific parts and details within images. (“5 +Region-level VQA” in Table 1)

Other scaling

We scaled up the resolution of the input image from 224 to 336 so that LLM can clearly “see” the details in the image.
Added the GQA dataset as an additional visual knowledge source.
Added ShareGPT data.
I scaled up my LLM from 7B to 13B.

The MM-Vet results show the most significant improvement when LLM is scaled up, suggesting the importance of underlying LLM abilities in visually informed conversations.

The final model with all changes is called LLaVA-1.5 (last two rows of Table 1).

Comparison with SoTA

Outperforming existing methods, LLaVA-1.5 excelled in 11 out of 12 benchmarks, even with minimal pre-training and instruction tuning data. Its achievement, based on a publicly accessible dataset and a straightforward architecture, sets a cost-effective and easily replicable standard for upcoming research.

In addition, it has been suggested that Vision instruction tuning plays a more important role than pre-learning in order to improve the ability of LMM (Large Multimodal Model), and Vision encoders such as CLIP, OpenCLIP, and EVA-CLIP are already Despite pre-training on a web-scale image-text pair dataset, we question the conventional wisdom that a large amount of V&L alignment pre-training is required when building an LMM.

Even the 7B model of LLaVA-1.5 outperforms the 80B IDEFICS. This result makes us reconsider the benefits of vision samplers and the need for extensive pre-training in terms of their ability to follow multimodal instructions.

Generalization of Zero-shot instructions

Although LLaVA-1.5 was only trained with a limited number of forms of instruction, it has been found to generalize to other instructions as well. First, VizWiz requires the model to output “Unanswerable” if the content provided is insufficient to answer the question, and our answer format prompt, as shown in Table 8, is effective. It works. (Of the questions that could not be answered, the percentage of respondents who answered “unanswerable” was 11.1% → 67.8%)

Additionally, we provide qualitative examples of instructing LLaVA-1.5 to validate questions (Table 3) and provide answers in JSON format (Table 4).

Zero-shot multilingual ability

Although LLaVA-1.5 has not been fine-tuned to follow multilingual multimodal instructions, one factor can be attributed to ShareGPT’s multilingual language instructions. In particular, it is noteworthy that LLaVA-1.5 outperformed Qwen-VL-Chat by 7.3% on MMBenchCN, even though LLaVA-1.5 was not instruction-tuned with Chinese multimodal instruction data.

calculation cost

For LLaVA-1.5, we used the same pre-training dataset as LCS-558K1 and performed instruction tuning with almost the same training iterations and batch size as LLaVA.Since the image input resolution has been improved to 336px, the learning time of LLaVA-1.5 is approximately twice that of LLaVA.

That is, I used one 8 x A100 node, about 6 hours for pretraining, and about 20 hours for vision instruction tuning.

LLaVA-1.5 issues

LLaVA utilizes complete image patches, so each training iteration can be long. Visual resampler reduces the number of visual patches in LLM, but it cannot converge as efficiently as LLaVA with the same amount of training data as LlaVA, probably because the resampler has a large number of learnable parameters. The development of sample-efficient visual resamplers could pave the way for scaling up tuned multimodal models in future instructions.

Also, LLaVA-1.5 does not yet have the ability to process multiple images due to a lack of instruction data and context length limitations.

Second, although LLaVA-1.5 has the ability to follow complex instructions, it shows limited problem-solving ability in specific domains. This could be improved by using more capable language models or higher quality goal-oriented visual instruction tuning data.

While LLaVA has made significant strides in reducing hallucinations, it can still occasionally generate them and potentially disseminate misinformation. Hence, it’s imperative to exercise caution, especially when utilizing it for critical tasks like medical care.

Conclusion (impressions)

It was a very interesting paper and I think it’s safe to say that LLaVA-1.5 is one of the most impactful V&L models released recently. Many of the improvements made in LLaVA-1.5 are easy to incorporate and will be useful for future reference.

By the way, you can play the demo of LLaVA-1.5 from the link below.https://llava.hliu.cc/

More Ideas On My Website

Reference :

https://github.com/haotian-liu/LLaVA https://arxiv.org/abs/2310.03744

Feel free to check out my other articles: