IDEA: Image Description Enhanced CLIP-Adapter (2025)

Zhipeng Yezhipengye@nustti.edu.cnFeng Jiangjf@nustti.edu.cnQiufeng Wangqiufeng.wang@xjtlu.edu.cnKaizhu Huangkaizhu.huang@duke.edu.cnJiaqi Huang2207880112@nustti.edu.cn

Abstract

CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model’s performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named “IMD-11.” Our code and data are released at https://github.com/FourierAI/IDEA.

keywords:

CLIP , Adapter Tuning , Image-Text Pairs , Multimodal Learning, Few-Shot Image Classification

journal: Pattern Recognition

\affiliation

[aff1]organization=Taizhou Institute of Science and Technology, Nanjing University of Science and Technology,city=Taizhou,postcode=225300,state=Jiangsu,country=China\affiliation[aff2]organization=Department of Intelligence Science, Xi’an Jiaotong-Liverpool University,city=Suzhou,postcode=215123,state=Jiangsu,country=China\affiliation[aff3]organization=Duke Kunshan University,city=Suzhou,postcode=215123,state=Jiangsu,country=China

1 Introduction

While animals primarily perceive the world through their visual system, only humans have evolved language systems over millions of years. Language enables humans to understand, use, and create things in a logical reasoning manner, ultimately evolving into intelligence. In computer vision, some studies[1, 2, 3] have shown that incorporating language/text information into vision tasks can significantly enhance a model’s visual understanding capabilities and therefore improve its performance. CLIP (Contrastive Language-Image Pre-training)[1] is a dual-tower structure Vision-Language Model (VLM) that consists of a visual encoder and a textual encoder. CLIP is pre-trained on large-scale image-text pairs using contrastive learning. During this process, the image and text data interact with each other, endowing the model with a generalization ability and leading CLIP to be able to classify unseen images during training (called zero-shot learning)[4, 5].

Fine-tuning CLIP for downstream vision tasks has become a hot research topic in recent years[6, 7]. Notably, PEFT is a novel fine-tuning method, which freezes the parameters of the model’s backbone and fine-tunes the incorporated lightweight learnable parameters on downstream datasets[8]. PEFT achieves or even surpasses the performance of full fine-tuning on some tasks. Recent studies focus on exploring PEFT for CLIP. Linear Probe[1] utilizes CLIP’s vision encoder to extract features, which are subsequently fed to a linear layer for training, enabling it to handle few-shot image classification tasks where a very limited number of samples are available for each class of data [9, 10, 11]. Subsequent research[12, 13] focuses on exploiting text features to enhance the performance of few-shot learning. As shown in Fig.1, CoOp[12] and CoCoOp[13] improve few-shot image classification performance by incorporating learnable text prompts. CLIP-Adapter[14] optimizes a vision adapter, which is a two-layer Multi-Layer Perceptron (MLP), to learn new vision features for few-shot image classification tasks. In Training-Free CLIP-Adapter (Tip-Adapter)[15], the two-layer MLP is replaced by a cache model, leading to a significant boost in the performance of few-shot image classification tasks.

IDEA: Image Description Enhanced CLIP-Adapter (1)

The work mentioned above primarily concentrates on optimizing text prompts or vision adapters, without fully exploiting the complementary relationship and inherent semantic correlation among image-text pairs, thus limiting their performance. To this end, in this paper, we propose a novel multimodal adapter, Image Description Enhanced CLIP-Adapter (IDEA), where the test image is retrieved against image-text pairs from the training set for few-shot classification tasks. IDEA is a training-free method, yet its performance rivals that of supervised training methods. Furthermore, we introduce Trainable Image Description Enhanced CLIP-Adapter (T-IDEA), which adopts two learnable components in IDEA to promote the model’s performance further. T-IDEA achieves state-of-the-art (SOTA) performance on 11 commonly used image datasets. In addition, most of the image datasets lack corresponding image descriptions, and labeling these datasets is time-consuming and laborious. We employ Llama[16] and design a comprehensive pipeline to generate a textual description for each image.

The contributions of this paper are summarized as follows:

  • 1.

    We propose IDEA, which utilizes the complementary relationship among image-text pairs and explores semantic association across multi-modalities in a training-free paradigm, to realize few-shot image classification.

  • 2.

    We propose T-IDEA, which extends IDEA by adopting a lightweight projection layer and a learnable semantic latent space to boost the performance of IDEA.

  • 3.

    We design a comprehensive pipeline to generate image descriptions for 11 public image datasets, resulting in a total of 1,637,795 image-text pairs. Our dataset, which is referred to as “IMD-11”, has been made publicly available.

  • 4.

    We evaluate the proposed methods on 11 public image datasets. The experimental results show that IDEA and T-IDEA respectively outperform SOTA methods in the training-free and training-required settings.

2 Related Work

In this section, we review the literature related to the paper, including Vision-Language Model (VLM) and Parameter-Efficient-Fine-Tuning (PEFT).

2.1 Vision-Language Model

Modality is the way humans perceive and recognize the world, which includes vision, text, auditory, touch, and etc [17, 18, 19]. For humans, vision and text are the main ways they perceive the world, which has attracted extensive research interests[7, 20, 21] from scholars around the world. The invention of Transformer[22] provides a unified model for both computer vision (CV) and natural language processing (NLP), and gives birth to the development of VLM[1]. VLM is a kind of pre-training model, the training methods of which are mainly categorized into Image-Text Contrastive Learning and Pre-training with Generative Objectives[17].

Image-Text Contrastive Learning. Image-Text Contractive Learning is the most common method for training VLMs. It employs contrastive learning to process input image-text pairs, ensuring that the image-text pairs with similar semantics are close in the embedding space, and the image-text pairs with different semantics are far away in the embedding space. CLIP[1] collects and cleans 400 million image-text pairs from the internet, and pre-trains them by using InfoNCE loss. Subsequently, CLIP performs zero-shot image classification by evaluating similarities between test samples and category names. The same pre-training method is adopted in ALIGN[2], which collects 1.8 billion noisy image-text pairs and gets a favourable result as well. The success of ALIGN verifies that multimodal could learn good Vision-Language representations by enlarging the size of training data, even if there might be massive noises in the data. ALIP[3] assumes that the image collected from the internet is noisy and generates a caption for each of them by Large Language Model (LLM), after which a dual-path model is pre-trained on the generated captions and the raw texts. ZeroVL[23] proposes a comprehensive pipeline, i.e. Debiased Data Sampling and Coin Flipping Mixup, to train the model with a limited training set of 14 million samples. PyramidCLIP[24] achieves fine-grained semantic alignment through cross-level and peer-level contrastive learning.

Pre-training with Generative Objectives. Pre-training with Generative Objectives is the other major method for training VLM, it masks parts of the raw data and regenerates the masked content by context, and thus learns the semantic correlations among various modalities. KELIP[25] splits an image into several patches and randomly masks some of them, then recovers the masked patches by image context which is also used in MAE[26]. SegCLIP[27] proposes a reconstruction loss and a superpixel-based KL loss to enhance the model’s visual representation, achieving open-vocabulary image segmentation in an annotation-free manner. FIBER[28] integrates contrastive loss, generative loss, and alignment loss to propose a deep multimodal fusion method for coarse-to-fine pre-training of the VLM. FLAVA[29] masks 40 percent image patches and 15 percent text tokens, and subsequently predicts the masked patches and tokens with the use of MLP, to better capture the correlation between vision and language. The above-mentioned methods pre-train the VLM by recovering parts of images or texts, some other models could even recover the full image or image description from image-text pairs. Diffusion models[30, 31, 32] use text as the prompt to generate the corresponding image that is consistent with the text. COCA[33] adopts an encoder-decoder architecture to pre-train the VLM, where the input is the image, and the output is the caption corresponding to the image. LLaVA[34] uses the pre-trained CLIP image encoder to obtain the image features and then converts the image features into text tokens through a trainable projection layer, which could remarkably promote the multimodal understanding ability of the model. Llama[16] employs a lightweight projection layer to align the input image with text and performs inter-modal fusion with the use of a cross-attention mechanism.

2.2 Parameter-Efficient Fine-Tuning

In the past few years, fine-tuning the pre-trained models on large-scale datasets to adapt downstream tasks has dominated the deep learning paradigm. However, there are significant disadvantages with this approach[8]. Firstly, in terms of large-scale models, full fine-tuning is challenging, time-consuming, and unsustainable. Secondly, fine-tuning large models on downstream tasks could potentially cause catastrophic forgetting. To tackle these issues, some scholars proposed Parameter-Efficient Fine-Tuning (PEFT)[8, 35]. PEFT is a new fine-tuning method, it freezes all the parameters of backbone and adapts the model to different downstream tasks by fine-tuning the parameters of additional modules attached to the model. PEFT is roughly categorized into two types: Prompt Tuning, and Adapter Tuning.

Prompt tuning. Prompt tuning adapts downstream tasks by adding learnable tokens in either input or hidden layers of the model as learnable prompts. Inspired by prompt learning in NLP, VPT[36] adopts prompt tuning technology to vision tasks for the first time. VPT fine-tunes the model by adding some prompt tokens in the input space and outperforms most full fine-tuning methods in multiple tasks. CoOp[12] employs learnable token vectors instead of manually designed prompts as input for the text encoder, achieving commendable performance in few-shot image classification tasks. On the basis of CoOp, CoCoOp[13] designs a lightweight neural network to generate prompts for each image, which is known as Conditional Prompt Learning.

Adapter tuning. In adapter tuning, the model is equipped with additional learnable layers (e.g. MLP, Transformer[37])to adapt to the downstream tasks. CLIP-Adapter[14] uses an additional lightweight bottleneck layer (i.e. two linear layers following the last layer of vision encoder and text encoder, respectively) to learn new features and fuses them with the original pre-trained features via residual connection. Tip-Adapter[15] makes use of key-value pairs collected from the few-shot training set to construct the adapter, which is called cache model. The linear layers in the CLIP-Adapter is replaced by the cache model, rendering Tip-Adapter a training-free method and being superior to other few-shot classification methods. Based on the Tip-Adapter, the keys in the cache model are dynamically updated using stochastic gradient descent, further enhancing the performance of Tip-Adapter and achieving the SOTA result.

3 Method

In this section, we elaborate on the proposed methods. Firstly, we briefly review the zero-shot image classification of CLIP. Then, we describe IDEA and T-IDEA in detail, respectively. Last, we introduce the process of generating image descriptions.

3.1 Revisiting zero-shot image classification of CLIP

The CLIP model is trained on a large-scale image-text pairs dataset by contrastive learning. It mines the semantic association between image-text pairs and enables the model to obtain a high generalization ability, achieving SOTA results in several vision downstream tasks. CLIP adopts a zero-shot classification strategy where the test image is retrieved against the textual information of category names to find the most matching category as the classification result. This ensures that CLIP can achieve open-vocabulary classification without re-training.

Specifically, given a test image 𝐈testsubscript𝐈test\mathbf{I}_{\text{test}}bold_I start_POSTSUBSCRIPT test end_POSTSUBSCRIPT, we feed it into the vision encoder of CLIP to obtain the corresponding visual feature itestD×1subscript𝑖testsuperscript𝐷1i_{\text{test}}\in\mathbb{R}^{D\times 1}italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 1 end_POSTSUPERSCRIPT, where D𝐷Ditalic_D is the dimension of the visual feature. Eq.1 describes the process.

itest=VisionEncoder(𝐈test)subscript𝑖testVisionEncodersubscript𝐈testi_{\text{test}}=\text{VisionEncoder}(\mathbf{I}_{\text{test}})italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT = VisionEncoder ( bold_I start_POSTSUBSCRIPT test end_POSTSUBSCRIPT )(1)

Then, let N𝑁Nitalic_N be the number of categories and Slabelsubscript𝑆labelS_{\text{label}}italic_S start_POSTSUBSCRIPT label end_POSTSUBSCRIPT be the set of category names. A manually designed prompt template (e.g. a photo of a {object}) is used to generate a textual prompt template for each of the categories. Next, the textual prompts are fed into the CLIP’s text encoder to obtain the corresponding features 𝐓classN×Dsubscript𝐓classsuperscript𝑁𝐷\mathbf{T}_{\text{class}}\in\mathbb{R}^{N\times D}bold_T start_POSTSUBSCRIPT class end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, as shown in Eq.2.

𝐓class=TextEncoder(Template(Slabel))subscript𝐓classTextEncoderTemplatesubscript𝑆label\mathbf{T}_{\text{class}}=\text{TextEncoder}(\text{Template}(S_{\text{label}}))bold_T start_POSTSUBSCRIPT class end_POSTSUBSCRIPT = TextEncoder ( Template ( italic_S start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) )(2)

Finally, we get the output logitsN×1logitssuperscript𝑁1\text{logits}\in\mathbb{R}^{N\times 1}logits ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT for classification, as denoted in Eq.3.

logits=𝐓classitestzero-shot knowledgelogitssubscriptsubscript𝐓classsubscript𝑖testzero-shot knowledge\text{logits}=\underbrace{\mathbf{T}_{\text{class}}\cdot i_{\text{test}}}_{%\text{zero-shot knowledge}}logits = under⏟ start_ARG bold_T start_POSTSUBSCRIPT class end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT zero-shot knowledge end_POSTSUBSCRIPT(3)

where \cdot means matrix multiplication, and both 𝐓classsubscript𝐓class\mathbf{T}_{\text{class}}bold_T start_POSTSUBSCRIPT class end_POSTSUBSCRIPT and itestsubscript𝑖testi_{\text{test}}italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT are normalized in the feature dimension.The classification result of CLIP is the index corresponding to the maximum value of the logits. For convenience, we refer to 𝐓classitestsubscript𝐓classsubscript𝑖test\mathbf{T}_{\text{class}}\cdot i_{\text{test}}bold_T start_POSTSUBSCRIPT class end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT in Eq.3 as the zero-shot knowledge.

3.2 Image Description Enhanced CLIP-Adapter

IDEA: Image Description Enhanced CLIP-Adapter (2)

Based on the zero-shot classification with CLIP, we propose an novel adapter called Image Description Enhanced CLIP-Adapter (IDEA), in which we explore the few-shot knowledge from the image-text pairs to strengthen CLIP.

Firstly, we construct a K𝐾Kitalic_K-shot N𝑁Nitalic_N-class training set that contains both visual information and textual descriptions of the images. Then, we freeze the parameters of CLIP’s visual and textual encoders for PEFT. Eq.4 demonstrates that the 𝗂𝗆𝖺𝗀𝖾𝗂𝗆𝖺𝗀𝖾\mathsf{image}sansserif_image in the training set is fed into the visual encoder to get the visual features 𝐈trainNK×Dsubscript𝐈trainsuperscript𝑁𝐾𝐷\mathbf{I}_{\text{train}}\in\mathbb{R}^{NK\times D}bold_I start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_K × italic_D end_POSTSUPERSCRIPT, and the 𝗍𝖾𝗑𝗍𝗍𝖾𝗑𝗍\mathsf{text}sansserif_text in the training set is fed into the textual encoder to get the textual features 𝐓trainNK×Dsubscript𝐓trainsuperscript𝑁𝐾𝐷\mathbf{T}_{\text{train}}\in\mathbb{R}^{NK\times D}bold_T start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_K × italic_D end_POSTSUPERSCRIPT.

𝐈train=VisionEncoder(𝖨𝗆𝖺𝗀𝖾)subscript𝐈trainVisionEncoder𝖨𝗆𝖺𝗀𝖾\displaystyle\mathbf{I}_{\text{train}}=\text{VisionEncoder}(\mathsf{Image})bold_I start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = VisionEncoder ( sansserif_Image )(4)
𝐓train=TextEncoder(𝖳𝖾𝗑𝗍)subscript𝐓trainTextEncoder𝖳𝖾𝗑𝗍\displaystyle\mathbf{T}_{\text{train}}=\text{TextEncoder}(\mathsf{Text})bold_T start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = TextEncoder ( sansserif_Text )

Subsequently, we compute the multimodal similarities, as shown in Eq.LABEL:eq:sim_idea.

SimI=𝐈trainitest𝑆𝑖subscript𝑚𝐼subscript𝐈trainsubscript𝑖test\displaystyle Sim_{I}=\mathbf{I}_{\text{train}}\cdot i_{\text{test}}italic_S italic_i italic_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = bold_I start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT(5)
SimT=𝐓trainitest𝑆𝑖subscript𝑚𝑇subscript𝐓trainsubscript𝑖test\displaystyle Sim_{T}=\mathbf{T}_{\text{train}}\cdot i_{\text{test}}italic_S italic_i italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = bold_T start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT

where SimINK×1𝑆𝑖subscript𝑚𝐼superscript𝑁𝐾1Sim_{I}\in\mathbb{R}^{NK\times 1}italic_S italic_i italic_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_K × 1 end_POSTSUPERSCRIPT is the similarity between the test image and the images in the training set, SimTNK×1𝑆𝑖subscript𝑚𝑇superscript𝑁𝐾1Sim_{T}\in\mathbb{R}^{NK\times 1}italic_S italic_i italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_K × 1 end_POSTSUPERSCRIPT is the similarity between the test image and the textual description in the training set.

IDEA computes the similarity between the test image and K𝐾Kitalic_K samples of each category in the training set, which is referred to as few-shot knowledge and facilitates the mining of fine-grained semantic correlations between images and texts. Previous studies[1, 12] indicate that incorporating textual information into visual models can effectively enhance their logical reasoning capabilities. Thus, we utilize both visual and textual information in the training set to promote recognition ability.

Finally, by combining zero-shot knowledge and few-shot knowledge, we get the output logitsN×1logitssuperscript𝑁1\text{logits}\in\mathbb{R}^{N\times 1}logits ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, as shown in Eq.6:

logits=βg{f[(1α)SimI+αSimT)]}Few-Shot Knowledge\displaystyle=\beta\underbrace{g\{f[(1-\alpha)Sim_{I}+\alpha Sim_{T})]\}}_{%\text{Few-Shot Knowledge}}= italic_β under⏟ start_ARG italic_g { italic_f [ ( 1 - italic_α ) italic_S italic_i italic_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_α italic_S italic_i italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ] } end_ARG start_POSTSUBSCRIPT Few-Shot Knowledge end_POSTSUBSCRIPT(6)
+𝐓classitestZero-Shot Knowledgesubscriptsubscript𝐓classsubscript𝑖testZero-Shot Knowledge\displaystyle+\underbrace{\mathbf{T}_{\text{class}}\cdot i_{\text{test}}}_{%\text{Zero-Shot Knowledge}}+ under⏟ start_ARG bold_T start_POSTSUBSCRIPT class end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Zero-Shot Knowledge end_POSTSUBSCRIPT

where α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] is a hyperparameter to balance the similarity between visual modal and textual modal, and β(0,)𝛽0\beta\in(0,\infty)italic_β ∈ ( 0 , ∞ ) is a hyperparameter to trade off the few-shot knowledge and the zero-shot knowledge. The activation function f(x)=exp(θ(x1))𝑓𝑥exp𝜃𝑥1f(x)=\text{exp}(\theta(x-1))italic_f ( italic_x ) = exp ( italic_θ ( italic_x - 1 ) ) is defined for mapping the value of similarity to the interval [0,1]. θ(0,)𝜃0\theta\in(0,\infty)italic_θ ∈ ( 0 , ∞ ) controls the sharpness of the activation function, which dynamically stretches and compresses the value of similarity to better fuse the few-shot knowledge to the zero-shot knowledge. Given the multimodal similarity among samples 𝐗NK×1𝐗superscript𝑁𝐾1\mathbf{X}\in\mathbb{R}^{NK\times 1}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_K × 1 end_POSTSUPERSCRIPT, we define a function g(𝐗)=Kreshape(𝐗,N,K)𝑔𝐗subscript𝐾reshape𝐗𝑁𝐾g(\mathbf{X})=\sum_{K}\text{reshape}(\mathbf{X},N,K)italic_g ( bold_X ) = ∑ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT reshape ( bold_X , italic_N , italic_K ) to aggregate the similarity among samples to form the few-shot knowledge. g(𝐗)𝑔𝐗g(\mathbf{X})italic_g ( bold_X ) reshapes 𝐗𝐗\mathbf{X}bold_X into a matrix with N𝑁Nitalic_N rows and K𝐾Kitalic_K columns. We then sum the matrix in the column dimension. This operation is used to aggregate the instance-level similarity into the class-level similarity. Algorithm1 shows the process of IDEA.

{minted}

[linenos, fontfamily=tt,fontsize=, baselinestretch=1,tabsize=4]python# test image i_test D*1# images in training set I_train NK*D# text in training set T_train NK*D# textual features for each class T_class N*D# hyperparameters alpha, beta, theta

# compute vision similaritysim_I = I_train @ i_test# compute text similaritysim_T = T_train @ i_test# compute multimodal similarity NK*1sim = (1-alpha) * sim_I + alpha * sim_T# compute the activation function f(x) for multimodal similaritysim = torch.exp(theta(sim-1))# aggregate similarity to form few-shot knowledge by g(X)few_shot_knowledge = torch.sum(sim.reshape(N,K),dim=1)# compute zero-shot knowledge N*1zero_shot_knowledge = T_class @ i_test# compute logits N*1logits = beta*few_shot_knowledge + zero_shot_knowledge

The advantages of IDEA are summarized as follows. Firstly, IDEA utilizes the corresponding textual descriptions of the images as a supplement to visual information, which improves the performance of the CLIP’s few-shot image classification. Secondly, IDEA combines the knowledge of zero-shot and few-shot to capture fine-grained semantic correlations of image-text pairs, which enhances the fusion of multimodal data. Finally, IDEA is a training-free method for CLIP which can be comparable to or even outperform the SOTA models.

3.3 Trainable Image Description Enhanced CLIP-Adapter

IDEA does not require stochastic gradient descent (SGD) to train the model and exhibits strong recognition performance in few-shot classification tasks. Even so, we believe that the performance of IDEA can be further improved. Therefore, we propose a Trainable Image Description Enhanced CLIP-Adapter (T-IDEA) method.

On one hand, we believe that there is a native intermodal semantic gap between visual and textual information when calculating the item of 𝐓trainitestsubscript𝐓trainsubscript𝑖test\mathbf{T}_{\text{train}}\cdot i_{\text{test}}bold_T start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT in Eq.LABEL:eq:sim_idea. To overcome this problem, we design a lightweight projection layer 𝐖projD×Dsubscript𝐖projsuperscript𝐷𝐷\mathbf{W}_{\text{proj}}\in\mathbb{R}^{D\times D}bold_W start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_D end_POSTSUPERSCRIPT for intermodal semantic alignment, and utilize residual connection for modal fusion.

On the other hand, for few-shot image classification tasks, the selected K𝐾Kitalic_K samples cannot supplemently cover all the samples in the training set, which means there is a bias of semantics between the K𝐾Kitalic_K samples and all the samples. Therefore, we design a trainable semantic latent space 𝐄biasNK×Dsubscript𝐄biassuperscript𝑁𝐾𝐷\mathbf{E}_{\text{bias}}\in\mathbb{R}^{NK\times D}bold_E start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_K × italic_D end_POSTSUPERSCRIPT to correct the bias in the semantic space of the training set.

Therefore, the formula for the logits of the T-IDEA is defined in Eq.7.

logits=βg{f[(1α)𝐈trainitest+α(𝐓train𝐖projitest+𝐓trainitest)align intermodal semantic\displaystyle=\beta g\{f[(1-\alpha)\mathbf{I}_{\text{train}}\cdot i_{\text{%test}}+\alpha(\underbrace{\mathbf{T}_{\text{train}}\cdot\mathbf{W}_{\text{proj%}}\cdot i_{\text{test}}+\mathbf{T}_{\text{train}}\cdot i_{\text{test}})}_{%\text{align intermodal semantic}}= italic_β italic_g { italic_f [ ( 1 - italic_α ) bold_I start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT + italic_α ( under⏟ start_ARG bold_T start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ⋅ bold_W start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT + bold_T start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT align intermodal semantic end_POSTSUBSCRIPT(7)
+𝐄biasitestcorrect semantic bias]}+𝐓classitest\displaystyle+\underbrace{\mathbf{E}_{\text{bias}}\cdot i_{\text{test}}}_{%\text{correct semantic bias}}]\}+\mathbf{T}_{\text{class}}\cdot i_{\text{test}}+ under⏟ start_ARG bold_E start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT correct semantic bias end_POSTSUBSCRIPT ] } + bold_T start_POSTSUBSCRIPT class end_POSTSUBSCRIPT ⋅ italic_i start_POSTSUBSCRIPT test end_POSTSUBSCRIPT

where 𝐖projD×Dsubscript𝐖projsuperscript𝐷𝐷\mathbf{W}_{\text{proj}}\in\mathbb{R}^{D\times D}bold_W start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_D end_POSTSUPERSCRIPT and 𝐄biasNK×Dsubscript𝐄biassuperscript𝑁𝐾𝐷\mathbf{E}_{\text{bias}}\in\mathbb{R}^{NK\times D}bold_E start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_K × italic_D end_POSTSUPERSCRIPT are lightweight components.

3.4 Image Description Generation

IDEA: Image Description Enhanced CLIP-Adapter (3)

To our knowledge, existing visual datasets generally lack corresponding image descriptions, and labelling these datasets is laborious. Therefore, we employ Llama[16], a multimodal large-scale model, to generate a textual description for each image. Fig.3 illustrates the pipeline of generating image descriptions. Firstly, we customize the textual prompt for each image dataset to guide the description generating. Then, we clean the original data to reduce task-irrelevant noise (e.g. escape character, special symbol, and markdown formatting). Finally, we utilize the BART[38] model to summarize text descriptions and compress the text length to fewer than 77 tokens, which is the maximum length of the CLIP text encoder.

IDEA: Image Description Enhanced CLIP-Adapter (4)

Fig.4 shows our method for designing prompts as well as some examples of generated texts. For common vision datasets (e.g., ImageNet[39], Caltech101[40]), we design generalized prompts to describe the image content. We first prompt the model for the category name of the image. Then we instruct the model to describe the image’s content from the following aspects: shape, color, number of objects, texture, location, and details. For fine-grained image datasets (e.g. Food101[41], Oxford Pets[42]), we customize prompts to generate domain-specific image descriptions. In particular, for the Oxford Pets dataset, we prompt the model for the subclass of pets. Then, we ask the model to generate image descriptions about the pet’s hair, color, eyes, shape, ears, paws, pose, and position. Fig.4 shows that the generated image descriptions are basically accurate and consistent with the image content.

While the research on multimodel is becoming increasingly popular, large-scale image-text pair data is precious and much-needed. We supplement 11 popular image datasets (e.g., ImageNet[39], Caltech101[40], and Oxford pets[42]) by generating textual descriptions for each image, producing 1,637,795 image-text pairs in total. We name the dataset as IMD-11 and publish the dataset on the Internet for public research.

4 Experiment

In this section, we first describe the basic settings of the experiments and the baseline models for the comparison experiments. Next, we quantitatively and qualitatively analyze the results of the comparison experiments on 11 public datasets. Finally, we perform several ablation experiments on IDEA and T-IDEA.

4.1 Experiment Settings

We select 11 popular computer vision datasets for comparison experiments, including 2 common image classification datasets (ImageNet[39] and Caltech101[40]) and 9 fine-grained image classification datasets (Food101[41], FGVCAircraft[43], StandCars[44], UCF101[45], Flowers102[46], SUN397[47], EuroSAT[48], DTD[49] and OxfordPets[42]). All models are trained on the training set under 1, 2, 4, 8, and 16 shots settings. For a fair comparison, the partition criteria of the training set, validation set, and test set are the same as CoOp[12], CLIP-Adapter[14], and Tip-Adapter[15].

At the data pre-processing stage, we first randomly crop and scale the images with the size of 224×224224224224\times 224224 × 224. Then we randomly flip and normalize the tensor of images. For the T-IDEA method, we set 50505050 epochs to train the model with a bath size of 256256256256, and employ the stochastic gradient descent (SGD) to fine-tune the model with a learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

All experiments are conducted on a server with an AMD EPYC 7642 processor, 4 NVIDIA® GeForce RTX4090 graphics cards, 256GB memory, 6TB Solid State Drive (SSD), 8TB Hard Disk Drive (HDD), and the Ubuntu 22.04.3 LTS operating system.

We compare the IDEA method and T-IDEA with five baseline models, i.e. Zero-shot CLIP[1], CoOp[12], CLIP-Adapter[14], Tip-Adapter[15], and Tip-Adapter-F[15]. All the comparison data are taken from the best result published in the original paper. To make it fair, in the comparison experiments, our method uses ResNet-50[50] as the visual encoder and Transformer[37] as the textual encoder, which is the same configuration as the five baseline models mentioned above.

4.2 Performance Comparison and Analysis

In this section, we conduct experiments to compare IDEA and T-IDEA with 5 baseline models on 11 publicly available image datasets.

IDEA: Image Description Enhanced CLIP-Adapter (5)

Fig.5(a) shows the average performance of each model on the 11 image datasets. As observed, IDEA outperforms the CoOp model, which requires additional training steps, under 1, 2, 4, and 8 shots settings. Compared to the Tip-Adapter method, which is also training-free, IDEA outperforms it by 0.63%, 0.12%, 0.59%, 0.39%, and 0.5% under 1, 2, 4, 8, and 16 shots settings, respectively. This reveals that fusing the multimodal data (visual and textual features) in the training set is favorable to improve the model’s performance. In addition, T-IDEA performs better than IDEA. As the number of shots increases, T-IDEA shows more advantages over IDEA. This phenomenon implies that it is crucial to design additional training components to fine-tune the model to better fit new features in the dataset. It is worth noticing that T-IDEA equipped with two learnable components outperforms Tip-Adapter-F under 1, 2, 4, 8, and 16 shots settings by 0.86%, 0.99%, 0.82%, 1.03%, and 0.65%, which achieves SOTA performance.

Fig.5(b) and (c) indicate that both IDEA and T-IDEA methods achieve good performance in the common datasets. It is notable that on the Caltech dataset, under the 8-shot training setting, IDEA improves by 0.47% over the Tip-Adapter method, and T-IDEA outperforms the SOTA model, Tip-Adapter-F, by 1.26%. Fig.5(d-l) shows that IDEA and T-IDEA methods achieve SOTA performance in most fine-grained image classification datasets. For example, in the OxfordPets and Food101 datasets, IDEA under the 1-shot and 2-shot settings demonstrates comparable performance with that of the SOTA model, even though the IDEA method has no extra training steps. This confirms the advantage and superiority of IDEA especially when the category samples are limited. Meanwhile, T-IDEA achieves SOTA performance on most fine-grained image datasets. For example, on the FGVCAircraft dataset, T-IDEA outperforms Tip-Adapter-F by 2.97% under the 16-shot setting, which is a significant boost.

In addition, we notice that our method does not perform well on some domain-specific fine-grained datasets. In Fig.5, we observe that, for the EuroSAT dataset, T-IDEA improves less compared to the SOTA method under 8, 16 shots settings. Given that the EuroSAT dataset is a remote sensing image classification dataset with the relatively smaller image size of 64×64646464\times 6464 × 64, it is difficult to describe the content information of the image in textual language due to the low resolution and abstract content. We infer that this may be an important reason for the limited improvement of our method on this dataset.

4.3 Ablation Studies

In this section, we perform several ablation studies of IDEA and T-IDEA on the ImageNet dataset, under the 16-shot training setting, to validate the effectiveness of each component.

4.3.1 Ablation Study of Hyperparameters

Ablation study of Hyper-parameters
α𝛼\alphaitalic_α00.20.40.50.81
Accuracy(%)59.6861.3662.3262.5862.1161.63
β𝛽\betaitalic_β0122.52.753
Accuracy(%)60.3461.6162.2862.4462.5862.49
θ𝜃\thetaitalic_θ0.511.5233.5
Accuracy(%)62.0562.4162.4962.5862.4362.34

The hyperparameter α𝛼\alphaitalic_α is designed to balance the visual similarity and the similarity of image-text pairs, as shown in Eq.LABEL:eq:sim_idea. We set β=2.75𝛽2.75\beta=2.75italic_β = 2.75 and θ=2𝜃2\theta=2italic_θ = 2 and vary the value of α𝛼\alphaitalic_α from 0 to 1. When α=0𝛼0\alpha=0italic_α = 0, it means that there is only visual similarity SimI𝑆𝑖subscript𝑚𝐼Sim_{I}italic_S italic_i italic_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, and when α=1𝛼1\alpha=1italic_α = 1, it means that there is only the image-text pair similarity SimT𝑆𝑖subscript𝑚𝑇Sim_{T}italic_S italic_i italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Table1 implies that neither SimI𝑆𝑖subscript𝑚𝐼Sim_{I}italic_S italic_i italic_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT nor SimT𝑆𝑖subscript𝑚𝑇Sim_{T}italic_S italic_i italic_m start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT could achieve optimal performance alone. The method achieves optimal performance when α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, which indicates that visual and textual information are equally important.

The hyperparameter β𝛽\betaitalic_β is used to trade-off between zero-shot knowledge and few-shot knowledge, as shown in Eq.6. A larger β𝛽\betaitalic_β indicates that more adaptation of few-shot knowledge is required. We set α=0.5,θ=2formulae-sequence𝛼0.5𝜃2\alpha=0.5,\theta=2italic_α = 0.5 , italic_θ = 2 and vary the value of β𝛽\betaitalic_β from 0 to 3. β=0𝛽0\beta=0italic_β = 0 means few-shot knowledge is omitted and it is the same as zero-shot CLIP. When β=1𝛽1\beta=1italic_β = 1, it means that zero-shot and few-shot knowledge are of equal importance. Table1 shows that IDEA achieves the best performance when β=2.75𝛽2.75\beta=2.75italic_β = 2.75, suggesting that the few-shot knowledge has a more significant weight and plays an important role in the classification results. The performance of IDEA improves by 2.24% compared to the performance of the pure zero-shot CLIP.

The hyperparameter θ𝜃\thetaitalic_θ controls the sharpness of the activation function f(x)=exp(θ(x1))𝑓𝑥exp𝜃𝑥1f(x)=\text{exp}(\theta(x-1))italic_f ( italic_x ) = exp ( italic_θ ( italic_x - 1 ) ). With the increasing of θ𝜃\thetaitalic_θ, the training samples which are closed to the test samples are significantly pulled apart. This operation can increase the model’s ability to capture fine-grained features of images. We set α=0.5,β=2.75formulae-sequence𝛼0.5𝛽2.75\alpha=0.5,\beta=2.75italic_α = 0.5 , italic_β = 2.75 and vary the value of θ𝜃\thetaitalic_θ from 0.5 to 3.5. Table1 shows that IDEA achieves the best performance when θ=3𝜃3\theta=3italic_θ = 3.

4.3.2 Ablation Study of Trainable Components

In this subsection, we perform ablation experiments on two learnable components (the projection layer 𝐖projsubscript𝐖proj\mathbf{W}_{\text{proj}}bold_W start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT and the semantic latent space 𝐄biassubscript𝐄bias\mathbf{E}_{\text{bias}}bold_E start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT) added to T-IDEA. They are plugged and unplugged separately for a total of 4 sets of experiments.

Projector 𝐖projsubscript𝐖proj\mathbf{W}_{\text{proj}}bold_W start_POSTSUBSCRIPT proj end_POSTSUBSCRIPTSemantic Latent Space 𝐄biassubscript𝐄bias\mathbf{E}_{\text{bias}}bold_E start_POSTSUBSCRIPT bias end_POSTSUBSCRIPTAccuracy(%)
62.58
64.35
63.28
66.05

Table2 shows that if we remove the projection layer 𝐖projsubscript𝐖proj\mathbf{W}_{\text{proj}}bold_W start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT from the T-IDEA, the performance decreases by 1.7%, suggesting that the projection layer 𝐖projsubscript𝐖proj\mathbf{W}_{\text{proj}}bold_W start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT can effectively eliminate the semantic gaps between visual and textual to some degree, and achieve intermodal semantic alignment. When we remove the semantic hidden space 𝐄biassubscript𝐄bias\mathbf{E}_{\text{bias}}bold_E start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT from T-IDEA, the performance decreases by 2.77%, indicating that the design of 𝐄biassubscript𝐄bias\mathbf{E}_{\text{bias}}bold_E start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT is able to reduce the semantic bias, leading to the performance improvement of the model. Overall, compared to the IDEA method without any trainable components, T-IDEA improves the classification metric by 3.47%, demonstrating that combining the two components can significantly improve the model’s performance.

4.3.3 Ablation Study of Vision Backbones

ModelResNet-50(%)ResNet-101(%)ViT-B/32(%)ViT-B/16(%)
Zero-Shot CLIP[1]60.3362.5363.8068.73
CoOp[12]62.9566.6066.8571.92
CLIP-Adapter[14]63.5965.3966.1971.13
Tip-Adapter[15]62.0364.7865.6170.75
Tip-Adapter-F[15]65.5168.5668.6573.69
IDEA62.5865.5165.9371.07
T-IDEA66.0568.9669.4274.54

To verify the scalability of the proposed methods, we conduct further ablation experiments equipped with various backbone networks. Specifically, we replace the vision encoder in the Zero-Shot CLIP[1], CoOp[12], CIIP-Adapter[14], Tip-Adapter[15], Tip-Adapter-F[15], IDEA, and T-IDEA models with ResNet-50[50], ResNet-101[50], ViT-B/32[51], and ViT-B/16[51], respectively. From Table3, we observe that under different settings of the backbone network, there is a significant performance improvement compared to the zero-shot CLIP model using only zero-shot knowledge. The performance of IDEA and T-IDEA is also improved when the parameter size of the backbone network increases. Furthermore, under different backbone settings, T-IDEA achieves SOTA performance. This indicates that our method can adapt to various backbone networks and thus demonstrates a strong generalization ability.

5 Conclusion and Future Work

Vision and language can semantically complement each other to enhance the ability of humans to perceive the world. Different from previous PEFT methods, we introduce a multimodal adapter to mine the multimodal information in image-text pairs, and it is fully adapted for few-shot image classification tasks. The training-free IDEA method has even outperformed the approaches that necessitate additional training steps. T-IDEA extends the IDEA method by integrating a learnable semantic alignment component and a semantic latent space component, achieving SOTA performance on 11 datasets. In addition, we design a comprehensive pipeline to generate 1.6 million image-text pairs and we publish our dataset online.

Although the performance of our methods is excellent, optimizing the text prompts could yield further enhancements. Exploring synthetic data to train models presents an intriguing area for future research. Some researchers have successfully utilized generated data from LLMs and achieved positive results[16, 34]. In the future, we plan to investigate Chain of Thought (CoT)[52] to generate higher-quality data from LLMs. In addition, the maximum length of input tokens in CLIP is limited to 77, constraining the amount of textual information. In the future, we will endeavor to apply IDEA and T-IDEA to Long-CLIP[53].

References

  • [1]A.Radford, J.W. Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P.Mishkin, J.Clark, G.Krueger, I.Sutskever, Learning transferable visual models from natural language supervision, in: M.Meila, T.Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, Vol. 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 8748–8763.
  • [2]C.Jia, Y.Yang, Y.Xia, Y.-T. Chen, Z.Parekh, H.Pham, Q.Le, Y.-H. Sung, Z.Li, T.Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, in: M.Meila, T.Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, Vol. 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 4904–4916.
  • [3]K.Yang, J.Deng, X.An, J.Li, Z.Feng, J.Guo, J.Yang, T.Liu, Alip: Adaptive language-image pre-training with synthetic caption, 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023) 2910–2919.
  • [4]W.Zhao, G.Yang, R.Zhang, C.Jiang, C.Yang, Y.Yan, A.Hussain, K.Huang, Open-pose 3d zero-shot learning: Benchmark and challenges, Neural Networks 181 (2025) 106775.doi:https://doi.org/10.1016/j.neunet.2024.106775.
  • [5]Z.Ye, G.Yang, X.Jin, Y.Liu, K.Huang, Rebalanced zero-shot learning, IEEE Transactions on Image Processing 32 (2023) 4185–4198.doi:10.1109/TIP.2023.3295738.
  • [6]M.V. Conde, K.Turgutlu, Clip-art: Contrastive pre-training for fine-grained art classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 3956–3960.
  • [7]Z.Huang, F.Bianchi, M.Yuksekgonul, T.J. Montine, J.Zou, A visual–language foundation model for pathology image analysis using medical twitter, Nature Medicine (2023) 1–10.
  • [8]N.Ding, Y.Qin, G.Yang, F.Wei, Z.Yang, Y.Su, S.Hu, Y.Chen, C.-M. Chan, W.Chen, J.Yi, W.Zhao, X.Wang, Z.Liu, H.-T. Zheng, J.Chen, Y.Liu, J.Tang, J.Li, M.Sun, Parameter-efficient fine-tuning of large-scale pre-trained language models, Nature Machine Intelligence 5(3) (2023) 220–235.doi:10.1038/s42256-023-00626-4.
  • [9]H.Chen, L.Li, F.Hu, F.Lyu, L.Zhao, K.Huang, W.Feng, Z.Xia, Multi-semantic hypergraph neural network for effective few-shot learning, Pattern Recognition 142 (2023) 109677.doi:https://doi.org/10.1016/j.patcog.2023.109677.
  • [10]H.Chen, L.Li, Z.Xia, F.Lyu, L.Zhao, K.Huang, W.Feng, F.Hu, Harnessing multi-semantic hypergraph forfew-shot learning, in: S.Yu, Z.Zhang, P.C. Yuen, J.Han, T.Tan, Y.Guo, J.Lai, J.Zhang (Eds.), Pattern Recognition and Computer Vision, Springer International Publishing, Cham, 2022, pp. 232–244.
  • [11]Q.Qiao, Y.Xie, Z.Zeng, F.Li, Talds-net: Task-aware adaptive local descriptors selection for few-shot image classification, in: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 3750–3754.doi:10.1109/ICASSP48485.2024.10448167.
  • [12]K.Zhou, J.Yang, C.C. Loy, Z.Liu, Learning to prompt for vision-language models, International Journal of Computer Vision (IJCV) (2022).
  • [13]K.Zhou, J.Yang, C.C. Loy, Z.Liu, Conditional prompt learning for vision-language models, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [14]P.Gao, S.Geng, R.Zhang, T.Ma, R.Fang, Y.Zhang, H.Li, Y.Qiao, Clip-adapter: Better vision-language models with feature adapters, International Journal of Computer Vision 132(2) (2024) 581–595.doi:10.1007/s11263-023-01891-x.
  • [15]R.Zhang, W.Zhang, R.Fang, P.Gao, K.Li, J.Dai, Y.Qiao, H.Li, Tip-adapter: Training-free adaption of clip for few-shot classification, in: Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, Springer-Verlag, Berlin, Heidelberg, 2022, p. 493–510.doi:10.1007/978-3-031-19833-5_29.
  • [16]H.Touvron, T.Lavril, G.Izacard, X.Martinet, M.-A. Lachaux, T.Lacroix, B.Rozière, N.Goyal, E.Hambro, F.Azhar, A.Rodriguez, A.Joulin, E.Grave, G.Lample, Llama: Open and efficient foundation language models, ArXiv abs/2302.13971 (2023).
  • [17]J.Zhang, J.Huang, S.Jin, S.Lu, Vision-language models for vision tasks: A survey, IEEE Trans. Pattern Anal. Mach. Intell. 46(8) (2024) 5625–5644.doi:10.1109/TPAMI.2024.3369699.
  • [18]W.Guo, J.Wang, S.Wang, Deep multimodal representation learning: A survey, IEEE Access 7 (2019) 63373–63394.doi:10.1109/ACCESS.2019.2916887.
  • [19]Y.Zhu, Y.Wu, N.Sebe, Y.Yan, Vision + x: A survey on multimodal learning in the light of data, IEEE Transactions on Pattern Analysis and Machine Intelligence 46(12) (2024) 9102–9122.doi:10.1109/TPAMI.2024.3420239.
  • [20]F.Liu, D.Chen, Z.Guan, X.Zhou, J.Zhu, Q.Ye, L.Fu, J.Zhou, Remoteclip: A vision language foundation model for remote sensing, IEEE Transactions on Geoscience and Remote Sensing 62 (2024) 1–16.doi:10.1109/TGRS.2024.3390838.
  • [21]M.Christensen, M.Vukadinovic, N.Yuan, D.Ouyang, Vision–language foundation model for echocardiogram interpretation, Nature Medicine 30(5) (2024) 1481–1488.doi:10.1038/s41591-024-02959-y.
  • [22]A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N. Gomez, L.Kaiser, I.Polosukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010.
  • [23]Q.Cui, B.Zhou, Y.Guo, W.Yin, H.Wu, O.Yoshie, Y.Chen, Contrastive vision-language pre-training withlimited resources, in: S.Avidan, G.Brostow, M.Cissé, G.M. Farinella, T.Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 236–253.
  • [24]Y.Gao, J.Liu, Z.Xu, J.Zhang, K.Li, C.Shen, Pyramidclip: hierarchical feature alignment for vision-language model pretraining, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY, USA, 2024.
  • [25]B.Ko, G.Gu, Large-scale bilingual language-image contrastive learning, ArXiv abs/2203.14463 (2022).
  • [26]K.He, X.Chen, S.Xie, Y.Li, P.Doll’ar, R.B. Girshick, Masked autoencoders are scalable vision learners, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021) 15979–15988.
  • [27]S.Zhang, B.Zhang, Y.Wu, H.Zhou, J.Jiang, J.Ma, Segclip: Multimodal visual-language and prompt learning for high-resolution remote sensing semantic segmentation, IEEE Transactions on Geoscience and Remote Sensing 62 (2024) 1–16.doi:10.1109/TGRS.2024.3487576.
  • [28]Z.-Y. Dou, A.Kamath, Z.Gan, P.Zhang, J.Wang, L.Li, Z.Liu, C.Liu, Y.LeCun, N.Peng, J.Gao, L.Wang, Coarse-to-fine vision-language pre-training with fusion in the backbone, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY, USA, 2024.
  • [29]A.Singh, R.Hu, V.Goswami, G.Couairon, W.Galuba, M.Rohrbach, D.Kiela, Flava: A foundational language and vision alignment model, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15617–15629.doi:10.1109/CVPR52688.2022.01519.
  • [30]J.Sohl-Dickstein, E.A. Weiss, N.Maheswaranathan, S.Ganguli, Deep unsupervised learning using nonequilibrium thermodynamics, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, JMLR.org, 2015, p. 2256–2265.
  • [31]J.Ho, A.Jain, P.Abbeel, Denoising diffusion probabilistic models, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Curran Associates Inc., Red Hook, NY, USA, 2020.
  • [32]A.Q. Nichol, P.Dhariwal, Improved denoising diffusion probabilistic models, in: M.Meila, T.Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, Vol. 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 8162–8171.
  • [33]J.Yu, Z.Wang, V.Vasudevan, L.Yeung, M.Seyedhosseini, Y.Wu, Coca: Contrastive captioners are image-text foundation models, Trans. Mach. Learn. Res. 2022 (2022).
  • [34]H.Liu, C.Li, Q.Wu, Y.J. Lee, Visual instruction tuning, in: Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Curran Associates Inc., Red Hook, NY, USA, 2024.
  • [35]F.Jin, J.Zhang, C.Zong, Parameter-efficient tuning for large language model without calculating its gradients, in: H.Bouamor, J.Pino, K.Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 2023, pp. 321–330.doi:10.18653/v1/2023.emnlp-main.22.
  • [36]M.Jia, L.Tang, B.-C. Chen, C.Cardie, S.Belongie, B.Hariharan, S.-N. Lim, Visual prompt tuning, in: European Conference on Computer Vision (ECCV), 2022.
  • [37]A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N. Gomez, L.Kaiser, I.Polosukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010.
  • [38]M.Lewis, Y.Liu, N.Goyal, M.Ghazvininejad, A.Mohamed, O.Levy, V.Stoyanov, L.Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: D.Jurafsky, J.Chai, N.Schluter, J.Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7871–7880.doi:10.18653/v1/2020.acl-main.703.
  • [39]J.Deng, W.Dong, R.Socher, L.-J. Li, K.Li, L.Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.doi:10.1109/CVPR.2009.5206848.
  • [40]L.Fei-Fei, R.Fergus, P.Perona, Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories, in: 2004 Conference on Computer Vision and Pattern Recognition Workshop, 2004, pp. 178–178.doi:10.1109/CVPR.2004.383.
  • [41]L.Bossard, M.Guillaumin, L.VanGool, Food-101 – mining discriminative components with random forests, in: D.Fleet, T.Pajdla, B.Schiele, T.Tuytelaars (Eds.), Computer Vision – ECCV 2014, Springer International Publishing, Cham, 2014, pp. 446–461.
  • [42]O.M. Parkhi, A.Vedaldi, A.Zisserman, C.V. Jawahar, Cats and dogs, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3498–3505.doi:10.1109/CVPR.2012.6248092.
  • [43]S.Maji, E.Rahtu, J.Kannala, M.B. Blaschko, A.Vedaldi, Fine-grained visual classification of aircraft, ArXiv abs/1306.5151 (2013).
  • [44]J.Krause, M.Stark, J.Deng, L.Fei-Fei, 3d object representations for fine-grained categorization, in: 2013 IEEE International Conference on Computer Vision Workshops, 2013, pp. 554–561.doi:10.1109/ICCVW.2013.77.
  • [45]K.Soomro, A.Zamir, M.Shah, Ucf101: A dataset of 101 human actions classes from videos in the wild, ArXiv abs/1212.0402 (2012).
  • [46]M.-E. Nilsback, A.Zisserman, Automated flower classification over a large number of classes, in: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008, pp. 722–729.doi:10.1109/ICVGIP.2008.47.
  • [47]J.Xiao, J.Hays, K.A. Ehinger, A.Oliva, A.Torralba, Sun database: Large-scale scene recognition from abbey to zoo, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 3485–3492.doi:10.1109/CVPR.2010.5539970.
  • [48]P.Helber, B.Bischke, A.Dengel, D.Borth, Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2019).
  • [49]M.Cimpoi, S.Maji, I.Kokkinos, S.Mohamed, A.Vedaldi, Describing textures in the wild, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3606–3613.doi:10.1109/CVPR.2014.461.
  • [50]K.He, X.Zhang, S.Ren, J.Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.doi:10.1109/CVPR.2016.90.
  • [51]A.Dosovitskiy, L.Beyer, A.Kolesnikov, D.Weissenborn, X.Zhai, T.Unterthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly, J.Uszkoreit, N.Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, in: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
  • [52]J.Wei, X.Wang, D.Schuurmans, M.Bosma, B.Ichter, F.Xia, E.H. Chi, Q.V. Le, D.Zhou, Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY, USA, 2024.
  • [53]B.Zhang, P.Zhang, X.Dong, Y.Zang, J.Wang, Long-clip: Unlocking the long-text capability of clip, in: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LI, Springer-Verlag, Berlin, Heidelberg, 2024, p. 310–325.doi:10.1007/978-3-031-72983-6_18.
IDEA: Image Description Enhanced CLIP-Adapter (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Dan Stracke

Last Updated:

Views: 5629

Rating: 4.2 / 5 (63 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Dan Stracke

Birthday: 1992-08-25

Address: 2253 Brown Springs, East Alla, OH 38634-0309

Phone: +398735162064

Job: Investor Government Associate

Hobby: Shopping, LARPing, Scrapbooking, Surfing, Slacklining, Dance, Glassblowing

Introduction: My name is Dan Stracke, I am a homely, gleaming, glamorous, inquisitive, homely, gorgeous, light person who loves writing and wants to share my knowledge and understanding with you.