Gift: Generative Interpretable Fine-Tuning Transformers

1North Carolina State University, 2An Independent Researcher
GIFT.

Flowers, Birds and here are our GIFTs. Our proposed GIFT is a deep parameter-residual learning method using a generator network for parameter-efficient fine-tuning. In training, given the pretrained weights of a layer \(\omega\in \mathbb{R}^{d_{out}\times d_{in}}\) in the backbone, the fine-tuned weights by our GIFT are \(\omega^+=\omega + \text{GIFT}(\omega)\) with learned clustering \(\mathcal{C}_{d_{out},M}\) for the parameters \(\omega\), where \(M\) is a predefined small number (e.g., 96). In testing, \(\mathcal{C}_{d_{out},M}\) plays the role of a semantic segmentation head classifier. For a testing data \(x\) (e.g., a flower image in the VGG Flowers dataset, or a bird in the Caltech-UCSD birds dataset), its output at the given layer is \(f(x; \omega^+)\in \mathbb{R}^{h \times w \times d_{out}}\) whose "\(M\)-cluster segmentation results" are simply \(f(x; \omega^+) \cdot \mathcal{C}_{d_{out},M} \in \mathbb{R}^{h \times w \times M}\). We observe that semantically meaningful clusters are formed.

Abstract

We present GIFT (Generative Interpretable Fine-tuning Transformers) for fine-tuning pretrained (often large) Transformer models at downstream tasks in a parameter-efficient way with built-in interpretability. Our GIFT is a deep parameter-residual learning method, which addresses two problems in fine-tuning a pretrained Transformer model: Where to apply the parameter-efficient fine-tuning (PEFT) to be extremely lightweight yet sufficiently expressive, and How to learn the PEFT to better exploit the knowledge of the pretrained model in a direct way? For the former, we select the final projection (linear) layer in the multi-head self-attention of a Transformer model, and verify its effectiveness. For the latter, in contrast to the prior art that directly introduce new model parameters (often in low-rank approximation form) to be learned in fine-tuning with downstream data, we propose a method for learning to generate the fine-tuning parameters. Our GIFT is a hyper-Transformer which take as input the pretrained parameters of the projection layer to generate its fine-tuning parameters using a proposed Parameter-to-Cluster Attention (PaCa). The PaCa results in a simple clustering-based forward explainer that plays the role of semantic segmentation in testing. In experiments, our proposed GIFT is tested on the VTAB benchmark and the fine-grained visual classification (FGVC) benchmark. It obtains significantly better performance than the prior art.

How does GIFT work?

GIFT.

Detailed workflow of our proposed \(l\)-layer GIFT for fine-tuning a pretrained and frozen \(L\)-layer Transformer backbone on a downstream task. For simplicity we assume the pretrained Transformer is isotropic and thus \(d_{out}=d_{in}\) where the subscripts are used to indicate the order. The task head is trained from scratch for the downstream task.

GIFT.

Illustration of our proposed GIFT in comparison with LoRA and TOAST. Both LoRA and our GIFT work at the layer level in the sense that they can be placed to update any selected layers, while TOAST is holistic by learning the value modulation based on the output of the entire frozen backbone. LoRA and our GIFT are more efficient in inference than TOAST since the parameter updates can be merged into the backbone after training. Compared with LoRA, our GIFT is conceptually different. LoRA learns the low-rank parameters \(A\) and \(B\) by directly treating them as model parameters to optimize (and might entail searching the rank \(r\), while our GIFT takes the parameters of the pretrained backbone as input and outputs the full parameter matrix without resorting to the search of the low rank. To address the quadratic complexity of vanilla Transformers and for efficiency, inspired by the recently proposed Patch-to-Cluster Attention (PaCa) in the PaCa-ViTs, we develop a Parameter-to-Cluster Attention (also termed as PaCa). The GIFT is shared by different layers of the pretrained backbone in training.

Interpretability of GIFT

By visualizing the individual clusters from the \(M\)-cluster segmentation maps \(m_{out}\) formed at a given layer \(f(x; \omega^+)\) as \(m_{out} = f(x; \omega^+) \cdot \mathcal{C}_{d_{out},M} \in \mathbb{R}^{h \times w \times M}\), we observe that that meaningful clusters consistently emerge after GIFT training on various datasets. Here are some exampes:

GIFT.

From top to bottom, on Flowers, GIFT learns to form clusters on the full flower (Layer 2), petals (Layer 5), stamen (Layers 11 and 12). NABirds, clusters are formed on various parts on the bird, such as the full body (Layer 12), head (Layers 10), wing+tail (Layer 9), torso (Layer 6), and even the background (Layer 2). For Cars, GIFT learns to form clusters on the door (Layer 8), wheel (Layer 4), windshield (Layer 9), and even a composition of both (Layer 11). For ImageNet, clusters on the foreground (Layer 2), players (Layer 7), arms+hands (Layer 8, Cluster 67), ball (Layer 12), and players+ball (Layer 8, Cluster 28) can be seen.

NABirds

GIFT.

For example in the class Reddish Egret (Dark Morph) (2nd row), clusters corresponding to the distinctive neck, beak and legs can be seen. Notably, for Prothonotary Warbler (3rd row), a cluster representing a twig of a tree can also be seen. We hypothesize that this is because almost all the images of this class from the dataset are images of the bird sitting on a twig. Although an undesirable behavior, this shows the interpretable nature of GIFT. We leave addressing how to avoid this behavior to future work.

VGG Flowers

GIFT.

Clusters can focus on very specific parts of the flower like the tips of the petals (Row 3 Column 6, Row 4 Columns 4 and 6), or the whole flower.

Stanford Cars

GIFT.

The PaCa module in GIFT can form various clusters focusing on different parts of the car. For example, clusters encoding the wheels (Column 3), windshield (Row 1, column 5), bonnet (Row 3, Column 2), headlight (Row 3, Column 5), etc. can be found.

Standord Dogs

GIFT.

The PaCa module in GIFT forms clusters over various parts of the dog's anatomy, many of which are unique for a class. For example, in Rows 3 and 4, clusters on the ears, eyes, snout and face can be found. In Row 4, clusters on the head, torso and a combination of both are formed.

ImageNet

GIFT.

Even on a large and diverse dataset like ImageNet, the PaCa module can form clusters over relevant parts of the image. The 4 images (taken from the official validation set) all show different image characteristics, but the clusters still focus on the relevant aspects of the image.

BibTeX


@misc{savadikar2023gift,
  title={GIFT: Generative Interpretable Fine-Tuning Transformers}, 
  author={Chinmay Savadikar and Xi Song and Tianfu Wu},
  year={2023},
  eprint={2312.00700},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}