Motivation
Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters?
Despite recent advances across different domains and tasks, current state-ofthe-art methods train a separate model with different model parameters for each task at hand.
Co-training PolyViT on multiple modalities and tasks leads to a model that is even more parameter-efficient, and learns representations that generalize across multiple domains.
Do not need to tune hyperpamaeters for each combinations of dataset.
Related Works
Computer VIsion with Multi-task learning
- In computer vision, multiple papers have developed models which predict multiple outputs (for example semantic segmentation and surface normals)
- Numerous works have also observed that although multi-task models are more versatile.
- Their accuracies are lower than single task models, and this accuracy deficit increases with the number of tasks, or by simultaneously performing unrelated tasks.
- Jointly training to simultaneously perform multiple tasks has typically required careful calibration of the individual tasks, to ensure that none of the task specific losses dominates another.
Extra Task Specific parameters
Previous works have improved performance on additional tasks only by introducing extra task-specific parameters which are typically conditioned on the input.
For high-capacity transformer models,
Co-training on multiple datasets simultaneously helps to regularize the model on a dataset that it would otherwise overfit on,
thus achieving accuracy improvements from co-training.
Perciever (Transformer with Modality)
- Instead of tokenizing data, operate directly on the raw input by projecting it into a smaller, latent set of tokens using cross-attention,
- Model trains separate network with separate parameters for each task. (Do not consider co-training scenearios like this paper.)
UniT
- Visual-NLP modality model based trasnformer, encoder - decoder architecture, encode is DeTR, decoder is BERT
- Do not consider entire transforemr backbone among diffrerent tasks.
Contribution
사전 지식으로 ViT, ViViT, AST 모델에 대해서 알면 좋다.
PolyViT architecture
Note that PolyViT with L layers acts like an L-layer ViT when processing images, an L-layer AST when processing audio, and an L-layer unfactorized ViViT when processing video.
And whilst it is capable of handling multiple modalities, it performs one task from one modality in a given forward pass.
Co-traninig procudure
We optimize all PolyViT model parameters, In all cases, we construct our training minibatches using examples from a single task.
This design choice allows us to evaluate gradients and perform a parameter update using the same training hyperparameters.
During co-training, for each SGD step, we sample a task (dataset), then sample a minibatch from that task, evaluate a gradient and then perform a parameter update.
An important consideration is the order in which we sample tasks and whether we accumulate gradients over different minibatches and tasks.
task index is j, Uj is number of SGD steps for the single task j, So U is total SGD step for all task
Task-by-Task
U1 → update U1, U2 → update U2, task index j is random order
Alternating
Deterministic schedule alternates between tasks in a fixed, repeating order.
Uniform sampling task
Stochastic version of alternating, Do not consider number of steps of each task
Weighted task sampling
In this stochastic schedule, We sample each task with a weight proportional to the number of traninig steps in the singke task baseline
Accumulating Gradient update
For T task, Performing fw/bw pass on minibatch for each task
Summing the gradient over each task → Updating single parameter with acculmulated gradient
Experiments
- Task, Image classification 5, Video classficiation 2, Audio classification 2
- Weighted task samping method consistently achieves the highest acc for 8 out of 9 tasks
Conclusion
생략