Over the past two years my research interests have revolved arround Multi-Task Learning (MTL) as a learning paradigm. It is a vast field of diverse research in all domains of computer science from NLP and Signal Processing to Computer Vision and Multimedia. In what follows I will motivate, describe and discuss an approach to MTL we developed called Task Routing.
Multi-Task and Many-Task Learning
By definition (Carruana 1997), multi-task learning is a learning paradigm that seeks to improve the generalization performance of machine learning models by optimizing for more than one task simultaneously. Its counterpart, Single Task Learning (STL) occurs when a model is optimized for performing a single task only. An interesting property of STL models is they usually have an abundance of parameters and have the capacity to fit to more than one task. In MTL the aim is to make use of this extra capacity and improve the model’s generalization properties by leveraging the domain-specific information contained in the training signals of related tasks. As such, in MTL the goal is to jointly perform experiments over multiple tasks and improve the learning process for each of them. For a detailed overview of MTL, check out the blog of Sebastian Ruder where you can find detailed explanations on the different ways to perform MTL.
One key question that arrises is how many tasks can be eficiently performed this way?
We know that performing multiple tasks at the same time leads to MTL. But does our model’s behaviour remain the same when the number of tasks drastically increase? With scaling up the nubmer of tasks, so does the complexity of the problem in both resource requirements and functional logic. To distinguish performing a vry large number of tasks from classic MTL problems we define Many-task Learning (MaTL) as a subtype of MTL which describes approaches performing more than 20 tasks simoultaneosly. In our work we perform MaTL experiments with optimizing for 20, 40, 100 and up to 312 tasks in a single model with conditioned feature-wise transformations. In what follows we describe our approach and its conrtibutions.
Many Task Learning with Task Routing
As with any combinatorial problem, in MTL there exists an optimal combination of tasks and shared resources which is unknown. Searching the space to find this combination is becoming increasingly inefficient, as modern models grow in depth, complexity and capacity. This search duration grows proportionally with the number of tasks and parameters present in the model’s structure. In our work, inspired by the efficiency of Random Search, we enforce a structured random solution to this problem by regulating the per-task data-flow in our models. As depicted in the figure below, by assigning each unit to a subset of tasks that can use it we create specialized sub-networks for each task. In addition, we show that providing tasks with alternate routes through the parameter space, increases feature robustness and improves scalability while boosting or maintaining predictive performance.
The figure below illustrates wht a three task convolutional neural network would look like with our approach. Each color indicates a task, so units colord with red, green and yellow are optimized for every task, while the single colored ones are being optimized for a single task only.
Most MTL methods involve task specific and shared units as part of the MTL training procedure. Our method enables the units within the model’s convolutional layers to have a consistent shared or task-specific role in both training and testing regimes. Figure 1 provides the intuition behind how the individual units are utilized throughout the set of tasks performed by the model. We achieve this behavior by applying a channel-wise task-specific binary mask over the convolutional activations, restricting the input to the following layer to contain only activations assigned to the task. Figure 2 illustrates the masking process over the activations.
Because the flow of activations does not follow its conventional route, i.e. it has been rerouted to an alternate one, we have named our method Task Routing (TR) and its corresponding layer the Task Routing Layer (TRL). How the TRL is trained and masks are applied can be seen in Algorythm 1 and 2 below. By applying the TRL to the network we are able to reuse units between tasks and scale up the number of tasks that can be performed with a single model.
The masks that enable task routing are generated randomly at the moment the model is instantiated and kept constant through the training process. These masks are created using a sharing ratio hyper-parameter σ defined beforehand. The sharing ratio defines how many units are task specific, and how many are shared between tasks. The inverse of this ratio determines how many of the units are nullified. As such, the sharing ratio enables us to explore the complete space of sharing possibilities with a simple adjustment of one hyper-parameter. A sharing ratio of 0 would indicate that no sharing occurs within the network and each trainable unit is specific to a single task only, resulting in distinct networks per task. On the opposite side of the spectrum, a sharing ratio of 1 would make every unit shared between each of the tasks, resulting in a classical fully shared MTL architecture.
In the figure below you can see how a mask is applied to the output of a convolutional layer and the form of the masked output (the input to the next layer). The green highlighted chanels (units) survived the masking and are propagatet forward.
In this work we report the following contributions:
- We present a scalable MTL technique for exploiting cross-task expertise transferability without requiring prior domain knowledge.
- We enable structured deterministic sampling of multiple sub-architectures within a single MTL model.
- We forge task relationships in an intuitive nonparametric manner during training without requiring prior domain knowledge or statistical analysis.
- We apply our method to 5 classification and retrieval datasets demonstrating its effectiveness and performance gains over strong baselines and state-of-the-art approaches