Back to Blog

Task Routing in Multi-Task Learning

Early Phd Blog

Over the past two years my research interests have revolved around Multi-Task Learning (MTL) as a learning paradigm. It is a vast field of diverse research in all domains of computer science from NLP and Signal Processing to Computer Vision and Multimedia. In what follows I will motivate, describe and discuss an approach to MTL we developed called Task Routing.

Multi-Task and Many-Task Learning

By definition, multi-task learning is a learning paradigm that seeks to improve the generalization performance of machine learning models by optimizing for more than one task simultaneously. Its counterpart, Single Task Learning (STL) occurs when a model is optimized for performing a single task only.

One key question that arises is: how many tasks can be efficiently performed this way?

We know that performing multiple tasks at the same time leads to MTL. But does our model's behaviour remain the same when the number of tasks drastically increase? With scaling up the number of tasks, so does the complexity of the problem in both resource requirements and functional logic.

To distinguish performing a very large number of tasks from classic MTL problems we define Many-task Learning (MaTL) as a subtype of MTL which describes approaches performing more than 20 tasks simultaneously. In our work we perform MaTL experiments with optimizing for 20, 40, 100 and up to 312 tasks in a single model with conditioned feature-wise transformations.

Many Task Learning with Task Routing

As with any combinatorial problem, in MTL there exists an optimal combination of tasks and shared resources which is unknown. Searching the space to find this combination is becoming increasingly inefficient, as modern models grow in depth, complexity and capacity.

Inspired by the efficiency of Random Search, we enforce a structured random solution to this problem by regulating the per-task data-flow in our models. By assigning each unit to a subset of tasks that can use it we create specialized sub-networks for each task. In addition, we show that providing tasks with alternate routes through the parameter space, increases feature robustness and improves scalability while boosting or maintaining predictive performance.

Key Contributions

  • • We present a scalable MTL technique for exploiting cross-task expertise transferability without requiring prior domain knowledge
  • • We enable structured deterministic sampling of multiple sub-architectures within a single MTL model
  • • We forge task relationships in an intuitive non-parametric manner during training without requiring prior domain knowledge or statistical analysis
  • • We apply our method to 5 classification and retrieval datasets demonstrating its effectiveness and performance gains over strong baselines and state-of-the-art approaches