METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens

Publication Name

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition


In clinical scenarios, multi-specialist consultation could significantly benefit the diagnosis, especially for intricate cases. This inspires us to explore a 'multi-expert joint diagnosis' mechanism to upgrade the existing 'single expert' framework commonly seen in the current literature. To this end, we propose METransformer, a method to realize this idea with a transformer-based backbone. The key design of our method is the introduction of multiple learnable 'expert' tokens into both the transformer encoder and decoder. In the encoder, each expert token interacts with both vision tokens and other expert tokens to learn to attend different image regions for image representation. These expert tokens are encouraged to capture complementary information by an orthogonal loss that minimizes their over-lap. In the decoder, each attended expert token guides the cross-attention between input words and visual tokens, thus influencing the generated report. A metrics-based expert voting strategy is further developed to generate the final report. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts. Experimental results demonstrate the promising performance of our proposed model on two widely used benchmarks. Last but not least, the framework-level innovation makes our work ready to incorporate advances on existing 'single-expert' models to further improve its performance.

Open Access Status

This publication may be available as open access



First Page


Last Page




Link to publisher version (DOI)