Multimodal in deep learning

Last updated on:a year ago

Mountains of forms can represent information. For instance, someone who has a headache can be shown by medical instruction text, CT, face expression image, body temperature, etc. With some of this combination, we can increase the performance of our deep learning models.


Multimodality is the application of multiple literacies within one medium.

Multimodal image matching refers to identifying and then corresponding the same or similar true/content from two or more images that are significant modalities or nonlinear appearance differences.

Multimodal analysis for biomedical applications:

Research in deep learning

Some researches focus on aligning the information of different modals. In Amazon research works in vision-language representation learning, image information and text information are aligned to increase the performance by cross-modality fusion. You can see this in the following picture.

You can also carefully design your network to simultaneously feed multimodal data into the network.




[1] Multimodality, wiki

[2] Jiang, X., Ma, J., Xiao, G., Shao, Z. and Guo, X., 2021. A review of multimodal image matching: Methods and applications. Information Fusion, 73, pp.22-71.

[3] Duan, J., Chen, L., Tran, S., Yang, J., Xu, Y., Zeng, B., Tao, C. and Chilimbi, T., 2022. Multi-modal Alignment using Representation Codebook. arXiv preprint arXiv:2203.00048.

[4] Lee, D.E., Koo, H., Sun, I.C., Ryu, J.H., Kim, K. and Kwon, I.C., 2012. Multifunctional nanoparticles for multimodal imaging and theragnosis. Chemical Society Reviews, 41(7), pp.2656-2672.

[5] 多模态(multi-modal)和多视图(multi-view)有什么区别?