
In this post we explain how we surveyed the multimodal machine learning (MMML) landscape, what architectures and datasets dominate today, and why noisy data and adversarial attacks remain major open problems.
TL;DR
- We reviewed 88 research papers (selected from an initial pool of 341) that blend text and image data in machine‑learning models.
- Across the literature, BERT for text and ResNet (or VGG) for images are the most common back‑bones, and simple concatenation or attention‑based fusion are the leading integration strategies.
- Our analysis shows that robustness to noisy inputs and adversarial attacks is still a minor focus, highlighting a clear research gap.
Why it matters
Everyday digital content, social‑media posts, news articles, medical reports, contains both text and images. When a model can understand both modalities together, it can answer richer questions, detect misinformation more accurately, and assist in domains such as healthcare or autonomous driving. Yet most existing systems assume clean, well‑labeled data. In the real world, captions can be misspelled, photos can be blurry, and malicious actors can deliberately tamper with one modality to fool a model. Understanding the current state of MMML helps us see how far we have come and why improving robustness is essential for safe deployment.
How it works
Our review follows a four‑step workflow that mirrors how most multimodal pipelines are built:
- Collect multimodal data. Researchers typically start with benchmark collections such as Twitter posts, Flickr photo‑caption pairs, or the COCO dataset, because these resources already pair text with images.
- Extract features from each modality. For text we usually fine‑tune a pre‑trained BERT model (or its variants) to obtain dense word‑level embeddings. For images we run a ResNet or VGG network to produce visual feature vectors.
- Fuse the two feature streams. The simplest method stacks (concatenates) the vectors side‑by‑side. More sophisticated approaches use attention mechanisms that learn to weight the text or image information differently for each example. Some recent works also employ specialized networks such as Multi‑Task Graph Convolutional Networks (MT‑GCN) or generative adversarial models (e.g., MANGO) to improve the interaction between modalities.
- Train and evaluate the joint model. The fused representation feeds a downstream classifier, generator, or reasoning module, and performance is measured on tasks like image captioning, visual question answering, or multimodal sentiment analysis.
This modular view explains why BERT and ResNet appear so often: they are reliable “feature factories” that can be swapped into any fusion strategy.
What we found
Our systematic scoping review uncovered clear patterns across the 88 papers we examined.
- Feature extractors. BERT (and its variants) dominate text encoding, appearing in more than 70 % of studies. For images, ResNet is the leading backbone, followed closely by VGG.
- Fusion choices. Simple concatenation remains the baseline in roughly half of the works. Attention‑based fusion, especially transformer‑style cross‑modal attention, is the fastest‑growing technique and is preferred for tasks that require nuanced reasoning such as visual question answering.
- Benchmark datasets. Twitter, Flickr, and COCO together account for over 80 % of the evaluated datasets, reflecting a community preference for large, publicly available multimodal corpora.
- Challenges reported. More than three‑quarters of the papers flag noisy or mislabeled data as a major obstacle. Only a small minority (fewer than ten papers) explicitly evaluate adversarial robustness, indicating a serious gap.
- Emerging defenses. A handful of works experiment with noise‑injection schemes such as MANGO (Multimodal Adversarial Noise Generator) or GAN‑based refinements (AC‑GAN, cGAN, rGAN) to improve resilience, but these approaches are still experimental.
Limits and next steps
All of the literature we surveyed shares several limitations:
- Adversarial robustness is under‑studied. Fewer than ten papers address attacks that perturb either the image, the text, or both, leaving safety concerns largely unanswered.
- Noise handling relies on ad‑hoc tricks. Techniques such as label‑noise‑robust GANs appear sporadically, without a unified evaluation framework.
- Computational cost. Combining large pre‑trained models (BERT + ResNet) with attention‑heavy fusion layers demands substantial GPU memory, hindering real‑time deployment.
- Dataset size bias. Many studies assume that bigger datasets always yield better performance, yet systematic investigations of data efficiency are rare.
Going forward, we recommend three concrete research directions:
- Develop standardized benchmarks that test multimodal models under realistic noisy and adversarial conditions.
- Design lightweight fusion architectures that retain cross‑modal reasoning power while reducing memory footprints.
- Explore principled noise‑reduction pipelines, e.g., curriculum learning, robust loss functions, or multimodal self‑supervision, to mitigate label and visual noise before fusion.
FAQ
- What is multimodal machine learning?
- It is a class of algorithms that process two or more data types (modalities) together, learning a shared representation that captures both textual semantics and visual context.
- Why do most papers use BERT and ResNet?
- Both models are pre‑trained on massive corpora (text for BERT, images for ResNet) and can be fine‑tuned with relatively little task‑specific data. They therefore provide strong, reusable feature vectors that simplify the design of multimodal systems.
- Is simple concatenation still useful?
- Yes. Concatenation is computationally cheap and works well for many baseline tasks. It serves as a reference point when researchers compare more complex attention‑based or graph‑based fusion methods.
- What are the most common public datasets?
- Twitter (short social posts with attached images), Flickr (photo‑caption pairs), and COCO (object‑rich images with detailed descriptions) dominate current evaluations because they are large, diverse, and openly licensed.
- How serious are adversarial attacks for multimodal systems?
- Even tiny perturbations, misspelled words, slight pixel noise, or hidden text, can mislead a model that relies heavily on one modality. Because only a few studies have measured this effect, we consider it a high‑priority safety issue.
Read the paper
For a detailed list of the 88 papers we examined, the full methodology, and our complete set of recommendations, please consult the original review article.
Rashid, M. B., Rahaman, M. S., & Rivas, P. (2024, July). Navigating the Multimodal Landscape: A Review on Integration of Text and Image Data in Machine Learning Architectures. Machine Learning and Knowledge Extraction, 6(3), 1545‑1563. https://doi.org/10.3390/make6030074. Download paper