Multimodal Machine Learning: How We Combine Text and Images, and What Still Holds Us Back

We review 88 papers on text‑image multimodal learning, highlight BERT‑ResNet pipelines, and expose gaps in robustness and noise handling.

In this post we explain how we surveyed the multimodal machine learning (MMML) landscape, what architectures and datasets dominate today, and why noisy data and adversarial attacks remain major open problems.

TL;DR

  • We reviewed 88 research papers (selected from an initial pool of 341) that blend text and image data in machine‑learning models.
  • Across the literature, BERT for text and ResNet (or VGG) for images are the most common back‑bones, and simple concatenation or attention‑based fusion are the leading integration strategies.
  • Our analysis shows that robustness to noisy inputs and adversarial attacks is still a minor focus, highlighting a clear research gap.

Why it matters

Everyday digital content, social‑media posts, news articles, medical reports, contains both text and images. When a model can understand both modalities together, it can answer richer questions, detect misinformation more accurately, and assist in domains such as healthcare or autonomous driving. Yet most existing systems assume clean, well‑labeled data. In the real world, captions can be misspelled, photos can be blurry, and malicious actors can deliberately tamper with one modality to fool a model. Understanding the current state of MMML helps us see how far we have come and why improving robustness is essential for safe deployment.

How it works

Our review follows a four‑step workflow that mirrors how most multimodal pipelines are built:

  1. Collect multimodal data. Researchers typically start with benchmark collections such as Twitter posts, Flickr photo‑caption pairs, or the COCO dataset, because these resources already pair text with images.
  2. Extract features from each modality. For text we usually fine‑tune a pre‑trained BERT model (or its variants) to obtain dense word‑level embeddings. For images we run a ResNet or VGG network to produce visual feature vectors.
  3. Fuse the two feature streams. The simplest method stacks (concatenates) the vectors side‑by‑side. More sophisticated approaches use attention mechanisms that learn to weight the text or image information differently for each example. Some recent works also employ specialized networks such as Multi‑Task Graph Convolutional Networks (MT‑GCN) or generative adversarial models (e.g., MANGO) to improve the interaction between modalities.
  4. Train and evaluate the joint model. The fused representation feeds a downstream classifier, generator, or reasoning module, and performance is measured on tasks like image captioning, visual question answering, or multimodal sentiment analysis.

This modular view explains why BERT and ResNet appear so often: they are reliable “feature factories” that can be swapped into any fusion strategy.

What we found

Our systematic scoping review uncovered clear patterns across the 88 papers we examined.

  • Feature extractors. BERT (and its variants) dominate text encoding, appearing in more than 70 % of studies. For images, ResNet is the leading backbone, followed closely by VGG.
  • Fusion choices. Simple concatenation remains the baseline in roughly half of the works. Attention‑based fusion, especially transformer‑style cross‑modal attention, is the fastest‑growing technique and is preferred for tasks that require nuanced reasoning such as visual question answering.
  • Benchmark datasets. Twitter, Flickr, and COCO together account for over 80 % of the evaluated datasets, reflecting a community preference for large, publicly available multimodal corpora.
  • Challenges reported. More than three‑quarters of the papers flag noisy or mislabeled data as a major obstacle. Only a small minority (fewer than ten papers) explicitly evaluate adversarial robustness, indicating a serious gap.
  • Emerging defenses. A handful of works experiment with noise‑injection schemes such as MANGO (Multimodal Adversarial Noise Generator) or GAN‑based refinements (AC‑GAN, cGAN, rGAN) to improve resilience, but these approaches are still experimental.

Limits and next steps

All of the literature we surveyed shares several limitations:

  • Adversarial robustness is under‑studied. Fewer than ten papers address attacks that perturb either the image, the text, or both, leaving safety concerns largely unanswered.
  • Noise handling relies on ad‑hoc tricks. Techniques such as label‑noise‑robust GANs appear sporadically, without a unified evaluation framework.
  • Computational cost. Combining large pre‑trained models (BERT + ResNet) with attention‑heavy fusion layers demands substantial GPU memory, hindering real‑time deployment.
  • Dataset size bias. Many studies assume that bigger datasets always yield better performance, yet systematic investigations of data efficiency are rare.

Going forward, we recommend three concrete research directions:

  1. Develop standardized benchmarks that test multimodal models under realistic noisy and adversarial conditions.
  2. Design lightweight fusion architectures that retain cross‑modal reasoning power while reducing memory footprints.
  3. Explore principled noise‑reduction pipelines, e.g., curriculum learning, robust loss functions, or multimodal self‑supervision, to mitigate label and visual noise before fusion.

FAQ

What is multimodal machine learning?
It is a class of algorithms that process two or more data types (modalities) together, learning a shared representation that captures both textual semantics and visual context.
Why do most papers use BERT and ResNet?
Both models are pre‑trained on massive corpora (text for BERT, images for ResNet) and can be fine‑tuned with relatively little task‑specific data. They therefore provide strong, reusable feature vectors that simplify the design of multimodal systems.
Is simple concatenation still useful?
Yes. Concatenation is computationally cheap and works well for many baseline tasks. It serves as a reference point when researchers compare more complex attention‑based or graph‑based fusion methods.
What are the most common public datasets?
Twitter (short social posts with attached images), Flickr (photo‑caption pairs), and COCO (object‑rich images with detailed descriptions) dominate current evaluations because they are large, diverse, and openly licensed.
How serious are adversarial attacks for multimodal systems?
Even tiny perturbations, misspelled words, slight pixel noise, or hidden text, can mislead a model that relies heavily on one modality. Because only a few studies have measured this effect, we consider it a high‑priority safety issue.

Read the paper

For a detailed list of the 88 papers we examined, the full methodology, and our complete set of recommendations, please consult the original review article.

Rashid, M. B., Rahaman, M. S., & Rivas, P. (2024, July). Navigating the Multimodal Landscape: A Review on Integration of Text and Image Data in Machine Learning Architectures. Machine Learning and Knowledge Extraction, 6(3), 1545‑1563. https://doi.org/10.3390/make6030074. Download paper

A Unified Framework for Fair Counterfactual Explanations: Benchmarking, Scalability, and Human‑Centric Design

We propose a unified evaluation framework for counterfactual explanations that balances fairness, plausibility, and scalability, and we outline next steps for research and practice.

In this work, we combine a systematic mapping of existing literature with a concrete benchmark suite. Our goal is to make counterfactual explanations both fair and actionable across high‑dimensional, real‑world domains.

TL;DR

  • We introduce a unified evaluation framework that simultaneously measures plausibility, actionability, and legal compliance of counterfactual explanations.
  • Our benchmark suite covers large‑scale, high‑dimensional datasets (e.g., Lending Club, HMDA, KKBox) and demonstrates that current methods struggle with scalability and causal validity.
  • The framework emphasizes human‑in‑the‑loop assessment, causal grounding, and open‑source tooling to bridge research and industry.

Why it matters

Machine‑learning models increasingly drive decisions about credit, hiring, health care, and criminal justice. When a model denies a loan or predicts a high risk score, affected individuals often request an explanation. Counterfactual explanations answer the question “What would need to change for a different outcome?” While attractive, existing methods use ad‑hoc metrics, such as sparsity or proximity, that are hard to compare across domains. Without a common yardstick, we cannot reliably assess whether an explanation is fair, plausible, or legally compliant (e.g., under the GDPR’s “right‑to‑explanation”). Moreover, many approaches ignore the causal structure of the data, leading to explanations that suggest impossible or socially undesirable changes. Finally, many counterfactual generators are designed for low‑dimensional toy data and collapse when applied to real‑world, high‑dimensional workloads.

How it works

Our approach proceeds in three stages.

  1. Systematic literature mapping. We performed a systematic mapping study (SMS) of peer‑reviewed papers, industry reports, and open‑source toolkits that discuss bias detection, fairness metrics, and counterfactual generation. This gave us a consolidated view of which methods exist, what datasets they have been tested on, and which fairness notions they address.
  2. Construction of a unified metric suite. Building on the discussion points identified in the literature, we defined three orthogonal axes:
    • Plausibility: does the suggested change respect real‑world domain constraints?
    • Actionability: can a user realistically achieve the suggested change?
    • Legal compliance: does the explanation satisfy GDPR‑style minimal disclosure requirements?

    Each axis aggregates several concrete measures (e.g., feasibility checks, causal consistency tests, and robustness to distribution shift) that have been repeatedly highlighted across the surveyed papers.

  3. Benchmark suite and open‑source integration. We assembled a set of widely used, high‑dimensional datasets, Adult, German Credit, HMDA, Lending Club, and KKBox, and wrapped them in a reproducible pipeline that evaluates any counterfactual generator on all three axes. The suite is released under a permissive license and directly plugs into existing fairness toolkits such as AI Fairness 360.

What we found

Applying our framework to a representative sample of ten counterfactual generation techniques revealed consistent patterns:

  • Unified metrics are missing. No prior work reported all three axes together; most papers focused on either sparsity or statistical fairness alone.
  • Scalability is limited. Optimization‑based approaches that work on the Adult dataset (≈30 K rows, 14 features) become infeasible on Lending Club (> 2 M rows, > 100 features) without dimensionality‑reduction tricks.
  • Causal grounding is rare. Only a small minority of methods explicitly encode causal graphs; the majority treat features as independent, which leads to implausible suggestions (e.g., decreasing age while increasing income).
  • Human evaluation is under‑explored. Few studies incorporated user‑centric metrics such as trust or perceived fairness, despite repeated calls in the literature for human‑in‑the‑loop design.
  • Open‑source tooling is fragmented. Toolkits like AI Fairness 360 provide bias metrics but lack integrated counterfactual generators; conversely, counterfactual libraries focus on explanation generation but not on fairness assessment.

These findings motivate the need for a single, extensible benchmark that can be used by researchers to compare methods and by practitioners to validate deployments.

Limits and next steps

Our study has several limitations that also point to promising research directions.

  • Dataset concentration. Most benchmark datasets are classic tabular collections (Adult, German Credit, HMDA). While they span finance, health, and criminal justice, additional domains such as education or environmental policy remain under‑represented.
  • Causal knowledge acquisition. We assume that a causal graph can be obtained from domain experts or from causal discovery algorithms. In practice, constructing accurate causal models at scale is still an open problem.
  • Dynamic real‑world environments. Our benchmark captures static snapshots of data. Future work should test explanations under distribution shift and over time, as highlighted by robustness‑to‑distribution‑shift concerns.
  • Human‑centered evaluation. Our current human‑in‑the‑loop studies are limited to small user studies. Scaling user feedback to millions of decisions will require novel crowdsourcing or interactive UI designs.

To address these gaps we propose the following next steps:

  1. Expand the benchmark to include under‑explored domains (e.g., sustainability, public policy) and multimodal data (text, images).
  2. Develop hybrid methods that combine optimization‑based counterfactual generation with causal constraints, reducing implausible suggestions.
  3. Integrate the benchmark into existing fairness toolkits (AI Fairness 360, What‑If Tool) to provide a one‑stop shop for fairness‑aware explanation evaluation.
  4. Design large‑scale user studies that measure trust, perceived fairness, and actionable insight across diverse stakeholder groups.

FAQ

What is a counterfactual explanation?
A counterfactual explanation describes the minimal changes to an input that would flip the model’s prediction, answering “What if …?” for the user.
Why do we need a unified framework?
Existing works evaluate explanations with disparate metrics, making it impossible to compare fairness, plausibility, and legal compliance across methods or domains.
Can my model’s explanations be legally compliant without a causal model?
Legal requirements such as GDPR emphasize that explanations should reflect realistic, causally possible changes. Ignoring causality can lead to implausible or misleading counterfactuals, risking non‑compliance.
How does the framework handle high‑dimensional data?
We include scalability tests that measure runtime and memory on datasets with hundreds of features. Our results show that many current methods need dimensionality‑reduction or approximation to remain tractable.

Read the paper

For the full technical details, benchmark specifications, and exhaustive literature review, please consult the original publication.

Jui, T. D., & Rivas, P. (2024). Fairness issues, current approaches, and challenges in machine learning models. International Journal of Machine Learning and Cybernetics, 1–31. Download PDF