{"id":7037,"date":"2025-09-07T10:00:00","date_gmt":"2025-09-07T15:00:00","guid":{"rendered":"https:\/\/lab.rivas.ai\/?p=7037"},"modified":"2025-09-07T22:22:05","modified_gmt":"2025-09-08T03:22:05","slug":"navigating-the-multimodal-landscape-a-review-on-integration-of-text-and-image-data-in-machine-learning-architectures","status":"publish","type":"post","link":"https:\/\/lab.rivas.ai\/?p=7037","title":{"rendered":"Navigating the Multimodal Landscape: A Review on Integration of Text and Image Data in Machine Learning Architectures"},"content":{"rendered":"<article>\n<header>\n<p class=\"meta-description\">We review 88 multimodal ML papers, highlighting BERT and ResNet for text\u2011image tasks, fusion methods, and challenges like noise and adversarial attacks.<\/p>\n<p class=\"deck\">We systematically surveyed the literature to identify the most common pre\u2011trained models, fusion strategies, and open challenges when combining text and images in machine learning pipelines.<\/p>\n<\/header>\n<nav class=\"toc\">\n<ul>\n<li><a href=\"#tldr\">TL;DR<\/a><\/li>\n<li><a href=\"#why-it-matters\">Why it matters<\/a><\/li>\n<li><a href=\"#how-it-works\">How it works<\/a><\/li>\n<li><a href=\"#results\">What we found<\/a><\/li>\n<li><a href=\"#limits\">Limits and next steps<\/a><\/li>\n<li><a href=\"#faq\">FAQ<\/a><\/li>\n<li><a href=\"#read-the-paper\">Read the paper<\/a><\/li>\n<\/ul>\n<\/nav>\n<section id=\"tldr\">\n<h2>TL;DR<\/h2>\n<ul>\n<li>We reviewed 88 multimodal machine\u2011learning papers to map the current landscape.<\/li>\n<li>BERT for text and ResNet (or VGG) for images dominate feature extraction.<\/li>\n<li>Simple concatenation remains common, but attention\u2011based fusion is gaining traction.<\/li>\n<\/ul>\n<\/section>\n<section id=\"why-it-matters\">\n<h2>Why it matters<\/h2>\n<p>Text and images together encode richer semantic information than either modality alone. Harnessing both can improve content understanding, recommendation systems, and decision\u2011making across domains such as healthcare, social media, and autonomous robotics. However, integrating these signals introduces new sources of noise and vulnerability that must be addressed for reliable deployment.<\/p>\n<\/section>\n<section id=\"how-it-works\">\n<h2>How it works (plain words)<\/h2>\n<p>Our workflow follows three clear steps:<\/p>\n<ol>\n<li>Gather and filter the literature \u2013 we started from 341 retrieved papers and applied inclusion criteria to focus on 88 high\u2011impact studies.<\/li>\n<li>Extract methodological details \u2013 for each study we recorded the pre\u2011trained language model (most often BERT or LSTM), the vision model (ResNet, VGG, or other CNNs), and the fusion approach (concatenation, early fusion, attention, or advanced neural networks).<\/li>\n<li>Synthesise findings \u2013 we counted how frequently each component appears, noted emerging trends, and listed the recurring limitations reported by authors.<\/li>\n<\/ol>\n<\/section>\n<section id=\"results\">\n<h2>What we found<\/h2>\n<p>Feature extraction<\/p>\n<ul>\n<li>We observed that BERT is the most frequently cited language encoder because of its strong contextual representations across a wide range of tasks.<\/li>\n<li>For visual features, ResNet is the leading architecture, with VGG also appearing regularly in older studies.<\/li>\n<\/ul>\n<p>Fusion strategies<\/p>\n<ul>\n<li>Concatenation \u2013 a straightforward method that simply stacks the text and image embeddings \u2013 is still the baseline choice in many applications.<\/li>\n<li>Attention mechanisms \u2013 either self\u2011attention within a joint transformer or cross\u2011modal attention linking BERT and ResNet embeddings \u2013 are increasingly adopted to let the model weigh the most informative signals.<\/li>\n<li>More complex neural\u2011network\u2011based fusions (e.g., graph\u2011convolutional networks, GAN\u2011assisted approaches) are reported in emerging studies, especially when robustness to adversarial perturbations is a priority.<\/li>\n<\/ul>\n<p>Challenges reported across the surveyed papers<\/p>\n<ul>\n<li>Noisy or mislabeled data \u2013 label noise in either modality can degrade joint representations.<\/li>\n<li>Dataset size constraints \u2013 balancing computational cost with sufficient multimodal examples remains difficult.<\/li>\n<li>Adversarial attacks \u2013 malicious perturbations to either text or image streams can cause catastrophic mis\u2011predictions, and defensive techniques are still in early development.<\/li>\n<\/ul>\n<\/section>\n<section id=\"limits\">\n<h2>Limits and next steps<\/h2>\n<p>Despite strong progress, several limitations persist:<\/p>\n<ul>\n<li><strong>Noisy data handling:<\/strong> Existing pipelines often rely on basic preprocessing; more sophisticated denoising or label\u2011noise\u2011robust training is needed.<\/li>\n<li><strong>Dataset size optimisation:<\/strong> Many studies use benchmark collections (Twitter, Flickr, COCO) but do not systematically explore the trade\u2011off between data volume and model complexity.<\/li>\n<li><strong>Adversarial robustness:<\/strong> Current defenses (e.g., auxiliary\u2011classifier GANs, conditional GANs, multimodal noise generators) are promising but lack thorough evaluation across diverse tasks.<\/li>\n<\/ul>\n<p>Future work should therefore concentrate on three fronts: developing noise\u2011resilient preprocessing pipelines, designing scalable training regimes for limited multimodal datasets, and building provably robust fusion architectures that can withstand adversarial pressure.<\/p>\n<\/section>\n<section id=\"faq\">\n<h2>FAQ<\/h2>\n<dl>\n<dt>What pre\u2011trained models should we start with for a new text\u2011image project?<\/dt>\n<dd>We recommend beginning with BERT (or its lightweight variants) for textual encoding and ResNet (or VGG) for visual encoding, as these models consistently achieve high baseline performance across the surveyed studies.<\/dd>\n<dt>Is attention\u2011based fusion worth the added complexity?<\/dt>\n<dd>Our review shows that attention mechanisms yield richer joint representations and improve performance on tasks requiring fine\u2011grained alignment (e.g., visual question answering). When computational resources allow, we suggest experimenting with cross\u2011modal attention after establishing a solid concatenation baseline.<\/dd>\n<\/dl>\n<\/section>\n<section id=\"read-the-paper\">\n<h2>Read the paper<\/h2>\n<\/section>\n<section id=\"citation\">Rashid, M. B., Rahaman, M. S., &amp; Rivas, P. (2024, July). Navigating the Multimodal Landscape: A Review on Integration of Text and Image Data in Machine Learning Architectures. <em>Machine Learning and Knowledge Extraction, 6<\/em>(3), 1545\u20131563. https:\/\/doi.org\/10.3390\/make6030074<a href=\"https:\/\/www.rivas.ai\/pdfs\/rashid2024multimodal.pdf\" rel=\"noopener noreferrer\">Download PDF<\/a><\/p>\n<\/section>\n<\/article>\n","protected":false},"excerpt":{"rendered":"<p>We review 88 multimodal ML papers, highlighting BERT and ResNet for text\u2011image tasks, fusion methods, and challenges like noise and adversarial attacks. We systematically surveyed the literature to identify the most common pre\u2011trained models, fusion strategies, and open challenges when combining text and images in machine learning pipelines.<\/p>\n","protected":false},"author":11,"featured_media":7036,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[6,8],"class_list":["post-7037","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-computer-vision","tag-representation-learning"],"jetpack_featured_media_url":"https:\/\/lab.rivas.ai\/wp-content\/uploads\/2025\/09\/GwHYo-cover.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts\/7037","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7037"}],"version-history":[{"count":2,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts\/7037\/revisions"}],"predecessor-version":[{"id":7040,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts\/7037\/revisions\/7040"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/media\/7036"}],"wp:attachment":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7037"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7037"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7037"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}