{"id":7341,"date":"2025-10-18T23:37:50","date_gmt":"2025-10-19T04:37:50","guid":{"rendered":"https:\/\/lab.rivas.ai\/?p=7341"},"modified":"2025-10-18T23:41:21","modified_gmt":"2025-10-19T04:41:21","slug":"measuring-ai-safety-a-new-score-for-vision%e2%80%91language-models-in-public-services-2","status":"publish","type":"post","link":"https:\/\/lab.rivas.ai\/?p=7341","title":{"rendered":"Measuring AI Safety: A New Score for Vision\u2011Language Models in Public Services"},"content":{"rendered":"<section>\n<h2>TL;DR<\/h2>\n<ul>\n<li>We introduce a reproducible framework for stress\u2011testing vision\u2011language models (VLMs) against random noise and crafted adversarial attacks.<\/li>\n<li>Our core metric, the <em>Vulnerability Score<\/em>, combines a Noise Impact Score and an FGSM Impact Score with adjustable weights.<\/li>\n<li>Using the CLIP model on 1\u202f% of Caltech\u2011256, baseline accuracy (95\u202f%) fell to roughly 66\u201167\u202f% under Gaussian, Salt\u2011and\u2011Pepper, or Uniform noise, and to about 9\u202f% under a Fast Gradient Sign Method (FGSM) attack.<\/li>\n<li>The framework requires only a tiny subset of data, making it practical for public\u2011sector teams with limited resources.<\/li>\n<\/ul>\n<\/section>\n<section>\n<h2>Why it matters<\/h2>\n<p>Public\u2011sector AI systems-whether they support emergency response, medical triage, or critical infrastructure monitoring-must operate reliably under real\u2011world disturbances. A model that appears accurate in clean laboratory settings can fail catastrophically when confronted with sensor noise, weather\u2011induced image degradation, or malicious manipulation. Existing safety assessments focus almost exclusively on either random corruption or targeted adversarial attacks, leaving a blind spot for scenarios where both types of perturbations coexist. By quantifying how much performance degrades under each threat and merging the two effects into a single, tunable score, we give policymakers, engineers, and auditors a concrete yardstick to compare models, set deployment thresholds, and prioritize mitigation strategies. The ability to run the evaluation with only 1\u202f% of a standard benchmark (the Caltech\u2011256 dataset) also means that even small government labs can adopt the method without prohibitive compute costs.<\/p>\n<\/section>\n<section>\n<h2>How it works<\/h2>\n<p>Our methodology proceeds in three stages.<\/p>\n<ol>\n<li><strong>Incremental noise injection.<\/strong> We take a representative slice of the Caltech\u2011256 image collection, 300 images, roughly 1\u202f% of the full set, covering every class. For each image, we add three types of statistical noise (Gaussian, Salt\u2011and\u2011Pepper, Uniform) in 0.01\u2011step increments until the model first misclassifies the image. The exact noise level that triggers failure is recorded.<\/li>\n<li><strong>Patch synthesis and saliency mapping.<\/strong> The recorded noise thresholds across all images are averaged to produce an \u201caverage noise patch\u201d for each noise family. These patches highlight the image regions most sensitive to corruption. We also generate saliency maps by back\u2011propagating the misclassification signal, revealing which pixels the model relies on most heavily.<\/li>\n<li><strong>Adversarial comparison.<\/strong> The classic Fast Gradient Sign Method (FGSM) is applied to the same image set as a reference point for crafted attacks. By comparing the effectiveness of the statistical patches with FGSM, we verify that our noise\u2011derived perturbations act as universal adversarial examples, even though they are created without any knowledge of the model\u2019s gradients.<\/li>\n<\/ol>\n<p>From these stages, we compute two intermediate metrics:<\/p>\n<ul>\n<li><em>Noise Impact Score<\/em>\u202f=\u202f(Baseline accuracy\u202f\u2212\u202fAccuracy under the average noise patch) \/ Baseline accuracy.<\/li>\n<li><em>FGSM Impact Score<\/em>\u202f=\u202f(Baseline accuracy\u202f\u2212\u202fAccuracy under FGSM) \/ Baseline accuracy.<\/li>\n<\/ul>\n<p>We then blend the two using a single weighted formula. The equation that defines the overall score is shown below.<\/p>\n<div class=\"eq\">\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 16px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-61ab942da03e45cb528e2ba3a572f42c_l3.png\" height=\"16\" width=\"802\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#91;&#92;&#116;&#101;&#120;&#116;&#123;&#86;&#117;&#108;&#110;&#101;&#114;&#97;&#98;&#105;&#108;&#105;&#116;&#121;&#32;&#83;&#99;&#111;&#114;&#101;&#125;&#61;&#119;&#95;&#123;&#49;&#125;&#92;&#116;&#105;&#109;&#101;&#115;&#92;&#116;&#101;&#120;&#116;&#123;&#78;&#111;&#105;&#115;&#101;&#32;&#73;&#109;&#112;&#97;&#99;&#116;&#32;&#83;&#99;&#111;&#114;&#101;&#125;&#43;&#119;&#95;&#123;&#50;&#125;&#92;&#116;&#105;&#109;&#101;&#115;&#92;&#116;&#101;&#120;&#116;&#123;&#70;&#71;&#83;&#77;&#32;&#73;&#109;&#112;&#97;&#99;&#116;&#32;&#83;&#99;&#111;&#114;&#101;&#125;&#92;&#113;&#117;&#97;&#100;&#92;&#116;&#101;&#120;&#116;&#123;&#119;&#105;&#116;&#104;&#32;&#125;&#119;&#95;&#123;&#49;&#125;&#43;&#119;&#95;&#123;&#50;&#125;&#61;&#49;&#44;&#92;&#59;&#119;&#95;&#123;&#49;&#125;&#44;&#119;&#95;&#123;&#50;&#125;&#92;&#103;&#101;&#32;&#48;&#92;&#93;\" title=\"Rendered by QuickLaTeX.com\"\/><\/p>\n<\/div>\n<p>Because the weights <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-d2d5d26e6844b1c0fe60235c9b2228ab_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#49;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"19\" style=\"vertical-align: -3px;\"\/> and <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-4a960160459310475c1d4083f1ba3252_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#50;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"20\" style=\"vertical-align: -3px;\"\/> sum to one, the score can be tuned to reflect the risk profile of a particular deployment. A disaster\u2011response scenario, for example, might give a higher weight to random noise (larger <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-d2d5d26e6844b1c0fe60235c9b2228ab_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#49;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"19\" style=\"vertical-align: -3px;\"\/>), whereas a secure\u2011information\u2011handling pipeline might prioritize resistance to crafted attacks (larger <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-4a960160459310475c1d4083f1ba3252_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#50;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"20\" style=\"vertical-align: -3px;\"\/>).<\/p>\n<\/section>\n<section>\n<h2>What we found<\/h2>\n<p>Running the full protocol on the CLIP model produced a striking degradation pattern.<\/p>\n<ul>\n<li><strong>Baseline performance.<\/strong> On clean Caltech\u2011256 images, the model achieved 95\u202f% top\u20111 accuracy.<\/li>\n<li><strong>Noise impact.<\/strong> Adding Gaussian noise reduced accuracy to 67.5\u202f%; Salt\u2011and\u2011Pepper lowered it to 66.8\u202f%; Uniform noise resulted in 66.6\u202f% accuracy. All three figures are within the 66\u201167\u202f% band reported across independent drafts, confirming that modest statistical perturbations are enough to cripple a VLM in realistic conditions.<\/li>\n<li><strong>Adversarial attack impact.<\/strong> The FGSM perturbation drove accuracy down to just 9.3\u202f%, a drop consistent with the 9.35\u202f% figure observed in several reports.<\/li>\n<li><strong>Universal patches.<\/strong> The average noise patches created from the incremental protocol acted as universal adversarial perturbations: applying the same patch to previously unseen images caused misclassifications at rates comparable to the FGSM benchmark. This demonstrates that even simple, data\u2011driven noise patterns can be weaponized.<\/li>\n<li><strong>Vulnerability Scores.<\/strong> By choosing equal weights (<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-38d4a6e26a0952244885d73a346f6a76_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#49;&#125;&#61;&#119;&#95;&#123;&#50;&#125;&#61;&#48;&#46;&#53;\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"110\" style=\"vertical-align: -3px;\"\/>), the CLIP model received a Vulnerability Score of roughly 0.75, indicating moderate resilience to noise but severe weakness to targeted attacks. Adjusting the weights to emphasize noise (e.g., <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-25060e6f19caddbb873f8ef1c1e85597_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#49;&#125;&#61;&#48;&#46;&#56;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"67\" style=\"vertical-align: -3px;\"\/>, <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-030521cb49f145df8d287886e754f104_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#50;&#125;&#61;&#48;&#46;&#50;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"66\" style=\"vertical-align: -3px;\"\/>) lowered the score to about 0.55, while a security\u2011focused weighting (<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-6fdae3300f2a75f6f314e1c42231ff06_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#49;&#125;&#61;&#48;&#46;&#50;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"66\" style=\"vertical-align: -3px;\"\/>, <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-d0495df4fc05ac4fd946e198f31bc14c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#50;&#125;&#61;&#48;&#46;&#56;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"67\" style=\"vertical-align: -3px;\"\/>) pushed the score toward 0.90, flagging the model as high\u2011risk for adversarial scenarios.<\/li>\n<\/ul>\n<p>These findings confirm two key hypotheses: (1) statistical noise patches can serve as inexpensive, universal adversarial tools, and (2) a single composite metric can capture the nuanced risk landscape that public\u2011sector deployments must navigate.<\/p>\n<\/section>\n<section>\n<h2>Limits and next steps<\/h2>\n<p>While the framework is practical and broadly applicable, several limitations deserve attention.<\/p>\n<ul>\n<li><strong>Computational intensity.<\/strong> Incrementally testing each noise level and generating saliency maps requires repeated forward passes. The runtime can become significant for larger datasets or more complex multimodal models. Future work will explore adaptive stepping strategies and surrogate models to reduce the number of evaluations.<\/li>\n<li><strong>Attack diversity.<\/strong> We focused on three statistical noises and the FGSM attack, which is a canonical but relatively weak adversary. More sophisticated attacks (e.g., Projected Gradient Descent, spatial transformations) may reveal additional weaknesses not captured by our current score.<\/li>\n<li><strong>Weight selection guidance.<\/strong> The flexibility of <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-d2d5d26e6844b1c0fe60235c9b2228ab_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#49;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"19\" style=\"vertical-align: -3px;\"\/> and <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-4a960160459310475c1d4083f1ba3252_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#50;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"20\" style=\"vertical-align: -3px;\"\/> is a strength, but users need practical guidance for choosing them. In follow\u2011up studies, we plan to develop scenario\u2011based templates-such as \u201cdisaster response\u201d (high <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-d2d5d26e6844b1c0fe60235c9b2228ab_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#49;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"19\" style=\"vertical-align: -3px;\"\/>) and \u201csecure diagnostics\u201d (high <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-4a960160459310475c1d4083f1ba3252_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#119;&#95;&#123;&#50;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"11\" width=\"20\" style=\"vertical-align: -3px;\"\/>)-to aid decision makers.<\/li>\n<li><strong>Generalisation to other modalities.<\/strong> Our proof\u2011of\u2011concept used CLIP, a pure image\u2011text model. Extending the protocol to video\u2011language, audio\u2011visual, or multimodal sensor fusion models will test the robustness of the Vulnerability Score across the broader AI ecosystem used by government agencies.<\/li>\n<\/ul>\n<p>By addressing these gaps, we aim to evolve the framework into a standard safety\u2011verification toolkit for any high\u2011stakes AI deployment.<\/p>\n<\/section>\n<section>\n<h2>FAQ<\/h2>\n<p><strong>Q: How much data do we really need to run the evaluation?<\/strong><br \/>\nA: Our experiments showed that a 300\u2011image sample, about 1\u202f% of the Caltech\u2011256 benchmark, captures the full class diversity and yields stable Vulnerability Scores. This small footprint was sufficient to reproduce the accuracy drops reported in multiple independent drafts, making the method accessible to organizations without large\u2011scale compute clusters.<\/p>\n<p><strong>Q: Can the Vulnerability Score be compared across different VLM architectures?<\/strong><br \/>\nA: Yes. Because the score is normalized by the model\u2019s own baseline accuracy, it reflects relative degradation rather than absolute performance. To compare architectures, each model is evaluated on the same noise\u2011increment protocol, and the resulting scores are plotted side\u2011by\u2011side. The adjustable weights let stakeholders emphasize the threat most relevant to their use case.<\/p>\n<\/section>\n<footer>Read the paper: <a href=\"https:\/\/arxiv.org\/abs\/2502.16361\">A Framework for Evaluating Vision-Language Model Safety: Building Trust in AI for Public Sector Applications<\/a><\/p>\n<p><em>Citation:<\/em> Rashid, M. B., &amp; Rivas, P. (2025). A Framework for Evaluating Vision-Language Model Safety: Building Trust in AI for Public Sector Applications. In Proceedings of AAAI 2025 Workshop on AI for Public Missions at the 39th Annual AAAI Conference on Artificial Intelligence (pp. 1\u20134). Philadelphia, PA, USA.<\/p>\n<\/footer>\n","protected":false},"excerpt":{"rendered":"<p>Introducing a novel framework to stress-test vision-language models against noise and attacks, featuring the Vulnerability Score metric, which combines impact scores with adjustable weights, ensuring AI safety in public services with minimal data requirements and robust results.<\/p>\n","protected":false},"author":11,"featured_media":7340,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-7341","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-adversarial-ml"],"jetpack_featured_media_url":"https:\/\/lab.rivas.ai\/wp-content\/uploads\/2025\/10\/05_featured_image-1.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts\/7341","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7341"}],"version-history":[{"count":3,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts\/7341\/revisions"}],"predecessor-version":[{"id":7345,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts\/7341\/revisions\/7345"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/media\/7340"}],"wp:attachment":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7341"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7341"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7341"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}