Measuring AI Safety: A New Score for Vision‑Language Models in Public Services

TL;DR

We introduce a reproducible framework for stress‑testing vision‑language models (VLMs) against random noise and crafted adversarial attacks.
Our core metric, the Vulnerability Score, combines a Noise Impact Score and an FGSM Impact Score with adjustable weights.
Using the CLIP model on 1 % of Caltech‑256, baseline accuracy (95 %) fell to roughly 66‑67 % under Gaussian, Salt‑and‑Pepper, or Uniform noise, and to about 9 % under a Fast Gradient Sign Method (FGSM) attack.
The framework requires only a tiny subset of data, making it practical for public‑sector teams with limited resources.

Why it matters

Public‑sector AI systems-whether they support emergency response, medical triage, or critical infrastructure monitoring-must operate reliably under real‑world disturbances. A model that appears accurate in clean laboratory settings can fail catastrophically when confronted with sensor noise, weather‑induced image degradation, or malicious manipulation. Existing safety assessments focus almost exclusively on either random corruption or targeted adversarial attacks, leaving a blind spot for scenarios where both types of perturbations coexist. By quantifying how much performance degrades under each threat and merging the two effects into a single, tunable score, we give policymakers, engineers, and auditors a concrete yardstick to compare models, set deployment thresholds, and prioritize mitigation strategies. The ability to run the evaluation with only 1 % of a standard benchmark (the Caltech‑256 dataset) also means that even small government labs can adopt the method without prohibitive compute costs.

How it works

Our methodology proceeds in three stages.

Incremental noise injection. We take a representative slice of the Caltech‑256 image collection, 300 images, roughly 1 % of the full set, covering every class. For each image, we add three types of statistical noise (Gaussian, Salt‑and‑Pepper, Uniform) in 0.01‑step increments until the model first misclassifies the image. The exact noise level that triggers failure is recorded.
Patch synthesis and saliency mapping. The recorded noise thresholds across all images are averaged to produce an “average noise patch” for each noise family. These patches highlight the image regions most sensitive to corruption. We also generate saliency maps by back‑propagating the misclassification signal, revealing which pixels the model relies on most heavily.
Adversarial comparison. The classic Fast Gradient Sign Method (FGSM) is applied to the same image set as a reference point for crafted attacks. By comparing the effectiveness of the statistical patches with FGSM, we verify that our noise‑derived perturbations act as universal adversarial examples, even though they are created without any knowledge of the model’s gradients.

From these stages, we compute two intermediate metrics:

Noise Impact Score = (Baseline accuracy − Accuracy under the average noise patch) / Baseline accuracy.
FGSM Impact Score = (Baseline accuracy − Accuracy under FGSM) / Baseline accuracy.

We then blend the two using a single weighted formula. The equation that defines the overall score is shown below.

$\text{Vulnerability Score}=w_{1}\times\text{Noise Impact Score}+w_{2}\times\text{FGSM Impact Score}\quad\text{with }w_{1}+w_{2}=1,\;w_{1},w_{2}\ge 0$

Because the weights $w_{1}$ and $w_{2}$ sum to one, the score can be tuned to reflect the risk profile of a particular deployment. A disaster‑response scenario, for example, might give a higher weight to random noise (larger $w_{1}$ ), whereas a secure‑information‑handling pipeline might prioritize resistance to crafted attacks (larger $w_{2}$ ).

What we found

Running the full protocol on the CLIP model produced a striking degradation pattern.

Baseline performance. On clean Caltech‑256 images, the model achieved 95 % top‑1 accuracy.
Noise impact. Adding Gaussian noise reduced accuracy to 67.5 %; Salt‑and‑Pepper lowered it to 66.8 %; Uniform noise resulted in 66.6 % accuracy. All three figures are within the 66‑67 % band reported across independent drafts, confirming that modest statistical perturbations are enough to cripple a VLM in realistic conditions.
Adversarial attack impact. The FGSM perturbation drove accuracy down to just 9.3 %, a drop consistent with the 9.35 % figure observed in several reports.
Universal patches. The average noise patches created from the incremental protocol acted as universal adversarial perturbations: applying the same patch to previously unseen images caused misclassifications at rates comparable to the FGSM benchmark. This demonstrates that even simple, data‑driven noise patterns can be weaponized.
Vulnerability Scores. By choosing equal weights ( $w_{1}=w_{2}=0.5$ ), the CLIP model received a Vulnerability Score of roughly 0.75, indicating moderate resilience to noise but severe weakness to targeted attacks. Adjusting the weights to emphasize noise (e.g., $w_{1}=0.8$ , $w_{2}=0.2$ ) lowered the score to about 0.55, while a security‑focused weighting ( $w_{1}=0.2$ , $w_{2}=0.8$ ) pushed the score toward 0.90, flagging the model as high‑risk for adversarial scenarios.

These findings confirm two key hypotheses: (1) statistical noise patches can serve as inexpensive, universal adversarial tools, and (2) a single composite metric can capture the nuanced risk landscape that public‑sector deployments must navigate.

Limits and next steps

While the framework is practical and broadly applicable, several limitations deserve attention.

Computational intensity. Incrementally testing each noise level and generating saliency maps requires repeated forward passes. The runtime can become significant for larger datasets or more complex multimodal models. Future work will explore adaptive stepping strategies and surrogate models to reduce the number of evaluations.
Attack diversity. We focused on three statistical noises and the FGSM attack, which is a canonical but relatively weak adversary. More sophisticated attacks (e.g., Projected Gradient Descent, spatial transformations) may reveal additional weaknesses not captured by our current score.
Weight selection guidance. The flexibility of $w_{1}$ and $w_{2}$ is a strength, but users need practical guidance for choosing them. In follow‑up studies, we plan to develop scenario‑based templates-such as “disaster response” (high $w_{1}$ ) and “secure diagnostics” (high $w_{2}$ )-to aid decision makers.
Generalisation to other modalities. Our proof‑of‑concept used CLIP, a pure image‑text model. Extending the protocol to video‑language, audio‑visual, or multimodal sensor fusion models will test the robustness of the Vulnerability Score across the broader AI ecosystem used by government agencies.

By addressing these gaps, we aim to evolve the framework into a standard safety‑verification toolkit for any high‑stakes AI deployment.

FAQ

Q: How much data do we really need to run the evaluation?
A: Our experiments showed that a 300‑image sample, about 1 % of the Caltech‑256 benchmark, captures the full class diversity and yields stable Vulnerability Scores. This small footprint was sufficient to reproduce the accuracy drops reported in multiple independent drafts, making the method accessible to organizations without large‑scale compute clusters.

Q: Can the Vulnerability Score be compared across different VLM architectures?
A: Yes. Because the score is normalized by the model’s own baseline accuracy, it reflects relative degradation rather than absolute performance. To compare architectures, each model is evaluated on the same noise‑increment protocol, and the resulting scores are plotted side‑by‑side. The adjustable weights let stakeholders emphasize the threat most relevant to their use case.

Giving Thanks for the Pioneering Advances in Machine Learning

As we gather around the table this Thanksgiving, it’s the perfect time to reflect on and express gratitude for the remarkable strides made in machine learning (ML) over recent years. These technical innovations have advanced the field and paved the way for countless applications that enhance our daily lives. Let’s check out some of the most influential ML architectures and algorithms for which we are thankful as a community.

1. The Transformer Architecture

Vaswani et al., 2017

We are grateful for the Transformer architecture, which revolutionized sequence modeling by introducing a novel attention mechanism, eliminating the reliance on recurrent neural networks (RNNs) for handling sequential data.

Key Components:

Self-Attention Mechanism: Computes representations of the input sequence by relating different positions via attention weights.
$\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V$
Multi-Head Attention: Allows the model to focus on different positions by projecting queries, keys, and values multiple times with different linear projections. $\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O$ where each head is computed as: $\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$
Positional Encoding: Adds information about the position of tokens in the sequence since the model lacks recurrence. $\text{PE}_{(pos, 2i)} = \sin\left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right) \text{PE}_{(pos, 2i+1)} = \cos\left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right)$

Significance: Enabled parallelization in sequence processing, leading to significant speed-ups and improved performance in tasks like machine translation and language modeling.

2. Bidirectional Encoder Representations from Transformers (BERT)

Devlin et al., 2018

We are thankful for BERT, which introduced a method for pre-training deep bidirectional representations by jointly conditioning on both left and right contexts in all layers.

Key Concepts:

Masked Language Modeling (MLM): Randomly masks tokens in the input and predicts them using the surrounding context. Loss Function: $\mathcal{L}_{\text{MLM}} = -\sum_{t \in \mathcal{M}} \log P_{\theta}(x_t | x_{\backslash \mathcal{M}})$ where $\mathcal{M}$ is the set of masked positions.
Next Sentence Prediction (NSP): Predicts whether a given pair of sentences follows sequentially in the original text.

Significance: Achieved state-of-the-art results on a wide range of NLP tasks via fine-tuning, demonstrating the power of large-scale pre-training.

3. Generative Pre-trained Transformers (GPT) Series

Radford et al., 2018-2020

We express gratitude for the GPT series, which leverages unsupervised pre-training on large corpora to generate human-like text.

Key Features:

Unidirectional Language Modeling: Predicts the next token $x_t$ given previous tokens $x_{<t}$ . Objective Function: $\mathcal{L}_{\text{LM}} = -\sum_{t=1}^N \log P_{\theta}(x_t | x_{<t})$
Decoder-Only Transformer Architecture: Utilizes masked self-attention to prevent the model from attending to future tokens.

Significance: Demonstrated the capability of large language models to perform few-shot learning, adapting to new tasks with minimal task-specific data.

4. Variational Autoencoders (VAEs)

Kingma and Welling, 2013

We appreciate VAEs for introducing a probabilistic approach to autoencoders, enabling generative modeling of complex data distributions.

Key Components:

Encoder Network: Learns an approximate posterior $q_{\phi}(z|x)$ .
Decoder Network: Reconstructs the input from latent variables $z$ , modeling $p_{\theta}(x|z)$ .

Objective Function (Evidence Lower Bound – ELBO): $\mathcal{L}(\theta, \phi; x) = -\text{KL}(q_{\phi}(z|x) \| p_{\theta}(z)) + \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]$ where $p_{\theta}(z)$ is typically a standard normal prior $\mathcal{N}(0, I)$ .

Significance: Provided a framework for unsupervised learning of latent representations and generative modeling.

5. Generative Adversarial Networks (GANs)

Goodfellow et al., 2014

We are thankful for GANs, which consist of two neural networks—a generator $G$ and a critic $D$ —competing in a minimax game.

Objective Function: $\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$ where $p_{\text{data}}$ is the data distribution and $p_z$ is the prior over the latent space.

Significance: Enabled the generation of highly realistic synthetic data, impacting image synthesis, data augmentation, and more.

6. Deep Reinforcement Learning

Mnih et al., 2015; Silver et al., 2016

We give thanks for the combination of deep learning with reinforcement learning, leading to agents capable of performing complex tasks.

Key Algorithms:

Deep Q-Networks (DQN): Approximate the action-value function $Q(s, a; \theta)$ using neural networks. Bellman Equation: $Q(s, a) = r + \gamma \max_{a'} Q(s', a'; \theta^{-})$ where $\theta^{-}$ are the parameters of a target network.
Policy Gradient Methods: Optimize the policy $\pi_{\theta}(a|s)$ directly. REINFORCE Algorithm Objective: $\nabla_{\theta} J(\theta) = \mathbb{E}_{\pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(a|s) R \right]$ where $R$ is the cumulative reward.

Significance: Achieved human-level performance in games like Atari and Go, advancing AI in decision-making tasks.

7. Normalization Techniques

We are grateful for normalization techniques that have improved training stability and performance of deep networks.

Batch Normalization (Ioffe and Szegedy, 2015) Formula: $\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}$ where $\mu_{\mathcal{B}}$ and $\sigma_{\mathcal{B}}^2$ are the batch mean and variance.
Layer Normalization (Ba et al., 2016) Formula: $\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$ where $\mu$ and $\sigma^2$ are computed over the features of a single sample.

Significance: Mitigated internal covariate shift, enabling faster and more reliable training.

8. Attention Mechanisms in Neural Networks

Bahdanau et al., 2014; Luong et al., 2015

We appreciate attention mechanisms for allowing models to focus on specific parts of the input when generating each output element.

Key Concepts:

Alignment Scores: Compute the relevance between encoder hidden states and decoder state . Common Score Functions:
- Dot-product: $\text{score}(h, s) = h^\top s$
- Additive (Bahdanau attention): $\text{score}(h, s) = v_a^\top \tanh(W_a [h; s])$
Context Vector: $c_t = \sum_{i=1}^T \alpha_{t,i} h_i$ where the attention weights $\alpha_{t,i}$ are computed as: $\alpha_{t,i} = \frac{\exp(\text{score}(h_i, s_{t-1}))}{\sum_{k=1}^T \exp(\text{score}(h_k, s_{t-1}))}$

Significance: Enhanced performance in sequence-to-sequence tasks by allowing models to utilize information from all input positions.

9. Graph Neural Networks (GNNs)

Scarselli et al., 2009; Kipf and Welling, 2016

We are thankful for GNNs, which extend neural networks to graph-structured data, enabling the modeling of relational information.

Message Passing Framework:

Node Representation Update: where:
- $h_v^{(k)}$ is the representation of node $v$ at layer $k$ .
- $\mathcal{N}(v)$ is the set of neighbors of node $v$ .
- $W$ and $W_0$ are learnable weight matrices.
- $\sigma$ is a nonlinear activation function.
Graph Convolutional Networks (GCNs): where:
- $\tilde{A} = A + I$ is the adjacency matrix with added self-loops.
- $\tilde{D}$ is the degree matrix of $\tilde{A}$ .

Significance: Enabled advancements in social network analysis, molecular chemistry, and recommendation systems.

10. Self-Supervised Learning and Contrastive Learning

He et al., 2020; Chen et al., 2020

We are grateful for self-supervised learning techniques that leverage unlabeled data by creating surrogate tasks.

Contrastive Learning Objective:

InfoNCE Loss: where:
- $z_i$ and $z_j$ are representations of two augmented views of the same sample.
- $\text{sim}(u, v) = \frac{u^\top v}{\|u\| \|v\|}$ is the cosine similarity.
- $\tau$ is a temperature parameter.
- $\textbf{1}_{[k \neq i]}$ is an indicator function equal to 1 when $k \neq i$ .

Significance: Improved representation learning, leading to state-of-the-art results in computer vision tasks without requiring labeled data.

11. Differential Privacy in Machine Learning

Abadi et al., 2016

We give thanks for techniques that allow training models while preserving the privacy of individual data points.

Differential Privacy Guarantee:

Definition: A randomized algorithm $\mathcal{A}$ provides $(\epsilon, \delta)$ -differential privacy if for all datasets $D$ and $D'$ differing on one element, and all measurable subsets $S$ : $P[\mathcal{A}(D) \in S] \leq e^\epsilon P[\mathcal{A}(D') \in S] + \delta$
Noise Addition: Applies calibrated noise to gradients during training to ensure privacy.

Significance: Enabled the deployment of machine learning models in privacy-sensitive applications.

12. Federated Learning

McMahan et al., 2017

We are thankful for federated learning, which allows training models across multiple decentralized devices while keeping data localized.

Federated Averaging Algorithm:

Local Update: Each client $k$ updates model parameters $\theta$ using local data $D_k$ : $\theta_k^{t+1} = \theta^t - \eta \nabla_{\theta} \mathcal{L}(\theta^t; D_k)$
Global Aggregation: The server aggregates updates: where:
- $n_k$ is the number of samples at client $k$ .
- $n = \sum_{k=1}^K n_k$ is the total number of samples across all clients.

Significance: Addressed privacy concerns and bandwidth limitations in distributed systems.

13. Neural Architecture Search (NAS)

Zoph and Le, 2016

We appreciate NAS for automating the design of neural network architectures using optimization algorithms.

Approaches:

Reinforcement Learning-Based NAS: Uses an RNN controller to generate architectures, trained to maximize expected validation accuracy.
Differentiable NAS (DARTS): Models the architecture search space as continuous, enabling gradient-based optimization. Objective Function: $\min_{\alpha} \mathcal{L}_{\text{val}}(w^*(\alpha), \alpha)$ where $w^*(\alpha)$ is obtained by: $w^*(\alpha) = \arg\min_w \mathcal{L}_{\text{train}}(w, \alpha)$

Significance: Reduced human effort in designing architectures, leading to efficient and high-performing models.

14. Optimizer Advancements (Adam, AdaBound, RAdam)

We are thankful for advancements in optimization algorithms that improved training efficiency.

Adam Optimizer(Kingma and Ba, 2014)
Update Rules: , , , ,
where:
- $g_t$ is the gradient at time step $t$ .
- $\beta_1$ and $\beta_2$ are hyperparameters controlling the exponential decay rates.
- $\eta$ is the learning rate.
- $\epsilon$ is a small constant to prevent division by zero.

Significance: Improved optimization efficiency and convergence in training deep neural networks.

15. Diffusion Models for Generative Modeling

Ho et al., 2020; Song et al., 2020

We give thanks for diffusion models, which are generative models that learn data distributions by reversing a diffusion (noising) process.

Key Concepts:

Forward Diffusion Process: Gradually adds Gaussian noise to data over $T$ timesteps.
Noising Schedule: $q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$
Reverse Process: Learns to denoise from back to .
Objective Function: where:
- $\epsilon$ is the noise added to the data.
- $\epsilon_\theta(x_t, t)$ is the model’s prediction of the noise at timestep $t$ .

Significance: Achieved state-of-the-art results in image generation, rivaling GANs without their training instability.

Give Thanks…

This Thanksgiving, let’s celebrate and express our gratitude for these groundbreaking contributions to machine learning. These technical advancements have not only pushed the boundaries of what’s possible but have also laid the foundation for future innovations that will continue to shape our world.

May we continue to build upon these foundations and contribute to the growing field of machine learning.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems. arXiv:1706.03762
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-training. OpenAI Blog.
Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Nets. Advances in Neural Information Processing Systems.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level Control through Deep Reinforcement Learning. Nature.
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. International Conference on Machine Learning (ICML).
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473
Kipf, T. N., & Welling, M. (2016). Semi-Supervised Classification with Graph Convolutional Networks. arXiv:1609.02907
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. arXiv:1911.05722
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep Learning with Differential Privacy. ACM SIGSAC Conference on Computer and Communications Security.
McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. arXiv:1602.05629
Zoph, B., & Le, Q. V. (2016). Neural Architecture Search with Reinforcement Learning. arXiv:1611.01578
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. arXiv:2006.11239

Resilient AI: Advancing Robustness Against Adversarial Threats with D-ReLU

Artificial intelligence (AI) is now embedded in everyday life, from self-driving cars to medical diagnostic tools, enabling tasks to be performed faster and, in some cases, more accurately than humans. However, this rapid advancement comes with significant challenges, particularly in the form of adversarial attacks. These attacks exploit small, often imperceptible changes in input data to deceive AI systems into making incorrect decisions. For example, a strategically placed sticker on a stop sign might cause an AI-powered car to misinterpret it as a speed limit sign, creating potentially dangerous situations; another example can be small perturbations added to your dog’s picture, which can lead to state-of-the-art AI to confuse it with a cat:

The Role of ReLU and Its Limitations

The Rectified Linear Unit (ReLU) activation function is a foundational component of many AI models. Its simplicity and efficiency have made it a go-to choice for training deep learning networks. However, ReLU’s unrestricted output can make models vulnerable to adversarial noise, leading to cascading errors in predictions. Attempts to address this vulnerability, such as Static-Max-Value ReLU (S-ReLU or capped ReLU), have introduced fixed output caps, but these solutions often underperform on more complex datasets and tasks.

Introducing D-ReLU

D-ReLU represents a significant advancement over traditional ReLU. It incorporates a dynamic output cap that adjusts based on the data flowing through the network. This adaptability serves as a robust defense mechanism against adversarial inputs while maintaining computational efficiency. In essence, D-ReLU acts as a self-adjusting safeguard, preserving model integrity even under duress.

Key Features of D-ReLU:

Adaptive Output Limits: D-ReLU employs learnable caps that evolve during training, enabling models to balance robustness and accuracy effectively.
Enhanced Resilience: D-ReLU has demonstrated superior performance against adversarial attacks, including FGSM, PGD, and Carlini-Wagner, while maintaining consistent performance on standard datasets.
Scalability: Tested on large-scale datasets like CIFAR-10, CIFAR-100, and TinyImagenet, D-ReLU has proven its ability to scale effectively without degradation in performance.
Efficient Training: Unlike adversarial training methods, which require extensive additional computations, D-ReLU achieves robustness naturally, streamlining the training process.
Real-World Viability: D-ReLU excels in real-world scenarios, including black-box attack settings where attackers lack full knowledge of the model.

The Broader Implications

In applications where reliability and safety are paramount—such as autonomous vehicles, financial systems, and medical imaging—D-ReLU offers a compelling solution to the challenges posed by adversarial inputs. By enhancing a model’s resilience without sacrificing performance, D-ReLU provides a vital upgrade for AI systems operating in high-stakes environments.

Future Directions

The potential of D-ReLU extends beyond current implementations. Areas of exploration include:

Further optimization for improved performance,
Applications in natural language processing and audio tasks,
Integration with complementary robust training methods for enhanced results.

For a detailed analysis and technical insights, download our paper here. If you are working on AI models, we encourage you to experiment with D-ReLU and share your experiences:

Sooksatra, Korn, and Pablo Rivas. 2024. “Dynamic-Max-Value ReLU Functions for Adversarially Robust Machine Learning Models” Mathematics 12, no. 22: 3551. https://doi.org/10.3390/math12223551

About the Author

Korn Sooksatra is a Ph.D. student at Baylor University, specializing in adversarial machine learning and AI robustness.

Enhancing AI Safety: Improving Adversarial Robustness in Vision Language Models

The Research Question

How can we improve the adversarial robustness of Vision Language Models (VLMs) to ensure their safe deployment in critical applications? This question drives our exploration into focused adversarial training techniques that improve the security of these models without excessive computational costs.

Adversarial Robustness and AI Safety

Adversarial attacks involve subtle manipulations of input data designed to deceive machine learning models into making incorrect predictions. In the context of VLMs, these attacks can have severe implications, especially when these models are deployed in sensitive areas such as autonomous driving, healthcare, and content moderation.

Enhancing the adversarial robustness of VLMs is crucial for AI safety. Robust models can withstand adversarial inputs, ensuring reliable performance and preventing malicious exploitation. Our research focuses on a novel approach to achieve this robustness by selectively re-training components of the multimodal architecture.

Our Approach

Traditional methods to improve model robustness often involve adversarial training, which integrates adversarial examples into the training process. However, this can be computationally intensive, particularly for complex models like VLMs that process images and text.

Our study introduces a more efficient strategy: adversarially re-training only the language model component of the VLM. This targeted approach leverages the Fast Gradient Sign Method (FGSM) to generate adversarial examples and incorporates them into the training of the text decoder. We maintain computational efficiency by keeping the image encoder fixed while significantly enhancing the model’s overall robustness.

Key Findings

Adversarial Training Efficiency: Adversarially re-training only the language model yields robustness comparable to full adversarial training, with reduced computational demands.
Selective Training Impact: Freezing the image encoder and focusing on the text decoder maintains high performance and robustness. In contrast, training only the image encoder results in a significant performance drop.
Benchmark Results: Experiments on the Flickr8k and COCO datasets demonstrate that our selective adversarial training approach effectively mitigates the impact of adversarial attacks, as evidenced by improved BLEU scores and model performance under adversarial conditions.

Implications for Ethical AI

Our findings support the development of more robust and secure AI systems, which is crucial for ethical AI deployment. By focusing on adversarial robustness, we contribute to the broader goal of AI safety, ensuring that multimodal models can be trusted in real-world applications.

For a detailed exploration of our methodology and findings, read the full paper pre-print: https://arxiv.org/abs/2407.21174

References

Rashid, M.B., & Rivas, P. (2024). AI Safety in Practice: Enhancing Adversarial Robustness in Multimodal Image Captioning. 3rd Workshop on Ethical Artificial Intelligence: Methods and Applications, ACM SIGKDD’24. https://arxiv.org/abs/2407.21174

About the Author

Maisha Binte Rashid is a Ph.D. student at Baylor University, specializing in AI safety and multimodal machine learning.

Evaluating Robustness of Reconstruction Models with Adversarial Networks

K. Sooksatra, G. Bejarano, and P. Rivas, “Evaluating Robustness of Reconstruction Models with Adversarial Networks,” Procedia Computer Science, vol. 222, pp. 353-366, 2023. https://doi.org/10.1016/j.procs.2023.08.174.

In the ever-evolving landscape of artificial intelligence, our lab has made a significant breakthrough with our latest publication featured in Procedia Computer Science. This research, spearheaded by Korn Sooksatra, delves into the critical domain of adversarial robustness, mainly focusing on reconstruction models, which, until now, have been a less explored facet of adversarial research. This paper was accepted initially into IJCNN and chosen to be added to the INNS workshop and published as a journal article.

Key Takeaways:

Innovative Frameworks: The team introduced two novel frameworks for assessing adversarial robustness: the standard framework, which perturbs input images to deceive reconstruction models, and the universal-attack framework, which generates adversarial perturbations from a dataset’s distribution.
Outperforming Benchmarks: Through rigorous testing on MNIST and Cropped Yale Face datasets, these frameworks demonstrated superior performance in altering image reconstruction and classification, surpassing existing state-of-the-art adversarial attacks.
Enhancing Model Resilience: A pivotal aspect of the study was using these frameworks to retrain reconstruction models, significantly improving their defense against adversarial perturbations and showcasing an ethical application of adversarial networks.
Latent Space Analysis: The research also included a thorough examination of the latent space, ensuring that adversarial attacks do not compromise the underlying features that are crucial for reconstruction integrity.

Broader Impact:

The implications of this research are profound for the AI community. It not only presents a method to evaluate and enhance the robustness of reconstruction models but also opens avenues for applying these frameworks to other image-to-image applications. The lab’s work is a call to the AI research community to prioritize the development of robust AI systems that can withstand adversarial threats, ensuring the security and reliability of AI applications across various domains.

Future Directions:

While the frameworks developed are groundbreaking, the team acknowledges the need for reduced preprocessing time to enhance practicality. Future work aims to refine these frameworks and extend their application to other domains, such as video keypoint interpretation, anomaly detection, and graph prediction.

The result of our standard framework without the discriminator on the left is from the VAE, and on the right is from the VAEGAN. The images 1) in the first column are clean; 2) in the second column are the reconstructed images for the images in the first column; 3) in the third column are adversarial examples concerning the images in the first column; 4) in the last column are the reconstructed images for the adversarial examples.

On Adversarial Examples for Text Classification by Perturbing Latent Representations

Recently, with the advancement of deep learning, several applications in text classification have advanced significantly. However, this improvement comes with a cost because deep learning is vulnerable to adversarial examples. This weakness indicates that deep learning is not very robust.

Fortunately, a couple of students in our lab, Korn Sooksatra and Bikram Khanal, noticed that the input of a text classifier is discrete. Hence, it can prevent the classifier from state-of-the-art attacks. Nonetheless, previous works have generated black-box attacks that successfully manipulate the discrete values of the input to find adversarial examples. Therefore, instead of changing the discrete values, they transform the input into its embedding vector containing real values to perform the state-of-the-art white-box attacks. Then, they convert the perturbed embedding vector back into a text and name it an adversarial example. In summary, PhD candidates, Sooksatra and Khanal, create a framework that measures the robustness of a text classifier by using the gradients of the classifier.

This paper was accepted for presentation and publication at the LXAI workshop at NeurIPS 2022 in New Orleans, LA. Download the paper here: [ bib | .pdf ]

SICEM: A Sensitivity-Inspired Constrained Evaluation Method for Adversarial Attacks on Classifiers with Occluded Input Data

In the rapidly evolving field of artificial intelligence, understanding the sensitivity of models to adversarial attacks is crucial. In our recent paper, Korn Sooksatra introduces the Sensitivity-inspired constrained evaluation method (SICEM) to address this concern.

Sooksatra, K., Rivas, P. Evaluation of adversarial attacks sensitivity of classifiers with occluded input data. Neural Comput & Applic 34, 17615–17632 (2022). https://doi.org/10.1007/s00521-022-07387-y

Understanding SICEM

Our proposed method, SICEM, evaluates the vulnerability of an incomplete input against an adversarial attack in comparison to a complete one. This is achieved by leveraging the Jacobian matrix concept. The sensitivity of the target classifier’s output to each attribute of the input is calculated, providing a comprehensive understanding of how changes in the input can affect the output.

$s(x,y)_i = \left|\min \left(0, \frac{\partial Z(x)_y}{\partial x_i} \cdot \left(\sum_{y^{'} \neq y} \frac{\partial Z(x)_{y^{'}}}{\partial x_i}\right) \cdot C(y, 1, 0)_i\right)\right|$

This sensitivity score gives us an insight into how much each attribute of the input contributes to the output’s sensitivity. The score is then used to estimate the overall sensitivity of the given input and its mask.

$S(x, M)_y = \sum_{i=0}^{n-1} (s(x, y)_i \cdot M_i)$

For a complete input, the sensitivity ratio provides a comparative measure of how sensitive the classifier’s output is for an incomplete input versus a complete one.

Results and Implications

Our focus was on an automobile image from the CIFAR-10 dataset. Interestingly, adversarial examples generated by FGSM and IGSM required the same value of $\epsilon$ , which was significantly lower than for other images. This can be attributed to the layer-wise linearity of the classifier. Larger inputs, like the automobile image, require a smaller $\epsilon$ to create an adversarial example. However, JSMA required a higher $\epsilon$ due to the metric of $L_0$ norm.

Understanding the sensitivity of AI models is paramount in ensuring their robustness against adversarial attacks. The SICEM method provides a comprehensive tool to ensure safer and more reliable AI systems. Read the full paper here [ bib | .pdf ].

Evaluating Accuracy and Adversarial Robustness of Quanvolutional Neural Networks

A combination of a quantum circuit and a convolutional neural network (CNN) can have better results over a classic CNN in some cases. In our recent article, we show an example of such a case, using accuracy and adversarial examples as measures of performance and robustness. Check it out: [ bib | pdf ]

Enhancing Adversarial Examples on Deep QNetworks with Previous Information

This work finds strong adversarial examples for Deep Q Networks which are famous deep reinforcement learning models. We combine two subproblems of finding adversarial examples in deep reinforcement learning: finding states to perturb and determining how much to perturb. Therefore, the attack can jointly optimize this problem. Further, we trained Deep Q Networks to play Atari games: Breakout and Space Invader. Then, we used our attack to find adversarial examples on those games. As a result, we can achieve state-of-the-art results and showed that our attack is natural and stealthy. Paper: [ bib | pdf ]

An Adversarial Neural Cryptography Approach to Integrity Checking: Learning to Secure Data Communications

Securing communications is an increasingly challenging problem. While communication channels can be secured using strong ciphers, attackers who gain access to the channel can still perform certain types of attacks. One way to mitigate such attacks is to verify the integrity of exchanging messages between two parties or more. While there are robust integrity check mechanisms currently, these lack variety, and very few are based on machine learning. This paper presents a methodology for performing an integrity check inspired by recent advances in neural cryptography. We provide formal, mathematical functions and an optimization problem for training an adversarial neural cryptography architecture. The proposed neural architectures can adequately solve the problem. In our experiments, a receiver can verify if incoming messages are authentic or altered with an accuracy greater than 99%. This work expands the repertoire of integrity checking methodologies, provides a unique perspective based on neural networks and facilitates data security and privacy. Paper: [ bib , pdf ]