VLA-Fool
Researchers propose VLA-Fool, demonstrating how textual typos, visual patches, and cross-modal misalignment can adversarially attack vision-language-action models.
Academic papers and research explorations
Researchers propose VLA-Fool, demonstrating how textual typos, visual patches, and cross-modal misalignment can adversarially attack vision-language-action models.
StableVLA introduces IB-Adapter, a plug-and-play module grounded in information bottleneck theory that enhances vision-language-action model robustness to visual corruptions without requiring extra training data.
A model-agnostic adversarial attack disrupts vision-language-action models by misaligning visual-text embeddings, while adversarial fine-tuning defends by learning perturbation-invariant representations.
Attackers can fully control a VLA-driven robot by appending just ~20 optimized text tokens to a normal instruction—no image manipulation, no model access at deployment.
Diffusion Transformers and UViT models enhance video generation, with Sora and Open-Sora leveraging DiT for diverse, high-quality video outputs without resizing constraints.