{"id":24257,"date":"2024-04-15T08:33:58","date_gmt":"2024-04-15T16:33:58","guid":{"rendered":"http:\/\/www.palada.net\/index.php\/2024\/04\/15\/news-17987\/"},"modified":"2024-04-15T08:33:58","modified_gmt":"2024-04-15T16:33:58","slug":"news-17987","status":"publish","type":"post","link":"https:\/\/www.palada.net\/index.php\/2024\/04\/15\/news-17987\/","title":{"rendered":"How Microsoft discovers and mitigates evolving attacks against AI guardrails"},"content":{"rendered":"<p><strong>Credit to Author: Mark Russinovich| Date: Thu, 11 Apr 2024 16:00:00 +0000<\/strong><\/p>\n<p>As we continue to integrate generative AI into our daily lives, it\u2019s important to understand the potential harms that can arise from its use. Our <a href=\"https:\/\/blogs.microsoft.com\/on-the-issues\/2023\/07\/21\/commitment-safe-secure-ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">ongoing commitment<\/a> to advance safe, secure, and trustworthy AI includes transparency about the capabilities and limitations of large language models (LLMs). We prioritize research on societal risks and building secure, safe AI, and focus on developing and deploying AI systems for the public good. You can read more about Microsoft\u2019s approach to securing generative AI with <a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/announcing-new-tools-in-azure-ai-to-help-you-build-more-secure-and-trustworthy-generative-ai-applications\/\" target=\"_blank\" rel=\"noreferrer noopener\">new tools we recently announced<\/a> as available or coming soon to Microsoft Azure AI Studio for generative AI app developers.<\/p>\n<p>We also made a commitment to identify and mitigate risks and share information on novel, potential threats. For example, earlier this year Microsoft shared the principles shaping <a href=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2024\/02\/14\/staying-ahead-of-threat-actors-in-the-age-of-ai\/\">Microsoft\u2019s policy and actions<\/a> blocking the nation-state advanced persistent threats (APTs), advanced persistent manipulators (APMs), and cybercriminal syndicates we track from using our AI tools and APIs.<\/p>\n<p>In this blog post, we will discuss some of the key issues surrounding AI harms and vulnerabilities, and the steps we are taking to address the risk.<\/p>\n<h2 class=\"wp-block-heading\" id=\"the-potential-for-malicious-manipulation-of-llms\">The potential for malicious manipulation of LLMs<\/h2>\n<p>One of the main concerns with AI is its potential misuse for malicious purposes. To prevent this, AI systems at Microsoft are built with several layers of defenses throughout their architecture. One purpose of these defenses is to limit what the LLM will do, to align with the developers\u2019 human values and goals. But sometimes bad actors attempt to bypass these safeguards with the intent to achieve unauthorized actions, which may result in what is known as a \u201cjailbreak.\u201d The consequences can range from the unapproved but less harmful\u2014like getting the AI interface to talk like a pirate\u2014to the very serious, such as inducing AI to provide detailed instructions on how to achieve illegal activities. As a result, a good deal of effort goes into shoring up these jailbreak defenses to protect AI-integrated applications from these behaviors.<\/p>\n<p>While AI-integrated applications can be attacked like traditional software (with methods like buffer overflows and cross-site scripting), they can also be vulnerable to more specialized attacks that exploit their unique characteristics, including the manipulation or injection of malicious instructions by talking to the AI model through the user prompt. We can break these risks into two groups of attack techniques:<\/p>\n<ul>\n<li><strong>Malicious prompts<\/strong>: When the user input attempts to circumvent safety systems in order to achieve a dangerous goal<strong>.<\/strong> Also referred to as user\/direct prompt injection attack, or UPIA.<\/li>\n<li><strong>Poisoned content<\/strong>: When a well-intentioned user asks the AI system to process a seemingly harmless document (such as summarizing an email) that contains content created by a malicious third party with the purpose of exploiting a flaw in the AI system. Also known as cross\/indirect prompt injection attack, or XPIA.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture1.webp\" alt=\"Diagram explaining how malicious prompts and poisoned content.\" class=\"wp-image-133938 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture1.webp\"><\/figure>\n<p>Today we\u2019ll share two of our team\u2019s advances in this field: the discovery of a powerful technique to neutralize poisoned content, and the discovery of a novel family of malicious prompt attacks, and how to defend against them with multiple layers of mitigations.<\/p>\n<h2 class=\"wp-block-heading\" id=\"neutralizing-poisoned-content-spotlighting\">Neutralizing poisoned content (Spotlighting)<\/h2>\n<p>Prompt injection attacks through poisoned content are a major security risk because an attacker who does this can potentially issue commands to the AI system as if they were the user. For example, a malicious email could contain a payload that, when summarized, would cause the system to search the user\u2019s email (using the user\u2019s credentials) for other emails with sensitive subjects\u2014say, \u201cPassword Reset\u201d\u2014and exfiltrate the contents of those emails to the attacker by fetching an image from an attacker-controlled URL. As such capabilities are of obvious interest to a wide range of adversaries, defending against them is a key requirement for the safe and secure operation of any AI service.<\/p>\n<p>Our experts have developed a family of techniques called <a href=\"https:\/\/aka.ms\/promptshields-techblog\" target=\"_blank\" rel=\"noreferrer noopener\">Spotlighting<\/a> that reduces the success rate of these attacks from more than 20% to below the threshold of detection, with minimal effect on the AI\u2019s overall performance:<\/p>\n<ul>\n<li><strong>Spotlighting<\/strong> (also known as data marking) to make the external data clearly separable from instructions by the LLM, with different marking methods offering a range of quality and robustness tradeoffs that depend on the model in use.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture2-1024x455.webp\" alt=\"Diagram explaining how Spotlighting works to reduce risk.\" class=\"wp-image-133939 webp-format\" srcset=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture2-1024x455.webp 1024w, https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture2-300x133.webp 300w, https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture2-768x341.webp 768w, https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture2-1536x683.webp 1536w, https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture2.webp 1759w\" data-orig-src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture2-1024x455.webp\"><\/figure>\n<h2 class=\"wp-block-heading\" id=\"mitigating-the-risk-of-multiturn-threats-crescendo\">Mitigating the risk of multiturn threats (Crescendo)<\/h2>\n<p>Our researchers discovered a novel generalization of jailbreak attacks, which we call <strong>Crescendo<\/strong>. This attack can best be described as a <strong>multiturn LLM jailbreak<\/strong>, and we have found that it can achieve a wide range of malicious goals against the most well-known LLMs used today. Crescendo can also bypass many of the existing content safety filters, if not appropriately addressed.\u00a0Once we discovered this jailbreak technique, we quickly shared our technical findings with other AI vendors so they could determine whether they were affected and take actions they deem appropriate. The vendors we contacted are aware of the potential impact of Crescendo attacks and focused on protecting their respective platforms, according to their own AI implementations and safeguards.<\/p>\n<p>At its core, Crescendo tricks LLMs into generating malicious content by exploiting their own responses. By asking carefully crafted questions or prompts that gradually lead the LLM to a desired outcome, rather than asking for the goal all at once, it is possible to bypass guardrails and filters\u2014this can usually be achieved in fewer than 10 interaction turns.\u00a0You can read about Crescendo\u2019s results across a variety of LLMs and chat services, and more about how and why it works, in our <a href=\"https:\/\/arxiv.org\/abs\/2404.01833\" target=\"_blank\" rel=\"noreferrer noopener\">research paper<\/a>.<\/p>\n<p>While Crescendo attacks were a surprising discovery, it is important to note that these attacks did not directly pose a threat to the privacy of users otherwise interacting with the Crescendo-targeted AI system, or the security of the AI system, itself. Rather, what Crescendo attacks bypass and defeat is content filtering regulating the LLM, helping to prevent an AI interface from behaving in undesirable ways. We are committed to continuously researching and addressing these, and other types of attacks, to help maintain the secure operation and performance of AI systems for all.<\/p>\n<p>In the case of Crescendo, our teams made software updates to the LLM technology behind Microsoft\u2019s AI offerings, including our Copilot AI assistants, to mitigate the impact of this multiturn AI guardrail bypass. It is important to note that as more researchers inside and outside Microsoft inevitably focus on finding and publicizing AI bypass techniques, Microsoft will continue taking action to update protections in our products, as major contributors to AI security research, bug bounties and collaboration.<\/p>\n<p>To understand how we addressed the issue, let us first review how we mitigate a standard malicious prompt attack (single step, also known as a one-shot jailbreak):<\/p>\n<ul>\n<li><strong>Standard prompt filtering<\/strong>: Detect and reject inputs that contain harmful or malicious intent, which might circumvent the guardrails (causing a jailbreak attack).<\/li>\n<li><strong>System metaprompt<\/strong>: Prompt engineering in the system to clearly explain to the LLM how to behave and provide additional guardrails.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture3.webp\" alt=\"Diagram of malicious prompt mitigations.\" class=\"wp-image-133941 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture3.webp\"><\/figure>\n<p>Defending against Crescendo initially faced some practical problems. At first, we could not detect a \u201cjailbreak intent\u201d with standard prompt filtering, as each individual prompt is not, on its own, a threat, and keywords alone are insufficient to detect this type of harm. Only when combined is the threat pattern clear. Also, the LLM itself does not see anything out of the ordinary, since each successive step is well-rooted in what it had generated in a previous step, with just a small additional ask; this eliminates many of the more prominent signals that we could ordinarily use to prevent this kind of attack.<\/p>\n<p>To solve the unique problems of multiturn LLM jailbreaks, we create additional layers of mitigations to the previous ones mentioned above:&nbsp;<\/p>\n<ul>\n<li><strong>Multiturn prompt filter<\/strong>: We have adapted input filters to look at the entire pattern of the prior conversation, not just the immediate interaction. We found that even passing this larger context window to existing malicious intent detectors, without improving the detectors at all, significantly reduced the efficacy of Crescendo.&nbsp;<\/li>\n<li><strong>AI Watchdog<\/strong>: Deploying an AI-driven detection system trained on adversarial examples, like a sniffer dog at the airport searching for contraband items in luggage. As a separate AI system, it avoids being influenced by malicious instructions. <a href=\"https:\/\/azure.microsoft.com\/en-us\/products\/ai-services\/ai-content-safety\" target=\"_blank\" rel=\"noreferrer noopener\">Microsoft Azure AI Content Safety<\/a> is an example of this approach.<\/li>\n<li><strong>Advanced research<\/strong>: We invest in research for more complex mitigations, derived from better understanding of how LLM\u2019s process requests and go astray. These have the potential to protect not only against Crescendo, but against the larger family of social engineering attacks against LLM\u2019s.&nbsp;<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture3-1.webp\" alt=\"A diagram explaining how the AI watchdog applies to the user prompt and the AI generated content.\" class=\"wp-image-133946 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2024\/04\/Picture3-1.webp\"><\/figure>\n<h2 class=\"wp-block-heading\" id=\"how-microsoft-helps-protect-ai-systems\">How Microsoft helps protect AI systems<\/h2>\n<p>AI has the potential to bring many benefits to our lives. But it is important to be aware of new attack vectors and take steps to address them. By working together and sharing vulnerability discoveries, we can continue to improve the safety and security of AI systems. With the right product protections in place, we continue to be cautiously optimistic for the future of generative AI, and embrace the possibilities safely, with confidence. To learn more about developing responsible AI solutions with Azure AI, <a href=\"https:\/\/azure.microsoft.com\/solutions\/ai\/responsible-ai-with-azure\" target=\"_blank\" rel=\"noreferrer noopener\">visit our website<\/a>.<\/p>\n<p>To empower security professionals and machine learning engineers to proactively find risks in their own generative AI systems, Microsoft has released an open automation framework, PyRIT (Python Risk Identification Toolkit for generative AI). Read more about the release of <a href=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2024\/02\/22\/announcing-microsofts-open-automation-framework-to-red-team-generative-ai-systems\/\">PyRIT for generative AI Red teaming<\/a>, and <a href=\"https:\/\/github.com\/Azure\/PyRIT\">access the PyRIT toolkit on GitHub<\/a>. If you discover new vulnerabilities in any AI platform, we encourage you to follow responsible disclosure practices for the platform owner. Microsoft\u2019s own procedure is explained here: <a href=\"https:\/\/www.microsoft.com\/en-us\/msrc\/bounty-ai\">Microsoft AI Bounty<\/a>.<\/p>\n<div class=\"wp-block-msxcm-cta-block\" data-moray data-bi-an=\"CTA Block\">\n<div class=\"card d-block mx-ng mx-md-0\">\n<div class=\"row no-gutters\">\n<div class=\"d-flex col-md\">\n<div class=\"card-body align-self-center p-4 p-md-5\">\n<h2>The Crescendo Multi-Turn LLM Jailbreak Attack<\/h2>\n<div class=\"mb-3\">\n<p>Read about Crescendo\u2019s results across a variety of LLMs and chat services, and more about how and why it works.<\/p>\n<\/p><\/div>\n<div class=\"link-group\"> \t\t\t\t\t\t\t<a href=\"https:\/\/arxiv.org\/abs\/2404.01833\" class=\"btn btn-link text-decoration-none p-0\" target=\"_blank\"> \t\t\t\t\t\t\t\t<span>Read the paper<\/span> \t\t\t\t\t\t\t\t<span class=\"glyph-append glyph-append-chevron-right glyph-append-xsmall\"><\/span> \t\t\t\t\t\t\t<\/a> \t\t\t\t\t\t<\/div>\n<\/p><\/div>\n<\/p><\/div>\n<div class=\"col-md-4\"> \t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"464\" height=\"328\" src=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2023\/11\/Diamond-Sleet-featured.png\" class=\"card-img img-object-cover\" alt=\"Photo of a male employee using a laptop in a small busines setting\" srcset=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2023\/11\/Diamond-Sleet-featured.webp 464w, https:\/\/www.microsoft.com\/en-us\/security\/blog\/wp-content\/uploads\/2023\/11\/Diamond-Sleet-featured-300x212.webp 300w\" sizes=\"(max-width: 464px) 100vw, 464px\" \/>\t\t\t\t<\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<p>To learn more about Microsoft Security solutions, visit our&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/security\/business\" target=\"_blank\" rel=\"noreferrer noopener\">website.<\/a>&nbsp;Bookmark the&nbsp;<a href=\"https:\/\/www.microsoft.com\/security\/blog\/\" target=\"_blank\" rel=\"noreferrer noopener\">Security blog<\/a>&nbsp;to keep up with our expert coverage on security matters. Also, follow us on LinkedIn (<a href=\"https:\/\/www.linkedin.com\/showcase\/microsoft-security\/\">Microsoft Security<\/a>) and X (<a href=\"https:\/\/twitter.com\/@MSFTSecurity\" target=\"_blank\" rel=\"noreferrer noopener\">@MSFTSecurity<\/a>)&nbsp;for the latest news and updates on cybersecurity.<\/p>\n<p>The post <a href=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2024\/04\/11\/how-microsoft-discovers-and-mitigates-evolving-attacks-against-ai-guardrails\/\">How Microsoft discovers and mitigates evolving attacks against AI guardrails<\/a> appeared first on <a href=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\">Microsoft Security Blog<\/a>.<\/p>\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2024\/04\/11\/how-microsoft-discovers-and-mitigates-evolving-attacks-against-ai-guardrails\/\" target=\"bwo\" >https:\/\/blogs.technet.microsoft.com\/mmpc\/feed\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p><strong>Credit to Author: Mark Russinovich| Date: Thu, 11 Apr 2024 16:00:00 +0000<\/strong><\/p>\n<p>Read about some of the key issues surrounding AI harms and vulnerabilities, and the steps Microsoft is taking to address the risk. <\/p>\n<p>The post <a href=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2024\/04\/11\/how-microsoft-discovers-and-mitigates-evolving-attacks-against-ai-guardrails\/\">How Microsoft discovers and mitigates evolving attacks against AI guardrails<\/a> appeared first on <a href=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\">Microsoft Security Blog<\/a>.<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[10759,10378],"tags":[],"class_list":["post-24257","post","type-post","status-publish","format-standard","hentry","category-microsoft","category-security"],"_links":{"self":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/24257","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/comments?post=24257"}],"version-history":[{"count":0,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/24257\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/media?parent=24257"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/categories?post=24257"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/tags?post=24257"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}