{"id":23227,"date":"2023-10-30T08:42:02","date_gmt":"2023-10-30T16:42:02","guid":{"rendered":"https:\/\/www.palada.net\/index.php\/2023\/10\/30\/news-16957\/"},"modified":"2023-10-30T08:42:02","modified_gmt":"2023-10-30T16:42:02","slug":"news-16957","status":"publish","type":"post","link":"http:\/\/www.palada.net\/index.php\/2023\/10\/30\/news-16957\/","title":{"rendered":"\u2018Data poisoning\u2019 anti-AI theft tools emerge \u2014 but are they ethical?"},"content":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/images.idgesg.net\/images\/article\/2023\/10\/shutterstock_2321877251-1-100947840-small.jpg\"\/><\/p>\n<p>Technologists are helping artists fight back against what they see as intellectual property (IP) theft by generative artificial intelligence (genAI) tools \u00a0whose training algorithms automatically scrape the internet and other places for content.<\/p>\n<p>The fight over what constitutes fair use of content found online is at the heart of what has been <a href=\"https:\/\/www.cnn.com\/2021\/04\/05\/tech\/google-oracle-supreme-court-ruling\/index.html\" rel=\"nofollow noopener\" target=\"_blank\">an ongoing court\u00a0battle<\/a>. The fight goes beyond artwork to whether genAi companies\u00a0<a href=\"https:\/\/www.legaldive.com\/news\/microsoft-copilot-user-copyright-legal-indemnification-ip-law\/693163\/\" rel=\"nofollow noopener\" target=\"_blank\">like Microsoft and its partner, OpenAI<\/a>, can incorporate software code and other published content into their models.<\/p>\n<p>Software engineers, many from university computer science departments, have taken the fight into their own hands. Digital \u201cwatermarks\u201d are one option<a href=\"https:\/\/www.cs.umd.edu\/article\/2022\/09\/umd-computer-scientists-create-unremovable-watermark-protect-intellectual-property\" rel=\"nofollow noopener\" target=\"_blank\">\u00a0created to claim authorship<\/a>\u00a0over unique art or other content.<\/p>\n<p>Digital watermarking methods, however, have been thwarted in the past by developers who change network parameters, allowing intruders to claim the content as their own. New techniques have surfaced to prevent those kinds of workarounds, but it\u2019s an ever-evolving battle.<\/p>\n<p>One new method uses \u201c<a href=\"https:\/\/arxiv.org\/abs\/2310.13828\" rel=\"nofollow\">data poisoning attacks<\/a>\u201d to manipulate genAI training data and introduce unexpected behaviors into machine learning models. Called <a href=\"https:\/\/arxiv.org\/abs\/2310.13828\" rel=\"nofollow noopener\" target=\"_blank\">Nightshade<\/a>, the technology, uses \u201ccloaking\u201d to trick a genAI training algorithm into believing it\u2019s getting one thing when in reality it\u2019s ingesting something completely different.<\/p>\n<p>First reported in <a href=\"https:\/\/www.technologyreview.com\/2023\/10\/23\/1082189\/data-poisoning-artists-fight-generative-ai\" rel=\"nofollow noopener\" target=\"_blank\">MIT\u2019s <em>Technology Review<\/em><\/a>, Nightshade essentially gets AI models to interprete an image as something other than what it actually shows.<\/p>\n<p>The technology can cause damage to image-generating genAI tools by corrupting AI <a href=\"https:\/\/www.computerworld.com\/article\/3697649\/what-are-large-language-models-and-how-are-they-used-in-generative-ai.html\">large language model<\/a> (LLM) training data, which leads platforms like DALL-E, Midjourney, and Stable Diffusion to spew out erroneous pictures or videos. For example, a photo interpreted by AI as a car could actually be a boat; a house becomes a banana; a person becomes a whale, and so on.<\/p>\n<p>Nightshade was developed by University of Chicago researchers under\u00a0<a href=\"https:\/\/cs.uchicago.edu\/people\/ben-y-zhao\/\" rel=\"nofollow noopener\" target=\"_blank\">computer science professor Ben Zhao.<\/a> Zhao worked with graduate students in the school\u2019s <a href=\"https:\/\/sandlab.cs.uchicago.edu\/\" rel=\"nofollow noopener\" target=\"_blank\">SAND Lab<\/a>, which earlier this year <a href=\"https:\/\/glaze.cs.uchicago.edu\/download.html\" rel=\"nofollow noopener\" target=\"_blank\">also launched a free service<\/a>\u00a0called <a href=\"https:\/\/glaze.cs.uchicago.edu\/index.html\" rel=\"nofollow noopener\" target=\"_blank\">Glaze<\/a> to mask their own IP so it cannot be scraped by genAI models. The Nightshade technology will eventually be integrated into Glaze, according to Zhao.<\/p>\n<p>\u201cA tool like Nightshade is very real, and similar tools have been used by hackers and criminals for years to poison model training data to their advantage \u2014 for example, to fool a satellite or a GPS system and thus avoid enemy detection\u201d said Avivah Litan, a vice president and distinguished analyst with Gartner.<\/p>\n<p><a href=\"https:\/\/research.ibm.com\/topics\/foundation-models\" rel=\"nofollow noopener\" target=\"_blank\">Foundation models<\/a>, also known as \u201ctransformers,\u201d are large-scale generative AI models trained on thousands \u2014 even millions \u2014 of pieces of raw, unlabeled data.\u00a0The models learn from the data they curate from the internet and other places, including purchased data sets, to produce answers or solve queries from users.<\/p>\n<p>So, is data poisoning unethical?<\/p>\n<p>Bradon Hancock, head of technology and co-founder of Snorkel AI, a startup that helps companies develop LLMs for domain-specific use, believes Nightshade could spur other efforts to thwart data scraping by AI developers. While a lot of technological defenses against data scraping date back as far as 2018, Nightshade is something he\u2019s not seen before.<\/p>\n<p>Whether the use of such tools is ethical or not depends on where they\u2019re aimed, he said.<\/p>\n<p>\u201cI think there are unethical uses of it \u2014 for example, if you\u2019re trying to poison self-driving car data that helps them recognize stop signs and speed limit signs,\u201d Hancock said. \u201cIf your goal is more towards \u2018don\u2019t scrape me\u2019 and not actively trying to ruin a model, I think that\u2019s where the line is for me.\u201d<\/p>\n<p>Ritu Jyoti, a vice president analyst at research firm IDC, sees it less as a question about what Nightshade and more about ethics. \u201cIt\u2019s my data or artwork,&#8221; she said. &#8220;I\u2019ve put it out in public and I\u2019ve masked it with something. So, if without my permission you\u2019re taking it, then it\u2019s your problem.&#8221;<\/p>\n<p>Companies routinely train AI content generation tools using data lakes with thousands and even many millions of licensed or unlicensed works, according to Jyoti. For example, Getty Images, an image licensing service, <a href=\"https:\/\/www.cnn.com\/2023\/01\/17\/tech\/getty-images-stability-ai-lawsuit\/index.html\" rel=\"nofollow noopener\" target=\"_blank\">filed a lawsuit against AI art tool Stable Diffusion<\/a> earlier this year alleging improper use of its photos, violating both copyright and trademark rights.<\/p>\n<p>Google is currently involved in <a href=\"https:\/\/www.reuters.com\/legal\/litigation\/google-says-data-scraping-lawsuit-would-take-sledgehammer-generative-ai-2023-10-17\/\" rel=\"nofollow\">a class-action lawsuit<\/a> that claims the company&#8217;s scraping of data to train genAI systems violates millions of people&#8217;s privacy and property rights. In 2015, <a href=\"https:\/\/www.smithsonianmag.com\/smart-news\/court-ruling-legalizes-google-books-180956997\/\" rel=\"nofollow\">Google won a landmark court ruling<\/a> allowing it to digitize library books.<\/p>\n<p>In each case, the legal system is being asked to clarify what a dedicated work is under intellectual property laws, according to Jyoti.<\/p>\n<p>\u201cAnd there are lots of variations in these cases depending on the jurisdiction; different state or federal circuit courts may respond with different interpretations,\u201d she said. \u201cSo, the outcome of these cases is expected to hinge on the interpretation of the fair-use doctrine, which allows copyrighted work to be used without the owner\u2019s permission for purposes such as criticism, such as satire, or fair comment, or news reporting, or teaching, or for classroom use.\u201d<\/p>\n<p>Hancock said genAI development companies are waiting to see how aggressive \u201cor not\u201d government regulators will be with IP protections. \u201cI suspect, as is often the case, we\u2019ll look to Europe to lead here. They\u2019re often a little more comfortable protecting data privacy than the US is, and then we end up following suit,\u201d Hancock said.<\/p>\n<p>To date, government efforts to address IP protection against genAI models are at best uneven, according to Litan.<\/p>\n<p>\u201cThe <a href=\"https:\/\/www.computerworld.com\/article\/3699311\/eu-parliament-approves-ai-act-moving-it-closer-to-becoming-law.html\">EU AI Act<\/a> proposes a rule that AI model producers and developers must disclose copyright materials used to train their models. Japan says AI generated art does not violate copyright laws,\u201d Litan said. \u201cUS federal laws on copyright are still non-existent, but there are discussions between government officials and industry leaders around using or mandating content provenance standards.\u201d<\/p>\n<p>Companies that develop genAI are more often turning away from indiscriminate scraping of online content and instead purchasing content to ensure they don\u2019t run afoul of IP statutes. That way, they can offer customers purchasing their AI services reassurance they won\u2019t be sued by content creators.<\/p>\n<p>\u201cEvery company I\u2019m speaking to \u2014 all the technology companies \u2014 IBM, Adobe, Microsoft are all offering indemnification,\u201d Jyoti said. \u201cIBM has announced [it] will be launching a model and if an enterprise is making use of it, they\u2019re in safe hands if they ever get into a lawsuit, because <a href=\"https:\/\/www.ibm.com\/blog\/how-to-use-foundation-models-and-trusted-governance-to-manage-ai-workflow-risk\/\" rel=\"nofollow\">IBM will provide<\/a> them with indemnification.<\/p>\n<p>\u201cThis is a big debatable topic right now,\u201d she added.<\/p>\n<p>Hancock said he\u2019s seeing a lot more companies being explicit in warning AI developers against simply scraping content. \u201cReddit, Stack Overflow, Twitter and other places are getting more explicit and aggressive around saying, &#8216;We will sue you if you use this for your models without our permission,&#8217;\u201d Hancock said.<\/p>\n<p>Microsoft has gone so far as to <a href=\"https:\/\/www.legaldive.com\/news\/microsoft-copilot-user-copyright-legal-indemnification-ip-law\/693163\/\" rel=\"nofollow\">tell its Copilot users<\/a> they won\u2019t be legally protected if they don\u2019t use the content filters and guardrails the company has built into its tool.<\/p>\n<p>A Microsoft spokesperson said the company had no comment. OpenAI, and IBM did not respond to requests for comment.<\/p>\n<p>Along with indemnifying users against stolen IP, industry efforts are underway to create content authentication standards that support provenance of images and other objects, according to Gartner\u2019s Litan.<\/p>\n<p>For example, Adobe has created <a href=\"https:\/\/helpx.adobe.com\/creative-cloud\/help\/content-credentials.html\" rel=\"nofollow noopener\" target=\"_blank\">Content Credentials<\/a>\u00a0\u2014 metadata that carries contextual details, such as who made the artwork, when they did it, and how it was created. \u00a0Another method for protecting creators involves\u00a0<a href=\"https:\/\/rmit.libguides.com\/referencing_AI_tools\" rel=\"nofollow\">source content references<\/a> in genAI outputs, which are provided by various AI model vendors or third-party firms such as <a href=\"https:\/\/calypsoai.com\/\" rel=\"nofollow noopener\" target=\"_blank\">Calypso AI<\/a> and <a href=\"https:\/\/www.datarobot.com\/\" rel=\"nofollow noopener\" target=\"_blank\">DataRobot<\/a>.<\/p>\n<p>Finally, genAI training techniques, such as prompt engineering and <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/retrieval-augmented-generation-overview\" rel=\"nofollow noopener\" target=\"_blank\">retrieval augmented generation<\/a> (RAG) or fine tuning, can instruct a model to only use private validated data from the user organization.\u00a0<\/p>\n<p>\u201cMicrosoft 365 Copilot uses RAG, so that responses to the users from the models are always based on the enterprise\u2019s private data, which is why they indemnify enterprises from copyright violations as long as they follow the M365 Copilot rules and use their guardrails,\u201d Litan said.<\/p>\n<p>Snorkel AI is one company focused entirely on customizing and specializing base genAI models for specific domains and applications. The result: \u00a0LLMs that have data sets orders of magnitude smaller than OpenAI\u2019s GPT-4, Google\u2019s PaLM 2, or Meta\u2019s Llama 2 models.<\/p>\n<p>\u201cWe\u2019re still not talking about tens or hundreds of data points, but thousands or tens of thousands of data points to teach the model what it needs to know from its general training,\u201d Hancock said. \u201cBut that\u2019s still quite a bit different from substantial portions of the Internet that are used for pre-training those other base models.\u201d<\/p>\n<p><a href=\"https:\/\/www.computerworld.com\/article\/3706510\/ai-language-models-need-to-shrink-heres-why-smaller-may-be-better.html\">Smaller domain-specific LLMs that address vertical industry needs<\/a>\u00a0are already emerging as the next frontier of AI. Along with using more targeted data and language, such as financial services terms and market information, base LLMs can still consume vast amounts of processor cycles and cost millions of dollars to train.<\/p>\n<p>\u201cWhen you\u2019ve got that much data that you need to pump through a model, you often end up needing hundreds or thousands of specialized accelerators \u2014 CPUs or GPUs \u2014 that you run for weeks or months depending on how much you parallelize,\u201d Hancock said. \u201cThe hardware itself is expensive, but then you\u2019re also running it with a non-stop electricity bill for a long period of time. That doesn\u2019t even include the time spent on data collection.\u201d<\/p>\n<p>Amorphous LLMs will continue to grow alongside domain-specific LLMs because they can be used for general purposes, which means tools to thwart unchecked IP scraping will also continue to grow.<\/p>\n<p>\u201cI can\u2019t judge the ethics of such a tool \u2013 I can only say it often helps to fight fire with fire, and that it just ups the ante for large model developers and providers,&#8221; Litan said. &#8220;They will now have to spend a lot of money training their models to ignore such types of adversarial attacks and data poisoning. Whoever has the strongest and most effective AI will win.\u00a0<\/p>\n<p>In the meantime, the artists are totally justified in their frustrations and response.\u201d<\/p>\n<p><a href=\"https:\/\/www.computerworld.com\/article\/3709609\/data-poisoning-anti-ai-theft-tools-emerge-but-are-they-ethical.html#tk.rss_security\" target=\"bwo\" >http:\/\/www.computerworld.com\/category\/security\/index.rss<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/images.idgesg.net\/images\/article\/2023\/10\/shutterstock_2321877251-1-100947840-small.jpg\"\/><\/p>\n<article>\n<section class=\"page\">\n<p>Technologists are helping artists fight back against what they see as intellectual property (IP) theft by generative artificial intelligence (genAI) tools \u00a0whose training algorithms automatically scrape the internet and other places for content.<\/p>\n<p>The fight over what constitutes fair use of content found online is at the heart of what has been <a href=\"https:\/\/www.cnn.com\/2021\/04\/05\/tech\/google-oracle-supreme-court-ruling\/index.html\" rel=\"nofollow noopener\" target=\"_blank\">an ongoing court\u00a0battle<\/a>. The fight goes beyond artwork to whether genAi companies\u00a0<a href=\"https:\/\/www.legaldive.com\/news\/microsoft-copilot-user-copyright-legal-indemnification-ip-law\/693163\/\" rel=\"nofollow noopener\" target=\"_blank\">like Microsoft and its partner, OpenAI<\/a>, can incorporate software code and other published content into their models.<\/p>\n<p class=\"jumpTag\"><a href=\"\/article\/3709609\/data-poisoning-anti-ai-theft-tools-emerge-but-are-they-ethical.html#jump\">To read this article in full, please click here<\/a><\/p>\n<\/section>\n<\/article>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[11062,10643],"tags":[11113,11070,29835,8397,19180,10516,29947,8698,714],"class_list":["post-23227","post","type-post","status-publish","format-standard","hentry","category-computerworld","category-independent","tag-artificial-intelligence","tag-emerging-technology","tag-generative-ai","tag-intel","tag-legal","tag-microsoft","tag-natural-language-processing","tag-regulation","tag-security"],"_links":{"self":[{"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/23227","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/comments?post=23227"}],"version-history":[{"count":0,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/23227\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/media?parent=23227"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/categories?post=23227"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/tags?post=23227"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}