{"id":25573,"date":"2024-12-13T05:21:06","date_gmt":"2024-12-13T13:21:06","guid":{"rendered":"http:\/\/www.palada.net\/index.php\/2024\/12\/13\/news-19302\/"},"modified":"2024-12-13T05:21:06","modified_gmt":"2024-12-13T13:21:06","slug":"news-19302","status":"publish","type":"post","link":"https:\/\/www.palada.net\/index.php\/2024\/12\/13\/news-19302\/","title":{"rendered":"DeepSpeed: a tuning tool for large language models"},"content":{"rendered":"<p><strong>Credit to Author: gallagherseanm| Date: Fri, 13 Dec 2024 11:30:50 +0000<\/strong><\/p>\n<div class=\"entry-content lg:prose-lg mx-auto prose max-w-4xl\">\n<p><span data-contrast=\"auto\">Large Language Models (LLMs) have the potential to automate and reduce the workloads of many types, including those of cybersecurity analysts and incident responders. But generic LLMs lack the domain-specific knowledge to handle these tasks well. While they may have been built with training data that included some cybersecurity-related resources, that is often insufficient for taking on more specialized tasks that require more up to date and, in some cases, proprietary knowledge to perform well\u2014knowledge not available to the LLMs when they were trained.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">There are several existing solutions for tuning \u201cstock\u201d (unmodified) LLMs for specific types of tasks. But unfortunately, these solutions were insufficient for the types of applications of LLMs that Sophos X-Ops is attempting to implement. For that reason, SophosAI\u00a0 has assembled a framework that utilizes\u00a0 DeepSpeed, a library developed by Microsoft that can be used to train and tune the inference of a model with (in theory) trillions of parameters by scaling up the compute power and number of graphics processing units (GPUs) used during training. The framework is open source licensed and can be found in <\/span><a href=\"https:\/\/github.com\/sophos\/Sophos-LLM-Builder-Toolkit.git\"><span data-contrast=\"none\">our GitHub repository<\/span><\/a><span data-contrast=\"auto\">.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">While many of the parts of the framework are not novel and leverage existing open-source libraries, SophosAI has synthesized several of the key components for ease of use. And we continue to work on improving the performance of the framework.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<h2><span data-contrast=\"none\">The (inadequate) alternatives<\/span><span data-ccp-props=\"{&quot;134245418&quot;:true,&quot;134245529&quot;:true,&quot;335559738&quot;:160,&quot;335559739&quot;:80}\">\u00a0<\/span><\/h2>\n<p><span data-contrast=\"auto\">There are several existing approaches to adapting stock LLMs to domain-specific knowledge. Each of them has its own advantages and limitations.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">To be fully effective, a domain expert LLM requires pre-training of all its parameters to learn the proprietary knowledge of a company. That undertaking can be resource intensive and time consuming\u2014which is why we turned to DeepSpeed\u00a0 for our training framework, which we implemented in Python. The version of the framework that we are releasing as open source can be run in the Amazon Web Services SageMaker machine learning service, but it could be adapted to other environments.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Training frameworks (including DeepSpeed) allow you to scale up large model training tasks through parallelism. There are three main types of parallelism: data, tensor, and pipeline.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<figure id=\"attachment_958841\" aria-describedby=\"caption-attachment-958841\" style=\"width: 640px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2024\/12\/Slide1.jpeg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-958841\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2024\/12\/Slide1.jpeg\" alt=\"\" width=\"640\" height=\"360\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2024\/12\/Slide1.jpeg 1200w, https:\/\/news.sophos.com\/wp-content\/uploads\/2024\/12\/Slide1.jpeg?resize=300,169 300w, https:\/\/news.sophos.com\/wp-content\/uploads\/2024\/12\/Slide1.jpeg?resize=768,432 768w, https:\/\/news.sophos.com\/wp-content\/uploads\/2024\/12\/Slide1.jpeg?resize=1024,576 1024w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"caption-attachment-958841\" class=\"wp-caption-text\">Figure 1: an illustration of the three main types of model training parallelism.<\/figcaption><\/figure>\n<p><span data-contrast=\"auto\">In data parallelism, each process working on the training task (essentially each graphics processor unit, or GPU) receives a copy of the full model\u2019s weights but only a subset of the data, called a minibatch. After the forward pass through the data (to calculate <\/span><a href=\"https:\/\/developers.google.com\/machine-learning\/crash-course\/descending-into-ml\/training-and-loss\"><span data-contrast=\"none\">loss<\/span><\/a><span data-contrast=\"auto\"> , or the amount of inaccuracy in the parameters of the model being used for training) and the backward pass (to calculate the <\/span><a href=\"https:\/\/developers.google.com\/machine-learning\/crash-course\/reducing-loss\/gradient-descent\"><span data-contrast=\"none\">gradient of the loss<\/span><\/a><span data-contrast=\"auto\">) are completed, the resulting gradients are synchronized.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">In Tensor parallelism, each layer of the model being used for training is split across the available processes. Each process computes a portion of the layer \u2018s operation using the full training data set. The partial outputs from each of these layers are synchronized across processes to create a single output matrix.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Pipeline parallelism splits up the model differently. Instead of parallelizing by splitting layers of the model, each layer of the model receives its own process. The minibatches of data are divided into micro-batches and that are sent down the \u201cpipeline\u201d sequentially. Once a process finishes a micro-batch, it receives a new one. This method may experience \u201cbubbles\u201d where a process is idling, waiting for the output of processes hosting earlier model layers.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">These three parallelism techniques can also be combined in several ways\u2014and are, in the DeepSpeed training library.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<h2><span data-contrast=\"none\">Doing it with DeepSpeed<\/span><span data-ccp-props=\"{&quot;134245418&quot;:true,&quot;134245529&quot;:true,&quot;335559738&quot;:160,&quot;335559739&quot;:80}\">\u00a0<\/span><\/h2>\n<p><span data-contrast=\"auto\">DeepSpeed performs sharded data parallelism. Every model layer is split such that each process gets a slice, and each process is given a separate mini batch as input. During the forward pass, each process shares its slice of the layer with the other processes. At the end of this communication, each process now has a copy of the full model layer.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Each process computes the layer output for its mini batch. After the process finishes computation for the given layer and its mini batch, the process discards the parts of the layer it was not originally holding.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The backwards pass through the training data is done in a similar fashion. As with data parallelism, the gradients are accumulated at the end of the backwards pass and synchronized across processes.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Training processes are more constrained in their performance by memory than processing power\u2014and bringing on more GPUs with additional memory to handle a batch that is too large for the GPU\u2019s own memory can cause significant performance cost because of the communication speed between GPUs, as well as the cost of using more processors than would otherwise be required to run the process. One of the key elements of the DeepSpeed library is its <\/span><a href=\"https:\/\/www.deepspeed.ai\/tutorials\/zero\/\"><span data-contrast=\"none\">Zero Redundancy Optimizer (ZeRO),<\/span><\/a><span data-contrast=\"auto\"> a set of memory utilization techniques that can efficiently parallelize very large language model training. ZeRO can reduce the memory consumption of each GPU by partitioning the model states (optimizers, gradients, and parameters) across parallelized data processes instead of duplicating them across each process.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The trick is finding the right combination of training approaches and optimizations for your computational budget. There are three selectable levels of partitioning in ZeRO:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<ul>\n<li data-leveltext=\"o\" data-font=\"Courier New\" data-listid=\"4\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:1440,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Courier New&quot;,&quot;469769242&quot;:[9675],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;o&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"2\"><span data-contrast=\"auto\">ZeRO Stage 1 shards the optimizer state across.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"o\" data-font=\"Courier New\" data-listid=\"4\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:1440,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Courier New&quot;,&quot;469769242&quot;:[9675],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;o&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"2\"><span data-contrast=\"auto\">Stage 2 shards the optimizer + the gradients.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"o\" data-font=\"Courier New\" data-listid=\"4\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:1440,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Courier New&quot;,&quot;469769242&quot;:[9675],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;o&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"3\" data-aria-level=\"2\"><span data-contrast=\"auto\">Stage 3 shards the optimizer + the gradients + the model weights.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/li>\n<\/ul>\n<p><span data-contrast=\"auto\">Each stage has its own relative benefits. ZeRO Stage 1 will be faster, for example, but will require more memory than Stage 2 or 3.\u00a0 There are two separate inference approaches within the DeepSpeed toolkit:\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/p>\n<ul>\n<li data-leveltext=\"o\" data-font=\"Courier New\" data-listid=\"3\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:1440,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Courier New&quot;,&quot;469769242&quot;:[9675],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;o&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"2\"><span data-contrast=\"auto\">DeepSpeed Inference: inference engine with optimizations such as kernel injection; this has lower latency but requires more memory.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"o\" data-font=\"Courier New\" data-listid=\"3\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559685&quot;:1440,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Courier New&quot;,&quot;469769242&quot;:[9675],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;o&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"2\"><span data-contrast=\"auto\">ZeRO Inference: allows for offloading parameters into CPU or NVMe memory during inference; this has higher latency but consumes less GPU memory.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:279}\">\u00a0<\/span><\/li>\n<\/ul>\n<h2>Our Contributions<\/h2>\n<p><span data-contrast=\"auto\">The Sophos AI team has put together a toolkit based on DeepSpeed that helps take some of the pain out of utilizing it. While the parts of the toolkit itself are not novel, what is new is the convenience of having several key components synthesized for ease of use.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">At the time of its creation, this tool repository was the first to combine training and both DeepSpeed inference types (DeepSpeed Inference and ZeRO Inference) into one configurable script. It was also the first repository to create a custom container for running the latest DeepSpeed version on Amazon Web Service\u2019s SageMaker. And it was the first repository to perform distributed script based DeepSpeed inference that was not run as an endpoint on SageMaker. The training methods currently supported include continued pre-training, supervised fine-tuning, and finally preference optimization.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The repository and its documentation can be found <\/span><a href=\"https:\/\/github.com\/sophos\/Sophos-LLM-Builder-Toolkit.git%20i\"><span data-contrast=\"none\">here<\/span><\/a><span data-contrast=\"auto\"> on Sophos\u2019 GitHub.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<\/p><\/div>\n<p><a href=\"https:\/\/news.sophos.com\/en-us\/2024\/12\/13\/deepspeed-a-tuning-tool-for-large-language-models\/\" target=\"bwo\" >http:\/\/feeds.feedburner.com\/sophos\/dgdY<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2024\/12\/shutterstock_asset-generation-6dc4f763-f8a7-4dff-a56b-92736d8c8d6c-1_edited-e1734048709973.jpeg\"\/><\/p>\n<p><strong>Credit to Author: gallagherseanm| Date: Fri, 13 Dec 2024 11:30:50 +0000<\/strong><\/p>\n<p>SophosAI\u2019s framework for upgrading the performance of LLMs for cybersecurity tasks (or any other specific task) is now open source. <\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[10378,10377],"tags":[27031,32012,129,29047,32224],"class_list":["post-25573","post","type-post","status-publish","format-standard","hentry","category-security","category-sophos","tag-ai-research","tag-deepspeed","tag-featured","tag-llm","tag-llm-tuning"],"_links":{"self":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/25573","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/comments?post=25573"}],"version-history":[{"count":0,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/25573\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/media?parent=25573"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/categories?post=25573"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/tags?post=25573"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}