{"id":22290,"date":"2023-06-22T05:21:07","date_gmt":"2023-06-22T13:21:07","guid":{"rendered":"http:\/\/www.palada.net\/index.php\/2023\/06\/22\/news-16020\/"},"modified":"2023-06-22T05:21:07","modified_gmt":"2023-06-22T13:21:07","slug":"news-16020","status":"publish","type":"post","link":"https:\/\/www.palada.net\/index.php\/2023\/06\/22\/news-16020\/","title":{"rendered":"And I Shall Call It Mini-Me GPT: Using Large Language Models to Classify the Uncharted Web"},"content":{"rendered":"<p><strong>Credit to Author: gallagherseanm| Date: Thu, 22 Jun 2023 11:30:10 +0000<\/strong><\/p>\n<div class=\"entry-content lg:prose-lg mx-auto prose max-w-4xl\">\n<p>While it may not seem as central to security as malware protection and breach detection, web content filtering plays an important role in ensuring regulatory compliance and the safety of workplaces as well as network security. Unlike security classification of URLs, which screens for malicious content such as malware or phishing, web filtering has to label content based not on attack mechanisms but the nature of its content, a much more generalized problem than checking for malicious patterns in the content behind the URL.<\/p>\n<p>Website category labels generally describe what the content or purpose of the site is. \u00a0Some categories are broad classifications such as \u201cbusiness,\u201d \u201ccomputers and internet\u201d, \u201cfood and dining\u201d and \u201centertainment\u201d. Others focus on intent, such as \u201cbanking,\u201d \u201cshopping,\u201d \u201csearch engines,\u201d \u201csocial media,\u201d \u201cjob search\u201d and \u201ceducation\u201d. And then there are categories that may include content of concern\u2014\u201dsexually explicit,\u201d \u201calcohol,\u201d \u201cmarijuana\u201d and \u201cweapons,\u201d for example. Organizations may want to set various policies for filtering or measuring the types of websites accessed from their networks.<\/p>\n<p>Sophos X-Ops has been researching ways to apply large language model (LLM) machine learning to web filtering to help catch the \u201clong tail\u201d of websites\u2014those millions of domains that have relatively few visitors and little or no visibility to human analysts. \u00a0LLMs themselves are not practical for this application because of their size and computational resource cost. But they can be used in turn as \u201cteacher\u201d models to train smaller models on categorization\u2014reducing the computational resources required to generate labels on the fly for newly encountered domains.<\/p>\n<p>Using LLMs such as OpenAI\u2019s GPT-3 and Google\u2019s T5 Large , the SophosAI team were able to train much smaller models to classify never-before-screened URLs on the fly. Most importantly, the methodology used here could be used to create small, economically deployable models based on the output of LLMs for other security tasks.<\/p>\n<p>The team\u2019s research, detailed in a recently published paper entitled <a href=\"https:\/\/arxiv.org\/pdf\/2305.05027.pdf\">\u201cWeb Content Filtering Through Knowledge Distillation of Large Language Models,\u201d<\/a>\u00a0 explores ways in which LLMs can be used to bolster existing human-driven site classification, and to build systems that can be deployed to perform real-time labeling of never-before-seen URLs.<\/p>\n<h3>The \u201clong tail\u201d problem<\/h3>\n<p>Categorization of sites has relied largely on rule-based domain-to-category mapping, where analyst-crafted signatures are used to look for tell-tales in URLs to quickly assign labels to new domains. This sort of mapping is vital in speedy labeling of URLs on well-known sites and preventing false positives that block important content. The hands-on human identification of site classification patterns gets folded back into the domain mapping tools\u2019 feature sets.<\/p>\n<p>The problem comes with the \u201clong tail\u201d of websites\u2014those less-visited domains that typically don\u2019t get signatures assigned to them. With the daily emergence of thousands of new websites, and with over a billion existing websites, maintaining and scaling signature-based approaches manually for the long tail has become increasingly challenging. \u00a0That\u2019s evident in the steep drop-off of labeling for less-visited domains&#8211;while well-known, high-traffic sites get nearly 100 percent coverage in most labeling schemes, as shown in the diagram below, the proportion of analyst-labeled domains begins to fall off quickly beyond the top hundred visited domains. Sites ranked below the top 5000 are less than 50 percent likely to have been labeled for content.<\/p>\n<p><a style=\"font-size: 1em\" href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/label_coverage.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-92240 size-full\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/label_coverage.jpg\" alt=\"Figure 1. Labeling of content relative to popularity of domains, derived from telemetry. Chart shows logarithmic plot of domains versus proportion with labeling, with almost all of sites in the top 100 labeled but less popular sites having decreasing levels of coverage for each power of ten they're ranked in.\" width=\"640\" height=\"400\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/label_coverage.jpg 8749w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/label_coverage.jpg?resize=300,187 300w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/label_coverage.jpg?resize=768,480 768w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/label_coverage.jpg?resize=1024,639 1024w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/label_coverage.jpg?resize=1536,959 1536w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/label_coverage.jpg?resize=2048,1279 2048w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><\/p>\n<p>Figure 1. Labeling of content relative to popularity of domains, derived from telemetry. One way to fix this is through application of machine learning for processing previously unlabeled domains. But up until now, most machine learning efforts (such as Microsoft\u2019s <a href=\"https:\/\/arxiv.org\/pdf\/2106.05256.pdf\">URLTran<\/a>) have used deep learning models to focus on the task of detecting security threats, rather than categorizing sites by content. These models could be retrained to perform multi-category classification, but they would require extremely large training sets of data. URLTran used over 1 million samples just for training on detection of malicious URLS.<\/p>\n<h3>Automating with AI<\/h3>\n<p>That\u2019s where LLMs come in. Because they are pre-trained on massive amounts of unlabeled text, the SophosAI team believed that LLMs could be used to perform URL labeling more accurately and with much less initial data. When fine-tuned on data labeled with domain-propagation signatures, the SophosAI team found that LLMs have a 9% accuracy advantage over the state of the art model architecture from Microsoft when tackling the \u201clong tail\u201d categorization problem\u2014and only required a training set of thousands of URLs, rather than millions.<\/p>\n<p>The LLMs, using semantic relationships between the site classes and keywords within URLs in a smaller data set, were then used to create labels for an unlabeled set of data from long tail sites that were in turn used to train smaller models (the BERTiny and BERT-based URLTran transformer models and the 1D convolutional model eXpose). This \u201cknowledge distillation\u201d approach allowed the team to reach performance levels similar to that of the LLM with models 175 times smaller, reducing the number of parameters from 770 million to just 4 million.<\/p>\n<figure id=\"attachment_92256\" aria-describedby=\"caption-attachment-92256\" style=\"width: 1200px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/knowledge-distill-2.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-92256 size-full\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/knowledge-distill-2.jpg\" alt=\"A three step process: 1. Manually-labeled URLs are used to train large language models to semantically recognize categories for URLs.2.Trained LLMs categorize a larger set of unlabeled URLS, which are then used to train smaller models (BERTiny, eXpose, URLTran). 3.The best-performing small model can then be deployed in combination with existing rules to catch long-tail websites.\" width=\"1200\" height=\"675\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/knowledge-distill-2.jpg 1200w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/knowledge-distill-2.jpg?resize=300,169 300w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/knowledge-distill-2.jpg?resize=768,432 768w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/knowledge-distill-2.jpg?resize=1024,576 1024w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><figcaption id=\"caption-attachment-92256\" class=\"wp-caption-text\">Figure 2. \u00a0How knowledge distillation was used to create deployable models.<\/figcaption><\/figure>\n<p>While the most accurate sets of models created performed far better than models trained via \u201cdeep learning\u201d alone, their accuracy fell short of perfection\u2014even the best models scored under 50 percent accuracy. Many URLs failed to be properly labeled simply because they didn\u2019t have sufficient \u201csignals\u201d embedded in them, while others had keywords that could be associated with multiple classifications\u2014creating uncertainty that could only be clarified by deeper examination of the content behind the URL.<\/p>\n<p><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_scaling.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-92243 size-full\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_scaling.jpg\" alt=\"Figures 3 and 4. An accuracy plot of trained models. LLMs outperformed smaller models trained with deep learning, and reached highest accuracy with less data\" width=\"640\" height=\"384\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_scaling.jpg 10000w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_scaling.jpg?resize=300,180 300w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_scaling.jpg?resize=768,461 768w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_scaling.jpg?resize=1024,614 1024w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_scaling.jpg?resize=1536,922 1536w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_scaling.jpg?resize=2048,1229 2048w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><\/p>\n<figure id=\"attachment_92242\" aria-describedby=\"caption-attachment-92242\" style=\"width: 640px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_augmentation.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-92242 size-full\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_augmentation.jpg\" alt=\"Smaller models trained by the LLMS approached the same level of acccuracy as the LLMs themselves.\" width=\"640\" height=\"384\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_augmentation.jpg 10000w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_augmentation.jpg?resize=300,180 300w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_augmentation.jpg?resize=768,461 768w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_augmentation.jpg?resize=1024,614 1024w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_augmentation.jpg?resize=1536,922 1536w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/domain_augmentation.jpg?resize=2048,1229 2048w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"caption-attachment-92242\" class=\"wp-caption-text\">Figures 3 and 4. An accuracy plot of trained models. LLMs outperformed smaller models trained with deep learning, but the smaller models approached the same level of accuracy when the LLMs were used as teaching models. (Y axis in both charts are from 0 to 0.5 accuracy.)<\/figcaption><\/figure>\n<p>However, the T5 Large model performed reasonably well on categories that would potentially be filtered out, as shown in the confusion matrix below\u2014with gambling and peer-to-peer sharing sites having near-perfect labeling on test data. Alcohol, weapons, and pornography sites also had better than 60% true positive detection rates.<\/p>\n<figure id=\"attachment_92244\" aria-describedby=\"caption-attachment-92244\" style=\"width: 640px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/confusion_matrix.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-92244 size-full\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/confusion_matrix.jpg\" alt=\"Figure 5. A confusion matrix showing the relationship between the labels the T5 Large model assigned to test URLs, and their true manually-assigned labels,\" width=\"640\" height=\"543\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/confusion_matrix.jpg 10335w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/confusion_matrix.jpg?resize=300,255 300w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/confusion_matrix.jpg?resize=768,652 768w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/confusion_matrix.jpg?resize=1024,869 1024w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/confusion_matrix.jpg?resize=1536,1304 1536w, https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/confusion_matrix.jpg?resize=2048,1739 2048w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"caption-attachment-92244\" class=\"wp-caption-text\">Figure 5. A confusion matrix showing the relationship between the labels the T5 Large model assigned to test URLs, and their true manually-assigned labels,<\/figcaption><\/figure>\n<p>There are several ways to improve this accuracy going forward that the SophosAI team has suggested. First, allowing for the assignment of multiple categories to a site would eliminate problems with category overlap. Augmenting the URL samples with retrieved HTML and images from them could also provide better \u00a0recognition of their categorization, And newer LLMs, such as GPT-4, could be used as a teacher.<\/p>\n<p>When combined with existing processes, this form of AI-based classification can greatly improve the handling of long tail websites. And there are other security-related tasks that the \u201cknowledge distillation\u201d methodology tested in this experiment could be applied to.<\/p>\n<p>For more details, see the paper authored by Tamas Voros, Sean Bergeron and Head of SophosAI Konstantin Berlin <a href=\"https:\/\/arxiv.org\/pdf\/2305.05027.pdf\">here on arxiv.org<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<\/p><\/div>\n<p><a href=\"https:\/\/news.sophos.com\/en-us\/2023\/06\/22\/using-large-language-models-classify-uncharted-web\/\" target=\"bwo\" >http:\/\/feeds.feedburner.com\/sophos\/dgdY<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2023\/06\/robuts.png\"\/><\/p>\n<p><strong>Credit to Author: gallagherseanm| Date: Thu, 22 Jun 2023 11:30:10 +0000<\/strong><\/p>\n<p>Sophos AI team employs GPT and other large language models as teachers to train smaller models to label websites.<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[10378,10377],"tags":[27031,29631,129,26474,28950,27030,29632,17384,29633],"class_list":["post-22290","post","type-post","status-publish","format-standard","hentry","category-security","category-sophos","tag-ai-research","tag-bert","tag-featured","tag-gpt-3","tag-large-language-models","tag-sophos-x-ops","tag-t5-large-llm","tag-web-filtering","tag-website-categorization"],"_links":{"self":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/22290","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/comments?post=22290"}],"version-history":[{"count":0,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/22290\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/media?parent=22290"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/categories?post=22290"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/tags?post=22290"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}