{"id":19944,"date":"2022-08-25T05:20:55","date_gmt":"2022-08-25T13:20:55","guid":{"rendered":"http:\/\/www.palada.net\/index.php\/2022\/08\/25\/news-13677\/"},"modified":"2022-08-25T05:20:55","modified_gmt":"2022-08-25T13:20:55","slug":"news-13677","status":"publish","type":"post","link":"https:\/\/www.palada.net\/index.php\/2022\/08\/25\/news-13677\/","title":{"rendered":"An open-source ML toolkit for automatically generating YARA rules"},"content":{"rendered":"<p><strong>Credit to Author: gallagherseanm| Date: Thu, 25 Aug 2022 11:00:35 +0000<\/strong><\/p>\n<div class=\"entry-content lg:prose-lg mx-auto prose max-w-4xl\">\n<p>Machine learning has many applications in information security. But you can\u2019t take machine learning-generated models everywhere, especially on lower power edge devices. Sometimes you need something more lightweight to investigate possible malware. And in some cases, the only thing available is <a href=\"https:\/\/github.com\/VirusTotal\/yara\">YARA<\/a>, the multiplatform classification tool maintained by VirusTotal.<\/p>\n<p>YARA classifies content by applying <a href=\"https:\/\/virustotal.github.io\/yara\/\">YARA rules<\/a> \u2013human-readable rules that identify text artifacts associated with a specific type of malicious code or content. \u00a0But writing hand-crafted rules for YARA can be time consuming, and how well rules complete their task depends heavily on the rule-writer\u2019s skill and judgement. Bringing the power of machine learning to the processs of creating YARA rules could significantly speed up and simplify their generation.<\/p>\n<p>That\u2019s precisely the idea behind Sophos\u2019s experimental YaraML tool, developed by SophosAI chief scientist Joshua Saxe. YaraML can be used without any prior machine learning experience, but it also allows \u00a0\u00a0more advanced users familiar with machine learning projects to set customized parameters.<\/p>\n<p>YaraML analyzes a dataset of benign and malicious labeled string artifacts to create YARA rules, extracting \u00a0\u00a0patterns that can be used to identify malicious string artifacts using \u00a0YARA. Saxe used YaraML to create example YARA rules for detecting <a href=\"https:\/\/github.com\/sophos\/yaraml_rules\/tree\/master\/example_models\">PowerShell malware<\/a>, malware associated with the SolarWinds cyberespionage campaign, and macOS malware, which have also been published on the Sophos GitHub page.<\/p>\n<p>While Sophos doesn\u2019t use YARA as part of its products, YARA rules can be useful for incident responders and malware researchers, especially in situations where other types of tools aren\u2019t available or are impractical to deploy. So the tool and rules have been shared as open source under the Apache 2.0 license in the hopes that they can provide assistance to the blue team and threat research community.<\/p>\n<h3>How to train your own model<\/h3>\n<p>YaraML is Python-based machine learning tool that lets you generate your own YARA rule from a dataset of malicious\/benign labeled data. From that dataset, as final output, YaraML generates a complete, human readable (if not totally comprehensible) YARA rule that can be deployed as-is\u2014or used alongside human-written rules.<\/p>\n<p>To train your own model you need to have Python 3.6 or later installed on your computer. The code for YaraML can be retrieved from <a href=\"https:\/\/github.com\/sophos\/yaraml_rules\">Sophos\u2019 GitHub page<\/a>. Its installation script will retrieve all the libraries required. \u00a0Once installed, YaraML can be invoked from the command line in the following format:<\/p>\n<pre>yaraml   [name for created model] [name for created Yara rule]<\/pre>\n<p>For example:<\/p>\n<pre>yaraml powershell_malware\/ powershell_benign\/ powershell_model powershell_detector<\/pre>\n<p>The script can perform one of two forms of binary classification using the Python <a href=\"https:\/\/scikit-learn.org\/stable\/index.html\">scikit-learn<\/a> library: <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/linear_model.html#logistic-regression\">logistic regression<\/a> or <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/ensemble.html#random-forests\">random forest<\/a>.\u00a0 By default, the script will use random forest; this can be changed at the command line A maximum number of sample files can also be set at the command line, as well as customized machine learning hyperparameters for those who want to tweak the model generated even further.<\/p>\n<p>As far as how many files should be used when running YaraML, the same rule applies as when doing any model building: typically, the more training data, the better. YaraML can easily scale to hundreds of thousands of examples. Depending on what sort of representative data is available, 10,000 malicious and 10,000 benign samples is a good set to work from to be able to both train the model and verify the generated rules. But practically speaking, you may not have this many samples to work with.<\/p>\n<p>For example, let\u2019s look at how the PowerShell generic malware detection rule included in the GitHub repository was created. First, a collection of about 2,000 PowerShell malware samples was placed in the folder powershell_malware, and another 8,000 harmless PowerShell scripts were placed in a folder named powershell_benign. Then YaraML was launched with the following command line entry:<\/p>\n<pre>yaraml powershell_malware\/ powershell_benign\/ powershell_model powershell_detector --model_type=\"logisticregression\"  <\/pre>\n<p>This will launch YaraML and create both a machine learning model named<strong> powershell_model<\/strong> and a YARA rule named <strong>powershell_detector<\/strong>. The<strong> &#8211;model_type<\/strong> parameter set here set the analysis type to logistic regression. \u00a0The<strong> model_type <\/strong>\u00a0argument is handled in the YaraML script in the code shown below \u00a0defining the machine learning task:<\/p>\n<p><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/codesnip.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-86458 size-full\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/codesnip.png\" alt=\"a screenshot of part of YaraML's main code\" width=\"624\" height=\"227\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/codesnip.png 624w, https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/codesnip.png?resize=300,109 300w\" sizes=\"auto, (max-width: 624px) 100vw, 624px\" \/><\/a><\/p>\n<p>If you want to deviate from the built-in classification schemes in YaraML, you can further tweak the machine learning process by directly specifying hyperparameters that define how the model is built with the model_instantiation command line parameter For example:<\/p>\n<pre>--model_instantiation=\"LogisticRegression(penalty='l1',solver='liblinear')\"<\/pre>\n<p>Furthermore, it\u2019s possible to set a maximum number of files to use in building the model to speed up creation of the ruleset, which will likely result in lower accuracy but speed up rule deployment, by entering a max_benign_files and max_malicious_files parameter:<\/p>\n<pre>--max_benign_files=100 --max_malicious_files=100<\/pre>\n<p>When executed, YaraML will extract substring features from the samples and perform feature selection to downselect the number of features to avoid \u201coverfitting\u201d the model. It then trains a model based on the selected model type and features. Finally, YaraML &#8220;compiles&#8221; the model\u2014turning the selected features and their weights as indicators into a textual Yara rule, which looks like the example below:<\/p>\n<p><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/YARA-rule.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-86459\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/YARA-rule.png\" alt=\"A screenshot of part of a YaraML-generated YARA rule\" width=\"640\" height=\"615\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/YARA-rule.png 653w, https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/YARA-rule.png?resize=300,288 300w, https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/YARA-rule.png?resize=32,32 32w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><\/p>\n<p>In the rule excerpt above, YaraML used a logistic regression classifier and automatically discovered important set of tokens (defined as $sXX) required for accurate classification of PowerShell commands. Those tokens are then summed together to create the final score.<\/p>\n<p>At the end of the rule, there is a section labeled \u201ccondition\u201d that applies a threshold for the weighted score to determine whether the detection rule is triggered. This threshold is automatically set by the tool, but it can be manually edited to tweak how the rule is applied\u2014either to increase overall detection or reduce false positives. \u00a0For logistic regression defined YARA rules, the condition statement is an equation that checks the sum of the tokens against a specific value:.<\/p>\n<p><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/linear-regression-yara-condition.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-86460 size-full\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/linear-regression-yara-condition.png\" alt=\"Screen shot of the end of a logistic regression based YARA rule generated by YaraML\" width=\"640\" height=\"145\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/linear-regression-yara-condition.png 781w, https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/linear-regression-yara-condition.png?resize=300,68 300w, https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/linear-regression-yara-condition.png?resize=768,174 768w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><\/p>\n<p>Changing the value after the \u201c&gt;\u201d \u00a0changes the threshold. Changing this number can either make the rule more aggressive in detecting malicious samples (resulting in more false positives), or reduce the number of false positives (resulting in more missed malicious files). Set it too low, and there will be too many false positives; set it too high and more malicious files will slip by. For example, here\u2019s how different thresholds affected the PowerShell rule:<\/p>\n<p><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/FP-TP-ratio.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-86462\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/FP-TP-ratio.png\" alt=\"A plot of the ratio of detections versus false positive rates based on different threshold levels set in a YARA rule for detecting malicious PowerShell scripts.\" width=\"640\" height=\"480\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/FP-TP-ratio.png 667w, https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/FP-TP-ratio.png?resize=300,225 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><\/p>\n<p>For rules created with a random forest classifier, the condition set is a collection of Boolean statements representing which defined trees have to be matched for an artifact to match the rule :<\/p>\n<p><a href=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/random-forest-yara-condition.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-86463\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/random-forest-yara-condition.png\" alt=\"The condition section of a YARA rule generated from a random forest model.\" width=\"640\" height=\"258\" srcset=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/random-forest-yara-condition.png 756w, https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/random-forest-yara-condition.png?resize=300,121 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><\/p>\n<h3>Caveat venator<\/h3>\n<p>Those using YaraML rules should be cautioned, as with all elements of machine learning systems, that there is no 100 percent accurate ruleset. Additionally, YaraML is an experimental tool, and we\u2019ve posted it to share with the community in that context. There\u2019s not likely to be much, if any, code maintenance, but it is offered under the Apache 2.0 license\u2014so feel free to adapt the code further and use it as you see fit under those terms.<\/p>\n<p>&nbsp;<\/p>\n<\/p><\/div>\n<p><a href=\"https:\/\/news.sophos.com\/en-us\/2022\/08\/25\/an-open-source-ml-toolkit-for-automatically-generating-yara-rules\/\" target=\"bwo\" >http:\/\/feeds.feedburner.com\/sophos\/dgdY<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/news.sophos.com\/wp-content\/uploads\/2022\/08\/ml-tag-team.png\"\/><\/p>\n<p><strong>Credit to Author: gallagherseanm| Date: Thu, 25 Aug 2022 11:00:35 +0000<\/strong><\/p>\n<p>The SophosAI Artificial Intelligence team has developed a machine-learning based tool that generates YARA rules for detecting specific types of threats<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[10378,10377],"tags":[27031,23226,27362,11099,27363,14788,27364],"class_list":["post-19944","post","type-post","status-publish","format-standard","hentry","category-security","category-sophos","tag-ai-research","tag-free-tools","tag-incident-response-tools","tag-open-source","tag-threat-hunting-tools","tag-yara","tag-yaraml"],"_links":{"self":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/19944","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/comments?post=19944"}],"version-history":[{"count":0,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/posts\/19944\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/media?parent=19944"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/categories?post=19944"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.palada.net\/index.php\/wp-json\/wp\/v2\/tags?post=19944"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}