Applying Deep Learning for PE-Malware Classification

Credit to Author: Ajay| Date: Thu, 10 Jan 2019 08:34:28 +0000

Estimated reading time: 5 minutesDeep Learning & Computer vision techniques are making progress in every possible field. With growing computing powers many organizations use them to resolve or minimize many day-to-day problems. In a recent talk at AVAR 2018, Quick Heal AI team presented an approach of effectively using Deep Learning for malware classification. Here we are giving detailed technical blog for the same. Introduction At the root of most malware attacks lies PE files which essentially causes the resultant damage. A typical attack initiates with the download of a PE file via email, website or other commonly used mechanisms. Traditional methods of detecting such malicious PE files range from rule-based; signature based static methods to behavior-based dynamic methods such as emulators, sandboxes, etc. But they are falling short in the race of generically detecting advanced malware.  The figure shown below explains one such example of Crysis Ransomware(Here is the decryption tool for Crysis). Here sample is so obfuscated that it bypasses rule-based or signature-based detection mechanisms easily. (Fig1: Ransom Crysis Obfuscated)   To combat against such mechanisms, we introduced Machine Learning based detection. Where, we use many algorithms like SVM, Random Forest for generic detection. (Fig2: XPaj Infector sample before & after infection) But ML needs sample collection & feature extraction before training. Feature engineering is a tedious task & requires human expertise & time. And it is becoming more complicated day by day as malware are finding ways to bypass it. Techniques like adversarial ML where malware samples are trained to bypass ML are evolving with such a rapid pace to evade property based ML models. Let’s take a look at another example. Here header of XPaj infector sample is shown before and after infection. XPaj makes almost no change in the header of the original file while appending. ML model which are trained on only file header attributes can’t detect these samples. (Fig 3: Byte-plot image of PE file) Deep Learning has come a long way in recently in the field of image classification & computer vision. There are many success stories about image classification problems on Imagenet & Resnet. So we thought about applying image classification to detect malicious files. Converting PE files into Images The first challenge is representing PE files in the form of images. Coming from a large set of resources we have a very good amount of PE files in our data set. We label these files as well. To generate images, we used a well-known open source tool called PortEX. This tool mainly generates three types of images as shown in the diagrams below. First of them (Fig3) is a byte-plot image, where each byte or a group of bytes represent different color pixel as shown in the figure. And then zeroes are padded at the end to keep the image size constant. In this byte-plot image, zeroes are black and FF by white color pixels. Visible ASCII characters are blue color pixes. And so on. (Fig 4: byte entropy image of PE file) In the second type of images, a scale of the single color pixel represents different bytes. Here (fig 4) bytes are represented by grayscale pixels 0x00 being black and 0xFF white. In the last type(Fig. 5), PE structure images represent every structure of a PE file in different colored pixels. Like in the example, green represents resources, yellow import, and sky blue pixels represent appended data. After representing PE files into images we can point out the similarities of two different files by simply looking at them. Below are two examples of such files.       (Fig 5: PE structure image)   Figure 6 and 7 shows similarities of two different samples of Wannacry and Cerber ransomware respectively. (Fig 6: Byte-plot images of Wannacry samples)   (Fig 7: Byte-plot images of Cerber samples) Deep Learning Overview Having a large amount of data i.e. images converted of both clean & malware files, we now apply Deep Learning algorithms on these samples. Deep Learning (DL) or Deep Neural Network (DNN) is a special class of Machine Learning (ML). Artificial Neural Networks (ANN) are building blocks of DNNs. ANNs take inspiration from biological nervous systems. Image classification problems use a special…
http://blogs.quickheal.com/feed/