一.惡意軟件分析
惡意軟件或惡意代碼分析通常包括靜態分析和動態分析。特征種類如果按照惡意代碼是否在用戶環境或仿真環境中運行,可以劃分為靜態特征和動態特征。
那么,如何提取惡意軟件的靜態特征或動態特征呢? 因此,第一部分將簡要介紹靜態特征和動態特征。
1.靜態特征
沒有真實運行的特征,通常包括:
- 字節碼
- 二進制代碼轉換成了字節碼,比較原始的一種特征,沒有進行任何處理
- IAT表
- PE結構中比較重要的部分,聲明了一些函數及所在位置,便于程序執行時導入,表和功能比較相關
- Android權限表
- 如果你的APP聲明了一些功能用不到的權限,可能存在惡意目的,如手機信息
- 可打印字符
- 將二進制代碼轉換為ASCII碼,進行相關統計
- IDA反匯編跳轉塊
- IDA工具調試時的跳轉塊,對其進行處理作為序列數據或圖數據
- 常用API函數
- 惡意軟件圖像化
靜態特征提取方式:
- CAPA
- – https://github.com/mandiant/capa
- IDA Pro
- 安全廠商沙箱
2.動態特征
相當于靜態特征更耗時,它要真正去執行代碼。通常包括:
– API調用關系:比較明顯的特征,調用了哪些API,表述對應的功能
– 控制流圖:軟件工程中比較常用,機器學習將其表示成向量,從而進行分類
– 數據流圖:軟件工程中比較常用,機器學習將其表示成向量,從而進行分類
動態特征提取方式:
- Cuckoo
- – https://github.com/cuckoosandbox/cuckoo
- CAPE
- – https://github.com/kevoreilly/CAPEv2
- – https://capev2.readthedocs.io/en/latest/
- 安全廠商沙箱
二.基于邏輯回歸的惡意家族檢測
前面的系列文章詳細介紹如何提取惡意軟件的靜態和動態特征,包括API序列。接下來將構建機器學習模型學習API序列實現分類。基本流程如下:

1.數據集
整個數據集包括5類惡意家族的樣本,每個樣本經過先前的CAPE工具成功提取的動態API序列。數據集分布情況如下所示:(建議讀者提取自己數據集的樣本,包括BIG2015、BODMAS等)
惡意家族類別數量訓練集測試集AAAAclass1352242110BBBBclass2335235100CCCCclass3363243120DDDDclass4293163130EEEEclass5548358190
數據集分為訓練集和測試集,如下圖所示:

數據集中主要包括四個字段,即序號、惡意家族類別、Md5值、API序列或特征。

需要注意,在特征提取過程中涉及大量數據預處理和清洗的工作,讀者需要結合實際需求完成。比如提取特征為空值的過濾代碼。
#coding:utf-8#By:Eastmount CSDN 2023-05-31import csvimport reimport os csv.field_size_limit(500 * 1024 * 1024)filename = "AAAA_result.csv"writename = "AAAA_result_final.csv"fw = open(writename, mode="w", newline="")writer = csv.writer(fw)writer.writerow(['no', 'type', 'md5', 'api'])with open(filename,encoding='utf-8') as fr: reader = csv.reader(fr) no = 1 for row in reader: #['no','type','md5','api'] tt = row[1] md5 = row[2] api = row[3] #print(no,tt,md5,api) #api空值的過濾 if api=="" or api=="api": continue else: writer.writerow([str(no),tt,md5,api]) no += 1fr.close()
2.模型構建
由于機器學習算法比較簡單,這里僅給出關鍵代碼。此外,常用特征表征包括TF-IDF和Word2Vec,這里采用TF-IDF計算特征向量,讀者可以嘗試Word2Vec,最終實現家族分類并取得0.6215的Acc值。
# -*- coding: utf-8 -*-# By:Eastmount CSDN 2023-06-01import osimport csvimport timeimport numpy as npimport seaborn as snsfrom sklearn import metricsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.preprocessing import LabelEncoder, OneHotEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import classification_report
start = time.clock()csv.field_size_limit(500 * 1024 * 1024)
#---------------------------第一步 加載數據集------------------------#訓練集file = "train_dataset.csv"label_train = []content_train = []with open(file, "r") as csv_file: csv_reader = csv.reader(csv_file) header = next(csv_reader) for row in csv_reader: label_train.append(row[1]) value = str(row[3]) content_train.append(value)print(label_train[:2])print(content_train[:2])
#測試集file = "test_dataset.csv"label_test = []content_test = []with open(file, "r") as csv_file: csv_reader = csv.reader(csv_file) header = next(csv_reader) for row in csv_reader: label_test.append(row[1]) value = str(row[3]) content_test.append(value)print(len(label_train),len(label_test))print(len(content_train),len(content_test)) #1241 650
#---------------------------第二步 向量轉換------------------------contents = content_train + content_testlabels = label_train + label_test
#計算詞頻 min_df max_dfvectorizer = CountVectorizer()X = vectorizer.fit_transform(contents)words = vectorizer.get_feature_names()print(words[:10])print("特征詞數量:",len(words))
#計算TF-IDFtransformer = TfidfTransformer()tfidf = transformer.fit_transform(X)weights = tfidf.toarray()
#---------------------------第三步 編碼轉換------------------------le = LabelEncoder()y = le.fit_transform(labels)X_train, X_test = weights[:1241], weights[1241:]y_train, y_test = y[:1241], y[1241:]
#---------------------------第四步 分類檢測------------------------clf = LogisticRegression(solver='liblinear')clf.fit(X_train, y_train)pre = clf.predict(X_test)print(clf)print(classification_report(y_test, pre, digits=4))print("accuracy:")print(metrics.accuracy_score(y_test, pre))
#計算時間elapsed = (time.clock() - start)print("Time used:", elapsed)
輸出結果如下圖所示:
1241 6501241 650['__anomaly__', 'accept', 'bind', 'changewindowmessagefilter', 'closesocket', 'clsidfromprogid', 'cocreateinstance', 'cocreateinstanceex', 'cogetclassobject', 'colescript_parsescripttext']特征詞數量: 269LogisticRegression(solver='liblinear') precision recall f1-score support
0 0.5398 0.5545 0.5471 110 1 0.6526 0.6200 0.6359 100 2 0.6596 0.5167 0.5794 120 3 0.8235 0.5385 0.6512 130 4 0.5665 0.7842 0.6578 190
accuracy 0.6215 650 macro avg 0.6484 0.6028 0.6143 650weighted avg 0.6438 0.6215 0.6199 650
accuracy:0.6215384615384615Time used: 2.2597622
三.基于SVM的惡意家族檢測
1.SVM模型
SVM分類算法的核心思想是通過建立某種核函數,將數據在高維尋找一個滿足分類要求的超平面,使訓練集中的點距離分類面盡可能的遠,即尋找一個分類面使得其兩側的空白區域最大。如圖19.16所示,兩類樣本中離分類面最近的點且平行于最優分類面的超平面上的訓練樣本就叫做支持向量。

SVM分類算法在Sklearn機器學習包中,實現的類是 svm.SVC,即C-Support Vector Classification,它是基于libsvm實現的。構造方法如下:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
SVC算法主要包括兩個步驟:
- 訓練:
nbrs.fit(data, target) - 預測:
pre = clf.predict(data)
2.代碼實現
下面僅給出SVM實現惡意家族分類的關鍵代碼,該算法也是各類安全任務中的常用模型。需要注意,這里將預測結果保存至文件中,在真實實驗中,建議大家多將實驗過程數據保存,從而能更好地比較各種性能,體現論文的貢獻。
# -*- coding: utf-8 -*-# By:Eastmount CSDN 2023-06-01import osimport csvimport timeimport numpy as npimport seaborn as snsfrom sklearn import svmfrom sklearn import metricsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.preprocessing import LabelEncoder, OneHotEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import classification_report
start = time.clock()csv.field_size_limit(500 * 1024 * 1024)
#---------------------------第一步 加載數據集------------------------#訓練集file = "train_dataset.csv"label_train = []content_train = []with open(file, "r") as csv_file: csv_reader = csv.reader(csv_file) header = next(csv_reader) for row in csv_reader: label_train.append(row[1]) value = str(row[3]) content_train.append(value)print(label_train[:2])print(content_train[:2])
#測試集file = "test_dataset.csv"label_test = []content_test = []with open(file, "r") as csv_file: csv_reader = csv.reader(csv_file) header = next(csv_reader) for row in csv_reader: label_test.append(row[1]) value = str(row[3]) content_test.append(value)print(len(label_train),len(label_test))print(len(content_train),len(content_test)) #1241 650
#---------------------------第二步 向量轉換------------------------contents = content_train + content_testlabels = label_train + label_test
#計算詞頻 min_df max_dfvectorizer = CountVectorizer()X = vectorizer.fit_transform(contents)words = vectorizer.get_feature_names()print(words[:10])print("特征詞數量:",len(words))
#計算TF-IDFtransformer = TfidfTransformer()tfidf = transformer.fit_transform(X)weights = tfidf.toarray()
#---------------------------第三步 編碼轉換------------------------le = LabelEncoder()y = le.fit_transform(labels)X_train, X_test = weights[:1241], weights[1241:]y_train, y_test = y[:1241], y[1241:]
#---------------------------第四步 分類檢測------------------------clf = svm.LinearSVC()clf.fit(X_train, y_train)pre = clf.predict(X_test)print(clf)print(classification_report(y_test, pre, digits=4))print("accuracy:")print(metrics.accuracy_score(y_test, pre))
#結果存儲f1 = open("svm_test_pre.txt", "w")for n in pre: f1.write(str(n) + "")f1.close()
f2 = open("svm_test_y.txt", "w")for n in y_test: f2.write(str(n) + "")f2.close()
#計算時間elapsed = (time.clock() - start)print("Time used:", elapsed)
實驗結果如下圖所示:

1241 6501241 650
['__anomaly__', 'accept', 'bind', 'changewindowmessagefilter', 'closesocket', 'clsidfromprogid', 'cocreateinstance', 'cocreateinstanceex', 'cogetclassobject', 'colescript_parsescripttext']特征詞數量: 269LinearSVC() precision recall f1-score support
0 0.6439 0.7727 0.7025 110 1 0.8780 0.7200 0.7912 100 2 0.7315 0.6583 0.6930 120 3 0.9091 0.6154 0.7339 130 4 0.6583 0.8316 0.7349 190
accuracy 0.7292 650 macro avg 0.7642 0.7196 0.7311 650weighted avg 0.7534 0.7292 0.7301 650
accuracy:0.7292307692307692Time used: 2.2672032
四.基于隨機森林的惡意家族檢測
該部分關鍵代碼如下,并且補充可視化分析代碼。
# -*- coding: utf-8 -*-# By:Eastmount CSDN 2023-06-01import osimport csvimport timeimport numpy as npimport seaborn as snsfrom sklearn import svmfrom sklearn import metricsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.preprocessing import LabelEncoder, OneHotEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import classification_reportimport matplotlib.pyplot as pltfrom matplotlib.colors import ListedColormap
start = time.clock()csv.field_size_limit(500 * 1024 * 1024)
#---------------------------第一步 加載數據集------------------------#訓練集file = "train_dataset.csv"label_train = []content_train = []with open(file, "r") as csv_file: csv_reader = csv.reader(csv_file) header = next(csv_reader) for row in csv_reader: label_train.append(row[1]) value = str(row[3]) content_train.append(value)print(label_train[:2])print(content_train[:2])
#測試集file = "test_dataset.csv"label_test = []content_test = []with open(file, "r") as csv_file: csv_reader = csv.reader(csv_file) header = next(csv_reader) for row in csv_reader: label_test.append(row[1]) value = str(row[3]) content_test.append(value)print(len(label_train),len(label_test))print(len(content_train),len(content_test)) #1241 650
#---------------------------第二步 向量轉換------------------------contents = content_train + content_testlabels = label_train + label_test
#計算詞頻 min_df max_dfvectorizer = CountVectorizer()X = vectorizer.fit_transform(contents)words = vectorizer.get_feature_names()print(words[:10])print("特征詞數量:",len(words))
#計算TF-IDFtransformer = TfidfTransformer()tfidf = transformer.fit_transform(X)weights = tfidf.toarray()
#---------------------------第三步 編碼轉換------------------------le = LabelEncoder()y = le.fit_transform(labels)X_train, X_test = weights[:1241], weights[1241:]y_train, y_test = y[:1241], y[1241:]
#---------------------------第四步 分類檢測------------------------clf = RandomForestClassifier(n_estimators=5)clf.fit(X_train, y_train)pre = clf.predict(X_test)print(clf)print(classification_report(y_test, pre, digits=4))print("accuracy:")print(metrics.accuracy_score(y_test, pre))
#結果存儲f1 = open("rf_test_pre.txt", "w")for n in pre: f1.write(str(n) + "")f1.close()
f2 = open("rf_test_y.txt", "w")for n in y_test: f2.write(str(n) + "")f2.close()
#計算時間elapsed = (time.clock() - start)print("Time used:", elapsed)
#---------------------------第五步 可視化分析------------------------#降維pca = PCA(n_components=2)pca = pca.fit(X_test)xx = pca.transform(X_test)
#畫圖plt.figure()plt.scatter(xx[:,0],xx[:,1],c=y_test, s=50)plt.title("Malware Family Detection")plt.show()
輸出結果如下所示,效果達到了0.8092,感覺還不錯。
1241 6501241 650['__anomaly__', 'accept', 'bind', 'changewindowmessagefilter', 'closesocket', 'clsidfromprogid', 'cocreateinstance', 'cocreateinstanceex', 'cogetclassobject', 'colescript_parsescripttext']特征詞數量: 269RandomForestClassifier(n_estimators=5) precision recall f1-score support
0 0.7185 0.8818 0.7918 110 1 0.9000 0.8100 0.8526 100 2 0.7963 0.7167 0.7544 120 3 0.9444 0.7846 0.8571 130 4 0.7656 0.8421 0.8020 190
accuracy 0.8092 650 macro avg 0.8250 0.8070 0.8116 650weighted avg 0.8197 0.8092 0.8103 650
accuracy:0.8092307692307692Time used: 2.1914324
同時,五類惡意家族進行可視化分析。然而,整個效果一般,需要進一步優化代碼和維度來區分數據集,或者三維散點圖,請讀者自行思考。

五.總結
寫到這里這篇文章就結束,希望對您有所幫助。忙碌的五月,真的很忙,項目本子論文畢業,等忙完后好好寫幾篇安全博客,感謝支持和陪伴,尤其是家人的鼓勵和支持, 繼續加油!
娜璋AI安全之家
一顆小胡椒
安全牛
看雪學苑
安全圈
嘶吼專業版
CNCERT國家工程研究中心
FreeBuf
關鍵基礎設施安全應急響應中心
安全圈
奇安信集團
E安全