[系統安全] 五十.惡意家族分類 (2)基于API序列和深度學習的惡意家族分類實例詳解

一.惡意軟件分析

惡意軟件或惡意代碼分析通常包括靜態分析和動態分析。特征種類如果按照惡意代碼是否在用戶環境或仿真環境中運行，可以劃分為靜態特征和動態特征。

那么，如何提取惡意軟件的靜態特征或動態特征呢？ 因此，第一部分將簡要介紹靜態特征和動態特征。

1.靜態特征

沒有真實運行的特征，通常包括：

字節碼
二進制代碼轉換成了字節碼，比較原始的一種特征，沒有進行任何處理
IAT表
PE結構中比較重要的部分，聲明了一些函數及所在位置，便于程序執行時導入，表和功能比較相關
Android權限表
如果你的APP聲明了一些功能用不到的權限，可能存在惡意目的，如手機信息
可打印字符
將二進制代碼轉換為ASCII碼，進行相關統計
IDA反匯編跳轉塊
IDA工具調試時的跳轉塊，對其進行處理作為序列數據或圖數據
常用API函數
惡意軟件圖像化

靜態特征提取方式：

CAPA
– https://github.com/mandiant/capa
IDA Pro
安全廠商沙箱

2.動態特征

相當于靜態特征更耗時，它要真正去執行代碼。通常包括：

– API調用關系：比較明顯的特征，調用了哪些API，表述對應的功能

– 控制流圖：軟件工程中比較常用，機器學習將其表示成向量，從而進行分類

– 數據流圖：軟件工程中比較常用，機器學習將其表示成向量，從而進行分類

動態特征提取方式：

Cuckoo
– https://github.com/cuckoosandbox/cuckoo
CAPE
– https://github.com/kevoreilly/CAPEv2
– https://capev2.readthedocs.io/en/latest/
安全廠商沙箱

二.基于CNN的惡意家族檢測

前面的系列文章詳細介紹如何提取惡意軟件的靜態和動態特征，包括API序列。接下來將構建深度學習模型學習API序列實現分類。基本流程如下：

1.數據集

整個數據集包括5類惡意家族的樣本，每個樣本經過先前的CAPE工具成功提取的動態API序列。數據集分布情況如下所示：（建議讀者提取自己數據集的樣本，包括BIG2015、BODMAS等）

惡意家族類別數量訓練集測試集AAAAclass1352242110BBBBclass2335235100CCCCclass3363243120DDDDclass4293163130EEEEclass5548358190 數據集分為訓練集和測試集，如下圖所示：

數據集中主要包括四個字段，即序號、惡意家族類別、Md5值、API序列或特征。

需要注意，在特征提取過程中涉及大量數據預處理和清洗的工作，讀者需要結合實際需求完成。比如提取特征為空值的過濾代碼。

#coding:utf-8#By:Eastmount CSDN 2023-05-31import csvimport reimport os
csv.field_size_limit(500 * 1024 * 1024)filename = "AAAA_result.csv"writename = "AAAA_result_final.csv"fw = open(writename, mode="w", newline="")writer = csv.writer(fw)writer.writerow(['no', 'type', 'md5', 'api'])with open(filename,encoding='utf-8') as fr:    reader = csv.reader(fr)    no = 1    for row in reader: #['no','type','md5','api']        tt = row[1]        md5 = row[2]        api = row[3]        #print(no,tt,md5,api)        #api空值的過濾        if api=="" or api=="api":            continue        else:            writer.writerow([str(no),tt,md5,api])            no += 1fr.close()

2.模型構建

該模型的基本步驟如下：

第一步數據讀取
第二步 OneHotEncoder()編碼
第三步使用Tokenizer對詞組進行編碼
第四步建立CNN模型并訓練
第五步預測及評估
第六步驗證算法

構建模型如下圖所示：

完整代碼如下所示：

# -*- coding: utf-8 -*-# By:Eastmount CSDN 2023-06-27import pickleimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn import metricsimport tensorflow as tffrom sklearn.preprocessing import LabelEncoder,OneHotEncoderfrom keras.models import Modelfrom keras.layers import LSTM, Activation, Dense, Dropout, Input, Embeddingfrom keras.layers import Convolution1D, MaxPool1D, Flattenfrom keras.optimizers import RMSpropfrom keras.layers import Bidirectionalfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing import sequencefrom keras.callbacks import EarlyStoppingfrom keras.models import load_modelfrom keras.models import Sequentialfrom keras.layers.merge import concatenateimport time
"""import osos.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS"os.environ["CUDA_VISIBLE_DEVICES"] = "0"gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))"""
start = time.clock()
#---------------------------------------第一步 數據讀取------------------------------------# 讀取測數據集train_df = pd.read_csv("..\\train_dataset.csv")val_df = pd.read_csv("..\\val_dataset.csv")test_df = pd.read_csv("..\\test_dataset.csv")
# 指定數據類型 否則AttributeError: 'float' object has no attribute 'lower' 存在文本為空的現象# train_df.SentimentText = train_df.SentimentText.astype(str)print(train_df.head())
# 解決中文顯示問題plt.rcParams['font.sans-serif'] = ['KaiTi']   #指定默認字體 SimHei黑體plt.rcParams['axes.unicode_minus'] = False    #解決保存圖像是負號'
#---------------------------------第二步 OneHotEncoder()編碼---------------------------------# 對數據集的標簽數據進行編碼  (no apt md5 api)train_y = train_df.aptprint("Label:")print(train_y[:10])val_y = val_df.apttest_y = test_df.aptle = LabelEncoder()train_y = le.fit_transform(train_y).reshape(-1,1)print("LabelEncoder")print(train_y[:10])print(len(train_y))val_y = le.transform(val_y).reshape(-1,1)test_y = le.transform(test_y).reshape(-1,1)Labname = le.classes_print(Labname)
# 對數據集的標簽數據進行one-hot編碼ohe = OneHotEncoder()train_y = ohe.fit_transform(train_y).toarray()val_y = ohe.transform(val_y).toarray()test_y = ohe.transform(test_y).toarray()print("OneHotEncoder:")print(train_y[:10])
#-------------------------------第三步 使用Tokenizer對詞組進行編碼-------------------------------# 使用Tokenizer對詞組進行編碼# 當我們創建了一個Tokenizer對象后，使用該對象的fit_on_texts()函數，以空格去識別每個詞# 可以將輸入的文本中的每個詞編號，編號是根據詞頻的，詞頻越大，編號越小max_words = 1000max_len = 200tok = Tokenizer(num_words=max_words)  #使用的最大詞語數為1000print(train_df.api[:5])print(type(train_df.api))
# 提取token：apitrain_value = train_df.apitrain_content = [str(a) for a in train_value.tolist()]val_value = val_df.apival_content = [str(a) for a in val_value.tolist()]test_value = test_df.apitest_content = [str(a) for a in test_value.tolist()]tok.fit_on_texts(train_content)print(tok)
# 保存訓練好的Tokenizer和導入# savingwith open('tok.pickle', 'wb') as handle:    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)# loadingwith open('tok.pickle', 'rb') as handle:    tok = pickle.load(handle)
# 使用word_index屬性可以看到每次詞對應的編碼# 使用word_counts屬性可以看到每個詞對應的頻數for ii,iterm in enumerate(tok.word_index.items()):    if ii < 10:        print(iterm)    else:        breakprint("===================")  for ii,iterm in enumerate(tok.word_counts.items()):    if ii < 10:        print(iterm)    else:        break
# 使用tok.texts_to_sequences()將數據轉化為序列# 使用sequence.pad_sequences()將每個序列調整為相同的長度# 對每個詞編碼之后，每句新聞中的每個詞就可以用對應的編碼表示，即每條新聞可以轉變成一個向量了train_seq = tok.texts_to_sequences(train_content)val_seq = tok.texts_to_sequences(val_content)test_seq = tok.texts_to_sequences(test_content)
# 將每個序列調整為相同的長度train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)print(train_seq_mat.shape)  #(1241, 200)print(val_seq_mat.shape)    #(459, 200)print(test_seq_mat.shape)   #(650, 200)print(train_seq_mat[:2])
#-------------------------------第四步 建立CNN模型并訓練-------------------------------num_labels = 5inputs = Input(name='inputs',shape=[max_len], dtype='float64')
# 詞嵌入（使用預訓練的詞向量）layer = Embedding(max_words+1, 256, input_length=max_len, trainable=False)(inputs)
# 詞窗大小分別為3,4,5cnn = Convolution1D(256, 3, padding='same', strides = 1, activation='relu')(layer)cnn = MaxPool1D(pool_size=3)(cnn)
# 合并三個模型的輸出向量flat = Flatten()(cnn) drop = Dropout(0.4)(flat)main_output = Dense(num_labels, activation='softmax')(drop)model = Model(inputs=inputs, outputs=main_output)model.summary()model.compile(loss="categorical_crossentropy",              optimizer='adam',      #RMSprop()              metrics=["accuracy"])
# 增加判斷 防止再次訓練flag = "train"if flag == "train":    print("模型訓練")    # 模型訓練    model_fit = model.fit(train_seq_mat, train_y, batch_size=64, epochs=15,                          validation_data=(val_seq_mat,val_y),                          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.001)]   #當val-loss不再提升時停止訓練 0.0001                         )        # 保存模型    model.save('cnn_model.h5')      del model  # deletes the existing model        # 計算時間    elapsed = (time.clock() - start)    print("Time used:", elapsed)    print(model_fit.history)    else:    print("模型預測")    # 導入已經訓練好的模型    model = load_model('cnn_model.h5')        #--------------------------------------第五步 預測及評估--------------------------------    # 對測試集進行預測    test_pre = model.predict(test_seq_mat)        # 評價預測效果，計算混淆矩陣    confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),                                     np.argmax(test_pre,axis=1))    print(confm)    print(metrics.classification_report(np.argmax(test_y,axis=1),                                        np.argmax(test_pre,axis=1),                                        digits=4))    print("accuracy", metrics.accuracy_score(np.argmax(test_y, axis=1),                                             np.argmax(test_pre, axis=1)))    # 結果存儲    f1 = open("cnn_test_pre.txt", "w")    for n in np.argmax(test_pre, axis=1):        f1.write(str(n) + "\n")    f1.close()
    f2 = open("cnn_test_y.txt", "w")    for n in np.argmax(test_y, axis=1):        f2.write(str(n) + "\n")    f2.close()
    plt.figure(figsize=(8,8))    sns.heatmap(confm.T, square=True, annot=True,                fmt='d', cbar=False, linewidths=.6,                cmap="YlGnBu")    plt.xlabel('True label',size = 14)    plt.ylabel('Predicted label', size = 14)    plt.xticks(np.arange(5)+0.5, Labname, size = 12)    plt.yticks(np.arange(5)+0.5, Labname, size = 12)    plt.savefig('cnn_result.png')    plt.show()
    #--------------------------------------第六步 驗證算法--------------------------------    # 使用tok對驗證數據集重新預處理    val_seq = tok.texts_to_sequences(val_content)    # 將每個序列調整為相同的長度    val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)    # 對驗證集進行預測    val_pre = model.predict(val_seq_mat)    print(metrics.classification_report(np.argmax(val_y,axis=1),                                        np.argmax(val_pre,axis=1),                                        digits=4))    print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),                                             np.argmax(val_pre, axis=1)))    # 計算時間    elapsed = (time.clock() - start)    print("Time used:", elapsed)

3.實驗結果

最終運行結果及其生成文件如下圖所示：

輸出中間過程結果如下所示：

   no  ...                                                api0   1  ...  GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...1   2  ...  GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...2   3  ...  NtQueryValueKey;GetSystemTimeAsFileTime;HeapCr...3   4  ...  NtQueryValueKey;NtClose;NtAllocateVirtualMemor...4   5  ...  NtOpenFile;NtCreateSection;NtMapViewOfSection;...
[5 rows x 4 columns]Label:0    class11    class12    class13    class14    class15    class16    class17    class18    class19    class1Name: apt, dtype: objectLabelEncoder[[0] [0] [0] [0] [0] [0] [0] [0] [0] [0]]1241['class1' 'class2' 'class3' 'class4' 'class5']OneHotEncoder:[[1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.]]0    GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...1    GetSystemInfo;HeapCreate;NtAllocateVirtualMemo...2    NtQueryValueKey;GetSystemTimeAsFileTime;HeapCr...3    NtQueryValueKey;NtClose;NtAllocateVirtualMemor...4    NtOpenFile;NtCreateSection;NtMapViewOfSection;...Name: api, dtype: object<class 'pandas.core.series.Series'><keras_preprocessing.text.Tokenizer object at 0x0000028E55D36B08>
('regqueryvalueexw', 1)('ntclose', 2)('ldrgetprocedureaddress', 3)('regopenkeyexw', 4)('regclosekey', 5)('ntallocatevirtualmemory', 6)('sendmessagew', 7)('ntwritefile', 8)('process32nextw', 9)('ntdeviceiocontrolfile', 10)===================('getsysteminfo', 2651)('heapcreate', 2996)('ntallocatevirtualmemory', 115547)('ntqueryvaluekey', 24120)('getsystemtimeasfiletime', 52727)('ldrgetdllhandle', 25135)('ldrgetprocedureaddress', 199952)('memcpy', 9008)('setunhandledexceptionfilter', 1504)('ntcreatefile', 43260)
(1241, 200)(459, 200)(650, 200)[[  3 135   3   3   2  21   3   3   4   3  96   3   3   4  96   4  96  20   22  20   3   6   6  23 128 129   3 103  23  56   2 103  23  20   3  23    3   3   3   3   4   1   5  23  12 131  12  20   3  10   2  10   2  20    3   4   5  27   3  10   2   6  10   2   3  10   2  10   2   3  10   2   10   2  10   2  10   2  10   2   3  10   2  10   2  10   2  10   2   3    3   3  36   4   3  23  20   3   5 207  34   6   6   6  11  11   6  11    6   6   6   6   6   6   6   6   6  11   6   6  11   6  11   6  11   6    6  11   6  34   3 141   3 140   3   3 141  34   6   2  21   4  96   4   96   4  96  23   3   3  12 131  12  10   2  10   2   4   5  27  10   2    6  10   2  10   2  10   2  10   2  10   2  10   2  10   2  10   2  10    2  10   2  10   2  10   2  36   4  23   5 207   6   3   3  12 131  12  132   3] [ 27   4  27   4  27   4  27   4  27  27   5  27   4  27   4  27  27  27   27  27  27  27   5  27   4  27   4  27   4  27   4  27   4  27   4  27    4  27   4  27   4  27   5  52   2  21   4   5   1   1   1   5  21  25    2  52  12  33  51  28  34  30   2  52   2  21   4   5  27   5  52   6    6  52   4   1   5   4  52  54   7   7  20  52   7  52   7   7   6   4    4  24  24  24  24  24  24  24  24  24  24  24  24  24  24  24  24   5    5   3   7  50  50  50  95  50  50  50  50  50   4   1   5   4   3   3    3   3   3   7   7   7   3   7   3   7   3  60   3   3   7   7   7   7   60   3   7   7   7   7   7   7   7   7  52  20   3   3   3  14  14  60   18  19  18  19   2  21   4   5  18  19  18  19  18  19  18  19   7   7    7   7   7   7   7   7   7   7   7  52   7   7   7   7   7  60   7   7    7   7]]

模型訓練過程如下：

模型訓練Epoch 1/15
 1/20 [>.............................] - ETA: 5s - loss: 1.5986 - accuracy: 0.2656 2/20 [==>...........................] - ETA: 1s - loss: 1.6050 - accuracy: 0.2266 3/20 [===>..........................] - ETA: 1s - loss: 1.5777 - accuracy: 0.2292 4/20 [=====>........................] - ETA: 2s - loss: 1.5701 - accuracy: 0.2500 5/20 [======>.......................] - ETA: 2s - loss: 1.5628 - accuracy: 0.2719 6/20 [========>.....................] - ETA: 3s - loss: 1.5439 - accuracy: 0.3125 7/20 [=========>....................] - ETA: 3s - loss: 1.5306 - accuracy: 0.3348 8/20 [===========>..................] - ETA: 3s - loss: 1.5162 - accuracy: 0.3535 9/20 [============>.................] - ETA: 3s - loss: 1.5020 - accuracy: 0.369810/20 [==============>...............] - ETA: 3s - loss: 1.4827 - accuracy: 0.396911/20 [===============>..............] - ETA: 3s - loss: 1.4759 - accuracy: 0.402012/20 [=================>............] - ETA: 3s - loss: 1.4734 - accuracy: 0.403613/20 [==================>...........] - ETA: 3s - loss: 1.4456 - accuracy: 0.425514/20 [====================>.........] - ETA: 3s - loss: 1.4322 - accuracy: 0.435315/20 [=====================>........] - ETA: 2s - loss: 1.4157 - accuracy: 0.446916/20 [=======================>......] - ETA: 2s - loss: 1.4093 - accuracy: 0.448217/20 [========================>.....] - ETA: 2s - loss: 1.4010 - accuracy: 0.453118/20 [==========================>...] - ETA: 1s - loss: 1.3920 - accuracy: 0.460119/20 [===========================>..] - ETA: 0s - loss: 1.3841 - accuracy: 0.463820/20 [==============================] - ETA: 0s - loss: 1.3763 - accuracy: 0.467420/20 [==============================] - 20s 1s/step - loss: 1.3763 - accuracy: 0.4674 - val_loss: 1.3056 - val_accuracy: 0.4837
Time used: 26.1328806{'loss': [1.3762551546096802], 'accuracy': [0.467365026473999],  'val_loss': [1.305567979812622], 'val_accuracy': [0.48366013169288635]}

最終預測結果如下所示：

模型測[[ 40  14  11   1  44] [ 16  57  10   0  17] [  6  30  61   0  23] [ 12  20  15  47  36] [ 11  14  19   0 146]]              precision    recall  f1-score   support
           0     0.4706    0.3636    0.4103       110           1     0.4222    0.5700    0.4851       100           2     0.5259    0.5083    0.5169       120           3     0.9792    0.3615    0.5281       130           4     0.5489    0.7684    0.6404       190
    accuracy                         0.5400       650   macro avg     0.5893    0.5144    0.5162       650weighted avg     0.5980    0.5400    0.5323       650
accuracy 0.54
              precision    recall  f1-score   support
           0     0.9086    0.4517    0.6034       352           1     0.5943    0.5888    0.5915       107           2     0.0000    0.0000    0.0000         0           3     0.0000    0.0000    0.0000         0           4     0.0000    0.0000    0.0000         0
    accuracy                         0.4837       459   macro avg     0.3006    0.2081    0.2390       459weighted avg     0.8353    0.4837    0.6006       459
accuracy 0.48366013071895425Time used: 14.170902800000002

思考：

然而，整個預測結果效果較差，請讀者思考，這是為什么呢？我們能不能通過調參進行優化，又如何改進算法呢？本文僅提供基本思路和代碼，更多優化及完善需要讀者學會獨立解決，加油喔！

三.基于BiLSTM的惡意家族檢測

1.模型構建

該模型的基本步驟如下：

第一步數據讀取
第二步 OneHotEncoder()編碼
第三步使用Tokenizer對詞組進行編碼
第四步建立BiLSTM模型并訓練
第五步預測及評估
第六步驗證算法

構建模型如下圖所示：

完整代碼如下所示：

# -*- coding: utf-8 -*-# By:Eastmount CSDN 2023-06-27import pickleimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn import metricsimport tensorflow as tffrom sklearn.preprocessing import LabelEncoder,OneHotEncoderfrom keras.models import Modelfrom keras.layers import LSTM, Activation, Dense, Dropout, Input, Embeddingfrom keras.layers import Convolution1D, MaxPool1D, Flattenfrom keras.optimizers import RMSpropfrom keras.layers import Bidirectionalfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing import sequencefrom keras.callbacks import EarlyStoppingfrom keras.models import load_modelfrom keras.models import Sequentialfrom keras.layers.merge import concatenateimport time
start = time.clock()
#---------------------------------------第一步 數據讀取------------------------------------# 讀取測數據集train_df = pd.read_csv("..\\train_dataset.csv")val_df = pd.read_csv("..\\val_dataset.csv")test_df = pd.read_csv("..\\test_dataset.csv")print(train_df.head())
# 解決中文顯示問題plt.rcParams['font.sans-serif'] = ['KaiTi']plt.rcParams['axes.unicode_minus'] = False
#---------------------------------第二步 OneHotEncoder()編碼---------------------------------# 對數據集的標簽數據進行編碼  (no apt md5 api)train_y = train_df.aptval_y = val_df.apttest_y = test_df.aptle = LabelEncoder()train_y = le.fit_transform(train_y).reshape(-1,1)val_y = le.transform(val_y).reshape(-1,1)test_y = le.transform(test_y).reshape(-1,1)Labname = le.classes_
# 對數據集的標簽數據進行one-hot編碼ohe = OneHotEncoder()train_y = ohe.fit_transform(train_y).toarray()val_y = ohe.transform(val_y).toarray()test_y = ohe.transform(test_y).toarray()
#-------------------------------第三步 使用Tokenizer對詞組進行編碼-------------------------------# 使用Tokenizer對詞組進行編碼max_words = 2000max_len = 300tok = Tokenizer(num_words=max_words)
# 提取token：apitrain_value = train_df.apitrain_content = [str(a) for a in train_value.tolist()]val_value = val_df.apival_content = [str(a) for a in val_value.tolist()]test_value = test_df.apitest_content = [str(a) for a in test_value.tolist()]tok.fit_on_texts(train_content)print(tok)
# 保存訓練好的Tokenizer和導入with open('tok.pickle', 'wb') as handle:    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)with open('tok.pickle', 'rb') as handle:    tok = pickle.load(handle)
# 使用tok.texts_to_sequences()將數據轉化為序列train_seq = tok.texts_to_sequences(train_content)val_seq = tok.texts_to_sequences(val_content)test_seq = tok.texts_to_sequences(test_content)
# 將每個序列調整為相同的長度train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
#-------------------------------第四步 建立LSTM模型并訓練-------------------------------num_labels = 5model = Sequential()model.add(Embedding(max_words+1, 128, input_length=max_len))#model.add(Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.1)))model.add(Bidirectional(LSTM(128)))model.add(Dense(128, activation='relu'))model.add(Dropout(0.3))model.add(Dense(num_labels, activation='softmax'))model.summary()model.compile(loss="categorical_crossentropy",              optimizer='adam',              metrics=["accuracy"])
flag = "train"if flag == "train":    print("模型訓練")    # 模型訓練    model_fit = model.fit(train_seq_mat, train_y, batch_size=64, epochs=15,                          validation_data=(val_seq_mat,val_y),                          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)]                         )        # 保存模型    model.save('bilstm_model.h5')      del model  # deletes the existing model        # 計算時間    elapsed = (time.clock() - start)    print("Time used:", elapsed)    print(model_fit.history)    else:    print("模型預測")    model = load_model('bilstm_model.h5')        #--------------------------------------第五步 預測及評估--------------------------------    # 對測試集進行預測    test_pre = model.predict(test_seq_mat)    confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),                                     np.argmax(test_pre,axis=1))    print(confm)    print(metrics.classification_report(np.argmax(test_y,axis=1),                                        np.argmax(test_pre,axis=1),                                        digits=4))    print("accuracy", metrics.accuracy_score(np.argmax(test_y, axis=1),                                             np.argmax(test_pre, axis=1)))    # 結果存儲    f1 = open("bilstm_test_pre.txt", "w")    for n in np.argmax(test_pre, axis=1):        f1.write(str(n) + "\n")    f1.close()
    f2 = open("bilstm_test_y.txt", "w")    for n in np.argmax(test_y, axis=1):        f2.write(str(n) + "\n")    f2.close()
    plt.figure(figsize=(8,8))    sns.heatmap(confm.T, square=True, annot=True,                fmt='d', cbar=False, linewidths=.6,                cmap="YlGnBu")    plt.xlabel('True label',size = 14)    plt.ylabel('Predicted label', size = 14)    plt.xticks(np.arange(5)+0.5, Labname, size = 12)    plt.yticks(np.arange(5)+0.5, Labname, size = 12)    plt.savefig('bilstm_result.png')    plt.show()
    #--------------------------------------第六步 驗證算法--------------------------------    # 使用tok對驗證數據集重新預處理    val_seq = tok.texts_to_sequences(val_content)    val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)        # 對驗證集進行預測    val_pre = model.predict(val_seq_mat)    print(metrics.classification_report(np.argmax(val_y,axis=1),                                        np.argmax(val_pre,axis=1),                                        digits=4))    print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),                                             np.argmax(val_pre, axis=1)))    # 計算時間    elapsed = (time.clock() - start)    print("Time used:", elapsed)

2.實驗結果

訓練輸出結果如下圖所示：

模型訓練Epoch 1/15 1/20 [>.............................] - ETA: 40s - loss: 1.6114 - accuracy: 0.2031 2/20 [==>...........................] - ETA: 10s - loss: 1.6055 - accuracy: 0.2969 3/20 [===>..........................] - ETA: 10s - loss: 1.6015 - accuracy: 0.3281 4/20 [=====>........................] - ETA: 10s - loss: 1.5931 - accuracy: 0.3477 5/20 [======>.......................] - ETA: 10s - loss: 1.5914 - accuracy: 0.3469 6/20 [========>.....................] - ETA: 10s - loss: 1.5827 - accuracy: 0.3698 7/20 [=========>....................] - ETA: 10s - loss: 1.5785 - accuracy: 0.3884 8/20 [===========>..................] - ETA: 10s - loss: 1.5673 - accuracy: 0.4121 9/20 [============>.................] - ETA: 9s - loss: 1.5610 - accuracy: 0.414910/20 [==============>...............] - ETA: 9s - loss: 1.5457 - accuracy: 0.418711/20 [===============>..............] - ETA: 8s - loss: 1.5297 - accuracy: 0.414812/20 [=================>............] - ETA: 8s - loss: 1.5338 - accuracy: 0.412813/20 [==================>...........] - ETA: 7s - loss: 1.5214 - accuracy: 0.427914/20 [====================>.........] - ETA: 6s - loss: 1.5176 - accuracy: 0.428615/20 [=====================>........] - ETA: 5s - loss: 1.5100 - accuracy: 0.427116/20 [=======================>......] - ETA: 4s - loss: 1.5065 - accuracy: 0.425817/20 [========================>.....] - ETA: 3s - loss: 1.5021 - accuracy: 0.423718/20 [==========================>...] - ETA: 2s - loss: 1.4921 - accuracy: 0.428819/20 [===========================>..] - ETA: 1s - loss: 1.4822 - accuracy: 0.433420/20 [==============================] - ETA: 0s - loss: 1.4825 - accuracy: 0.432720/20 [==============================] - 33s 2s/step - loss: 1.4825 - accuracy: 0.4327 - val_loss: 1.4187 - val_accuracy: 0.4074
Time used: 38.565846900000004{'loss': [1.4825222492218018], 'accuracy': [0.4327155649662018],  'val_loss': [1.4187402725219727], 'val_accuracy': [0.40740740299224854]}>>>

最終預測結果如下所示：

模型預測[[36 18 37  1 18] [14 46 34  0  6] [ 8 29 73  0 10] [16 29 14 45 26] [47 15 33  0 95]]              precision    recall  f1-score   support
           0     0.2975    0.3273    0.3117       110           1     0.3358    0.4600    0.3882       100           2     0.3822    0.6083    0.4695       120           3     0.9783    0.3462    0.5114       130           4     0.6129    0.5000    0.5507       190
    accuracy                         0.4538       650   macro avg     0.5213    0.4484    0.4463       650weighted avg     0.5474    0.4538    0.4624       650
accuracy 0.45384615384615384
              precision    recall  f1-score   support
           0     0.9189    0.3864    0.5440       352           1     0.4766    0.4766    0.4766       107           2     0.0000    0.0000    0.0000         0           3     0.0000    0.0000    0.0000         0           4     0.0000    0.0000    0.0000         0
    accuracy                         0.4074       459   macro avg     0.2791    0.1726    0.2041       459weighted avg     0.8158    0.4074    0.5283       459
accuracy 0.4074074074074074Time used: 32.2772881

四.基于BiGRU的惡意家族檢測

1.模型構建

該模型的基本步驟如下：

第一步數據讀取
第二步 OneHotEncoder()編碼
第三步使用Tokenizer對詞組進行編碼
第四步建立BiGRU模型并訓練
第五步預測及評估
第六步驗證算法

構建模型如下圖所示：

完整代碼如下所示：

# -*- coding: utf-8 -*-# By:Eastmount CSDN 2023-06-27import pickleimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn import metricsimport tensorflow as tffrom sklearn.preprocessing import LabelEncoder,OneHotEncoderfrom keras.models import Modelfrom keras.layers import GRU, LSTM, Activation, Dense, Dropout, Input, Embeddingfrom keras.layers import Convolution1D, MaxPool1D, Flattenfrom keras.optimizers import RMSpropfrom keras.layers import Bidirectionalfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing import sequencefrom keras.callbacks import EarlyStoppingfrom keras.models import load_modelfrom keras.models import Sequentialfrom keras.layers.merge import concatenateimport time
start = time.clock()
#---------------------------------------第一步 數據讀取------------------------------------# 讀取測數據集train_df = pd.read_csv("..\\train_dataset.csv")val_df = pd.read_csv("..\\val_dataset.csv")test_df = pd.read_csv("..\\test_dataset.csv")print(train_df.head())
# 解決中文顯示問題plt.rcParams['font.sans-serif'] = ['KaiTi']plt.rcParams['axes.unicode_minus'] = False
#---------------------------------第二步 OneHotEncoder()編碼---------------------------------# 對數據集的標簽數據進行編碼  (no apt md5 api)train_y = train_df.aptval_y = val_df.apttest_y = test_df.aptle = LabelEncoder()train_y = le.fit_transform(train_y).reshape(-1,1)val_y = le.transform(val_y).reshape(-1,1)test_y = le.transform(test_y).reshape(-1,1)Labname = le.classes_
# 對數據集的標簽數據進行one-hot編碼ohe = OneHotEncoder()train_y = ohe.fit_transform(train_y).toarray()val_y = ohe.transform(val_y).toarray()test_y = ohe.transform(test_y).toarray()
#-------------------------------第三步 使用Tokenizer對詞組進行編碼-------------------------------# 使用Tokenizer對詞組進行編碼max_words = 2000max_len = 300tok = Tokenizer(num_words=max_words)
# 提取token：apitrain_value = train_df.apitrain_content = [str(a) for a in train_value.tolist()]val_value = val_df.apival_content = [str(a) for a in val_value.tolist()]test_value = test_df.apitest_content = [str(a) for a in test_value.tolist()]tok.fit_on_texts(train_content)print(tok)
# 保存訓練好的Tokenizer和導入with open('tok.pickle', 'wb') as handle:    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)with open('tok.pickle', 'rb') as handle:    tok = pickle.load(handle)
# 使用tok.texts_to_sequences()將數據轉化為序列train_seq = tok.texts_to_sequences(train_content)val_seq = tok.texts_to_sequences(val_content)test_seq = tok.texts_to_sequences(test_content)
# 將每個序列調整為相同的長度train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
#-------------------------------第四步 建立GRU模型并訓練-------------------------------num_labels = 5model = Sequential()model.add(Embedding(max_words+1, 256, input_length=max_len))#model.add(Bidirectional(GRU(128, dropout=0.2, recurrent_dropout=0.1)))model.add(Bidirectional(GRU(256)))model.add(Dense(256, activation='relu'))model.add(Dropout(0.4))model.add(Dense(num_labels, activation='softmax'))model.summary()model.compile(loss="categorical_crossentropy",              optimizer='adam',              metrics=["accuracy"])
flag = "train"if flag == "train":    print("模型訓練")    # 模型訓練    model_fit = model.fit(train_seq_mat, train_y, batch_size=64, epochs=15,                          validation_data=(val_seq_mat,val_y),                          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.005)]                         )        # 保存模型    model.save('gru_model.h5')      del model  # deletes the existing model        # 計算時間    elapsed = (time.clock() - start)    print("Time used:", elapsed)    print(model_fit.history)    else:    print("模型預測")    model = load_model('gru_model.h5')        #--------------------------------------第五步 預測及評估--------------------------------    # 對測試集進行預測    test_pre = model.predict(test_seq_mat)    confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),                                     np.argmax(test_pre,axis=1))    print(confm)    print(metrics.classification_report(np.argmax(test_y,axis=1),                                        np.argmax(test_pre,axis=1),                                        digits=4))    print("accuracy", metrics.accuracy_score(np.argmax(test_y, axis=1),                                             np.argmax(test_pre, axis=1)))    # 結果存儲    f1 = open("gru_test_pre.txt", "w")    for n in np.argmax(test_pre, axis=1):        f1.write(str(n) + "\n")    f1.close()
    f2 = open("gru_test_y.txt", "w")    for n in np.argmax(test_y, axis=1):        f2.write(str(n) + "\n")    f2.close()
    plt.figure(figsize=(8,8))    sns.heatmap(confm.T, square=True, annot=True,                fmt='d', cbar=False, linewidths=.6,                cmap="YlGnBu")    plt.xlabel('True label',size = 14)    plt.ylabel('Predicted label', size = 14)    plt.xticks(np.arange(5)+0.5, Labname, size = 12)    plt.yticks(np.arange(5)+0.5, Labname, size = 12)    plt.savefig('gru_result.png')    plt.show()
    #--------------------------------------第六步 驗證算法--------------------------------    # 使用tok對驗證數據集重新預處理    val_seq = tok.texts_to_sequences(val_content)    val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)        # 對驗證集進行預測    val_pre = model.predict(val_seq_mat)    print(metrics.classification_report(np.argmax(val_y,axis=1),                                        np.argmax(val_pre,axis=1),                                        digits=4))    print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),                                             np.argmax(val_pre, axis=1)))    # 計算時間    elapsed = (time.clock() - start)    print("Time used:", elapsed)

2.實驗結果

訓練輸出結果如下圖所示：

模型訓練Epoch 1/15
 1/20 [>.............................] - ETA: 47s - loss: 1.6123 - accuracy: 0.1875 2/20 [==>...........................] - ETA: 18s - loss: 1.6025 - accuracy: 0.2656 3/20 [===>..........................] - ETA: 18s - loss: 1.5904 - accuracy: 0.3333 4/20 [=====>........................] - ETA: 18s - loss: 1.5728 - accuracy: 0.3867 5/20 [======>.......................] - ETA: 17s - loss: 1.5639 - accuracy: 0.4094 6/20 [========>.....................] - ETA: 17s - loss: 1.5488 - accuracy: 0.4375 7/20 [=========>....................] - ETA: 16s - loss: 1.5375 - accuracy: 0.4397 8/20 [===========>..................] - ETA: 16s - loss: 1.5232 - accuracy: 0.4434 9/20 [============>.................] - ETA: 15s - loss: 1.5102 - accuracy: 0.435810/20 [==============>...............] - ETA: 14s - loss: 1.5014 - accuracy: 0.425011/20 [===============>..............] - ETA: 13s - loss: 1.5053 - accuracy: 0.423312/20 [=================>............] - ETA: 12s - loss: 1.5022 - accuracy: 0.423213/20 [==================>...........] - ETA: 11s - loss: 1.4913 - accuracy: 0.427914/20 [====================>.........] - ETA: 9s - loss: 1.4912 - accuracy: 0.4286 15/20 [=====================>........] - ETA: 8s - loss: 1.4841 - accuracy: 0.436516/20 [=======================>......] - ETA: 7s - loss: 1.4720 - accuracy: 0.440417/20 [========================>.....] - ETA: 5s - loss: 1.4669 - accuracy: 0.437518/20 [==========================>...] - ETA: 3s - loss: 1.4636 - accuracy: 0.434919/20 [===========================>..] - ETA: 1s - loss: 1.4544 - accuracy: 0.438320/20 [==============================] - ETA: 0s - loss: 1.4509 - accuracy: 0.440020/20 [==============================] - 44s 2s/step - loss: 1.4509 - accuracy: 0.4400 - val_loss: 1.3812 - val_accuracy: 0.3660
Time used: 49.7057119{'loss': [1.4508591890335083], 'accuracy': [0.4399677813053131],  'val_loss': [1.381193995475769], 'val_accuracy': [0.3660130798816681]}

最終預測結果如下所示：

模型預測[[ 30   8   9  17  46] [ 13  50   9  13  15] [ 10   4  58  29  19] [ 11   8   8  73  30] [ 25   3  23  14 125]]              precision    recall  f1-score   support
           0     0.3371    0.2727    0.3015       110           1     0.6849    0.5000    0.5780       100           2     0.5421    0.4833    0.5110       120           3     0.5000    0.5615    0.5290       130           4     0.5319    0.6579    0.5882       190
    accuracy                         0.5169       650   macro avg     0.5192    0.4951    0.5016       650weighted avg     0.5180    0.5169    0.5120       650
accuracy 0.5169230769230769
              precision    recall  f1-score   support
           0     0.8960    0.3182    0.4696       352           1     0.7273    0.5234    0.6087       107           2     0.0000    0.0000    0.0000         0           3     0.0000    0.0000    0.0000         0           4     0.0000    0.0000    0.0000         0
    accuracy                         0.3660       459   macro avg     0.3247    0.1683    0.2157       459weighted avg     0.8567    0.3660    0.5020       459
accuracy 0.3660130718954248Time used: 60.106339399999996

五.基于CNN+BiLSTM和注意力的惡意家族檢測

1.模型構建

該模型的基本步驟如下：

第一步數據讀取
第二步 OneHotEncoder()編碼
第三步使用Tokenizer對詞組進行編碼
第四步建立Attention機制
第五步建立Attention+CNN+BiLSTM模型并訓練
第六步預測及評估
第七步驗證算法

構建模型如下圖所示：

Model: "model"__________________________________________________________________________________________________Layer (type)                    Output Shape         Param #     Connected to                     ==================================================================================================inputs (InputLayer)             [(None, 100)]        0                                            __________________________________________________________________________________________________embedding (Embedding)           (None, 100, 256)     256256      inputs[0][0]                     __________________________________________________________________________________________________conv1d (Conv1D)                 (None, 100, 256)     196864      embedding[0][0]                  __________________________________________________________________________________________________conv1d_1 (Conv1D)               (None, 100, 256)     262400      embedding[0][0]                  __________________________________________________________________________________________________conv1d_2 (Conv1D)               (None, 100, 256)     327936      embedding[0][0]                  __________________________________________________________________________________________________max_pooling1d (MaxPooling1D)    (None, 25, 256)      0           conv1d[0][0]                     __________________________________________________________________________________________________max_pooling1d_1 (MaxPooling1D)  (None, 25, 256)      0           conv1d_1[0][0]                   __________________________________________________________________________________________________max_pooling1d_2 (MaxPooling1D)  (None, 25, 256)      0           conv1d_2[0][0]                   __________________________________________________________________________________________________concatenate (Concatenate)       (None, 25, 768)      0           max_pooling1d[0][0]                                                                               max_pooling1d_1[0][0]                                                                             max_pooling1d_2[0][0]            __________________________________________________________________________________________________bidirectional (Bidirectional)   (None, 25, 256)      918528      concatenate[0][0]                __________________________________________________________________________________________________dense (Dense)                   (None, 25, 128)      32896       bidirectional[0][0]              __________________________________________________________________________________________________dropout (Dropout)               (None, 25, 128)      0           dense[0][0]                      __________________________________________________________________________________________________attention_layer (AttentionLayer (None, 128)          6500        dropout[0][0]                    __________________________________________________________________________________________________dense_1 (Dense)                 (None, 5)            645         attention_layer[0][0]            ==================================================================================================Total params: 2,002,025Trainable params: 1,745,769Non-trainable params: 256,256

完整代碼如下所示：

# -*- coding: utf-8 -*-# By:Eastmount CSDN 2023-06-27import pickleimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport tensorflow as tffrom sklearn import metricsfrom sklearn.preprocessing import LabelEncoder,OneHotEncoderfrom keras.models import Modelfrom keras.layers import LSTM, GRU, Activation, Dense, Dropout, Input, Embeddingfrom keras.layers import Convolution1D, MaxPool1D, Flattenfrom keras.optimizers import RMSpropfrom keras.layers import Bidirectionalfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing import sequencefrom keras.callbacks import EarlyStoppingfrom keras.models import load_modelfrom keras.models import Sequentialfrom keras.layers.merge import concatenateimport time
start = time.clock()
#---------------------------------------第一步 數據讀取------------------------------------# 讀取測數據集train_df = pd.read_csv("..\\train_dataset.csv")val_df = pd.read_csv("..\\val_dataset.csv")test_df = pd.read_csv("..\\test_dataset.csv")print(train_df.head())
# 解決中文顯示問題plt.rcParams['font.sans-serif'] = ['KaiTi']plt.rcParams['axes.unicode_minus'] = False
#---------------------------------第二步 OneHotEncoder()編碼---------------------------------# 對數據集的標簽數據進行編碼  (no apt md5 api)train_y = train_df.aptval_y = val_df.apttest_y = test_df.aptle = LabelEncoder()train_y = le.fit_transform(train_y).reshape(-1,1)val_y = le.transform(val_y).reshape(-1,1)test_y = le.transform(test_y).reshape(-1,1)Labname = le.classes_
# 對數據集的標簽數據進行one-hot編碼ohe = OneHotEncoder()train_y = ohe.fit_transform(train_y).toarray()val_y = ohe.transform(val_y).toarray()test_y = ohe.transform(test_y).toarray()
#-------------------------------第三步 使用Tokenizer對詞組進行編碼-------------------------------# 使用Tokenizer對詞組進行編碼max_words = 1000max_len = 100tok = Tokenizer(num_words=max_words)
# 提取token：apitrain_value = train_df.apitrain_content = [str(a) for a in train_value.tolist()]val_value = val_df.apival_content = [str(a) for a in val_value.tolist()]test_value = test_df.apitest_content = [str(a) for a in test_value.tolist()]tok.fit_on_texts(train_content)print(tok)
# 保存訓練好的Tokenizer和導入with open('tok.pickle', 'wb') as handle:    pickle.dump(tok, handle, protocol=pickle.HIGHEST_PROTOCOL)with open('tok.pickle', 'rb') as handle:    tok = pickle.load(handle)
# 使用tok.texts_to_sequences()將數據轉化為序列train_seq = tok.texts_to_sequences(train_content)val_seq = tok.texts_to_sequences(val_content)test_seq = tok.texts_to_sequences(test_content)
# 將每個序列調整為相同的長度train_seq_mat = sequence.pad_sequences(train_seq,maxlen=max_len)val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)test_seq_mat = sequence.pad_sequences(test_seq,maxlen=max_len)
#-------------------------------第四步 建立Attention機制-------------------------------"""由于Keras目前還沒有現成的Attention層可以直接使用，我們需要自己來構建一個新的層函數。  Keras自定義的函數主要分為四個部分，分別是：  init：初始化一些需要的參數  bulid：具體來定義權重是怎么樣的  call：核心部分，定義向量是如何進行運算的  compute_output_shape：定義該層輸出的大小
推薦文章 https://blog.csdn.net/huanghaocs/article/details/95752379推薦文章 https://zhuanlan.zhihu.com/p/29201491"""# Hierarchical Model with Attentionfrom keras import initializersfrom keras import constraintsfrom keras import activationsfrom keras import regularizersfrom keras import backend as Kfrom keras.engine.topology import Layer
K.clear_session()
class AttentionLayer(Layer):    def __init__(self, attention_size=None, **kwargs):        self.attention_size = attention_size        super(AttentionLayer, self).__init__(**kwargs)            def get_config(self):        config = super().get_config()        config['attention_size'] = self.attention_size        return config            def build(self, input_shape):        assert len(input_shape) == 3                self.time_steps = input_shape[1]        hidden_size = input_shape[2]        if self.attention_size is None:            self.attention_size = hidden_size                    self.W = self.add_weight(name='att_weight', shape=(hidden_size, self.attention_size),                                initializer='uniform', trainable=True)        self.b = self.add_weight(name='att_bias', shape=(self.attention_size,),                                initializer='uniform', trainable=True)        self.V = self.add_weight(name='att_var', shape=(self.attention_size,),                                initializer='uniform', trainable=True)        super(AttentionLayer, self).build(input_shape)
    #解決方法: Attention The graph tensor has name: model/attention_layer/Reshape:0    #https://blog.csdn.net/weixin_54227557/article/details/129898614    def call(self, inputs):        #self.V = K.reshape(self.V, (-1, 1))        V = K.reshape(self.V, (-1, 1))        H = K.tanh(K.dot(inputs, self.W) + self.b)        #score = K.softmax(K.dot(H, self.V), axis=1)        score = K.softmax(K.dot(H, V), axis=1)        outputs = K.sum(score * inputs, axis=1)        return outputs        def compute_output_shape(self, input_shape):        return input_shape[0], input_shape[2]
#-------------------------------第五步 建立Attention+CNN模型并訓練-------------------------------# 構建TextCNN模型num_labels = 5inputs = Input(name='inputs',shape=[max_len], dtype='float64')layer = Embedding(max_words+1, 256, input_length=max_len, trainable=False)(inputs)cnn1 = Convolution1D(256, 3, padding='same', strides = 1, activation='relu')(layer)cnn1 = MaxPool1D(pool_size=4)(cnn1)cnn2 = Convolution1D(256, 4, padding='same', strides = 1, activation='relu')(layer)cnn2 = MaxPool1D(pool_size=4)(cnn2)cnn3 = Convolution1D(256, 5, padding='same', strides = 1, activation='relu')(layer)cnn3 = MaxPool1D(pool_size=4)(cnn3)
# 合并三個模型的輸出向量cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)
# BiLSTM+Attention#bilstm = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.1, return_sequences=True))(cnn)bilstm = Bidirectional(LSTM(128, return_sequences=True))(cnn)  #參數保持維度3layer = Dense(128, activation='relu')(bilstm)layer = Dropout(0.3)(layer)attention = AttentionLayer(attention_size=50)(layer)
output = Dense(num_labels, activation='softmax')(attention)model = Model(inputs=inputs, outputs=output)model.summary()model.compile(loss="categorical_crossentropy",              optimizer='adam',              metrics=["accuracy"])
flag = "test"if flag == "train":    print("模型訓練")    # 模型訓練    model_fit = model.fit(train_seq_mat, train_y, batch_size=128, epochs=15,                          validation_data=(val_seq_mat,val_y),                          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0005)]                         )
    # 保存模型    model.save('cnn_bilstm_model.h5')    del model  # deletes the existing model        #計算時間    elapsed = (time.clock() - start)    print("Time used:", elapsed)    print(model_fit.history)    else:    print("模型預測")    model = load_model('cnn_bilstm_model.h5', custom_objects={'AttentionLayer': AttentionLayer(50)}, compile=False)
    #--------------------------------------第六步 預測及評估--------------------------------    # 對測試集進行預測    test_pre = model.predict(test_seq_mat)    confm = metrics.confusion_matrix(np.argmax(test_y,axis=1),np.argmax(test_pre,axis=1))    print(confm)    print(metrics.classification_report(np.argmax(test_y,axis=1),                                        np.argmax(test_pre,axis=1),                                        digits=4))    print("accuracy",metrics.accuracy_score(np.argmax(test_y,axis=1),                                 np.argmax(test_pre,axis=1)))    # 結果存儲    f1 = open("cnn_bilstm_test_pre.txt", "w")    for n in np.argmax(test_pre, axis=1):        f1.write(str(n) + "\n")    f1.close()
    f2 = open("cnn_bilstm_test_y.txt", "w")    for n in np.argmax(test_y, axis=1):        f2.write(str(n) + "\n")    f2.close()
    plt.figure(figsize=(8,8))    sns.heatmap(confm.T, square=True, annot=True,                fmt='d', cbar=False, linewidths=.6,                cmap="YlGnBu")    plt.xlabel('True label',size = 14)    plt.ylabel('Predicted label', size = 14)    plt.xticks(np.arange(5)+0.5, Labname, size = 12)    plt.yticks(np.arange(5)+0.5, Labname, size = 12)    plt.savefig('cnn_bilstm_result.png')    plt.show()
    #--------------------------------------第七步 驗證算法--------------------------------    # 使用tok對驗證數據集重新預處理，并使用訓練好的模型進行預測    val_seq = tok.texts_to_sequences(val_content)    val_seq_mat = sequence.pad_sequences(val_seq,maxlen=max_len)        # 對驗證集進行預測    val_pre = model.predict(val_seq_mat)    print(metrics.classification_report(np.argmax(val_y, axis=1),                                        np.argmax(val_pre, axis=1),                                        digits=4))    print("accuracy", metrics.accuracy_score(np.argmax(val_y, axis=1),                                             np.argmax(val_pre, axis=1)))    # 計算時間    elapsed = (time.clock() - start)    print("Time used:", elapsed)

2.實驗結果

訓練輸出結果如下圖所示：

模型訓練Epoch 1/15
 1/10 [==>...........................] - ETA: 18s - loss: 1.6074 - accuracy: 0.2188 2/10 [=====>........................] - ETA: 2s - loss: 1.5996 - accuracy: 0.2383  3/10 [========>.....................] - ETA: 2s - loss: 1.5903 - accuracy: 0.2500 4/10 [===========>..................] - ETA: 2s - loss: 1.5665 - accuracy: 0.2793 5/10 [==============>...............] - ETA: 2s - loss: 1.5552 - accuracy: 0.2750 6/10 [=================>............] - ETA: 1s - loss: 1.5346 - accuracy: 0.2930 7/10 [====================>.........] - ETA: 1s - loss: 1.5229 - accuracy: 0.3103 8/10 [=======================>......] - ETA: 1s - loss: 1.5208 - accuracy: 0.3135 9/10 [==========================>...] - ETA: 0s - loss: 1.5132 - accuracy: 0.328110/10 [==============================] - ETA: 0s - loss: 1.5046 - accuracy: 0.340010/10 [==============================] - 9s 728ms/step - loss: 1.5046 - accuracy: 0.3400 - val_loss: 1.4659 - val_accuracy: 0.5599
Time used: 13.8141568{'loss': [1.5045626163482666], 'accuracy': [0.34004834294319153],  'val_loss': [1.4658586978912354], 'val_accuracy': [0.5599128603935242]}

最終預測結果如下所示：

模型預測[[ 56  13   1   0  40] [ 31  53   0   0  16] [ 54  47   3   1  15] [ 27  14   1  51  37] [ 39  16   8   2 125]]              precision    recall  f1-score   support
           0     0.2705    0.5091    0.3533       110           1     0.3706    0.5300    0.4362       100           2     0.2308    0.0250    0.0451       120           3     0.9444    0.3923    0.5543       130           4     0.5365    0.6579    0.5910       190
    accuracy                         0.4431       650   macro avg     0.4706    0.4229    0.3960       650weighted avg     0.4911    0.4431    0.4189       650
accuracy 0.4430769230769231
havior.              precision    recall  f1-score   support
           0     0.8571    0.5625    0.6792       352           1     0.6344    0.5514    0.5900       107           2     0.0000    0.0000    0.0000         0           4     0.0000    0.0000    0.0000         0
    accuracy                         0.5599       459   macro avg     0.3729    0.2785    0.3173       459weighted avg     0.8052    0.5599    0.6584       459
accuracy 0.5599128540305011Time used: 23.0178675

六.總結

寫到這里這篇文章就結束，希望對您有所幫助。忙碌的五月、六月，真的很忙，項目本子論文畢業，等忙完后好好寫幾篇安全博客，感謝支持和陪伴，尤其是家人的鼓勵和支持，繼續加油！

一.惡意軟件分析
1.靜態特征
2.動態特征
二.基于CNN的惡意家族檢測
1.數據集
2.模型構建
3.實驗結果
三.基于BiLSTM的惡意家族檢測
1.模型構建
2.實驗結果
四.基于BiGRU的惡意家族檢測
1.模型構建
2.實驗結果
五.基于CNN+BiLSTM和注意力的惡意家族檢測
1.模型構建
2.實驗結果