[AI安全論文] 17.英文論文Design和Overview撰寫及精句摘抄——以系統AI安全頂會為例 - 網安

前一篇介紹CCS2019的Powershell去混淆工作，這篇文章質量非常高，來自于浙江大學的李振源老師。這篇文章將從個人角度介紹英文論文模型設計（Model Design）和概述（Overview）如何撰寫，并以系統AI安全的頂會論文為例。一方面自己英文太差，只能通過最土的辦法慢慢提升，另一方面是自己的個人學習筆記，并分享出來希望大家批評和指正。希望這篇文章對您有所幫助，這些大佬是真的值得我們去學習，獻上小弟的膝蓋~fighting！

由于作者之前做NLP和AI，現在轉安全方向，因此本文選擇的論文主要為近四年篇AI安全和系統安全的四大頂會（S&P、USENIX Sec、CCS、NDSS）。同時，作者能力有限，只能結合自己的實力和實際閱讀情況出發，也希望自己能不斷進步，每個部分都會持續補充。可能五年十年后，也會詳細分享一篇英文論文如何撰寫，目前主要以學習和筆記為主。大佬還請飄過O(∩_∩)O

文章目錄：

一.模型設計或方法如何撰寫
1.論文總體框架及方法撰寫
2.方法或模型設計撰寫
3.模型設計撰寫之個人理解
4.整體結構撰寫補充
二.Model Design撰寫及精句
第0部分：引入和關聯
第1部分：Overview（系統框架如何描述）
Hermes Attack（USENIX Sec21）
PalmTree（CCS21）
TextShield（USENIX Sec20）
UNICORN（NDSS20）
PowerShell Deobfuscation（CCS19）
DeepReflect（USENIX Sec21）
Phishpedia（USENIX Sec21）
TextExerciser（S&P21）
DeepBinDiff（NDSS20）
Slimium（CCS20）
三.總結

《娜璋帶你讀論文》系列主要是督促自己閱讀優秀論文及聽取學術講座，并分享給大家，希望您喜歡。由于作者的英文水平和學術能力不高，需要不斷提升，所以還請大家批評指正。同時，前期翻譯提升為主，后續隨著學習加強會更多分享論文的精華和創新，在之后是復現和論文撰寫總結分析。雖然自己科研很菜，但喜歡記錄和分享，也歡迎大家給我留言評論，學術路上期待與您前行，加油~

前文推薦：

[AI安全論文] 01.人工智能真的安全嗎？浙大團隊分享AI對抗樣本技術
[AI安全論文] 02.清華張超老師 GreyOne和Fuzzing漏洞挖掘各階段進展總結
[AI安全論文] 03.什么是生成對抗網絡？GAN的前世今生（Goodfellow）
[AI安全論文] 04.NLP知識總結及NLP論文撰寫之道——Pvop老師
[AI安全論文] 05.RAID-Cyber Threat Intelligence Modeling Based on GCN
[AI安全論文] 06.NDSS2020 UNICORN: Runtime Provenance-Based Detector for Advanced Persistent Threats
[AI安全論文] 07.S&P19 HOLMES：基于可疑信息流相關性的實時APT檢測
[AI安全論文] 08.基于溯源圖的APT攻擊檢測安全頂會論文總結
[AI安全論文] 09.ACE算法和暗通道先驗圖像去霧算法詳解（Rizzi | 何愷明老師）
[AI安全論文] 10.英文論文引言introduction如何撰寫及精句摘抄——以入侵檢測系統(IDS)為例
[AI安全論文] 11.英文論文模型設計（Model Design）如何撰寫及精句摘抄——以IDS為例
[AI安全論文] 12.英文論文實驗評估（Evaluation）如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 13.英文SCI論文審稿意見及應對策略學習筆記總結
[AI安全論文] 14.S&P2019-Neural Cleanse 神經網絡中的后門攻擊識別與緩解
[AI安全論文] 15.Powershell惡意代碼檢測論文總結及抽象語法樹（AST）提取
[AI安全論文] 16.CCS2019 針對PowerShell腳本的輕量級去混淆和語義感知攻擊檢測（經典）
[AI安全論文] 17.英文論文Design和Overview撰寫及精句摘抄——以系統AI安全頂會為例

一.模型設計或方法如何撰寫

論文如何撰寫因人而異，作者僅分享自己的觀點，歡迎大家提出意見。然而，堅持閱讀所研究領域最新和經典論文，這個大家應該會贊成，如果能做到相關領域文獻如數家珍，就離你撰寫第一篇英文論文更近一步了。在模型設計中，重點是模型如何與需要解決的問題結合，讓人覺得確實該模型能解決類似的問題，一個好的故事是論文成功的關鍵。同時，多讀多寫是基操，共勉！

1.論文總體框架及方法撰寫

該部分回顧和參考周老師的博士課程內容，感謝老師的分享。典型的論文框架包括兩種（The typical “anatomy” of a paper），如下所示：

第一種格式：理論研究

Title and authors
Abstract
Introduction
Related Work (可置后)
Materials and Methods
Results
Acknowledgements
References

第二種格式：系統研究

Title and authors
Abstract
Introduction
Related Work (可置后)
System Model
Mathematics and algorithms
Experiments
Acknowledgements
References

System Model（系統模型）

應該足夠詳細，讓另一個科學家來理解這個問題
Objective（客觀）
Constraints（約束）
Difficulties and Challenges

Mathematics and algorithms（算法）

這些部分是論文的技術核心
通過閱讀高質量論文來學習對應的框架

注意，閱讀理解該部分是前提，在閱讀算法實現前，我們需要注意以下幾點：

restate unclear points in your own words
用自己的語言重申不明確的觀點
fill in missing details (assumptions, algebraic steps, proofs, pseudocode)
填寫缺失的詳細信息（假設、代數步驟、證明、偽代碼）
annotate mathematical objects with their types
使用對應的類型注釋數學對象
come up with examples that illustrate the author’s ideas, and examples that would be problematic for the author
給出一些例子來說明作者的想法，以及對作者有問題的例子
draw connections to other methods and problems you know about
繪制連接到您所知道的其他方法和問題
ask questions about things that aren’t stated or that don’t make sense
問一些沒有陳述或沒有明確的問題
challenge the paper’s claims or methods dream up followup work that you (or someone) should do
質疑你（或某人）論文應該做的主張或方法

2.方法或模型設計撰寫

該部分主要是學習易莉老師書籍《學術寫作原來是這樣》，具體如下：

“通常，我會指導學生從方法和結果部分開始一篇論文的寫作。方法和結果相比于論文其他部分來說比較好寫，學生參照文獻‘依葫蘆畫瓢’，也能快速掌握。一般就是按部就班地寫，實驗是怎么做的，方法部分就怎么寫；發現了什么，結果部分就寫什么。”

個人而言，感覺不太贊同這個觀點。私以為方法或系統設計非常重要，至少在計算機領域，尤其方法對應的框架圖，一定程度上能決定您論文的觀點、創新及貢獻，也決定論文最終的層次。其次，如果想投頂會或頂刊，一個好的故事，或者一個與您貢獻吻合的方法描述至關重要。當然，初學者寫作從模型設計開始是可以的。

在方法部分的寫作中，需要注意：

抓住重點。初學者容易花最多的篇幅寫自己實驗的預處理、數據篩選等。其實其中大量的細節可以籠統地寫，或者放到補充材料中。如果文章中這些無助于讀者理解科學發現的細節過多，必然分散讀者的注意力，消耗他們過多的能量，從而影響他們對文章最重要部分的理解。同時，該寫的內容沒有寫清，不該寫的內容卻寫得太多。

該部分有一個重要的問題——謀篇布局：怎么講好一個故事

對于一篇文章來說，不管語法多正確，用詞多精確，句子多有邏輯，如果沒有把故事講好，它很難成為一篇好的文章。故事是整篇文章的靈魂，它奠定了文章的主要貢獻和創新點。大綱是一篇文章的骨架，大綱的寫法能將故事呈現出來，它決定了整篇文章的組織架構的合理性和邏輯性。

第一個問題：什么樣的故事是好故事？

在評估一個科研成果的科學價值時，最重要的是創新性和研究意義。創新性是指研究者不是單純地跟隨或重復別人的研究，而是有自己的獨到的新貢獻。據說，研究要經歷三個階段：“me too”、“me better”、“me only”。同樣，創新性也可以套用三個階段描述。

舊范式新條件
發現某一現象在不同條件下有不一樣的表現，這類創新多體現于用了新樣本，結語“me too”和“me better”之間
舊范式新技術
屬于“me better”級別創新，新方法比原來方法好
新的數據處理方法
統計學和人工智能發展催生很多新的數據處理方法，用這些模型處理舊數據得出新結論，解決新問題，屬于“me better”級別創新
新范式
用一種全新的方法來研究問題，介于“me better”和“me only”之間
新問題
提出新問題的研究屬于“me only”層面，提出一個全新問題的同時往往伴隨著一種新的研究方法，提出的新問題的價值應該是非常重要的和需要被論證的，當然也存在風險

同時，創新不是天馬行空。科研中的創新都是有邊界的，創新是有研究意義的。總之，在開始寫論文之前，我們需要考慮這篇文章大概要投稿給什么期刊，是小領域的專業期刊，還是一般的心理學期刊，這個決定會影響你的文章思路以及講故事要達到的層次。

比起其他的寫作技能，講故事的能力更難掌握，因為這是一種“只可意會不可言傳”的神秘技能，甚至有不少人覺得這是寫作的天花板。有些研究者在國外老板的大實驗室里能做出非常漂亮的工作，回到柜內自己做PI時成果卻平平無奇，多是因為國外大老板的研究視野（vision）和品味決定了文章的層次，而他們自己卻沒有學會提有前瞻性和創新性的科學問題。

個人想對上面的觀點進行補充。確實國外的大老板能帶給我們很大的視野，提升論文的層次。然而，我們大部分的科研工作者，尤其是博士生，很少能接觸到這么厲害的科學家，甚至很多都需要獨立的搞論文。此時，怎么辦呢？

我覺得一方面我們需要多讀論文，多看國外大佬們的分享，通過學習和對比他們的論文和方法來提升自己；另一方面我們要多寫論文，多做實驗，善于關注問題，當我們的論文故事敘述能寫得像頂會、頂刊時，我們至少會中一些SCI二區、B會，如果再持之以恒的提升質量、想idea，最后肯定也能寫出高質量論文。蕓蕓眾生，雖然我們沒有這些大佬的幫助，但我相信通過努力肯定能學會科研，提升自己。這也是我寫這個專欄的目的，從零開始學習英文寫作，用最土的方法學習寫作。博士路上一起前行，相信自己，加油~

第二個問題：怎么將一個故事講好？

有了故事內核，我們還需要把這個故事完美地呈現出來。為此，我們需要學習如何利用大綱（outline）來幫助文章布局。大綱是關于文章組織架構的寫作計劃，它可以幫助你思考文章的主要故事和框架，抓住核心科學問題，把文章的故事有順序、有邏輯、有重點地呈現出來。

Title and authors
Abstract
Introduction
Related Work (可置后)
System Model
Mathematics and algorithms
Experiments
Acknowledgements
References

3.模型設計撰寫之個人理解

個人感覺：模型設計非常重要，通常我的寫法如下。

首先，閱讀大量相關方向的論文，只有多讀才會寫，然后總結現有方法的優缺點，找到你需要解決的問題或方法（idea難），如果能發現新問題并提出解決方法最好。
其次，我會結合自己的方法進行簡單實驗，實驗證明成功之后構建論文或模型的框架（論文的“龍骨”），接著也會融入一些類似于剪枝的細節處理算法。
接著，我會敘述模型的整體框架，可以將Overview放到Model Design第一部分敘述，也可以置于前作為一大塊敘述。
再次，根據整體框架分別實現各個部分，以深度學習為例，通常包括數據采集、數據預處理、特征選擇、模型構建、分類任務等。最重要的是你提出的算法或具有貢獻的部分應該進行詳細描述，通過算法、公式或圖表，當然也包括一些約束。
最后，整個故事應該圍繞文章貢獻、觀點及實驗敘述，更好地突出論文的賣點。

下面簡單介紹幾篇經典論文的模型設計寫法及組成，通常都是論文框架或算法如何實現的，包括框架圖、算法、公式、表格、約束等。

(1) Chuanpu Fu, et al. Realtime Robust Malicious Traffic Detection via Frequency Domain Analysis. CCS21. （頻域惡意流量入侵檢測）

4 DESIGN DETAILS

4.1 Frequency Feature Extraction Module

4.2 Automatic Parameters Selection Module

4.3 Statistical Clustering Module

(2) Yuankun Zhu, et al. Hermes Attack: Steal DNN Models with Lossless Inference Accuracy. USENIX Sec21. （DNN模型攻擊&推理信息）

3 Attack Design

3.1 Overview

3.2 Traffic Processing

3.3 Extraction

– 3.3.1 Header Extraction

– 3.3.2 Command Extraction

3.4 Reconstruction

– 3.4.1 Semantic Reconstruction

– 3.4.2 Model Reconstruction

(3) Xuezixiang Li, et al. PalmTree: Learning an Assembly Language Model for Instruction Embedding, CCS21 （Bert預訓練指令向量）

3 DESIGN OF PALMTREE

3.1 Overview

3.2 Input Generation

3.3 Tokenization

3.4 Assembly Language Model

– 3.4.1 PalmTree model

– 3.4.2 Training task 1: Masked Language Model

– 3.4.3 Training task 2: Context Window Prediction

– 3.4.4 Training task 3: Def-Use Prediction

– 3.4.5 Instruction Representation

– 3.4.6 Deployment of the model

(4) Jinfeng Li, et al. TextShield: Robust Text Classification Based on Multimodal Embedding and Neural Machine Translation, USENIX Sec20. （文本對抗樣本，多模態和實驗值得我學習）

3 Design of TEXTSHIELD

3.1 Problem Definition and Threat Model

3.2 Overview of TEXTSHIELD Framework

3.3 Adversarial Translation

3.4 Multimodal Embedding

3.5 Multimodal Fusion

(5) Xueyuan Han, et al. UNICORN: Runtime Provenance-Based Detector for Advanced Persistent Threats, NDSS20. （溯源圖檢測APT）

IV. DESIGN

A. Provenance Graph

B. Constructing Graph Histograms

C. Generating Graph Sketches

D. Learning Evolutionary Models

E. Detecting Anomalies

(6) Zhenyuan Li, et al. Effective and Light-Weight Deobfuscation and Semantic-Aware Attack Detection for PowerShell Scripts, CCS19. （Powershell解混淆）

3 OVERVIEW

4 POWERSHELL DEOBFUSCATION

4.1 Subtree-based Deobfuscation Approach Overview

4.2 Extract Suspicious Subtrees

4.3 Subtree-based Obfuscation Detection

4.4 Emulation-based Recovery

4.5 AST Update

4.6 Post processing

4.整體結構撰寫補充

同時，模型設計整體結構和寫作細節補充幾點：（引用周老師博士課程，受益匪淺）

實驗部分同樣有很多細節，下一篇文章我們再詳細介紹。

二.Model Design撰寫及精句

個人習慣將模型設計結合框架圖進行描述，也歡迎大家批評指正。下面主要以四大頂會論文為主進行介紹，重點以系統安全和AI安全領域為主。從這些論文中，我們能學習到什么呢？具體如下：

一幅優美的頂會論文框架圖
論文整體架構的英文描述（Overview）
論文撰寫的前后關聯及轉折關鍵詞
給自己的模型或系統取一個恰當的名字，論文描述也方便
深度學習與系統安全結合的論文方向

第0部分：引入和關聯

引入主要是各個部分（Section）詳細介紹前，會說明該部分的相關分布情況。其描述大同小異，通常介紹各個部分的內容即可“In this section, we present…”。此外，部分論文也會通過“To tackle the challenges mentioned above”引入。下面給出12個示例，來源于模型設計或相關工作部分。

(1) We next discuss in detail our proposed euphemism detection approach in Section IV-A and the proposed euphemism identification approach in Section IV-B.

(2) In this section, we present the design details of Whisper, i.e., the design of three main modules in Whisper.

(3) In this section, we present the design details of DEEPNOISE. First, we show the overall workflow of DEEPNOISE. Then, we present the details of each component of DEEPNOISE.

(4) In this section, we first introduce audit log analysis and its challenges with a motivating example. We then analyze the problem of behavior abstraction with our insights, as well as describing the threat model.

(5) In this section, we briefly summarize the related similarity-based phishing detection approaches, then we introduce our threat model.

(6) In this section, we introduce the stealthy malware we focus on in this study and present our insights of using provenance analysis to detect such malware.

(7) In this section, we detail the pipeline of DEEPREFLECT as well as the features and models it uses.

(8) In this section, we first present our threat model and then describe overarching assumptions and principles used throughout the paper.

(9) In this section, we describe our process for curating this ground truth dataset and the features that we use for our classifier. We then present the accuracy of the resulting classifier and how we use it to measure the phenomenon and abuse of unintended URLs in the wild. Our analysis pipeline is illustrated in Figure 3.

(10) In this section, we introduce the methodology of exercising text inputs for Android apps. We start from introducing the workflow of TextExerciser and then present each phase of TextExerciser individually.

(11) To tackle the challenges mentioned above, we develop an automated tool VIEM by combining and customizing a set of state-of-the-art natural language processing (NLP) techniques. In this section, we briefly describe the design of VIEM and discuss the reasons behind our design. Then, we elaborate on the NLP techniques that VIEM adopts.

(12) In this section, we detail the ATLAS architecture introduced in Figure 3. We start with an audit log pre-processing phase that constructs and optimizes the causal graph for scalable analysis (Sec. 4.1). We then present a sequence construction and learning phase that constructs attack and non-attack sequences for model learning (Sec. 4.2). Lastly, we present an attack investigation phase that uses the model to identify attack entities, which helps build the attack story (Sec. 4.3).

第1部分：Overview（系統框架如何描述）

Hermes Attack（USENIX Sec21）

[1] Yuankun Zhu, et al. Hermes Attack: Steal DNN Models with Lossless Inference Accuracy. USENIX Sec21.

Attack Overview. The methodology of our attack can be divided into two phases: offline phase and online phase. During the offline phase, we use white-box models to build a database with the identified command headers, the mappings between GPU kernel (binaries) and DNN layer types, and the mappings between GPU kernels and offsets of hyperparameters. Specifically, the traffic processing module ( ① in Figure 5) sorts the out-of-order PCIe packets intercepted by PCIe snooping device. The extraction module ( ② ) has two sub-modules: header extraction module and command extraction module. The header extraction module extracts command headers from the sorted PCIe packets (Section 3.3.1). The extracted command headers will be stored in the database, accelerating command extraction in the online phase. The command extraction module in the offline phase helps get the kernel binaries (Section 3.3.2). The semantic reconstruction module within the reconstruction module ( ③ ) takes the inputs from the command extraction module and the GPU profiler to create the mappings between the kernel (binary) and the layer type, as well as the mappings between the kernel and the offset of hyper-parameters, facilitating the module reconstruction in the online phase (Section 3.4.1).

During the online phase, the original (victim) model is used for inference on a single image. The victim model is a black-box model and thoroughly different from the white-box models used in the offline phase. PCIe traffics are intercepted and sorted by the traffic processing module. The command extraction module ( ② ) extracts K (kernel launch related) and D (data movement related) commands as well as the GPU kernel binaries, using the header information profiled from the offline phase (Section 3.3.2). The entire database are feed to the model reconstruction module ( ③ ) to fully reconstruct architecture, hyper-parameters, and parameters (Section 3.4.2). All these steps need massive efforts of reverse engineering.

段落優點：該工作是新型DNN模型提取攻擊，在Overview中將框架分為兩部分，分別進行描述，同時框架圖中標注對應的模塊，結合內容標注描述實現過程。

PalmTree（CCS21）

[2] Xuezixiang Li, et al. PalmTree: Learning an Assembly Language Model for Instruction Embedding, CCS21.

To meet the challenges summarized in Section 2, we propose PalmTree, a novel instruction embedding scheme that automatically learns a language model for assembly code. PalmTree is based on BERT [9], and incorporates the following important design considerations.

First of all, to capture the complex internal formats of instructions, we use a fine-grained strategy to decompose instructions: we consider each instruction as a sentence and decompose it into basic tokens. Then, in order to train the deep neural network to understand the internal structures of instructions, we make use of a recently proposed training task in NLP to train the model: Masked Language Model (MLM) [9]. This task trains a language model to predict the masked (missing) tokens within instructions.

Moreover, we would like to train this language model to capture the relationships between instructions. To do so, we design a training task, inspired by word2vec [28] and Asm2Vec [10], which attempts to infer the word/instruction semantics by predicting two instructions’ co-occurrence within a sliding window in control flow. We call this training task Context Window Prediction (CWP), which is based on Next Sentence Prediction (NSP) [9] in BERT. Essentially, if two instructions i and j fall within a sliding window in control flow and i appears before j, we say i and j have a contextual relation. Note that this relation is more relaxed than NSP, where two sentences have to be next to each other. We make this design decision based on our observation described in Section 2.2.2: instructions may be reordered by compiler optimizations, so adjacent instructions might not be semantically related.

Furthermore, unlike natural language, instruction semantics are clearly documented. For instance, the source and destination operands for each instruction are clearly stated. Therefore, the data dependency (or def-use relation) between instructions is clearly specified and will not be tampered by compiler optimizations. Based on these facts, we design another training task called Def-Use Prediction (DUP) to further improve our assembly language model. Essentially, we train this language model to predict if two instructions have a def-use relation.

Figure 1 presents the design of PalmTree. It consists of three components: Instruction Pair Sampling, Tokenization, and Language Model Training. The main component (Assembly Language Model) of the system is based on the BERT model [9]. After the training process, we use mean pooling of the hidden states of the second last layer of the BERT model as instruction embedding. The Instruction Pair Sampling component is responsible for sampling instruction pairs from binaries based on control flow and def-use relations.

In Section 3.2, we introduce how we construct two kinds of instruction pairs. In Section 3.3, we introduce our tokenization process. Then, we introduce how we design different training tasks to pre-train a comprehensive assembly language model for instruction embedding in Section 3.4.

段落優點：本文提出對一種名為PalmTree的預訓練匯編語言模型，通過在大規模無標記二進制語料庫中自監督訓練來生成通用指令Embedding。引入Bert預訓練進行指令嵌入，結合解決挑戰來描述Overview，同時段落遞進描述得比較好。

TextShield（USENIX Sec20）

[3] Jinfeng Li, et al. TextShield: Robust Text Classification Based on Multimodal Embedding and Neural Machine Translation, USENIX Sec20.

We present the framework overview of TEXTSHIELD in Fig.1, which is built upon multimodal embedding, multimodal fusion and NMT. Generally, we first feed each text into an NMT model trained with a plenty of adversarial–benign text pairs for adversarial correction. Then, we input the corrected text into the DLTC model for multimodal embedding to extract features from semantic-level, glyph-level and phoneticlevel. Finally, we use a multimodal fusion scheme to fuse the extracted features for the following regular classifications. Below, we will elaborate on each of the backbone techniques.

3.3 Adversarial Translation
3.4 Multimodal Embedding
3.5 Multimodal Fusion

Since the variation strategies adopted by malicious users in the real scenarios are mainly concentrated on glyph-based and phonetic-based perturbations [47], we therefore dedicatedly propose three embedding methods across different modalities to handle the corresponding variation types, i.e., semantic embedding, glyph embedding and phonetic embedding. They are also dedicatedly designed to deal with the sparseness and diversity unique to Chinese adversarial perturbations.

Since multiple modalities can provide more valuable information than a single one by describing the same content in various ways, it is highly expected to learn effective joint representation by fusing the features of different modalities. Therefore, after multimodal embedding, we first fuse the features extracted from different modalities by multimodal fusion and then feed the fused features into a classification model for regular classification. In this paper, we experiment with two different fusion strategies, i.e., early multimodal fusion and intermediate multimodal fusion as shown in Fig. 10 in Appendix A.

段落優點：提出一種基于多模態嵌入和神經機器翻譯的文本分類器（TEXTSHIELD），多模態融合描述值得學習，即多模態融合優于單模態的原因描述。

UNICORN（NDSS20）

[4] Xueyuan Han, et al. UNICORN: Runtime Provenance-Based Detector for Advanced Persistent Threats, NDSS20.

UNICORN is a host-based intrusion detection system capable of simultaneously (同時) detecting intrusions on a collection of networked hosts. We begin with a brief overview of UNICORN and then follow with a detailed discussion of each system component in the following sections. Fig.1 illustrates UNICORN’s general pipeline.

①Takes as input a labeled, streaming provenance graph.

UNICORN accepts a stream of attributed edges produced by a provenance capture system running on one or more networked hosts. Provenance systems construct a single, whole-system provenance DAG with a partial-order guarantee, which allows for efficient streaming computation (§ IV-B) and fully contextualized analysis (L2). We present UNICORN using CamFlow [100], although it can obtain provenance from other systems, such as LPM [16] and Spade [44], the latter of which interoperates with commodity audit systems such as Linux Audit and Windows ETW.

② Builds at runtime an in-memory histogram.

UNICORN efficiently constructs a streaming graph histogram (直方圖) that represents the entire history of system execution, updating the counts of histogram elements as new edges arrive in the graph data stream. By iteratively exploring larger graph neighborhoods, it discovers causal relationships between system entities providing execution context. This is UNICORN’s first step in building an efficient data structure that facilitates contextualized graph analysis (L2). Specifically, each element in the histogram describes a unique substructure of the graph, taking into consideration the heterogeneous label(s) attached to the vertices and edges within the substructure, as well as the temporal order of those edges.

To adapt to expected behavioral changes during the course of normal system execution, UNICORN periodically discounts the influence of histogram elements that have no causal relationships with recent events (L3). Slowly “forgetting” irrelevant past events allows us to effectively model metastates (§ IV-D) throughout system uptime (e.g., system boot, initialization, serving requests, failure modes, etc.). However, it does not mean that UNICORN forgets informative execution history; rather, UNICORN uses information flow dependencies in the graph to keep up-to-date important, relevant context information. Attackers can slowly penetrate the victim system in an APT, hoping that a time-based IDS eventually forgets this initial attack, but they cannot break the information flow dependencies that are essential to the success of the attack [87].

③ Periodically, computes a fixed-size graph sketch.

In a pure streaming environment, the number of unique histogram elements can grow arbitrarily large as UNICORN summarizes the entire provenance graph. This variation in size makes it challenging to efficiently compute similarity between two histograms and impractical to design algorithms for later modeling and detection. UNICORN employs a similarity-preserving hashing technique [132] to transform the histogram to a graph sketch [7]. The graph sketch is incrementally maintainable, meaning that UNICORN does not need to keep the entire provenance graph in memory; its size is constant (L4). Additionally, graph sketches preserve normalized Jaccard similarity [64] between two graph histograms. This distance-preserving property is particularly important to the clustering algorithm in our later analysis, which is based on the same graph similarity metric.

④ Clusters sketches into a model.

UNICORN builds a normal system execution model and identifies abnormal activities without attack knowledge (L1). However, unlike traditional clustering approaches, UNICORN takes advantage of its streaming capability to generate models that are evolutionary. The model captures behavioral changes within a single execution by clustering system activities at various stages of its execution, but UNICORN does not modify models dynamically during runtime when the attacker may be subverting the system (L3). It is therefore more suitable for long-running systems under potential APT attacks.

段落優點：這篇論文是經典的APT溯源圖，其Overview寫得非常好，這里將其全部引出供自己和大家學習。

PowerShell Deobfuscation（CCS19）

[5] Zhenyuan Li, et al. Effective and Light-Weight Deobfuscation and Semantic-Aware Attack Detection for PowerShell Scripts, CCS19.

As shown in §2.3, obfuscation is highly effective in bypassing today’s the PowerShell attack detection. To combat such threat, it is thus highly desired to design a effective and light-weight deobfuscation mechanism for PowerShell scripts. In this paper, we are the first to design such a mechanism and use it as the key building block to develop the first semantic-aware PowerShell attack detection system. As shown in Figure 3, the detection process can be divided into three phases:

Deobfuscation phase.
In the deobfuscation phase, we propose a novel (新穎) subtree-based approach leveraging the features of the PowerShell scripts. We treat the AST subtrees as the minimum units of obfuscation, and perform recovery on the subtrees, and finally construct the deobfuscated scripts. The deobfuscated scripts are then used in both training and detection phases. Note that such deobfuscation function can benefit not only the detection of PowerShell attacks in this paper but the analysis and forensics of them as well, which is thus a general contribution to the PowerShell attack defense area.
Training and detection phases.
After the deobfuscation phase, the semantics of the malicious PowerShell scripts are exposed and thus enable us to design and implement the first semantic-aware PowerShell attack detection approach. As shown on the right side of Figure 3, we adopt the classic Objective-oriented Association (OOA) mining algorithm [68] on malicious PowerShell script databases, which is able to automatically extract 31 OOA rules for signature matching. Besides, we can adapt existing anti-virus engines and manual analysis as extensions.
Application scenarios.
Our deobfuscation-based semantic-aware attack detection approach is mostly based on static analysis. Thus, compared to dynamic analysis based attack detection approaches, our approach has higher code coverage, much lower overhead, and also does not require modification to the system or interpreter. Compared to existing static analysis based attack detection approaches [26, 32, 53, 55], our approach is more resilient to obfuscation and also more explainable as our detection is semantics based. With these advantages over alternative approaches, our approach can be deployed in various application scenarios, including but not limited to:
– Real-time attack detection.
– Large-scale automated malware analysis.

段落優點：這篇論文是Powershell最經典的一篇論文，介紹去混淆工作，整篇論文的寫作也值得我們學習，包括框架圖及AST變換。

DeepReflect（USENIX Sec21）

[6] Evan Downing, et al. DeepReflect: Discovering Malicious Functionality through Binary Reconstruction, USENIX Sec21.

The goal of DEEPREFLECT is to identify malicious functions within a malware binary. In practice, it identifies functions which are likely to be malicious by locating abnormal basic blocks (regions of interest - RoI). The analyst must then determine if these functions exhibit malicious or benign behaviors. There are two primary steps in our pipeline, illustrated in Figure 2: (1) RoI detection and (2) RoI annotation. RoI detection is performed using an autoencoder, while annotation is performed by clustering all of the RoIs per function and labeling those clusters.

異常基本塊即ROI感興趣區域識別

Terminology. First, we define what we mean by “malicious behaviors.” We generate our ground-truth based on identifying core components of our malware’s source code (e.g., denial-of-service function, spam function, keylogger function, command-and-control (C&C) function, exploiting remote services, etc.). These are easily described by the MITRE ATT&CK framework [9], which aims to standardize these terminologies and descriptions of behaviors. However, when statically reverse engineering our evaluation malware binaries (i.e., in-the-wild malware binaries), we sometimes cannot for-certain attribute the observed low-level functions to these higher-level descriptions. For example, malware may modify registry keys for a number of different reasons (many of which can be described by MITRE), but sometimes determining which registry key is modified for what reason is difficult and thus can only be labeled loosely as “Defense Evasion: Modify Registry” in MITRE. Even modern tools like CAPA [3] identify these types of vague labels as well. Thus in our evaluation, we denote “malicious behaviors” as functions which can be described by the MITRE framework.

RoI Detection. The goal of detection is to automatically identify malicious regions within a malware binary. For example, we would like to detect the location of the C&C logic rather than detect the specific components of that logic (e.g, the network API calls connect(), send(), and recv()). The advantage of RoI detection is that an analyst can be quickly pointed to specific regions of code responsible for launching and operating its malicious actions. Prior work only focuses on creating ad hoc signatures that simply identify a binary as malware or some capability based on API calls alone. This is particularly helpful for analysts scaling their work (i.e., not relying on manual reverse engineering and domain expertise alone).

RoI Annotation (標注). The goal of annotation is to automatically label the behavior of the functions containing the RoIs. In other words, this portion of our pipeline identifies what this malicious functionality is doing. Making this labeling nonintrusive to an analyst’s workflow and scalable is crucial（至關重要）. The initial work performed by an analyst for labeling clusters is a long-tail distribution. That is, there is relatively significant work upfront but less work as they continue to label each cluster. The advantage of this process is simple: it gives the analyst a way to automatically generate reports and insights about an unseen sample. For example, if a variant of a malware sample contains similar logic as prior malware samples (but looks different enough to an analyst to be unfamiliar), our tool gives them a way to realize this more quickly.

段落優點：本文提出一種二進制重構的惡意函數發現方法（DeepReflect），該系統流程為：將未解壓的惡意軟件樣本作為輸入，從每個輸入（基本塊）中提取CFG特征，并將它們應用于預訓練的自編碼器模型，以突出顯示ROI（感興趣的區域），最后聚類并標記這些區域。

Phishpedia（USENIX Sec21）

[7] Yun Lin, et al. Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages, USENIX Sec21.

Figure 3 provides an overview of our proposed system, Phishpedia. Phishpedia takes as input a URL and a target brand list describing legitimate brand logos and their web domains; it then generates a phishing target (if the URL is considered as phishing) as output. We refer to the logo that identifies with the legitimate brand as the identity logo of that brand. Moreover, input boxes are the small forms where a user inputs credential information such as username and password.

Given a URL, we first capture its screenshot in a sandbox. Then, we decompose the phishing identification task into two: an object-detection task and an image recognition task. First, we detect important UI components, specifically identity logos and input boxes, in the screenshot with an object detection algorithm [57, 58] (Section 3.1). As the next step, we identify the phishing target by comparing the detected identity logo with the logos in the target brand list via a Siamese model [33]

(Section 3.2). Once a logo in the target brand list (e.g., that of Paypal) is matched, we consider its corresponding domain (e.g., paypal.com) as the intended domain for the captured screenshot. Subsequently（隨后）, we analyze the difference between the intended domain and the domain of the given URL to report the phishing result. Finally, we combine the reported identity logo, input box, and phishing target to synthesize a visual phishing explanation (as shown in Figure 2).

段落優點：本文設計了一個混合深度學習系統（Phishpedia），包括目標檢測和圖像識別，從而解決釣魚網站識別的兩個技術挑戰，即(i)準確識別網頁截圖上的標識；(ii)匹配同一品牌的標識變體。該Overview比較簡潔，但這篇論文的思想和背景值得我學習（目標檢測=>安全場景）。

TextExerciser（S&P21）

[8] Yuyu He, et al. TextExerciser: Feedback-driven Text Input Exercising for Android Applications, S&P21.

TextExerciser is a feedback-driven text exerciser that understands hints shown on user interfaces of Android apps and then extracts corresponding constraints (約束條件). The high-level idea of understanding these hints is based on an observation that these hints with similar semantics often have a similar syntax structure—and therefore TextExerciser can cluster these hints based on their syntax structures and then extract the constraints from the syntax structure. Now, let us give some details of TextExerciser’s workflow.

The exercising has three phases, seven steps as shown in Figure 2. First, TextExerciser extracts all the texts in the app’s UI (Step 1) and then identifies static hints via a learning-based method and dynamic hints via a structure-based differential analysis (Step 2). Second, TextExerciser parses all the extracted hints via three steps: classifying hints into different categories (Step 3), generating syntax trees for each hint (Step 4), and interpreting the generated tree into a constraint representation form (Step 5). Lastly, TextExerciser generates a concrete input by feeding constraints into a solver (Step 6), e.g., Z3. Then, TextExerciser solves the problem, feeds generated inputs back to the target Android app and extracts feedbacks, such as success and another hint (Step 7). In the case of another hint, TextExerciser will iterate the entire procedure until TextExerciser finds a valid input.

Now let us look at our motivating example in §II again to explain TextExerciser’s workflow. We start from the sign-up page, which has three text input fields, i.e., “username”, “password” and “confirm password”. TextExerciser generates a random input to the username field: If the username is used in the database, Yippi returns a “username used” hint. TextExerciser will then parse the hint and generate a new username. The “password” and “confirm password” are handled together by TextExerciser: based on the hint that “Both password has to be the same”1, TextExerciser will convert the hint into a constraint that the value of both fields need to be the same and then generate corresponding inputs.

示例

After TextExerciser generates inputs for the first sign-up page, Yippi asks the user to input a code that is sent to a phone number. TextExerciser will first extract hints related to the phone number page, understand that this is a phone number, and then input a pre-registered phone number to the field. Next, TextExerciser will automatically extract the code from the SMS and solve the constraints by inputting the code to Yippi. In order to find the aforementioned vulnerability in §II, TextExerciser also generates text inputs to the “Change Password” page. Particularly, TextExerciser extracts the password matching hint and another hint that distinguishes old and new passwords, converts them into constraints and then generates corresponding inputs so that existing dynamic analysis tools can find the vulnerability.

段落優點：

自動生成合適的程序輸入是軟件自動化測試和動態程序分析的關鍵環節，而現代軟件中文本輸入是很常見的一項功能，如何自動化生成有效的文本輸入事件來驅動測試是影響動態測試的一大難題。現有方案一般根據用戶界面信息和啟發式規則生成文本輸入，不能根據應用特點理解輸入的內容和格式限制，因此產生的文本輸入通常不滿足程序運行需要。此外，近年來移動應用逐漸將數據處理邏輯移動至云端，對輸入信息的審核也大部分位于云服務器中，從而導致傳統通過程序分析求解輸入文本的方法失效。

有鑒于此，本文提出了一種面向移動應用的自動文本輸入生成方法（TextExerciser）。其基于的insight是：只要文本輸入不符合應用要求，應用軟件都會將提示信息通過自然語言顯示在人機交互界面上。本文通過結合自然語言處理和機器學習等技術，對應用提示信息進行解析，理解提示信息包含的輸入限制，并據此自動生成輸入文本。該過程是迭代進行，直到產生合適的文本輸入。在實驗過程中，本文將此文本生成方法與現有的動態測試和分析工具結合，驗證了此方法不但能提高應用在測試過程中的代碼覆蓋，還能找到基于特定輸入事件的程序漏洞和隱私泄露問題。相關研究成果發表在信息安全領域頂級會議S&P 2020上。

DeepBinDiff（NDSS20）

[9] Yue Duan, et al. DEEPBINDIFF: Learning Program-Wide Code Representations for Binary Diffing, NDSS20.

Figure 1 delineates the system architecture of DEEP- BINDIFF. Red squares represent generated intermediate data during analysis. As shown, the system takes as input two binaries and outputs the basic block level diffing results. The system solves the two tasks mentioned in Section II-A by using two major techniques. First, to calculate sim(mi) that quantitatively measures basic block similarity, DEEPBINDIFF embraces an unsupervised learning approach to generate embeddings and utilizes them to efficiently calculate the similarity scores between basic blocks. Second, our system uses a k-hop greedy (貪心) matching algorithm to generate the matching M(p1, p2).

The whole system consists of three major components: 1) pre-processing; 2) embedding generation and 3) code diffing. Pre-processing, which can be further divided into two sub-components: CFG generation and feature vector generation, is responsible for generating two pieces of information: inter-procedural control-flow graphs (ICFGs) and feature vectors for basic blocks. Once generated, the two results are sent to embedding generation component that utilizes TADW technique [48] to learn the graph embeddings for each basic block. DEEPBINDIFF then makes use of the generated basic block embeddings and performs a k-hop greedy matching algorithm for code diffing at basic block level.

段落優點：DeepBinDiff是一篇經典的二進制相關論文，值得學習。

Slimium（CCS20）

[10] Chenxiong Qian, Slimium: Debloating the Chromium Browser with Feature Subsetting, CCS20.

Figure 3 shows an overview of Slimium for debloating Chromium. Slimium consists of three main phases: i) feature-code mapp generation, ii) prompt website profiling based on page visits, and iii) binary instrumentation based on i) and ii).

Feature-Code Mapping. To build a set of unit features for debloating, we investigate source code [35] (Figure 2), previously-assigned CVEs pertaining to Chromium, and external resources [8, 47] for the Web specification standards (Step ① in Figure 3). Table 1 summarizes 164 features with four different categories. Once the features have been prepared, we generate a feature-code map that aids further debloating from the two sources (①’ and ②’). From the light-green box in Figure 3, consider the binary that contains two CUs to which three and four consecutive binary functions (i.e., { f0 ? f2} and { f3 ? f6}) belong, respectively. The initial mapping between a feature and source code relies on a manual discovery process that may miss some binary functions (i.e., from the source generated at compilation). Then, we apply a new means to explore such missing functions, followed by creating a call graph on the IR (Intermediate Representation) (Step ②, Section 4.2).

Website Profiling. The light-yellow box in Figure 3 enables us to trace exercised functions when running a Chromium process. Slimium harnesses a website profiling to collect non-deterministic code paths, which helps to avoid accidental code elimination. As a baseline, we perform differential analysis on exercised functions by visiting a set of websites (Top 1000 from Alexa [3]) multiple times (Step ③). For example, we mark any function non-deterministic if a certain function is not exercised for the first visit but is exercised for the next visit. Then, we gather exercised functions for target websites of our interest with a defined set of user activities (Step ④). During this process, profiling may identify a small number of exercised functions that belong to an unused feature (i.e., initialization). As a result, we obtain the final profiling results that assist binary instrumentation (③’ and ④’).

Binary Rewriting. The final process creates a debloated version of a Chromium binary with a feature subset (Step ⑤ in Figure 3). In this scenario, the feature in the green box has not been needed based on the feature-code mapping and profiling results, erasing the functions { f0, f1, f3} of the feature. As an end user, it is sufficient to take Step ④ and ⑤ for binary instrumentation where pre-computed feature-code mapping and profiling results are given as supplementary information (補充信息).

段落優點：如今，Chromium已經成為移動端和PC端主流的瀏覽器。隨著其功能的日益完善，Chromium的代碼也日漸臃腫，這給攻擊者帶來了很多可利用的機會。考慮到這點，作者在本篇論文中提出了SLIMIUM——瀏覽器的精簡化框架，通過對不必要代碼的刪減，以達到縮小攻擊面的目的。作者研究了CVEs和Web規范標準的外部資源，構建對應的特征。

三.總結

這篇文章就寫到這里了，希望對您有所幫助。由于作者英語實在太差，論文的水平也很低，寫得不好的地方還請海涵和批評。同時，也歡迎大家討論，真心推薦原文，這些大佬真的值得我們學習，繼續加油，且看且珍惜！

忙碌的三月結束，時間流逝得真快，也很充實。雖然很忙，但三月還是擠時間回答了很多博友的問題，有咨詢技術和代碼問題的，有找工作和方向入門的，也有復試面試的。盡管自己科研和技術很菜，今年閉關也非常忙，但總忍不住。唉，性格使然，感恩遇見，問心無愧就好。但行好事，莫問前程。繼續奮斗，繼續閉關。小珞珞每日快樂源泉，哈哈，晚安娜O(∩_∩)O

真心感謝家人的支持和陪伴，大家都要保重身體，沒有什么比親情和生命更美麗。祝好~

(By:Eastmount 2021-04-01 晚上12點 )