Seccomp BPF與容器安全 - 網安 - 專業的網絡安全產業、社區、知識平臺

本文詳細介紹了關于seccomp的相關概念，包括seccomp的發展歷史、Seccomp BPF的實現原理以及與seccomp相關的一些工具等。此外，通過實例驗證了如何使用seccomp bpf 來保護Docker的安全。

簡介

seccomp（全稱securecomputing mode）是linux kernel支持的一種安全機制。在Linux系統里，大量的系統調用（systemcall）直接暴露給用戶態程序。但是，并不是所有的系統調用都被需要，而且不安全的代碼濫用系統調用會對系統造成安全威脅。通過seccomp，我們限制程序使用某些系統調用，這樣可以減少系統的暴露面，同時是程序進入一種“安全”的狀態。

Seccomp 的發展歷史

2005年，Linux 2.6.12中的引入了第一個版本的seccomp，通過向/proc/PID/seccomp接口中寫入“1”來啟用過濾器，最初只有一個模式：嚴格模式（strict mode），該模式下只允許被限制的進程使用4種系統調用：read(), write(), _exit(), 和 sigreturn() ，需要注意的是，open()系統調用也是被禁止的，這就意味著在進入嚴格模式之前必須先打開文件。一旦為程序施加了嚴格模式的seccomp，對于其他的所有系統調用的調用，都會觸發SIGKILL并立即終止進程。

2007年，Linux 2.6.23 內核使用prctl（）操作代替了/proc/PID/seccomp接口來施加seccomp，通過Prctl (PR_SET_SECCOMP,arg)修改調用者的seccomp模式；prctl(PR_GET_SECCOMP)用來獲取seccomp的狀態，返回值為0時代表進程沒有被施加seccomp，但是如果進程配置了seccomp，則會由于不能調用prctl(）導致進程中止，那就沒有其他返回值了？

2012年，Linux 3.5引入了”seccomp mode 2“，為seccomp帶來了一種新的模式：過濾模式（ filter mode ），該模式使用 Berkeley 包過濾器 (BPF) 程序過濾任意系統調用及其參數,使用該模式，進程可以使用 prctl (PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...)來指定允許哪些系統調用。現在已經有許多應用使用 seccomp 過濾器來對系統調用進行控制，包括 Chrome/Chromium 瀏覽器、OpenSSH、vsftpd 和 Firefox OS 。

2013年，Linux 3.8版本，在/proc/PID/status中添加了一個Seccomp字段，可以通過讀取該文件獲取對應進程的 seccomp 模式的狀態（0 表示禁用，1 表示嚴格，2 表示過濾）。

/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, ) */#define SECCOMP_MODE_DISABLED    0 /* seccomp is not in use. */#define SECCOMP_MODE_STRICT    1 /* uses hard-coded filter. */#define SECCOMP_MODE_FILTER    2 /* uses user-supplied filter. */
null@ubuntu:~/seccomp$ cat /proc/1/status | grep SeccompSeccomp:        0

2014年，Linux 3.17 引入了seccomp()系統調用，seccomp()在prctl()的基礎上提供了現有功能的超集，增加了將進程中的所有線程同步到同一組過濾器的能力，這有助于確保即使在施加seccomp過濾器之前創建的線程仍然有效。

Seccomp + BPF

seccomp 過濾模式允許開發人員編寫 BPF 程序來確定是否允許給定的系統調用，基于系統調用號和參數（寄存器）值進行過濾。當使用seccomp()或prctl()對進程施加seccomp 時，需要提前將編寫好的BPF程序安裝到內核，之后每次系統調用都會經過該過濾器。而且此過程是不可逆的，因為安裝過濾器實際上是聲明任何后續執行的代碼都不可信。

BPF在1992年的tcpdump程序中首次提出，tcpdump是一個網絡數據包的監控工具，但是由于數據包的數量很大，而且將內核空間捕獲到的數據包傳輸到用戶空間會帶來很多不必要的性能損耗，所以要對數據包進行過濾，只保留感興趣的那一部分，而在內核中過濾感興趣的數據包比在用戶空間中進行過濾更有效。BPF 就是提供了一種進行內核過濾的方法，因此用戶空間只需要處理經過內核過濾的后感興趣的數據包。

BPF定義了一個可以在內核內實現的虛擬機(VM)。該虛擬機有以下特性：

簡單指令集
小型指令集
所有的指令大小相同
實現過程簡單、快速
只有分支向前指令
程序是有向無環圖(DAGs)，沒有循環
易于驗證程序的有效性/安全性
簡單的指令集?可以驗證操作碼和參數
可以檢測死代碼
程序必須以 Return 結束
BPF過濾器程序僅限于4096條指令

BPF 程序在Linux內核中主要在filter.h和bpf_common.h中實現，主要的數據結構包括以下幾個：

Linux v5.18.4/include/uapi/linux/filte.h -> sock_fprog（https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/filter.h#L24）

struct sock_fprog {    /* Required for SO_ATTACH_FILTER. */    unsigned short        len;    /* BPF指令的數量 */    struct sock_filter __user *filter;  /*指向BPF數組的指針 */};

這個結構體記錄了過濾規則個數與規則數組起始位置 , 而 filter 域指向了具體的規則，每一條規則的形式如下：

Linux v5.18.4/include/uapi/linux/filte.h -> sock_filter（https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/filter.h#L24）

struct sock_filter {    /* Filter block */    __u16    code;   /* Actual filter code */    __u8    jt;    /* Jump true */    __u8    jf;    /* Jump false */    __u32    k;      /* Generic multiuse field */};

該規則有四個參數，code：過濾指令；jt:條件真跳轉；jf：條件假跳轉；k：操作數。

BPF的指令集比較簡單，主要有以下幾個指令：

加載指令
存儲指令
跳轉指令
算術邏輯指令
包括：ADD、SUB、 MUL、 DIV、 MOD、 NEG、OR、 AND、XOR、 LSH、 RSH
Return 指令
條件跳轉指令
有兩個跳轉目標，jt為真，jf為假
jmp 目標是指令偏移量，最大 255

如何編寫BPF程序呢？BPF指令可以手工編寫，但是，開發人員定義了符號常量和兩個方便的宏BPF_STMT和BPF_JUMP可以用來方便的編寫BPF規則。

Linux v5.18.4/include/uapi/linux/filte.h -> BPF_STMT&BPF_JUMP（https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/filter.h#L45）

/* * Macros for filter block array initializers. */#ifndef BPF_STMT#define BPF_STMT(code, k) { (unsigned short)(code), 0, 0, k }#endif#ifndef BPF_JUMP#define BPF_JUMP(code, k, jt, jf) { (unsigned short)(code), jt, jf, k }#endif

BPF_STMT

BPF_STMT有兩個參數，操作碼(code)和值(k)，舉個例子：

BPF_STMT(BPF_LD | BPF_W | BPF_ABS,(offsetof(struct seccomp_data, arch)))

這里的操作碼是由三個指令相或組成的，BPF_LD: 建一個 BPF 加載操作；BPF_W:操作數大小是一個字，BPF_ABS: 使用絕對偏移，即使用指令中的值作為數據區的偏移量,該值是體系結構字段與數據區域的偏移量。offsetof()生成數據區域中期望字段的偏移量。

該指令的功能是將體系架構數加載到累加器中。

BPF_JUMP

BPF_JUMP 中有四個參數：操作碼、值(k)、為真跳轉(jt)和為假跳轉(jf)，舉個例子：

BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K ,AUDIT_ARCH_X86_64 , 1, 0)

BPF_JMP | BPF JEQ會創建一個相等跳轉指令，它將指令中的值（即第二個參數AUDIT_ARCH_X86_64）與累加器中的值（BPF_K）進行比較。判斷是否相等，也就是說，如果架構是 x86-64，則跳過下一條指令（jt=1，代表測試為真跳過一條指令），否則將執行下一條指令（jf=0，代表如果測試為假，則跳過0條指令，也就是繼續執行下一條指令）。

上面這兩條指令常用作系統架構的驗證。

再舉個實際例子，該示例用作過濾execve系統調用的過濾規則：

struct sock_filter filter[] = {    BPF_STMT(BPF_LD+BPF_W+BPF_ABS,0),           //將幀的偏移0處，取4個字節數據，也就是系統調用號的值載入累加器    BPF_JUMP(BPF_JMP+BPF_JEQ,59,0,1),           //當A == 59時，順序執行下一條規則，否則跳過下一條規則，這里的59就是x64的execve系統調用號    BPF_STMT(BPF_RET+BPF_K,SECCOMP_RET_KILL),   //返回KILL    BPF_STMT(BPF_RET+BPF_K,SECCOMP_RET_ALLOW),  //返回ALLOW};

在bpf_common.h（https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/bpf_common.h#L7）中給出了BPF_STMT和BPF_JUMP相關的操作碼：

/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */#ifndef _UAPI__LINUX_BPF_COMMON_H__#define _UAPI__LINUX_BPF_COMMON_H__
/* Instruction classes */                    #define BPF_CLASS(code) ((code) & 0x07)    //指定操作的類別#define        BPF_LD        0x00               //將值復制到累加器中#define        BPF_LDX        0x01               //將值加載到索引寄存器中#define        BPF_ST        0x02               //將累加器中的值存到暫存器#define        BPF_STX        0x03               //將索引寄存器的值存儲在暫存器中#define        BPF_ALU        0x04               //用索引寄存器或常數作為操作數在累加器上執行算數或邏輯運算#define        BPF_JMP        0x05               //跳轉#define        BPF_RET        0x06               //返回#define        BPF_MISC        0x07           // 其他類別
/* ld/ldx fields */#define BPF_SIZE(code)  ((code) & 0x18)#define        BPF_W        0x00 /* 32-bit */       //字#define        BPF_H        0x08 /* 16-bit */       //半字#define        BPF_B        0x10 /*  8-bit */       //字節/* eBPF        BPF_DW        0x18    64-bit */       //雙字#define BPF_MODE(code)  ((code) & 0xe0)#define        BPF_IMM        0x00                  //常數 #define        BPF_ABS        0x20                  //固定偏移量的數據包數據(絕對偏移)#define        BPF_IND        0x40                  //可變偏移量的數據包數據(相對偏移)#define        BPF_MEM        0x60                  //暫存器中的一個字#define        BPF_LEN        0x80                  //數據包長度#define        BPF_MSH        0xa0
/* alu/jmp fields */#define BPF_OP(code)    ((code) & 0xf0)       //當操作碼類型為ALU時，指定具體運算符   #define        BPF_ADD        0x00        #define        BPF_SUB        0x10#define        BPF_MUL        0x20#define        BPF_DIV        0x30#define        BPF_OR        0x40#define        BPF_AND        0x50#define        BPF_LSH        0x60#define        BPF_RSH        0x70#define        BPF_NEG        0x80#define        BPF_MOD        0x90#define        BPF_XOR        0xa0                                               //當操作碼是jmp時指定跳轉類型#define        BPF_JA        0x00#define        BPF_JEQ        0x10#define        BPF_JGT        0x20#define        BPF_JGE        0x30#define        BPF_JSET        0x40#define BPF_SRC(code)   ((code) & 0x08)#define        BPF_K        0x00                    //常數#define        BPF_X        0x08                    //索引寄存器
#ifndef BPF_MAXINSNS#define BPF_MAXINSNS 4096#endif
#endif /* _UAPI__LINUX_BPF_COMMON_H__ */

與seccomp相關的定義大多數在seccomp.h中定義。

一旦為程序配置了seccomp-BPF，每個系統調用都會經過seccomp過濾器，這在一定程度上會影響系統的性能。此外，Seccomp過濾器會向內核返回一個值，指示是否允許該系統調用，該返回值是一個 32 位的數值，其中最重要的 16 位（SECCOMP_RET_ACTION掩碼）指定內核應該采取的操作，其他位（SECCOMP_RET_DATA 掩碼）用于返回與操作關聯的數據。

/* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. * The upper 16-bits are ordered from least permissive values to most, * as a signed value (so 0x8000000 is negative). * * The ordering ensures that a min_t() over composed return values always * selects the least permissive choice. */#define SECCOMP_RET_KILL_PROCESS 0x80000000U /* kill the process */#define SECCOMP_RET_KILL_THREAD     0x00000000U /* kill the thread */#define SECCOMP_RET_KILL     SECCOMP_RET_KILL_THREAD#define SECCOMP_RET_TRAP     0x00030000U /* disallow and force a SIGSYS */#define SECCOMP_RET_ERRNO     0x00050000U /* returns an errno */#define SECCOMP_RET_USER_NOTIF     0x7fc00000U /* notifies userspace */#define SECCOMP_RET_TRACE     0x7ff00000U /* pass to a tracer or disallow */#define SECCOMP_RET_LOG         0x7ffc0000U /* allow after logging */#define SECCOMP_RET_ALLOW     0x7fff0000U /* allow */
/* Masks for the return value sections. */#define SECCOMP_RET_ACTION_FULL    0xffff0000U#define SECCOMP_RET_ACTION    0x7fff0000U#define SECCOMP_RET_DATA    0x0000ffffU

SECCOMP_RET_ALLOW：允許執行
SECCOMP_RET_KILL：立即終止執行
SECCOMP_RET_ERRNO：從系統調用中返回一個錯誤（系統調用不執行）
SECCOMP_RET_TRACE：嘗試通知ptrace()，使之有機會獲得控制權
SECCOMP_RET_TRAP：通知內核發送SIGSYS信號（系統調用不執行）

每一個seccomp-BPF程序都使用seccomp_data（https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/seccomp.h#L63）結構作為輸入參數：

/include（https://elixir.bootlin.com/linux/latest/source/include）

/uapi（https://elixir.bootlin.com/linux/latest/source/include/uapi）

/linux（https://elixir.bootlin.com/linux/latest/source/include/uapi/linux）

/seccomp.h （https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/seccomp.h）:

struct seccomp_data {  int nr ;                    /* 系統調用號（依賴于體系架構） */  __u32 arch ;                /* 架構（如AUDIT_ARCH_X86_64） */  __u64 instruction_pointer ; /* CPU指令指針 */  __u64 args [6];             /* 系統調用參數，最多有6個參數 */};

實現

Prctl()

prctl （https://man7.org/linux/man-pages/man2/prctl.2.html）函數是為進程制定而設計的，該函數原型如下：

#include 
int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5);

其中明確指定哪種種操作在于option選項， option有很多，與seccomp有關的option主要有兩個： PR_SET_NO_NEW_PRIVS()和PR_SET_SECCOMP()。

PR_SET_NO_NEW_PRIVS()：是在Linux 3.5 之后引入的特性，當一個進程或者子進程設置了PR_SET_NO_NEW_PRIVS 屬性,則其不能訪問一些無法共享的操作，如setuid、chroot等。配置seccomp-BPF的程序必須擁有Capabilities 中的CAP_SYS_ADMIN，或者程序已經定義了no_new_privs屬性。若不這樣做非 root 用戶使用該程序時 seccomp保護將會失效，設置了 PR_SET_NO_NEW_PRIVS 位后能保證 seccomp 對所有用戶都能起作用。

prctl(PR_SET_NO_NEW_PRIVS,1,0,0,0);

如果將其第二個參數設置為1，則這個操作能保證seccomp對所有用戶都能起作用，并且會使子進程即execve后的進程依然受到seccomp的限制。

PR_SET_SECCOMP()：為進程設置seccomp；通常的形式如下：

prctl(PR_SET_SECCOMP,SECCOMP_MODE_FILTER,&prog);

SECCOMP_MODE_FILTER參數表示設置的seccomp的過濾模式，如果設置為SECCOMP_MODE_STRICT，則代表嚴格模式；若為過濾模式，則對應的系統調用限制通過&prog結構體定義（上面提到過的 struct sock_fprog）。

嚴格模式的簡單示例

在嚴格模式下，進程可用的系統調用只有4個，因為open()也被禁用，所有在進入嚴格模式前，需要先打開文件，簡單的示例如下：

seccomp_strict.c：

#include #include #include #include #include 
void configure_seccomp() {  printf("Configuring seccomp");  prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);}
int main(int argc, char* argv[]) {  int infd, outfd;  ssize_t read_bytes;  char buffer[1024];
  if (argc < 3) {    printf("Usage:\tdup_file  ");    return -1;  }
  configure_seccomp(); /* 配置seccomp */
  printf("Opening '%s' for reading", argv[1]);  if ((infd = open(argv[1], O_RDONLY)) > 0) { /* open() 被禁用，進程會在此終止*/    printf("Opening '%s' for writing", argv[2]);    if ((outfd = open(argv[2], O_WRONLY | O_CREAT, 0644)) > 0) {        while((read_bytes = read(infd, &buffer, 1024)) > 0)          write(outfd, &buffer, (ssize_t)read_bytes);    }  }  close(infd);  close(outfd);  return 0;}

代碼功能實現簡單的文件復制，當seccomp施加嚴格模式的時候運行時，seccomp 會在執行open(argv[1], O_RDONLY)函數調用時終止應用程序。

null@ubuntu:~/seccomp$ gcc -o seccomp_strict seccomp_strict.cnull@ubuntu:~/seccomp$ ./seccomp_strict /etc/passwd outputConfiguring seccompOpening '/etc/passwd' for readingKilled

過濾模式的簡單示例

通過上面的介紹和程序流，如果我們想要為一個程序施加seccomp-BPF策略，那可以分為以下幾個步驟，首先定義filter數組，之后定義prog參數，最后使用prctl施加策略。

示例一：禁止execve系統調用

seccomp_filter_execv.c:

#include #include #include #include #include #include int main(){struct sock_filter filter[] = {    BPF_STMT(BPF_LD+BPF_W+BPF_ABS,0), //將幀的偏移0處，取4個字節數據，也就是系統調用號的值載入累加器    BPF_JUMP(BPF_JMP+BPF_JEQ,59,0,1), //判斷系統調用號是否為59，是則順序執行，否則跳過下一條    BPF_STMT(BPF_RET+BPF_K,SECCOMP_RET_KILL), //返回KILL    BPF_STMT(BPF_RET+BPF_K,SECCOMP_RET_ALLOW), //返回ALLOW};struct sock_fprog prog = {    .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),//規則條數    .filter = filter,                                         //結構體數組指針};    prctl(PR_SET_NO_NEW_PRIVS,1,0,0,0);             //設置NO_NEW_PRIVS    prctl(PR_SET_SECCOMP,SECCOMP_MODE_FILTER,&prog);    write(0,"test",5);    system("/bin/sh");    return 0;}

示例二：

seccomp_filter.c:

#include #include #include #include #include #include #include #include 
void configure_seccomp() {  struct sock_filter filter [] = {    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))), //將系統調用號載入累加器    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1), //測試系統調用號是否匹配'__NR__write',如果是允許其他syscall，如果不是則跳過下一條指令，    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_open, 0, 3),//測試是否為'__NR_open',不是直接退出，    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, args[1]))),//第二個參數送入累加器    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, O_RDONLY, 0, 1),//判斷是否是'O_RDONLY'的方式，是則允許    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL)  };  struct sock_fprog prog = {       .len = (unsigned short)(sizeof(filter) / sizeof (filter[0])),       .filter = filter,  };
  printf("Configuring seccomp");  prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);  prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);}
int main(int argc, char* argv[]) {  int infd, outfd;  ssize_t read_bytes;  char buffer[1024];
  if (argc < 3) {    printf("Usage:\tdup_file  ");    return -1;  }  printf("Ducplicating file '%s' to '%s'", argv[1], argv[2]);
  configure_seccomp(); //配置seccomp
  printf("Opening '%s' for reading", argv[1]);  if ((infd = open(argv[1], O_RDONLY)) > 0) {    printf("Opening '%s' for writing", argv[2]);    if ((outfd = open(argv[2], O_WRONLY | O_CREAT, 0644)) > 0) {        while((read_bytes = read(infd, &buffer, 1024)) > 0)          write(outfd, &buffer, (ssize_t)read_bytes);    }  }  close(infd);  close(outfd);  return 0;}

在這種情況下，在這種情況下，seccomp-BPF 程序將允許使用 O_RDONLY 參數打開第一個調用 , 但是在使用 O_WRONLY | O_CREAT 參數調用 open 時終止程序。

$ ./seccomp_filter /etc/passwd outputDucplicating file '/etc/passwd' to 'output'Configuring seccompOpening '/etc/passwd' for readingOpening 'output' for writingBad system call

libseccomp

項目地址：libseccomp：https://github.com/seccomp/libseccomp

基于prctl()函數的機制不夠靈活，libseccomp庫可以提供一些函數實現prctl類似的效果，庫中封裝了一些函數，可以不用了解BPF規則而實現過濾。但是在c程序中使用它，需要裝一些庫文件：

null@ubuntu:~/seccomp$ sudo apt install libseccomp-dev libseccomp2 seccomp

使用示例：

simple_syscall_seccomp.c：

//gcc -g simple_syscall_seccomp.c -o simple_syscall_seccomp -lseccomp#include #include #include 
int main(void){    scmp_filter_ctx ctx;    ctx = seccomp_init(SCMP_ACT_ALLOW);    seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(execve), 0);    seccomp_load(ctx);
    char * filename = "/bin/sh";    char * argv[] = {"/bin/sh",NULL};    char * envp[] = {NULL};    write(1,"i will give you a shell",24);    syscall(59,filename,argv,envp);//execve    return 0;}

編譯運行, 在執行 execve 時程序報錯退出 :

null@ubuntu:~/seccomp$ gcc -g simple_syscall_seccomp.c -o simple_syscall_seccomp -lseccompnull@ubuntu:~/seccomp$ ./simple_syscall_seccompi will give you a shellBad system call (core dumped)

解釋一下上訴代碼：

scmp_filter_ctx : 過濾器的結構體

seccomp_init : 初始化的過濾狀態 ,函數原型：

seccomp_init(uint32_t def_action)

可選的def_action有：

SCMP_ACT_ALLOW：即初始化為允許所有系統調用，過濾為黑名單模式；SCMP_ACT_KILL：則為白名單模式過濾。SCMP_ACT_KILL_PROCESS：整個進程將被內核終止SCMP_ACT_TRAP:如果所有系統調用都不匹配，則給線程發送一個SIGSYS信號SCMP_ACT_TRACE(uint16_t msg_num)：在使用ptrace根據進程時的相關選項SCMP_ACT_ERRNO(uint16_t errno)：不匹配會收到errno的返回值SCMP_ACT_LOG：不影響系統調用，但是會被記錄；

seccomp_rule_add（https://man7.org/linux/man-pages/man3/seccomp_rule_add.3.html）：添加一條規則，函數原型為：

int seccomp_rule_add(scmp_filter_ctx ctx, uint32_t action,int syscall, unsigned int arg_cnt, ...);

其中arg_cnt參數表明是否需要對對應系統調用的參數做出限制以及指示做出限制的個數，如果僅僅需要允許或者禁止所有某個系統調用，arg_cnt直接傳入0即可，如 seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(execve), 0) 即禁用execve，不管其參數如何。如果arg_cnt的參數不為0，那 arg_cnt 表示后面限制的參數的個數，也就是只有調用 execve，且參數滿足要求時，才會攔截 syscall 。如果想要更細粒度的過濾系統調用，把參數也考慮進去,就要設置arg_cnt不為零，然后在利用宏做一些過濾。

舉個例子，攔截 write 函數參數大于 0x10 時的系統調用：

seccomp_write_limit.c：

#include #include #include 
int main(void){    scmp_filter_ctx ctx;    ctx = seccomp_init(SCMP_ACT_ALLOW);    seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(write),1,SCMP_A2(SCMP_CMP_GT,0x10));//第2(從0)個參數大于0x10    seccomp_load(ctx);    write(1,"1234567812345678",0x10);//不被攔截    write(1,"i will give you a shell",24);//會攔截       return 0;}

編譯執行

null@ubuntu:~/seccomp$ gcc -g seccomp_write_limit.c -o seccomp_write_limit -lseccompnull@ubuntu:~/seccomp$ ./seccomp_write_limit1234567812345678Bad system call (core dumped)

其中SCMP_A2代表為第二個參數指定比較結構，SCMP_CMP_GT代表大于(greater than)，詳細內容如下。

libseccmop/include/seccomp.h.in（https://github.com/seccomp/libseccomp/blob/3f0e47fe2717b73ccef68ca18f9f7297ee73ebb2/include/seccomp.h.in）：

......
/** * Comparison operators */enum scmp_compare {    _SCMP_CMP_MIN = 0,    SCMP_CMP_NE = 1,        /**< not equal */    SCMP_CMP_LT = 2,        /**< less than */    SCMP_CMP_LE = 3,        /**< less than or equal */    SCMP_CMP_EQ = 4,        /**< equal */    SCMP_CMP_GE = 5,        /**< greater than or equal */    SCMP_CMP_GT = 6,        /**< greater than */    SCMP_CMP_MASKED_EQ = 7,        /**< masked equality */    _SCMP_CMP_MAX,}; ... struct scmp_arg_cmp {    unsigned int arg;    /**< argument number, starting at 0 */    enum scmp_compare op;    /**< the comparison op, e.g. SCMP_CMP_* */    scmp_datum_t datum_a;    scmp_datum_t datum_b;}; ..../** * Specify a 32-bit argument comparison struct for use in declaring rules * @param arg the argument number, starting at 0 * @param op the comparison operator, e.g. SCMP_CMP_* * @param datum_a dependent on comparison (32-bits) * @param datum_b dependent on comparison, optional (32-bits) */#define SCMP_CMP32(x, y, ...) \    _SCMP_MACRO_DISPATCHER(_SCMP_CMP32_, __VA_ARGS__)(x, y, __VA_ARGS__)
/** * Specify a 64-bit argument comparison struct for argument 0 */#define SCMP_A0_64(...)        SCMP_CMP64(0, __VA_ARGS__)#define SCMP_A0            SCMP_A0_64
/** * Specify a 32-bit argument comparison struct for argument 0 */#define SCMP_A0_32(x, ...)    SCMP_CMP32(0, x, __VA_ARGS__)
/** * Specify a 64-bit argument comparison struct for argument 1 */#define SCMP_A1_64(...)        SCMP_CMP64(1, __VA_ARGS__)#define SCMP_A1            SCMP_A1_64
/** * Specify a 32-bit argument comparison struct for argument 1 */#define SCMP_A1_32(x, ...)    SCMP_CMP32(1, x, __VA_ARGS__)
/** * Specify a 64-bit argument comparison struct for argument 2 */#define SCMP_A2_64(...)        SCMP_CMP64(2, __VA_ARGS__)#define SCMP_A2            SCMP_A2_64
/** * Specify a 32-bit argument comparison struct for argument 2 */#define SCMP_A2_32(x, ...)    SCMP_CMP32(2, x, __VA_ARGS__)
/** * Specify a 64-bit argument comparison struct for argument 3 */#define SCMP_A3_64(...)        SCMP_CMP64(3, __VA_ARGS__)#define SCMP_A3            SCMP_A3_64
/** * Specify a 32-bit argument comparison struct for argument 3 */#define SCMP_A3_32(x, ...)    SCMP_CMP32(3, x, __VA_ARGS__)
/** * Specify a 64-bit argument comparison struct for argument 4 */#define SCMP_A4_64(...)        SCMP_CMP64(4, __VA_ARGS__)#define SCMP_A4            SCMP_A4_64
/** * Specify a 32-bit argument comparison struct for argument 4 */#define SCMP_A4_32(x, ...)    SCMP_CMP32(4, x, __VA_ARGS__)
/** * Specify a 64-bit argument comparison struct for argument 5 */#define SCMP_A5_64(...)        SCMP_CMP64(5, __VA_ARGS__)#define SCMP_A5            SCMP_A5_64
/** * Specify a 32-bit argument comparison struct for argument 5 */#define SCMP_A5_32(x, ...)    SCMP_CMP32(5, x, __VA_ARGS__)
     ...     ...

除了seccomp_rule_add之外，還有其他添加規則的函數，如：seccomp_rule_add_array ()、 seccomp_rule_add_exact ()和seccomp_rule_add_exact_array ()，詳細信息可查看參考鏈接。

seccomp_load：將當前的 seccomp 過濾器加載到內核中，函數原型：

int seccomp_load(scmp_filter_ctx ctx);

seccomp_reset : 釋放現有的過濾上下文重新初始化之前的狀態，并且只能在成功調用seccomp_init () 之后才能使用。

int seccomp_reset（scmp_filter_ctx ctx ，uint32_t def_action ）

其他工具

seccmop-bpf.h

seccomp-bpf.h（https://github.com/ahupowerdns/secfilter/blob/master/seccomp-bpf.h）是由開發人員編寫的一個十分便捷的頭文件用于開發seccomp-bpf 。該頭文件已經定義好了很多常見的宏，如驗證系統架構、允許系統調用等功能，十分便捷，如下所示。

...define VALIDATE_ARCHITECTURE \    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, arch_nr), \    BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARCH_NR, 1, 0), \    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
define EXAMINE_SYSCALL \    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_nr)
define ALLOW_SYSCALL(name) \    BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
define KILL_PROCESS \    BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)...

應用示例：

seccomp_policy.c（https://gist.github.com/mstemm/1bc06c52abb7b6b4feef79d7bfff5815#file-seccomp_policy-c）

#include #include #include #include #include #include #include #include "seccomp-bpf.h"
void install_syscall_filter(){        struct sock_filter filter[] = {                /* Validate architecture. */                VALIDATE_ARCHITECTURE,                /* Grab the system call number. */                EXAMINE_SYSCALL,                /* List allowed syscalls. We add open() to the set of                   allowed syscalls by the strict policy, but not                   close(). */                ALLOW_SYSCALL(rt_sigreturn),#ifdef __NR_sigreturn                ALLOW_SYSCALL(sigreturn),#endif                ALLOW_SYSCALL(exit_group),                ALLOW_SYSCALL(exit),                ALLOW_SYSCALL(read),                ALLOW_SYSCALL(write),                ALLOW_SYSCALL(open),                KILL_PROCESS,        };        struct sock_fprog prog = {                .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),                .filter = filter,        };
        assert(prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) == 0);
        assert(prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) == 0);}
int main(int argc, char **argv){        int output = open("output.txt", O_WRONLY);        const char *val = "test";
        printf("Calling prctl() to set seccomp with filter...");
        install_syscall_filter();
        printf("Writing to an already open file...");        write(output, val, strlen(val)+1);
        printf("Trying to open file for reading...");        int input = open("output.txt", O_RDONLY);
        printf("Note that open() worked. However, close() will not");        close(input);
        printf("You will not see this message--the process will be killed first");}

執行結果

$ ./seccomp_policyCalling prctl() to set seccomp with filter...Writing to an already open file...Trying to open file for reading...Note that open() worked. However, close() will notBad system call

seccomp-tools

一款用于分析seccomp的開源工具，項目地址：https://github.com/david942j/seccomp-tools

主要功能：

Dump：從可執行文件中自動轉儲 seccomp BPF
Disasm：將 seccomp BPF 轉換為人類可讀的格式
Asm：使編寫seccomp規則類似于編寫代碼
Emu：模擬 seccomp 規則

安裝

sudo apt install gcc ruby-devgem install seccomp-tools

使用

null@ubuntu:~/seccomp$ seccomp-tools dump ./simple_syscall_seccomp line  CODE  JT   JF      K================================= 0000: 0x20 0x00 0x00 0x00000004  A = arch 0001: 0x15 0x00 0x05 0xc000003e  if (A != ARCH_X86_64) goto 0007 0002: 0x20 0x00 0x00 0x00000000  A = sys_number 0003: 0x35 0x00 0x01 0x40000000  if (A < 0x40000000) goto 0005 0004: 0x15 0x00 0x02 0xffffffff  if (A != 0xffffffff) goto 0007 0005: 0x15 0x01 0x00 0x0000003b  if (A == execve) goto 0007 0006: 0x06 0x00 0x00 0x7fff0000  return ALLOW 0007: 0x06 0x00 0x00 0x00000000  return KILL

從輸出中可知禁用了execve系統調用。

使用Seccomp保護Docker的安全

Seccomp技術被用在很多應用程序上以保護系統的安全性，Docker支持使用seccomp來限制容器的系統調用，不過需要啟用內核中的CONFIG_SECCOMP。

null@ubuntu:~$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)CONFIG_SECCOMP=y

當使用docker run 啟動一個容器時，Docker會使用默認的seccomp配置文件來對容器施加限制策略，該默認文件是以json格式編寫，在 300 多個系統調用中禁用了大約 44 個系統調用，可以在Moby項目中找到該源碼（https://github.com/moby/moby/blob/master/profiles/seccomp/default.json）。

null@ubuntu:~$ sudo docker run --rm -it ubuntu /bin/bashroot@85e01c28bd2c:/# bashroot@85e01c28bd2c:/# ps   PID TTY          TIME CMD     1 pts/0    00:00:00 bash    10 pts/0    00:00:00 bash    13 pts/0    00:00:00 psroot@85e01c28bd2c:/# grep -i seccomp /proc/1/statusSeccomp:        2

Docker中默認的配置文件提供了最大限度的包容性，除了默認的選擇之外，Docker允許我們自定義該配置文件來靈活的對容器的系統調用進行限制。

示例：以白名單的形式允許特定的系統調用

example.json

{    "defaultAction": "SCMP_ACT_ERRNO",    "architectures": [        "SCMP_ARCH_X86_64",        "SCMP_ARCH_X86",        "SCMP_ARCH_X32"    ],    "syscalls": [        {            "names": [                "arch_prctl",                "sched_yield",                "futex",                "write",                "mmap",                "exit_group",                "madvise",                "rt_sigprocmask",                "getpid",                "gettid",                "tgkill",                "rt_sigaction",                "read",                "getpgrp"            ],            "action": "SCMP_ACT_ALLOW",            "args": [],            "comment": "",            "includes": {},            "excludes": {}        }    ]}

defaultAction : 指定默認的seccomp 操作，具體的可選參數上面已經介紹過了，最常用的無非是SCMP_ACT_ALLOW、SCMP_ACT_ERRNO，這里選擇SCMP_ACT_ERRNO，表示默認禁止全部系統調用，以白名單的形式在賦予可用的系統調用。

architectures ：系統架構，不同的系統架構系統調用可能不同。

syscalls：指定系統調用以及對應的操作，name定義系統調用名，action對應的操作，這里表示允許name里邊中的系統調用，args對應系統調用參數，可以為空。

這樣，在使用 docker run 運行容器時，就可以使用 --security-opt 選項指定該配置文件來對容器進行系統調用定制。

$ docker run --rm -it --security-opt seccomp=/path/to/seccomp/example.json hello-world

舉例，禁止容器創建文件夾，就可以用黑名單的形式禁用mkdir系統調用

seccomp_mkdir.json:

{    "defaultAction": "SCMP_ACT_ALLOW",    "syscalls": [        {            "name": "mkdir",            "action": "SCMP_ACT_ERRNO",            "args": []        }    ]}

使用該策略啟動容器，并在容器中創建文件夾時，就會收到禁止信息，不允許創建文件夾。

null@ubuntu:~/seccomp/docker$ sudo docker run --rm -it --security-opt seccomp=seccomp_mkdir.json busybox /bin/sh/ # lsbin   dev   etc   home  proc  root  sys   tmp   usr   var/ # mkdir testmkdir: can't create directory 'test': Operation not permitted

當然也可以不適用任何seccomp策略啟動容器，只需要在啟動選項中加上--security-opt seccomp=unconfined即可。

zaz

zaz seccomp 是一個可以為容器自動生成json格式的seccomp文件的開源工具，項目地址：https://github.com/pjbgf/zaz。

主要用法為

zaz seccomp docker IMAGE COMMAND

它能夠為特定的可執行文件定制系統調用，以只允許特定的操作，禁止其他操作。

舉個例子：為alpine中的ping命令生成seccomp配置文件

$ sudo ./zaz seccomp docker alpine "ping -c5 8.8.8.8" > seccomp_ping.json$ cat seccomp_ping.json | jq '.'{  "defaultAction": "SCMP_ACT_ERRNO",  "architectures": [    "SCMP_ARCH_X86_64",    "SCMP_ARCH_X86",    "SCMP_ARCH_X32"  ],  "syscalls": [    {      "names": [        "arch_prctl",        "bind",        "clock_gettime",        "clone",        "close",        "connect",        "dup2",        "epoll_pwait",        "execve",        "exit",        "exit_group",        "fcntl",        "futex",        "getpid",        "getsockname",        "getuid",        "ioctl",        "mprotect",        "nanosleep",        "open",        "poll",        "read",        "recvfrom",        "rt_sigaction",        "rt_sigprocmask",        "rt_sigreturn",        "sendto",        "set_tid_address",        "setitimer",        "setsockopt",        "socket",        "write",        "writev"      ],      "action": "SCMP_ACT_ALLOW"    }  ]}

如上所示，zaz檢測到了33個系統調用，使用白名單的形式過濾系統調用。那它以白名單的形式生成的系統調用能否很好的過濾系統系統呢？是否能夠滿足運行ping命令，而不能運行除了它允許的系統調用之外的命令呢？做個測試，首先用下面Dockerfile構建一個簡單的鏡像。

Dockerfile

FROM alpine:latestCMD ["ping","-c5","8.8.8.8"]

構建成功后，使用默認的seccomp策略啟動容器，沒有任何問題，可以運行。

$ sudo docker build -t pingtest .$ sudo docker run --rm -it pingtestPING 8.8.8.8 (8.8.8.8): 56 data bytes64 bytes from 8.8.8.8: seq=0 ttl=127 time=42.139 ms64 bytes from 8.8.8.8: seq=1 ttl=127 time=42.646 ms64 bytes from 8.8.8.8: seq=2 ttl=127 time=42.098 ms64 bytes from 8.8.8.8: seq=3 ttl=127 time=42.484 ms64 bytes from 8.8.8.8: seq=4 ttl=127 time=42.007 ms
--- 8.8.8.8 ping statistics ---5 packets transmitted, 5 packets received, 0% packet lossround-trip min/avg/max = 42.007/42.274/42.646 ms

接著我們使用上述zaz生成的策略試試：

$ sudo docker run --rm -it --security-opt seccomp=seccomp_ping.json pingtestdocker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: close exec fds: open /proc/self/fd: operation not permitted: unknown.

容器并沒有成功啟動，在創建OCI的時候就報錯了，報錯原因是operation not permitted,這個報錯上面似乎提到過，是想要使用的系統調用被禁用的緣故，可能zaz這種白名單的模式魯棒性還是不夠強，而且Docker更新那么多次，zaz缺乏維護導致捕獲的系統調用不足，在容器啟動過程中出現了問題。奇怪的是，當我在此運行同樣的命令，卻引發了panic報錯：No error following JSON procError payload。

$ sudo docker run --rm -it --security-opt seccomp=seccomp_ping.json pingtestdocker: Error response from daemon: failed to create shim: OCI runtime create failed: runc did not terminate successfully: exit status 2: panic: No error following JSON procError payload.
goroutine 1 [running]:github.com/opencontainers/runc/libcontainer.parseSync(0x56551adf30b8, 0xc000010b20, 0xc0002268a0, 0xc00027f9e0, 0x0)        github.com/opencontainers/runc/libcontainer/sync.go:93 +0x307github.com/opencontainers/runc/libcontainer.(*initProcess).start(0xc000297cb0, 0x0, 0x0)        github.com/opencontainers/runc/libcontainer/process_linux.go:440 +0x5efgithub.com/opencontainers/runc/libcontainer.(*linuxContainer).start(0xc000078700, 0xc000209680, 0x0, 0x0)        github.com/opencontainers/runc/libcontainer/container_linux.go:379 +0xf5github.com/opencontainers/runc/libcontainer.(*linuxContainer).Start(0xc000078700, 0xc000209680, 0x0, 0x0)        github.com/opencontainers/runc/libcontainer/container_linux.go:264 +0xb4main.(*runner).run(0xc0002274c8, 0xc0000200f0, 0x0, 0x0, 0x0)        github.com/opencontainers/runc/utils_linux.go:312 +0xd2amain.startContainer(0xc00025c160, 0xc000076400, 0x1, 0x0, 0x0, 0xc0002275b8, 0x6)        github.com/opencontainers/runc/utils_linux.go:455 +0x455main.glob..func2(0xc00025c160, 0xc000246000, 0xc000246120)        github.com/opencontainers/runc/create.go:65 +0xbbgithub.com/urfave/cli.HandleAction(0x56551ad3b040, 0x56551ade81e8, 0xc00025c160, 0xc00025c160, 0x0)        github.com/urfave/cli@v1.22.1/app.go:523 +0x107github.com/urfave/cli.Command.Run(0x56551aa566f5, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x56551aa5f509, 0x12, 0x0, ...)        github.com/urfave/cli@v1.22.1/command.go:174 +0x579github.com/urfave/cli.(*App).Run(0xc000254000, 0xc000132000, 0xf, 0xf, 0x0, 0x0)        github.com/urfave/cli@v1.22.1/app.go:276 +0x7e8main.main()        github.com/opencontainers/runc/main.go:163 +0xd3f: unknown.

這種報錯或許是不應該的，我嘗試在網上尋找報錯的相關信息，類似的情況很少，而且并不是每次運行都是出現這種panic，正常情況下應該是operation not permitted，這是由于我們的白名單沒有完全包括必須的系統調用導致的。目前將此情況匯報給了Moby issue（https://github.com/moby/moby/issues/43730），或許能夠得到一些解答。

類似panic信息：

https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=1714183

無論是哪種報錯，看起來都是runc出了問題，嘗試解決這個問題，我們就要知道Docker到底是如何在運行時加載seccomp？

當我們要創建一個容器的時候，容器守護進程 Dockerd會請求 containerd 來創建一個容器， containerd 收到請求后，也并不會直接去操作容器，而是創建一個叫做 containerd-shim 的進程，讓這個進程去操作容器，之后containerd-shim會通過OCI去調用容器運行時runc來啟動容器， runc 啟動完容器后本身會直接退出，containerd-shim 則會成為容器進程的父進程, 負責收集容器進程的狀態, 上報給 containerd, 并在容器中 pid 為 1 的進程退出后接管容器中的子進程進行清理, 確保不會出現僵尸進程。也就是說調用順序為：

Dockerd -> containerd -> containerd-shim -> runc

啟動一個容器ubuntu，并在容器中再運行一個bash。

null@ubuntu:~$ sudo docker run --rm -it ubuntu /bin/bashroot@ef57fff95b80:/# bashroot@ef57fff95b80:/# ps   PID TTY          TIME CMD     1 pts/0    00:00:00 bash     9 pts/0    00:00:00 bash    12 pts/0    00:00:00 ps

查看調用棧，containerd-shim（28051-28129）并沒有被施加seccomp,而容器內的兩個bash（1 -> 28075;9->28126）被施加了seccomp策略。

root@ubuntu:/home/null# pstree -p | grep containerd-shim           |-containerd-shim(28051)-+-bash(28075)---bash(28126)           |                        |-{containerd-shim}(28052)           |                        |-{containerd-shim}(28053)           |                        |-{containerd-shim}(28054)           |                        |-{containerd-shim}(28055)           |                        |-{containerd-shim}(28056)           |                        |-{containerd-shim}(28057)           |                        |-{containerd-shim}(28058)           |                        |-{containerd-shim}(28059)           |                        |-{containerd-shim}(28060)           |                        `-{containerd-shim}(28129)root@ubuntu:/home/null# grep -i seccomp /proc/28051/statusSeccomp:        0root@ubuntu:/home/null# grep -i seccomp /proc/28075/statusSeccomp:        2root@ubuntu:/home/null# grep -i seccomp /proc/28126/statusSeccomp:        2root@ubuntu:/home/null# grep -i seccomp /proc/28052/statusSeccomp:        0......root@ubuntu:/home/null# grep -i seccomp /proc/28129/statusSeccomp:        0

也就是說對容器施加seccomp 是在container-shim啟動之后，在調用runc的時候出現了問題，是否我們的seccomp策略也要將runc所必須的系統調用考慮進去呢？Zaz是否考慮了容器啟動時候的runc所必須的系統調用?

這就需要捕獲容器在啟動時，runc所必要的系統調用了。

Sysdig

為了獲取容器運行時runc用了哪些系統調用，可以有很多方法，比如ftrace、strace、fanotify等，這里使用sysdig來監控容器的運行，sisdig時一款原生支持容器的系統可見性工具，項目地址：https://github.com/draios/sysdig。具體的安裝和使用方法可以參考GitHub上給出的詳細教程，這里只做簡單介紹。

安裝完成后，直接在命令行運行sysdig，不加任何參數， sysdig 會捕獲所有的事件并將其寫入標準輸出：

$ sysdig285304 01:21:51.270700399 7 sshd (50485) > select285306 01:21:51.270701716 7 sshd (50485) < select res=2285307 01:21:51.270701982 7 sshd (50485) > rt_sigprocmask285308 01:21:51.270702258 7 sshd (50485) < rt_sigprocmask285309 01:21:51.270702473 7 sshd (50485) > rt_sigprocmask285310 01:21:51.270702660 7 sshd (50485) < rt_sigprocmask285312 01:21:51.270702983 7 sshd (50485) > read fd=13(/dev/ptmx) size=16384285313 01:21:51.270703971 1 sysdig (59131) > switch next=59095 pgft_maj=0 pgft_min=1759 vm_size=280112 vm_rss=18048 vm_swap=0...

默認情況下，sysdig 在一行中打印每個事件的信息，格式如下：

%evt.num %evt.time %evt.cpu %proc.name (%thread.tid) %evt.dir %evt.type %evt.args

其中：

evt.num 是遞增的事件編號

evt.time 是事件時間戳

evt.cpu 是捕獲事件的 CPU 編號

proc.name 是生成事件的進程的名稱

thread.tid 是產生事件的TID，對應單線程進程的PID

evt.dir 是事件方向，> 表示進入事件，< 表示退出事件

evt.type 是事件的名稱，例如“open”或“read”

evt.args 是事件參數的列表。在系統調用的情況下，這些往往對應于系統調用參數，但情況并非總是如此：出于簡單或性能原因，某些系統調用參數被排除在外。

啟動一個終端A，輸入以下命令進行監控，container.name指定捕獲容器名為ping，proc.name指定進程名為runc的包，保存為runc.scap。

$sysdig -w runc.scap container.name=ping&&proc.name=runc

接著在另一個終端B啟動該容器：

$sudo docker run --rm -it --name=ping pingtestPING 8.8.8.8 (8.8.8.8): 56 data bytes64 bytes from 8.8.8.8: seq=0 ttl=127 time=44.032 ms64 bytes from 8.8.8.8: seq=1 ttl=127 time=42.069 ms64 bytes from 8.8.8.8: seq=2 ttl=127 time=42.066 ms64 bytes from 8.8.8.8: seq=3 ttl=127 time=42.073 ms64 bytes from 8.8.8.8: seq=4 ttl=127 time=42.112 ms
--- 8.8.8.8 ping statistics ---5 packets transmitted, 5 packets received, 0% packet lossround-trip min/avg/max = 42.066/42.470/44.032 ms

執行完畢后，在終端A使用ctrl+c停止捕獲，并篩選捕獲的內容，只留系統調用，將結果保存到runc_syscall.txt中，這樣我們就得到了啟動容器時runc使用了哪些系統調用。

$ sysdig  -p "%syscall.type" -r runc.scap | runc_syscall.txt$ cat -n runc_syscall.txt  ...  3437  rt_sigaction  3438  exit_group  3439  procexit

可以發現篩選出的系統調用數還是有很多的，其中包含很多重復的系統調用，這里可以簡單的寫一個腳本，進行過濾，通過過濾后，一共有72個系統調用。

$ python analyse.py runc_syscall.txtFilter syscall num: 72filter syscall:['clone', 'close', 'prctl', 'getpid', 'write', 'unshare', 'read', 'exit_group', 'procexit', 'setsid', 'setuid', 'setgid', 'sched_getaffinity', 'openat', 'mmap', 'rt_sigprocmask', 'sigaltstack', 'gettid', 'rt_sigaction', 'mprotect', 'futex', 'set_robust_list', 'munmap', 'nanosleep', 'readlinkat', 'fcntl', 'epoll_create1', 'pipe', 'epoll_ctl', 'fstat', 'pread', 'getdents64', 'capget', 'epoll_pwait', 'newfstatat', 'statfs', 'getppid', 'keyctl', 'socket', 'bind', 'sendto', 'getsockname', 'recvfrom', 'mount', 'fchmodat', 'mkdirat', 'symlinkat', 'umask', 'mknodat', 'fchownat', 'unlinkat', 'chdir', 'fchdir', 'pivot_root', 'umount', 'dup', 'sethostname', 'fstatfs', 'seccomp', 'brk', 'fchown', 'setgroups', 'capset', 'execve', 'signaldeliver', 'access', 'arch_prctl', 'getuid', 'getgid', 'geteuid', 'getcwd', 'getegid']

將zaz生成的系統調用與我們捕獲的系統調用合二為一，系統調用數到了85個。如下：

{    "defaultAction": "SCMP_ACT_ERRNO",    "architectures": [        "SCMP_ARCH_X86_64",        "SCMP_ARCH_X86",        "SCMP_ARCH_X32"    ],    "syscalls": [        {            "names": [                "clone",                "close",                "prctl",                "getpid",                "write",                "unshare",                "read",                "exit_group",                "procexit",                "setsid",                "setuid",                "setgid",                "sched_getaffinity",                "openat",                "mmap",                "rt_sigprocmask",                "sigaltstack",                "gettid",                "rt_sigaction",                "mprotect",                "futex",                "set_robust_list",                "munmap",                "nanosleep",                "readlinkat",                "fcntl",                "epoll_create1",                "pipe",                "epoll_ctl",                "fstat",                "pread",                "getdents64",                "capget",                "epoll_pwait",                "newfstatat",                "statfs",                "getppid",                "keyctl",                "socket",                "bind",                "sendto",                "getsockname",                "recvfrom",                "mount",                "fchmodat",                "mkdirat",                "symlinkat",                "umask",                "mknodat",                "fchownat",                "unlinkat",                "chdir",                "fchdir",                "pivot_root",                "umount",                "dup",                "sethostname",                "fstatfs",                "seccomp",                "brk",                "fchown",                "setgroups",                "capset",                "signaldeliver",                "access",                "getuid",                "getgid",                "geteuid",                "getcwd",                "getegid",                "arch_prctl",                "clock_gettime",                "connect",                "dup2",                "execve",                "exit",                "ioctl",                "open",                "poll",                "rt_sigreturn",                "set_tid_address",                "setitimer",                "setsockopt",                "socket",                "writev"            ],            "action": "SCMP_ACT_ALLOW"        }    ]}

通過該文件再次運行容器，發現可以成功運行！

null@ubuntu:~/seccomp/docker/zaz/cmd$ sudo docker run -it --rm --security-opt seccomp=seccomp_ping.json pingtestPING 8.8.8.8 (8.8.8.8): 56 data bytes64 bytes from 8.8.8.8: seq=0 ttl=127 time=43.424 ms64 bytes from 8.8.8.8: seq=1 ttl=127 time=42.873 ms64 bytes from 8.8.8.8: seq=2 ttl=127 time=42.336 ms64 bytes from 8.8.8.8: seq=3 ttl=127 time=48.164 ms64 bytes from 8.8.8.8: seq=4 ttl=127 time=42.260 ms
--- 8.8.8.8 ping statistics ---5 packets transmitted, 5 packets received, 0% packet lossround-trip min/avg/max = 42.260/43.811/48.164 ms

嘗試運行其他命令，有些命令由于缺乏必須的系統調用，會出現Operation not permitted的報錯。

$ sudo docker run -it --rm --security-opt seccomp=seccomp_ping.json pingtest lsls: .: Operation not permitted$ sudo docker run -it --rm --security-opt seccomp=seccomp_ping.json pingtest mkdir testmkdir: can't create directory 'test': Operation not permitted