BPF之路二(e)BPF匯編 - 網安 - 專業的網絡安全產業、社區、知識平臺

原始的BPF匯編

https://www.kernel.org/doc/html/latest/networking/filter.html#networking-filter

原始的BPF又稱之為class BPF(cBPF), BPF與eBPF類似于i386與amd64的關系, 最初的BPF只能用于套接字的過濾,內核源碼樹中tools/bpf/bpf_asm可以用于編寫這種原始的BPF程序,

cBPF架構的基本元素如下

元素描述A32bit寬的累加器X32bit寬的X寄存器M[]16*32位寬的雜項寄存器寄存器, 又稱為臨時寄存器, 可尋找范圍:0~15

類似于一個int32_t M[16];的小內存

cBPF匯編的一條指令為64字節, 在頭文件中有定義 . 如下. 這種結構被組裝為一個 4 元組數組，其中包含code、jt、jf 和 k 值. jt 和 jf 是用于提供代碼的跳轉偏移量, k為通用值

struct sock_filter {    /* Filter block */        __u16   code;   /* 16位寬的操作碼 */        __u8    jt;     /* 如果條件為真時的8位寬的跳轉偏移  */        __u8    jf;     /* 如果條件為假時的8位寬的跳轉偏移 */        __u32   k;      /* 雜項參數 */};

對于套接字過濾，把struct sock_filter數組的指針通過setsockopt(2) 傳遞給內核。例子:

#include #include #include #include /* ... */
/* From the example above: tcpdump -i em1 port 22 -dd */struct sock_filter code[] = {        { 0x28,  0,  0, 0x0000000c },        { 0x15,  0,  8, 0x000086dd },        { 0x30,  0,  0, 0x00000014 },        { 0x15,  2,  0, 0x00000084 },        { 0x15,  1,  0, 0x00000006 },        { 0x15,  0, 17, 0x00000011 },        { 0x28,  0,  0, 0x00000036 },        { 0x15, 14,  0, 0x00000016 },        { 0x28,  0,  0, 0x00000038 },        { 0x15, 12, 13, 0x00000016 },        { 0x15,  0, 12, 0x00000800 },        { 0x30,  0,  0, 0x00000017 },        { 0x15,  2,  0, 0x00000084 },        { 0x15,  1,  0, 0x00000006 },        { 0x15,  0,  8, 0x00000011 },        { 0x28,  0,  0, 0x00000014 },        { 0x45,  6,  0, 0x00001fff },        { 0xb1,  0,  0, 0x0000000e },        { 0x48,  0,  0, 0x0000000e },        { 0x15,  2,  0, 0x00000016 },        { 0x48,  0,  0, 0x00000010 },        { 0x15,  0,  1, 0x00000016 },        { 0x06,  0,  0, 0x0000ffff },        { 0x06,  0,  0, 0x00000000 },};
struct sock_fprog bpf = {        .len = ARRAY_SIZE(code),        .filter = code,};
sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));    //建立套接字if (sock < 0)        /* ... bail out ... */
ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); //把bpf程序附加到套接字上if (ret < 0)        /* ... bail out ... */
/* ... */close(sock);

由于性能有限, 因此后面cBPF由發展成為eBPF, 有新的指令和架構. 原始的BPF指令會被自動翻譯為新的eBPF指令

eBPF虛擬機

eBPF虛擬機是一個RISC指令, 帶有寄存器的虛擬機, 內部有11個64位寄存器, 一個程序計數器(PC), 以及一個512字節的固定大小的棧. 9個通用寄存器可以讀寫, 一個是只能讀的棧指針寄存器(SP), 以及一個隱含的程序計數器, 我們只能根據PC進行固定偏移的跳轉. 虛擬機寄存器總是64位的(就算是32位物理機也是這樣的), 并且支持32位子寄存器尋址(寄存器高32位自動設置為0)

r0: 保存函數調用和當前程序退出的返回值
r1~r5: 作為函數調用參數, 當程序開始運行時, r1包含一個指向context參數的指針
r6~r9: 在內核函數調用之間得到保留
r10: 只讀的指向512字節棧的棧指針

加載BPF程序時提供的的程序類型(prog_type)決定了內核里面哪些函數子集可以調用, 也決定了程序啟動時通過r1提供的context參數. r0中保存的返回值含義也由程序類型決定.

對于eBPF到eBPF, eBPF到內核, 每個函數調用最多5個參數, 保存在寄存器r1~r5中. 并且傳遞參數時, 寄存器r1~r5只能保存常數或者指向堆棧的指針, 不能是任意內存的指針. 所有的內存訪問必須先把數據加載到eBPF堆棧中, 然后才能使用. 這樣的限制簡化內存模型, 幫助eBPF驗證器進行正確性檢查

BPF可以訪問內核核心提供(除去模塊擴展的部分)的內核助手函數, 類似于系統調用, 這些助手函數在內核中通過BPF_CALL_*宏進行定義. bpf.h文件提供了所有BPF能訪問的內核助手函數的聲明.

以bpf_trace_printk為例子, 這個函數在內核中通過BPF_CALL_5進行定義, 并且有5對類型與參數名, 定義參數的類型對于eBPF很重要, 因為每一個eBPF程序加載時eBPF驗證器都要確保寄存器數據類型與被調用函數的參數類型匹配.

BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, u64, arg2, u64, arg3){    ...}

這樣設計是為了讓虛擬機指令與原生的指令集(x86 arm)盡可能匹配, 這樣JIT編譯出的指令可以更簡單高效, 所有寄存器都一對一地映射到硬件寄存器。例如，x86_64 JIT 編譯器可以將它們映射為

R0 - raxR1 - rdiR2 - rsiR3 - rdxR4 - rcxR5 - r8R6 - rbxR7 - r13R8 - r14R9 - r15R10 - rbp

eBPF指令編碼

每個eBPF指令都是固定的8字節, 大概有100條指令, 被劃分為8個類型. 虛擬機支持從通用內存(映射, 棧, contexts比如數據包, ..)中進行1-8字節的讀寫, 支持前后有無條件的跳轉, 支持數據與邏輯操作(ALU指令), 支持函數調用.

一個eBPF程序就是64位指令的序列, 所有的eBPF指令都有同樣的基礎格式:

8bit操作碼
4bit目標寄存器
4bit源寄存器
16bit偏移
32bit立即數

msb最高bit                                                    lsb最低bit+------------------------+----------------+----+----+--------+|immediate               |offset          |src |dst |opcode  |+------------------------+----------------+----+----+--------+|       32               |    16          | 4  | 4  |    8   |

大多數指令并不會使用全部的區域, 不使用的區域應該設置為0

操作碼的最低3bit表示指令類別, 這個把相關的操作碼組合在一起

LD/LDX/ST/STX操作碼有如下結構

msb      lsb+---+--+---+|mde|sz|cls|+---+--+---+| 3 |2 | 3 |

sz區域表示目標內存區域的大小, mde區域是內存訪問模式, uBPF只支持通用MEM訪問模式

ALU/ALU64/JMP操作碼的結構

msb      lsb+----+-+---+|op  |s|cls|+----+-+---+| 4  |1| 3 |

如果s是0, 那么源操作數就是imm, 如果s是1, 那么源操作數就是src. op部分指明要執行哪一個ALU或者分支操作

bpf.h中使用struct bpf_insn來描述一個eBPF指令, 其定義與上文是一致的. 因此一段eBPF程序也可以用一個struct bpf_insn數組來描述

struct bpf_insn {    __u8    code;        /* 操作碼 opcode */    __u8    dst_reg:4;    /* 目標寄存器 dest register */    __u8    src_reg:4;    /* 源寄存器 source register */    __s16    off;        /* 有符號的偏移 signed offset */    __s32    imm;        /* 有符號的立即數 signed immediate constant */};

ALU指令: 64-bit

操作對象為64位

操作碼助記符偽代碼0x07add dst, immdst += imm0x0fadd dst, srcdst += src0x17sub dst, immdst -= imm0x1fsub dst, srcdst -= src0x27mul dst, immdst *= imm0x2fmul dst, srcdst *= src0x37div dst, immdst /= imm0x3fdiv dst, srcdst /= src0x47or dst, immdst0x4for dst, srcdst0x57and dst, immdst &= imm0x5fand dst, srcdst &= src0x67lsh dst, immdst <<= imm0x6flsh dst, srcdst <<= src0x77rsh dst, immdst >>= imm (logical)0x7frsh dst, srcdst >>= src (logical)0x87neg dstdst = -dst0x97mod dst, immdst %= imm0x9fmod dst, srcdst %= src0xa7xor dst, immdst ^= imm0xafxor dst, srcdst ^= src0xb7mov dst, immdst = imm0xbfmov dst, srcdst = src0xc7arsh dst, immdst >>= imm (arithmetic)0xcfarsh dst, srcdst >>= src (arithmetic)

ALU指令:32-bit

這些操作碼只使用了他們操作數的低32位, 并且用0初始化目標寄存器的高32位(操作對象是32位)

操作碼助記符偽代碼0x04add32 dst, immdst += imm0x0cadd32 dst, srcdst += src0x14sub32 dst, immdst -= imm0x1csub32 dst, srcdst -= src0x24mul32 dst, immdst *= imm0x2cmul32 dst, srcdst *= src0x34div32 dst, immdst /= imm0x3cdiv32 dst, srcdst /= src0x44or32 dst, immdst0x4cor32 dst, srcdst0x54and32 dst, immdst &= imm0x5cand32 dst, srcdst &= src0x64lsh32 dst, immdst <<= imm0x6clsh32 dst, srcdst <<= src0x74rsh32 dst, immdst >>= imm (logical)0x7crsh32 dst, srcdst >>= src (logical)0x84neg32 dstdst = -dst0x94mod32 dst, immdst %= imm0x9cmod32 dst, srcdst %= src0xa4xor32 dst, immdst ^= imm0xacxor32 dst, srcdst ^= src0xb4mov32 dst, immdst = imm0xbcmov32 dst, srcdst = src0xc4arsh32 dst, immdst >>= imm (arithmetic)0xccarsh32 dst, srcdst >>= src (arithmetic)

字節交換指令

操作碼助記符偽代碼0xd4 (imm == 16)le16 dstdst = htole16(dst)0xd4 (imm == 32)le32 dstdst = htole32(dst)0xd4 (imm == 64)le64 dstdst = htole64(dst)0xdc (imm == 16)be16 dstdst = htobe16(dst)0xdc (imm == 32)be32 dstdst = htobe32(dst)0xdc (imm == 64)be64 dstdst = htobe64(dst)

內存指令

操作碼助記符偽代碼0x18lddw dst, immdst = imm0x20ldabsw src, dst, immSee kernel documentation0x28ldabsh src, dst, imm…0x30ldabsb src, dst, imm…0x38ldabsdw src, dst, imm…0x40ldindw src, dst, imm…0x48ldindh src, dst, imm…0x50ldindb src, dst, imm…0x58ldinddw src, dst, imm…0x61ldxw dst, [src+off]dst = (uint32_t ) (src + off)0x69ldxh dst, [src+off]dst = (uint16_t ) (src + off)0x71ldxb dst, [src+off]dst = (uint8_t ) (src + off)0x79ldxdw dst, [src+off]dst = (uint64_t ) (src + off)0x62stw [dst+off], imm(uint32_t ) (dst + off) = imm0x6asth [dst+off], imm(uint16_t ) (dst + off) = imm0x72stb [dst+off], imm(uint8_t ) (dst + off) = imm0x7astdw [dst+off], imm(uint64_t ) (dst + off) = imm0x63stxw [dst+off], src(uint32_t ) (dst + off) = src0x6bstxh [dst+off], src(uint16_t ) (dst + off) = src0x73stxb [dst+off], src(uint8_t ) (dst + off) = src0x7bstxdw [dst+off], src(uint64_t ) (dst + off) = src

分支指令

操作碼助記符偽代碼0x05ja +offPC += off0x15jeq dst, imm, +offPC += off if dst == imm0x1djeq dst, src, +offPC += off if dst == src0x25jgt dst, imm, +offPC += off if dst > imm0x2djgt dst, src, +offPC += off if dst > src0x35jge dst, imm, +offPC += off if dst >= imm0x3djge dst, src, +offPC += off if dst >= src0xa5jlt dst, imm, +offPC += off if dst < imm0xadjlt dst, src, +offPC += off if dst < src0xb5jle dst, imm, +offPC += off if dst <= imm0xbdjle dst, src, +offPC += off if dst <= src0x45jset dst, imm, +offPC += off if dst & imm0x4djset dst, src, +offPC += off if dst & src0x55jne dst, imm, +offPC += off if dst != imm0x5djne dst, src, +offPC += off if dst != src0x65jsgt dst, imm, +offPC += off if dst > imm (signed)0x6djsgt dst, src, +offPC += off if dst > src (signed)0x75jsge dst, imm, +offPC += off if dst >= imm (signed)0x7djsge dst, src, +offPC += off if dst >= src (signed)0xc5jslt dst, imm, +offPC += off if dst < imm (signed)0xcdjslt dst, src, +offPC += off if dst < src (signed)0xd5jsle dst, imm, +offPC += off if dst <= imm (signed)0xddjsle dst, src, +offPC += off if dst <= src (signed)0x85call immFunction call0x95exitreturn r0

https://github.com/iovisor/bpf-docs/blob/master/eBPF.md

匯編編寫eBPF程序

根據上表我們可以直接寫eBPF字節碼

struct bpf_insn bpf_prog[] = {    { 0xb7, 0, 0, 0, 0x123 },   // mov r0, 0x123    { 0xb7, 1, 0, 0, 0x456 },   // mov r1, 0x456    { 0x0F, 0, 1, 0, 0 },       // add r0, r1    { 0x95, 0, 0, 0, 0x0 },     // exit };

利用上一章說過的方法加載BPF程序, 驗證器輸出的日志如下, 表示已經接受了此程序

用字節碼很不直觀, 我們可以通過對初始化struct bpf_insn進行一個包裹, 以方便編寫, 不明白的話可以對照上面的指令編碼

首先進行指令類型sc的定義, 表示指令屬于那個大類

#define BPF_CLASS(code) ((code) & 0x07) //指令種類為指令操作碼的低3bit#define BPF_ALU64    0x07    /* 操作64位對象的ALU指令種類 */#define    BPF_JMP        0x05  //跳轉指令類別

接著進行操作碼op部分的定義, 這部分表示具體是哪個操作碼, 也就是指令要干什么

#define BPF_OP(code)    ((code) & 0xf0)  //操作數為操作碼的高4bit#define BPF_MOV        0xb0    /* 把寄存器移動到寄存器 */#define    BPF_ADD        0x00     //加法操作#define BPF_EXIT    0x90    /* 從函數中返回 */

對于ALU與JMP指令的操作碼, 還有1bit的s需要定義, 表示操作的來源

#define BPF_SRC(code)   ((code) & 0x08)    //只占用第4bit一個bit#define        BPF_K        0x00    //源操作數是立即數, 立即數的值在imm中表示#define        BPF_X        0x08    //源操作數是寄存器,具體是哪一個寄存器在src字段表示

下一步對于寄存器進行定義, 就是用枚舉類型對r0~r10從0~10進行編

enum {    BPF_REG_0 = 0,    BPF_REG_1,    BPF_REG_2,    BPF_REG_3,    BPF_REG_4,    BPF_REG_5,    BPF_REG_6,    BPF_REG_7,    BPF_REG_8,    BPF_REG_9,    BPF_REG_10,    __MAX_BPF_REG,};

基本元素都有了之后就可組合為表示指令的宏

/*    給寄存器賦值, mov DST, IMM    操作碼: BPF_ALU64 | BPF_MOV表示要進行賦值操作, BPF_K表示要源是立即數IMM*/#define BPF_MOV64_IMM(DST, IMM)                    \    ((struct bpf_insn) {                    \        .code  = BPF_ALU64 | BPF_MOV | BPF_K,        \        .dst_reg = DST,                    \        .src_reg = 0,                    \        .off   = 0,                    \        .imm   = IMM })

/*    兩個寄存器之間的ALU運算指令: OP DST, SRC;     OP可以是加減乘除..., DST SRC表示是那個寄存器    操作碼: BPF_ALU64|BPF_OP(OP)表示執行什么ALU64操作, BPF_X表示源操作數是寄存器*/#define BPF_ALU64_REG(OP, DST, SRC)                \    ((struct bpf_insn) {                    \        .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,    \        .dst_reg = DST,                    \        .src_reg = SRC,                    \        .off   = 0,                    \        .imm   = 0 })
/*    退出指令: exit    操作碼: BPF_JMP|BPF_EXIT表示要進行跳轉指令類比中的退出指令*/#define BPF_EXIT_INSN()                        \    ((struct bpf_insn) {                    \        .code  = BPF_JMP | BPF_EXIT,            \        .dst_reg = 0,                    \        .src_reg = 0,                    \        .off   = 0,                    \        .imm   = 0 })

借用以上宏定義, 我們可以不用令人困惑的常數重新編寫這個eBPF程序, 效果與之前一樣

 struct bpf_insn bpf_prog[] = {        BPF_MOV64_IMM(BPF_REG_0, 0x123),                 //{ 0xb7, 0, 0, 0, 0x123 },  mov r0, 0x123        BPF_MOV64_IMM(BPF_REG_1, 0x456),                 //{ 0xb7, 1, 0, 0, 0x456 },  mov r1, 0x456        BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),    //{ 0x0F, 0, 1, 0, 0 }, add r0, r1        BPF_EXIT_INSN()                                  //{ 0x95, 0, 0, 0, 0x0 } exit     };

實際上, 在#include 中含有指令操作碼等常數的定義, 在內核的源碼目錄samples/bpf/bpf_insn.h就含有上述指令的宏定義, 而且更全面, 我們只要把此文件與源碼放在同一目錄, 然后#include "./bpf_insn.h"就可以直接使用這些宏來定義eBPF指令的字節碼

C編寫eBPF指令

還是一樣的程序, 我們換成C寫, 由于gcc不支持編譯BPF程序, 因此要用clang或者llvm來編譯, -target bpf表示編譯為eBPF字節碼, -c表示編譯為目標文件即可, 因為eBPF是沒有入口點的, 沒法編譯為可執行文件. 轉換過程: C---llvm--->eBPF---JIT--->本機指令

//clang -target bpf -c ./prog.c -o ./prog.ounsigned long prog(void){    unsigned long a=0x123;    unsigned long b=0x456;    return a+b;}

編譯出來的目標文件是ELF格式, 通過readelf可以看到最終編譯出的字節碼

objdump不支持反匯編eBPF, 可以使用llvm-objdump對字節碼進行反編譯, r10是棧指針, *(u32 *)(r10-4) = r1是在向棧中寫入局部變量, 整體結構與之前用匯編寫的類似

如果想要執行eBPF字節碼的話需要先從ELF格式的目標文件中提取.text段, 利用llvm-objcopy可以做到

如何從elf中提取指定段https://stackoverflow.com/questions/3925075/how-to-extract-only-the-raw-contents-of-an-elf-section

之后編寫一個加載器負責從prog.text中讀入字節碼, 放入緩沖區中, 然后使用BPF_PROG_LOAD命令進行bpf系統調用, 從而把字節碼注入內核, 加載器代碼如下, 整體與之前類似. 不明白的可以看前一篇文章

//gcc ./loader.c -o loader#include #include   //為了exit()函數#include     //為了uint64_t等標準類型的定義#include     //為了錯誤處理#include     //位于/usr/include/linux/bpf.h, 包含BPF系統調用的一些常量, 以及一些結構體的定義#include     //為了syscall()
//類型轉換, 減少warning, 也可以不要#define ptr_to_u64(x) ((uint64_t)x)
//對于系統調用的包裝, __NR_bpf就是bpf對應的系統調用號, 一切BPF相關操作都通過這個系統調用與內核交互int bpf(enum bpf_cmd cmd, union bpf_attr *attr, unsigned int size){    return syscall(__NR_bpf, cmd, attr, size);}
//用于保存BPF驗證器的輸出日志#define LOG_BUF_SIZE 0x1000char bpf_log_buf[LOG_BUF_SIZE];
//通過系統調用, 向內核加載一段BPF指令int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn* insns, int insn_cnt, const char* license){    union bpf_attr attr = {        .prog_type = type,        //程序類型        .insns = ptr_to_u64(insns),    //指向指令數組的指針        .insn_cnt = insn_cnt,    //有多少條指令        .license = ptr_to_u64(license),    //指向整數字符串的指針        .log_buf = ptr_to_u64(bpf_log_buf),    //log輸出緩沖區        .log_size = LOG_BUF_SIZE,    //log緩沖區大小        .log_level = 2,    //log等級    };
    return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));}
//BPF程序就是一個bpf_insn數組, 一個struct bpf_insn代表一條bpf指令struct bpf_insn bpf_prog[0x100];
int main(int argc, char **argv){    //用法 loader <保存字節碼的文件> <字節碼長度, 字節為單位>
    //讀入文件中的內容到bpf_prog數組中    int text_len = atoi(argv[2]);    int file = open(argv[1], O_RDONLY);    if(read(file, (void *)bpf_prog, text_len)<0){          perror("read prog fail");        exit(-1);    }    close(file);
    //加載執行    int prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, bpf_prog, text_len/sizeof(bpf_prog[0]), "GPL");    if(prog_fd<0){        perror("BPF load prog");        exit(-1);    }    printf("prog_fd: %d", prog_fd);    printf("%s", bpf_log_buf);    //輸出程序日志}

clang編譯出9條指令, 一個72字節, 使用命令./loader ./prog.text 72執行的結果如下