使用python手寫Metropolis-Hastings算法的貝葉斯線性回歸 - 網安 - 專業的網絡安全產業、社區、知識平臺

在學習貝葉斯計算的解馬爾可夫鏈蒙特卡洛(MCMC)模擬時，最簡單的方法是使用PyMC3，構建模型，調用Metropolis優化器。但是使用別人的包我們并不真正理解發生了什么，所以本文通過手寫Metropolis-Hastings來深入的理解MCMC的過程，再次強調我們自己實現該方法并不是并不是為了造輪子，而是為了更好的通過代碼理解該概念。

貝葉斯線性回歸包含了幾十個概念和定義，這使得我們的整個研究成為一種折磨，并且真正發生的事情。在本文中，我將通過常見Metropolis-Hastings 算法構建一個馬爾可夫鏈，并提供一個實際的使用案例。我們將著重于推斷簡單線性回歸模型的參數（但是這里說“簡單”并不能代表它背后的原理簡單）。我們還可以用任何其他條件分布(泊松/伽馬/負二項或其他)替換正態分布，這樣可以通過MCMC實現幾乎相同的GLM（只更改4或5行代碼）。由于推斷的樣本來自參數的后驗分布，我們可以很自然地使用這些來構建參數的置信區間(在這種情況下，“可信區間”更正確)。

下面我們將簡要描述為什么使用MCMC方法，提供一個線性回歸模型的MH算法的實現，并將以一個可視化的方式顯示當算法尋找生成數據的參數集時，真正發生了什么。

數據準備

設Y和X分別為模型的響應和輸入。線性模型表明，給定輸入的響應的條件分布是正態的。也就是:

對于合適的參數a(斜率)、b(偏差)和σ(噪聲強度)。

我們的任務是推斷a, b和σ。

所以我們首先要知道一些模型需要遵循的“基本規則”。在這個例子中，a, b幾乎可以是任何數值，正的或負的，但σ必須是嚴格正的(因為從來沒有聽說過負標準差的正態分布，對吧?)除此之外，沒有其他任何規則。也許你會說：“我們需要先了解這些是如何分布的”，但是后驗分布的漸近正態性保證告訴我們，只要有足夠的后驗樣本，這些樣本無論如何都是正態分布(如果馬爾可夫鏈達到其平穩分布)，所以分布不是我們考慮的必要因素。

現在讓我們為回歸生成合成數據，這里使用參數a=3, b=20和σ=5。可以通過以下代碼在python中完成:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sc
# sample x
np.random.seed(2022)
x = np.random.rand(100)*30
# set parameters
a = 3
b = 20
sigma = 5
# obtain response and add noise
y = a*x+b
noise = np.random.randn(100)*sigma
# create a matrix containing the predictor in the first column
# and the response in the second
data = np.vstack((x,y)).T + noise.reshape(-1,1)
# plot data
plt.scatter(data[:,0], data[:,1])
plt.xlabel("x")
plt.ylabel("y")

合成數據如下圖所示：

數據的準備已經完成了，下一節將涉及定義 Metropolis Hastings 算法的函數和一組迭代次數的循環。

算法介紹

假設θ=[a,b，σ]是算法上面的參數向量，θ '是一組新參數的建議，MH比較參數(θ '和θ)的兩個競爭假設之間的貝葉斯因子(似然和先驗的乘積)，并通過條件建議分布的倒數縮放該因子。然后將該因子與均勻分布的隨機變量的值進行比較。這給模型增加了隨機性，使不可能的參數向量有可能被探索，也可能被丟棄(很少)。

這聽起來有點復雜，讓我們從頭一步一步對它進行代碼的實現。

Proposal Distribution

首先，我們定義了一個建議的分布（Proposal Distribution）g(θ′|θ):這是前一個時間步固定參數的分布。在我們的例子中，a和b可以是正的也可以是負的，因此一個自然的選擇就是從一個以前一個迭代步驟為中心的多元正態分布中采樣它們。我們可以從伽馬分布中取樣σ，這些分布的定義我們可以根據實際情況進行選擇，但是一個更好的方法(這里我們將不涉及)是從逆伽馬分布抽樣σ。

因此，我們可以按照以下方式定義進行Proposal Distribution:

分布抽樣σ為

參數 k 用于控制分布在其均值周圍的“擴展”。? 是對gamma 額外調整。k?越大，gamma 的分布越大（我們使用的是 gamma 分布的形狀速率公式因為 scipy 強迫使用形狀比例公式）。功能如下：

def proposal(prec_theta, search_width = 0.5):
 # this function generates the proposal for the new theta
 # we assume that the distribution of the random variables
 # is normal for the first two and gamma for the third.
 # conditional on the previous value of the accepted parameters (prec_theta)
 out_theta = np.zeros(3)
 out_theta[:2] = sc.multivariate_normal(mean=prec_theta[:2],cov=np.eye(2)*search_width**2).rvs(1)
 #the last component is the noise
 out_theta[2] = sc.gamma(a=prec_theta[2]*search_width*500, scale=1/(500*search_width)).rvs()
 return out_theta

我們可能會可以注意到函數proposal包含兩個參數:prec_theta表示上面說的的參數向量，search_width表示尋找建議的區域。尋找一組良好的參數會在探索(在未探索的區域中搜索新參數集)和開發(在已找到良好參數集的區域中改進搜索)之間進行權衡。所以應該非常小心地設置search_width。search_width值過大會使MCMC收斂到平穩分布。一個小的值可能會阻止算法在合理的時間內找到最優(optima)(需要繪制更多的樣本，更多的訓練時間)。

似然函數

似然函數其實就是線性函數，并且給定參數的響應的條件分布是正態的。換句話說，我們將計算正態分布的可能性，其中均值是輸入和系數a和b的乘積，噪聲是σ。在這種情況下，我們將使用對數似然而不是原始似然，這樣可以提高穩定性。

def lhd(x,theta):
 # x is the data matrix, first column for input and second column for output.
 # theta is a vector containing the parameters for the evaluation
 # remember theta[0] is a, theta[1] is b and theta[2] is sigma
 xs = x[:,0]
 ys = x[:,1]
 lhd_out = sc.norm.logpdf(ys, loc=theta[0]*xs+theta[1], scale=theta[2])
 # then we sum lhd_out (be careful here, we are summing instead of multiplying
 # because we are dealing with the log-likelihood, instead of the raw likelihood).
 lhd_out = np.sum(lhd_out)
 return lhd_out

先驗

我們不需要在這方面花很多時間。在先驗分布方面，可以選擇任何喜歡的東西。這里需要做的就是確保可能找到推斷的參數的區域具有非零先驗，并且噪聲參數 σ 是非負的。這里的先驗以 log-pdf 的形式表示。

def prior(theta):
 # evaluate the prior for the parameters on a multivariate gaussian.
 prior_out = sc.multivariate_normal.logpdf(theta[:2],mean=np.array([0,0]), cov=np.eye(2)*100)
 # this needs to be summed to the prior for the sigma, since I assumed independence.
 prior_out += sc.gamma.logpdf(theta[2], a=1, scale=1)
 return prior_out

Proposal Ratio

Proposal Ratio是在給定新參數向量的情況下觀察到舊參數向量的可能性除以給定舊參數向量的情況下觀察到新參數向量的概率。也就是Proposal Distribution提到的，g(θ|θ′)/g(θ′|θ)。這里將使用log-pdf，這樣可以在概率中具有統一的尺度，并獲得更好的數值穩定性。

def proposal_ratio(theta_old, theta_new, search_width=10):
 # this is the proposal distribution ratio
 # first, we calculate of the pdf of the proposal distribution at the old value of theta with respect to the new
 # value of theta. And then we do the exact opposite.
 prop_ratio_out = sc.multivariate_normal.logpdf(theta_old[:2],mean=theta_new[:2], cov=np.eye(2)*search_width**2)
 prop_ratio_out += sc.gamma.logpdf(theta_old[2], a=theta_new[2]*walk*500, scale=1/(500*walk))
 prop_ratio_out -= sc.multivariate_normal.logpdf(theta_new[:2],mean=theta_old[:2], cov=np.eye(2)*search_width**2)
 prop_ratio_out -= sc.gamma.logpdf(theta_new[2], a=theta_old[2]*walk*500, scale=1/(500*walk))
 return prop_ratio_out

整合

把所有信息整合起來。偽代碼如下:

1)實例化參數向量的初始值

．.． N次，直到收斂

2)從建議分布中找到一個新的參數向量

3)計算似然、先驗pdf值和建議似然比的倒數

4)將3中的所有數量相乘(或log求和)，并比較這個比例(線性比例)

根據從均勻分布中得出的數字。

5)如果比值較大，則接受新的參數向量。

否則，新的參數向量將被拒絕。

6)重復2

在Python代碼如下:

np.random.seed(100)
width = 0.2
thetas = np.random.rand(3).reshape(1,-1)
accepted = 0
rejected = 0
N = 200000
for i in range(N):
 # 1) provide a proposal for theta
 theta_new = proposal(thetas[-1,:], search_width=width)
  
 # 2) calculate the likelihood of this proposal and the likelihood
 # for the old value of theta
 log_lik_theta_new = lhd(data,theta_new)
 log_lik_theta = lhd(data,thetas[-1,:])
  
 # 3) evaluate the prior log-pdf at the current and at the old value of theta
 theta_new_prior = prior(theta_new)
 theta_prior = prior(thetas[-1,:])
  
 # 4) finally, we need the proposal distribution ratio
 prop_ratio = proposal_ratio(thetas[-1], theta_new, search_width=width)
  
 # 5) assemble likelihood, priors and proposal distributions
 likelihood_prior_proposal_ratio = log_lik_theta_new - log_lik_theta + \
              theta_new_prior - theta_prior + prop_ratio
  
 # 6) throw a - possibly infinitely weigthed - coin. The move for Metropolis-Hastings is
 # not deterministic. Here, we exponentiate the likelihood_prior_proposal ratio in order
 # to obtain the probability in linear scale
 if np.exp(likelihood_prior_proposal_ratio) > sc.uniform().rvs():
   thetas = np.vstack((thetas,theta_new))
   accepted += 1
 else:
   rejected += 1

上面的代碼可能會要花費一些時間來運行。修改N來獲得更少的采樣，或編輯寬度以加快探索。

結果展示

結果包含在下圖中

得到的：a(紅色), b(藍色) and σ(紫色)

參數的直方圖。

可以看到，樣本的歷史結果非常穩定。平均值和標準偏差是：

但是有一個問題，我們只在這里提一下：這些樣本高度相關，因此在估計可信區間時可能需要小心。這里的一種解決方案是通過只保留一小部分參數來細化歷史記錄(例如，只保留1 / 10已接受的提議，并丟棄其余的)。

傳統的線性回歸相比如何呢？

使用 statsmodels.api.OLS 執行完全相同的計算

import statsmodels.api as sm
import pandas as pd
df = pd.DataFrame(data)
df.columns = ["a","y"]
df["b"] = 1
lm = sm.OLS(df["y"], df[["a","b"]]).fit()
lm.summary()

結果如下：

a 和 b 的系數的平均值非常接近 MCMC。標準誤差也可以這樣說，這樣也進一步證明了我們對 MCMC 實現是可行的。

請注意，這不是最好的解決方案，而只是一個解決方案。因為確實存在并推薦更好的先驗和建議分布的選擇。

迭代的可視化？

在 3D 中可視化相當的混亂，所以這里只關注斜率 a 和偏差 b。

在可視化中，每 10 步顯示一次 MH 的行為：

紅點代表當前的建議提案，紅點周圍的灰色區域表示與平均值相差 3 個標準差的建議分布（正態）。該提案有一個對角協方差矩陣，這就是我們得到一個圓而不是橢圓。

藍色線代表被拒絕的動作。

紅線代表接受的動作

最上面淺藍點表示從 statsmodels.api.OLS 獲得的參數的平均值。