建筑设计专业是干什么的/seo薪酬如何
统计学习方法——EM算法及其推广
- EM算法及其推广(一)
- EM算法引入
- EM算法
- EM算法的导出(可不看)
- 在非监督学习中的应用
- EM算法的收敛性
- 参考文献
EM算法及其推广(一)
EM算法(期望极大算法)是一种迭代算法,用于含有隐变量的概率模型参数的极大似然估计。主要包含两步:
- E步:求期望
- M步:求极大
EM算法引入
概率模型有时既含有观测变量,又含有隐变量(潜在变量)。
EM算法
- 输入:观测变量数据YYY,隐变量数据ZZZ,联合分布P(Y,Z∣θ)P\left( {Y,Z\left| \theta \right.} \right)P(Y,Z∣θ),条件分布P(Z∣Y,θ)P\left( {Z\left| {Y,\theta } \right.} \right)P(Z∣Y,θ)
- 输出:模型参数θ\thetaθ
- 流程:
- 选择参数的初始化θ(0)\theta_{\left(0\right)}θ(0),开始迭代
- E步:记θ(i)\theta_{\left(i\right)}θ(i)为第iii次迭代参数θ\thetaθ的估计值,在第i+1i+1i+1次迭代的E步,计算
Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]=∑ZlogP(Y,Z∣θ)P(Z∣Y,θ(i))Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right) = {E_Z}\left[ {\log P\left( {Y,Z\left| \theta \right.} \right)\left| {Y,{\theta ^{\left( i \right)}}} \right.} \right] = \sum\limits_Z {\log P\left( {Y,Z\left| \theta \right.} \right)P\left( {Z\left| {Y,{\theta ^{\left( i \right)}}} \right.} \right)}Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣∣∣Y,θ(i)]=Z∑logP(Y,Z∣θ)P(Z∣∣∣Y,θ(i))
其中P(Z∣Y,θ(i))P\left( {Z\left| {Y,{\theta ^{\left( i \right)}}} \right.} \right)P(Z∣∣Y,θ(i))是在给定观测数据YYY和当前的参数估计θ(i)\theta_{\left(i\right)}θ(i)下隐变量数据ZZZ的条件概率分布。 - M步:求使Q(θ,θ(i))Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right)Q(θ,θ(i))最大化的θ\thetaθ,确定第i+1i+1i+1次迭代的参数的估计值θ(i+1)\theta_{\left(i+1\right)}θ(i+1)
θ(i+1)=argmaxθQ(θ,θ(i))\theta_{\left(i+1\right)}=\arg \mathop {\max }\limits_\theta Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right)θ(i+1)=argθmaxQ(θ,θ(i)) - 重复上面两步直到收敛,一般是:
∥θ(i+1)−θ(i)∥<ε1\left\| {{\theta ^{\left( {i + 1} \right)}} - {\theta ^{\left( i \right)}}} \right\| < {\varepsilon _1}∥∥∥θ(i+1)−θ(i)∥∥∥<ε1
或
∥Q(θ(i+1),θ(i))−Q(θ(i),θ(i))∥<ε2\left\| {Q\left( {{\theta ^{\left( {i + 1} \right)}},{\theta ^{\left( i \right)}}} \right) - Q\left( {{\theta ^{\left( i \right)}},{\theta ^{\left( i \right)}}} \right)} \right\| < {\varepsilon _2}∥∥∥Q(θ(i+1),θ(i))−Q(θ(i),θ(i))∥∥∥<ε2
Q函数:
完全数据的对数似然函数logP(Y,Z∣θ)\log P\left( {Y,Z\left| \theta \right.} \right)logP(Y,Z∣θ)关于在给定观测数据YYY和当前参数θ(i)\theta^{\left(i\right)}θ(i)下对未观测数据ZZZ的条件概率分布logP(Z∣Y,θ(i))\log P\left( {Z\left| Y, \theta_{\left(i\right)} \right.} \right)logP(Z∣∣Y,θ(i))的期望称为Q函数,即:
Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right) = {E_Z}\left[ {\log P\left( {Y,Z\left| \theta \right.} \right)\left| {Y,{\theta ^{\left( i \right)}}} \right.} \right]Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣∣∣Y,θ(i)]
EM算法的导出(可不看)
对于一个含有隐变量的概率模型,目标是极大化观测数据(不完全数据)YYY关于参数θ\thetaθ的对数似然函数,即
L(θ)=log(Y∣θ)=log∑ZP(Y,Z∣θ)=log(∑ZP(Y∣Z,θ)P(Z∣θ))L\left( \theta \right) = \log \left( {Y\left| \theta \right.} \right) = \log \sum\limits_Z {P\left( {Y,Z\left| \theta \right.} \right)} = \log \left( {\sum\limits_Z {P\left( {Y\left| {Z,\theta } \right.} \right)P\left( {Z\left| \theta \right.} \right)} } \right)L(θ)=log(Y∣θ)=logZ∑P(Y,Z∣θ)=log(Z∑P(Y∣Z,θ)P(Z∣θ))
EM算法的通过迭代逐步近似极大化L(θ)L\left( \theta \right)L(θ),希望新的θ\thetaθ能使其增加,考虑相邻两次的差:
L(θ)−L(θ(i))=log(∑ZP(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))L\left( \theta \right) - L\left( {{\theta ^{\left( i \right)}}} \right) = \log \left( {\sum\limits_Z {P\left( {Y\left| {Z,\theta } \right.} \right)P\left( {Z\left| \theta \right.} \right)} } \right) - \log P\left( {Y\left| {{\theta ^{\left( i \right)}}} \right.} \right)L(θ)−L(θ(i))=log(Z∑P(Y∣Z,θ)P(Z∣θ))−logP(Y∣∣∣θ(i))
利用Jensen不等式得到下界:
L(θ)−L(θ(i))=log(∑ZP(Y∣Z,θ(i))P(Y∣Z,θ)P(Z∣θ)P(Y∣Z,θ(i)))−logP(Y∣θ(i))    ≥∑ZP(Y∣Z,θ(i))logP(Y∣Z,θ)P(Z∣θ)P(Y∣Z,θ(i))−logP(Y∣θ(i))    =∑ZP(Y∣Z,θ(i))logP(Y∣Z,θ)P(Z∣θ)P(Y∣Z,θ(i))P(Y∣θ(i))\begin{array}{l} L\left( \theta \right) - L\left( {{\theta ^{\left( i \right)}}} \right) = \log \left( {\sum\limits_Z {P\left( {Y\left| {Z,{\theta ^{\left( i \right)}}} \right.} \right)\frac{{P\left( {Y\left| {Z,\theta } \right.} \right)P\left( {Z\left| \theta \right.} \right)}}{{P\left( {Y\left| {Z,{\theta ^{\left( i \right)}}} \right.} \right)}}} } \right) - \log P\left( {Y\left| {{\theta ^{\left( i \right)}}} \right.} \right)\\ \quad \quad \quad \quad \quad \;\; \ge \sum\limits_Z {P\left( {Y\left| {Z,{\theta ^{\left( i \right)}}} \right.} \right)\log \frac{{P\left( {Y\left| {Z,\theta } \right.} \right)P\left( {Z\left| \theta \right.} \right)}}{{P\left( {Y\left| {Z,{\theta ^{\left( i \right)}}} \right.} \right)}}} - \log P\left( {Y\left| {{\theta ^{\left( i \right)}}} \right.} \right)\\ \quad \quad \quad \quad \quad \;\; = \sum\limits_Z {P\left( {Y\left| {Z,{\theta ^{\left( i \right)}}} \right.} \right)\log \frac{{P\left( {Y\left| {Z,\theta } \right.} \right)P\left( {Z\left| \theta \right.} \right)}}{{P\left( {Y\left| {Z,{\theta ^{\left( i \right)}}} \right.} \right)P\left( {Y\left| {{\theta ^{\left( i \right)}}} \right.} \right)}}} \end{array}L(θ)−L(θ(i))=log(Z∑P(Y∣∣Z,θ(i))P(Y∣Z,θ(i))P(Y∣Z,θ)P(Z∣θ))−logP(Y∣∣θ(i))≥Z∑P(Y∣∣Z,θ(i))logP(Y∣Z,θ(i))P(Y∣Z,θ)P(Z∣θ)−logP(Y∣∣θ(i))=Z∑P(Y∣∣Z,θ(i))logP(Y∣Z,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)
令
B(θ,θ(i))=^L(θ(i))+∑ZP(Y∣Z,θ(i))logP(Y∣Z,θ)P(Z∣θ)P(Y∣Z,θ(i))P(Y∣θ(i))B\left( {\theta ,{\theta ^{\left( i \right)}}} \right)\hat = L\left( {{\theta ^{\left( i \right)}}} \right) + \sum\limits_Z {P\left( {Y\left| {Z,{\theta ^{\left( i \right)}}} \right.} \right)\log \frac{{P\left( {Y\left| {Z,\theta } \right.} \right)P\left( {Z\left| \theta \right.} \right)}}{{P\left( {Y\left| {Z,{\theta ^{\left( i \right)}}} \right.} \right)P\left( {Y\left| {{\theta ^{\left( i \right)}}} \right.} \right)}}}B(θ,θ(i))=^L(θ(i))+Z∑P(Y∣∣∣Z,θ(i))logP(Y∣∣Z,θ(i))P(Y∣∣θ(i))P(Y∣Z,θ)P(Z∣θ)
则其为L(θ)L\left( \theta \right)L(θ)的一个下界。为了增大L(θ)L\left( \theta \right)L(θ),所以要使B(θ,θ(i))B\left( {\theta ,{\theta ^{\left( i \right)}}} \right)B(θ,θ(i))达到极大值:
θ(i+1)=argmaxθB(θ,θ(i)){\theta ^{\left( {i + 1} \right)}} = \arg \mathop {\max }\limits_\theta B\left( {\theta ,{\theta ^{\left( i \right)}}} \right)θ(i+1)=argθmaxB(θ,θ(i))
在非监督学习中的应用
对于非监督学习,我们可以认为XXX为观测数据,YYY为未观测数据,生成模型由联合概率分布P(X,Y)P\left( {X,Y} \right)P(X,Y)表示,可以认为非监督学习训练数据是联合概率分布产生的数据。
EM算法的收敛性
- 定理1
设P(Y∣θ)P\left( {Y\left| \theta \right.} \right)P(Y∣θ)为观测数据的似然函数,θ(i),i=1,2,⋯{\theta ^{\left( i \right)}},i=1,2,\cdotsθ(i),i=1,2,⋯为EM算法得到的参数估计序列,P(Y∣θ(i)),i=1,2,⋯P\left( {Y\left| {{\theta ^{\left( i \right)}}} \right.} \right),i = 1,2, \cdotsP(Y∣∣θ(i)),i=1,2,⋯为对应的似然函数序列,则P(Y∣θ(i))P\left( {Y\left| {{\theta ^{\left( i \right)}}} \right.} \right)P(Y∣∣θ(i))是单调递增的,即
P(Y∣θ(i+1))≥P(Y∣θ(i))P\left( {Y\left| {{\theta ^{\left( {i + 1} \right)}}} \right.} \right) \ge P\left( {Y\left| {{\theta ^{\left( i \right)}}} \right.} \right)P(Y∣∣∣θ(i+1))≥P(Y∣∣∣θ(i)) - 定理2
设L(θ)=logP(Y∣θ)L\left( \theta \right) = \log P\left( {Y\left| \theta \right.} \right)L(θ)=logP(Y∣θ)为观测数据的对数似然函数,θ(i),i=1,2,⋯{{\theta ^{\left( i \right)}}},i=1,2,\cdotsθ(i),i=1,2,⋯为EM算法得到的参数估计序列,L(θ(i)),i=1,2,⋯L\left({{\theta ^{\left( i \right)}}}\right),i=1,2,\cdotsL(θ(i)),i=1,2,⋯为对应的对数似然函数序列。- 如果P(Y∣θ)P\left( {Y\left| \theta \right.}\right)P(Y∣θ)有上界,则L(θ(i))=logP(Y∣θ(i))L\left({{\theta ^{\left( i \right)}}}\right)=\log P\left( {Y\left| \theta^{\left(i\right)} \right.}\right)L(θ(i))=logP(Y∣∣θ(i))收敛到某一个值L∗L^*L∗;
- 在函数Q(θ,θ′)Q\left( {\theta ,\theta '} \right)Q(θ,θ′)与L(θ)L\left( \theta \right)L(θ)满足一定条件下,由EM算法得到的参数估计序列θ(i)\theta^{\left(i\right)}θ(i)的收敛值θ∗\theta^*θ∗是L(θ)L\left(\theta\right)L(θ)的稳定点。
参考文献
《统计学习方法》