查看“Baum-Welch算法”的源代码

在电气工程、计算机科学、统计计算和生物信息学中，Baum-Welch算法是用于寻找隐马尔可夫模型(HMM)未知参数的一种[[最大期望算法|EM算法]]，它利用前向-后向算法来计算E-Step的统计信息。

== 历史 ==
Baum-Welch算法是以其发明者Leonard E. Baum和Lloyd R. Welch的名字命名的。Baum-Welch算法和隐马尔可夫模型在20世纪60年代末和70年代初由Baum和他的同事在国防分析研究所的一系列文章中首次描述。HMMs最初主要应用于语音处理领域。20世纪80年代，HMMs开始成为分析生物系统和信息，特别是遗传信息的有用工具。此后，它们成为基因组序列概率建模的重要工具。

== 介绍 ==
[[隐马尔可夫模型]]描述了一组“隐含”变量和可观测到的离散随机变量的联合概率。它依赖于假设：第<math>i</math>个隐藏变量只与第<math>i-1</math>个隐含变量相关，而与其他先前的隐藏变量无关，而当前观测到的状态仅依赖于当前的隐藏状态。

Baum-Welch算法利用EM算法，在给定一组观测特征向量的情况下，求出隐马尔可夫模型参数的最大似然估计。

记离散的隐含随机变量为<math>X_t</math>，它总共有<math display="inline">N</math>种状态(<math>X_t</math>有<math>N</math>个不同的值)。设<math>P(X_t|X_{t-1})</math>与时间无关，得到与时间无关的随机变量转移矩阵：

<math>A=\{a_{ij}\}=P(X_t=j|X_{t-1}=i)</math><br />初始的状态(即<math>t=1</math>)分布由下式给出：

<math>\pi_i=P(X_1=i)</math>

记观测到的变量为<math>Y_t</math>，总共有<math>K</math>种取值。同样假设由隐含变量得到的可观测变量与时间无关。在时间<math>t</math>，由隐含变量<math>X_t=j</math>得到的可观察变量<math>Y_t=y_i</math>的概率是:

<math>b_j(y_i)=P(Y_t=y_i|X_t=j)</math>

由所有可能得<math>X_t</math>和<math>Y_t</math>的取值，我们可以得到<math>N\times K</math>的矩阵<math>B=\{b_j(y_i)\}</math>，其中<math>b_j</math>属于所有可能得隐含状态，<math>y_i</math>属于所有的可观测状态。

给出可观测序列：<math>Y=(Y_1=y_1, Y_2=y_2, \cdots, Y_T=y_T)</math>。

我们可以用<math display="inline">\theta(A, B, \pi)</math>描述隐马尔科夫链，Baum-Welch算法寻找<math>\theta^*=\arg \underset{\theta}{max}P(Y|\theta)</math>的局部极大值，也就是能够使得观测到的序列出现的概率最大的HMM的参数<math>\theta</math>。

== 算法 ==
初始化参数<math>\theta(A, B, \pi)</math>，可以随机初始化，或者根据先验知识初始化。

==== 前向过程 ====
记<math>\alpha_i(t)=P(Y_1=y_1, Y_2=y_2, \cdots, Y_t=y_t, X_t=i|\theta)</math>是参数<math>\theta</math>的条件下，观测的序列是<math>y_1, y_2, \cdots, y_t</math>，时刻<math>t</math>的状态是<math>i</math>的概率。可以通过递归计算：

# <math>a_i(1)=\pi_i b_i(y_1)</math>
# <math>\alpha_i(t+1) = b_i (y_{t+1})\sum_{j=1}^N\alpha_j(t)a_{ji}</math>

==== 后向过程 ====
记<math>\beta_i(t) = P(Y_{t+1}=y_{t+1}, \cdots, Y_T=y_T|X_t=i, \theta)</math>是参数是<math>\theta</math>，在时刻<math>t</math>的状态是<math>i</math>的条件下，余下部分的观测序列是<math>y_{t+1}, \cdots, y_{T}</math>的概率。

# <math>\beta_i(T)=1</math>
# <math>\beta_i(t) = \sum_{j=1}^N \beta_j(t+1)a_{ij}b_j(y_{t+1})</math>

==== 更新 ====

* 根据贝叶斯公式计算临时变量。
**在给定观测序列<math>Y</math>和参数<math>\theta</math>的情况下，在时间<math>t</math>状态是<math>i</math>的概率:<math>\gamma_i(t) = P(X_t=i|Y, \theta) = \frac{P(X_t=i, Y|\theta)}{P(Y|\theta)}=\frac{\alpha_i(t)\beta_i(t)}{\sum_{j=1}^N \alpha_j(t)\beta_j(t)}</math>
**在给定观测序列<math>Y</math>和参数<math>\theta</math>的情况下，在时间<math>t</math>状态是<math>i</math>，在时间<math>t+1</math>状态是<math>j</math>的概率:<math>\xi_{ij}(t) = P(X_t=i, X_{t+1}=j|Y, \theta) = \frac{P(X_t=t, X_{t+1}=j, Y|\theta)}{P(Y|\theta)}=\frac{\alpha_i(t)\alpha_{ij}\beta_j(t+1)b_j(y_t+1)}{\sum_{i=1}^N\sum_{j=1}^N\alpha_i(t) \alpha_{ij} \beta_j(y_{t+1})}</math>
**<math>\gamma_i(t)</math>和<math>\xi_{ij}(t)</math>的分母一样，表示给定参数<math>\theta</math>得到观测序列<math>Y</math>的概率。

* 然后更新参数：
**<math>\pi_i^*=\gamma_i(1)</math>，在时间<math>1</math>状态是<math>i</math>的概率
**<math>\alpha_{ij}^* = \frac{\sum_{t-1}^{T-1}\xi_{ij}(t)}{\sum_{t=1}^{T-1}\gamma_i(t)}</math>，等于期望的从状态<math>i</math>转换到状态<math>j</math>的数量除以从状态<math>i</math>开始的转换的总数。
**<math>b_i^*(v_k) = \frac{\sum_{t=1}^T 1_{y_t=v_k} \gamma_i(t)}{\sum_{t=1}^T\gamma_i(t)}</math>，其中<math>1_{y_t=v_k} = 
\begin{cases}
1\text{  if } y_t=v_k, \\
0\text{  otherwise}
\end{cases}</math>，<math>b_i^*(v_k)</math>是期望的从状态<math>i</math>得到的观察值等于<math>v_k</math>的数量除以从 状态<math>i</math>开始的转换的总数。
*重复上面的步骤直到收敛。算法可能过拟合，也不保证收敛到全局最大值。
*其中计算<math>\gamma_i(t)</math>和<math>\xi_{ij}(t)</math>相当于EM算法的E-Step，而更新<math>\pi_i^* \alpha_{ij}^*, b_i^*(v_k)</math>的过程相当于EM算法的M-Step。

== 例子 ==
假设我们有一只会下蛋的鸡，每天中午我们都会去拾取鸡蛋。而鸡是否下蛋依赖于一些未知的隐含状态，这里我们简单的假设只有两种隐含状态会决定它是否下蛋。我们不知道这些隐含状态的初始值，不知道他们之间的转换概率，也不知道在每种状态下鸡会下蛋的概率。我们随机初始化他们来开始猜测。

{| border="0" style="background:transparent; margin: 1em auto 1em auto;"
|-valign="bottom"
|
{| class="wikitable"
|+Transition
|-
! !! State 1 !! State 2
|-
! State 1
|0.5 || 0.5
|-
! State 2
|0.3 || 0.7
|}
||
{| class="wikitable"
|+ Emission
|-
! !! No Eggs !! Eggs
|-
! State 1
|0.3 || 0.7
|-
! State 2
|0.8 || 0.2
|}
||
{| class="wikitable"
|+ Initial
|-
! State 1
|0.2
|-
! State 2
|0.8
|}
|}
假设我们得到的观测序列是(E=eggs, N=no eggs): N, N, N, N, N, E, E, N, N, N。

这样我们同时也得到了观测状态的转移：NN, NN, NN, NN, NE, EE, EN, NN, NN。

通过上面的信息来重新估计状态转移矩阵。
{| class="wikitable" style="margin: 1em auto 1em auto;"
|-
! Observed sequence !! Probability of sequence and state is {{tmath|S_1}} then {{tmath|S_2}} !! colspan=2 | Highest Probability of observing that sequence
|-
| NN || 0.024 || 0.3584 || {{tmath|S_2}},{{tmath|S_2}}
|-
| NN || 0.024 || 0.3584 || {{tmath|S_2}},{{tmath|S_2}}
|-
| NN || 0.024 || 0.3584 || {{tmath|S_2}},{{tmath|S_2}}
|-
| NN || 0.024 || 0.3584 || {{tmath|S_2}},{{tmath|S_2}}
|-
| NE || 0.006 || 0.1344 || {{tmath|S_2}},{{tmath|S_1}}
|-
| EE || 0.014 || 0.0490 || {{tmath|S_1}},{{tmath|S_1}}
|-
| EN || 0.056 || 0.0896 || {{tmath|S_2}},{{tmath|S_2}}
|-
| NN || 0.024 || 0.3584 || {{tmath|S_2}},{{tmath|S_2}}
|-
| NN || 0.024 || 0.3584 || {{tmath|S_2}},{{tmath|S_2}}
|-
! Total
| 0.22 || 2.4234 ||
|}

重新估计<math>S_1</math>到<math>S_2</math>的转移概率为<math>\frac{0.22}{2.4234} = 0.0908</math>(下表中的"Pseudo probabilities")，重新计算所有的转移概率，得到下面的转移矩阵：

{| border="0" style="background:transparent; margin: 1em auto 1em auto;"
|-valign="bottom"
|
{| class="wikitable"
|+Old Transition Matrix
|-
! !! State 1 !! State 2
|-
! State 1
|0.5 || 0.5
|-
! State 2
|0.3 || 0.7
|}
||
{| class="wikitable"
|+New Transition Matrix (Pseudo Probabilities)
|-
! !! State 1 !! State 2
|-
! State 1
|0.0598 || 0.0908
|-
! State 2
|0.2179 || 0.9705
|}
||
{| class="wikitable"
|+New Transition Matrix (After Normalization)
|-
! !! State 1 !! State 2
|-
! State 1
|0.3973 || 0.6027
|-
! State 2
|0.1833 || 0.8167
|}
|}

接下来重新估计Emission Matrix:
{| class="wikitable" style="margin: 1em auto 1em auto;"
|-
! Observed Sequence !! colspan=2 | Highest probability of observing that sequence<br/> if E is assumed to come from {{tmath|S_1}} !! colspan=2 |Highest Probability of observing that sequence
|-
| NE || 0.1344 || {{tmath|S_2}},{{tmath|S_1}} || 0.1344 || {{tmath|S_2}},{{tmath|S_1}}
|-
| EE || 0.0490 || {{tmath|S_1}},{{tmath|S_1}} || 0.0490 || {{tmath|S_1}},{{tmath|S_1}}
|-
| EN || 0.0560 || {{tmath|S_1}},{{tmath|S_2}} || 0.0896 ||{{tmath|S_2}},{{tmath|S_2}}
|-
! Total
| 0.2394 || || 0.2730 ||
|}

重新估计从隐含状态<math>S_1</math>得到观察结果E的概率是<math>\frac{0.2394}{0.2730} = 0.8769</math>，得到新的Emission Matrix

{| border="0" style="background:transparent; margin: 1em auto 1em auto;"
|-valign="bottom"
|
{| class="wikitable"
|+Old Emission Matrix
|-
! !! No Eggs !! Eggs
|-
! State 1
|0.3 || 0.7
|-
! State 2
|0.8 || 0.2
|}
||
{| class="wikitable"
|+New Emission Matrix (Estimates)
|-
! !! No Eggs !! Eggs
|-
! State 1
|0.0876 || 0.8769
|-
! State 2
|1.0000 || 0.7385
|}
||
{| class="wikitable"
|+New Emission Matrix (After Normalization)
|-
! !! No Eggs !! Eggs
|-
! State 1
|0.0908 || 0.9092
|-
! State 2
|0.5752 || 0.4248
|}
|}

为了估计初始状态的概率，我们分别假设序列的开始状态是<math>S_1</math>和<math>S_2</math>，然后求出最大的概率，再归一化之后更新初始状态的概率。

一直重复上面的步骤，直到收敛。

== 代码 ==
<syntaxhighlight lang="python3" line="1">
from hmmlearn import hmm
import numpy as np

X = np.array([1, 1, 1, 1, 1, 0, 0, 1, 1, 1]).reshape(-1, 1)
model = hmm.GaussianHMM(n_components=2, covariance_type='full')
model.fit(X)

model.monitor_.history

# pi
model.startprob_

# state transform matrix
model.transmat_

# emission_matrix
np.power(np.e, model._compute_log_likelihood(np.unique(X).reshape(-1, 1)))
</syntaxhighlight>