Baum-Welch algorithm

Under the hood of Hidden Markov Models training

8 min readMar 14, 2021

Intro

This post aims to show how Hidden Markov Models (HMMs) are trained using Baum-Welch algorithm. If you want to learn more about Hidden Markov models, I suggest to read some posts:

This post assumes you are familiar with concepts like the transition, emission probabilities, hidden state, observation, forward and backward algorithm.

Goal

Baum-Welch algorithm finds values for HMM parameters that best fit the observed data. For training we need:

sequence of observations O₁, O₂, …, Oₙ
sequence of hidden states for these observations S₁, S₂, …, Sₙ

Algorithm tries to estimate transition matrix A and emission matrix B using values of O and S. If we know A and B, we can use HMM with new unseen data to find most probable sequence of hidden states that correspond to observations (which is called decoding problem).

Expectation maximization

Baum-Welch algorithm is using expectation maximization (EM) approach to find values for A and B. Process looks following:

Initialize A and B with some initial values (done only once)
Estimate latent variables ξᵢⱼ(t) and γᵢ(t) (I’ll explain the meaning of these later) using A, B, O and S. Goal of this step is to estimate how much each transition and emission has been used. This is estimation step
Maximize values of A and B using estimations (latent variables) from previous steps. This is maximization step
Continue previous steps until convergence

Initial equations

For estimating A and B values we could use the following formulas (source):

A = aᵢⱼ (probability of transition from hidden state i to hidden state j)= expected number of transitions from hidden state i to state j /expected number of transition from hidden state i
B = bⱼₖ (probability of observing observation Oₖ in hidden state j)= expected number of times model is in hidden state j and we observe Oₖ/ expected number of times in hidden state j

aᵢⱼ could be defined as the probability of being in hidden state i at time t and in hidden state j at time t+1, given the observation sequence O and the model (source). Graphically it could be described as follows:

Current step probability with forward, backward and emission probability. Inspiration from here

In the graph we are at time t, we know probability that we are at the current hidden state Sᵢ (this is forward probability αᵢ(t)), we know hidden probabilities going from hidden state Sⱼ to the end of the sequence using backward probabilities βᵢ₊₁(t). We want to get probability of going from Sᵢ to Sⱼ and given that we have observed Oₜ₊₁ in Sⱼ at t+1.

Here we’ll make use of latent variables ξᵢⱼ(t) and γᵢ(t):

ξᵢⱼ(t)-probability of transition from hidden state i to hidden state j at time t given observations:

source

Note that denominator P(O|Θ) means probability of the observation sequence O by any path given the model Θ. ξᵢⱼ(t) is defined for time t only. We have to sum over all timesteps to get the total joint probability for all the transitions from hidden state i to hidden state j (calculate aᵢⱼ). This will be our numerator of the equation of aᵢⱼ. For denominator we could use marginal probability which means the probability of being in state i at time t, whole equation has the following form:

source

Denominator could be expressed differently and leads to a new latent variable γᵢ(t):

source

γᵢ(t)-probability at given state i at time t given observations. We can use it to calculate aᵢⱼ (our previous formula for aᵢⱼ is also valid):

source

We can use γᵢ(t) to calculate bⱼₖ (which is the probability of a observation Oₖ from the observations O given hidden state j):

source

Note that 1ₒₜ₌ₖ is an indicator function which has value 1, if observation Oₜ belongs to class k and 0 if it doesn’t.

Expectations and maximization in HMM

Now that we have equations to calculate components separately we can form EM approach. We have two steps:

Calculate expected value of latent variables ξᵢⱼ(t) and γᵢ(t). At first we’ll initialize A and B randomly or use some previous knowledge if we have it
Maximize values of A and B by using equations for aᵢⱼ and bⱼₖ. And go for a next round by using new A and B values for estimating ξᵢⱼ(t) and γᵢ(t).

This is like chicken and egg problem. We have only observations and we start with random guess (or if we have some more information we could use it). We estimate our latent variables which we’ll use to maximize A and B. At each step we should get more better estimation for A and B until improvements are small and algorithm has converged.

Python implementation

Having equations is great, but implementation in code gives better understanding what the important details are and how whole process looks like. Here I’ll first give and implementation and then explain what is going on. Full code is available here and is mostly borrowed from here.

Let’s go through this function:

First lines are just initializations (note this function expects a and b (which correspond to A and B in previous parts) to be initialize with some (random) values). M = a.shape[0] is the number of hidden states. T = len(O) is the number of timesteps we have observed (number of observations)
Next we’ll start iterating n_iter number of iterations
In each iteration we’ll calculate alpha (αᵢ(t)-forward probability) and beta (βⱼ(t+1)-backward probability). We’ll do it for all of the timesteps once:

alpha = forward(O, a, b, initial_distribution)
beta = backward(O, a, b)

We have to initialize a variable xi that is going to hold values of ξᵢⱼ(t):

xi = np.zeros((M, M, T - 1))

xi is 3-dimensional matrix/array (or we could say a tensor) which dimensions have the following meaning. One side is state i, other side state j and one dimension for time T (example assumes there are two unique hidden states and two timesteps):

For each timestep t there will be an matrix/array of probabilities transitioning from state i to j (given observations)

If we go into loop for t in range(T-1) we start calculating values for xi. T-1 is used, as we have T-1 transitions in sequence which has T timesteps (observations). First we’ll calculate the denominator for ξᵢⱼ(t) (remember that denominator means the probability of the observation sequence O by any path given the model):

denominator = (alpha[t,:].T @ a * b[:, O[t + 1]].T) @ beta[t + 1, :]

We take dot-product between alpha (forward probability) at time t and transition probabilities (matrix a), which is multiplied with emission probabilities for observation O at time t and finally we take dot product with beta (which is our backward probability).

next part involves calculating numerator for ξᵢⱼ(t) and dividing numerator with denominator:

for i in range(M):                
    numerator = alpha[t,i] * a[i,:] * b[:,O[t+1]].T * beta[t+1,:].T                
    xi[i, :, t] = numerator / denominator

Note that calculation of numerator is very similar to calculation of denominator. Instead of taking alpha and a for all states at time t, we now care about specific state i. We are looping them all through at each time step t. For each timestep we’ll divide numerator with denominator and store result in variable xi.

Next we’ll calculate true values of gamma (γᵢ(t)) by summing over dimension 1 (which is j in γᵢ(t) equation):

gamma = np.sum(xi, axis=1)

Now we can do a maximization step. This straight from the formula of aᵢⱼ, where we sum xi over dimension 2 (over timesteps) and divide it with summation of γᵢ(t) over dimension 1 (over timesteps, gamma has already one dimension less because of previous step summation over dimension 1 of xi):

a = np.sum(xi, 2) / np.sum(gamma, axis=1).reshape((-1, 1))

Next we’ll add additional element to gamma. This is needed as we have calculate gamma for T-1 timesteps, but we need T emission probabilities (bⱼₖ) (for example, if we have 3 observations, we’ll have two transitions between states and 3 emission probabilities from hidden states). We’ll just add penultimate element of gamma one more time as last element:

gamma=np.hstack((gamma, np.sum(xi[:,:,T-2],axis=0).reshape((-1,1))))

Final preparations for calculating bⱼₖ involve setting up parameter K, which indicates number of unique observations. Also we’ll calculate denominator which involves summing gamma at dimension 1 (over timesteps):

K = b.shape[1]
denominator = np.sum(gamma, axis=1)

Finally we can calculate bⱼₖ by looping over unique observed values (unique O-s). This is needed as for each unique observed value we have to sum gamma over timesteps (if observation had this value). Note that O == l is indicator function. Last step is to divide numerator with denominator:

for l in range(K):
            b[:, l] = np.sum(gamma[:, O == l], axis=1)
b = np.divide(b, denominator.reshape((-1, 1)))

We can repeat this process for sufficient times we could approximate accurate values for A and B.

Usage

Full example how to use our Baum-Welch implementation is here. Here I’ll show some code snippets:

data = pd.read_csv('data_python.csv.txt')
V = data['Visible'].values# Transition Probabilities
a = np.ones((2, 2))
a = a / np.sum(a, axis=1)
# Emission Probabilities
b = np.array(((1, 3, 5), (2, 4, 6)))
b = b / np.sum(b, axis=1).reshape((-1, 1))
# Equal Probabilities for the initial distribution
initial_distribution = np.array((0.5, 0.5))#train model
n_iter = 100
a_model, b_model = baum_welch(V.copy(), a.copy(), b.copy(), initial_distribution.copy(), n_iter=n_iter)
print(f'Custom model A is \n{a_model} \n \nCustom model B is \n{b_model}')#Custom model A is 
#[[0.53816345 0.46183655]
# [0.48664443 0.51335557]]#Custom model B is 
#[[0.16277513 0.26258073 0.57464414]
# [0.2514996  0.27780971 0.47069069]]

To validate results, we should get close results to hmmlearn package implementation:

model = hmm.MultinomialHMM(n_components=2, n_iter=n_iter, init_params="")
model.startprob_ = initial_distribution
model.transmat_ = a
model.emissionprob_ = bmodel.fit([V])print(np.allclose(a_model, model.transmat_, atol=0.1))
print(np.allclose(b_model, model.emissionprob_, atol=0.1))#True
#True

Simple implementation gives reasonably similar results which gives confidence that our version of Baum-Welch is working.

Summary

This showed some math behind training HMMs using Baum-Welch algorithm and presented one python implementation. Hope this post gave some understanding how HMMs are trained and how to implement it in python.

References

Expectation–maximization algorithm, Wikipedia
Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model, Abhisek Jana, A Developer Diary
Hidden Markov Models 12: the Baum-Welch algorithm, Westmont College, Youtube
Hidden Markov Models — Part 1: the Likelihood Problem,
Maria Burlando, Medium
Hidden Markov Models — Part 2: the Decoding Problem, Maria Burlando, Medium
Machine Learning — Hidden Markov Model (HMM), Jonathan Hui, Medium
Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Lawrence Rabiner, Department of Electrical and Computer Engineering

Baum-Welch algorithm

Under the hood of Hidden Markov Models training

Intro

Goal

Expectation maximization

Initial equations

Usage

Summary

References

Written by Risto Hinno

Responses (1)