# Viterbi Algorithm Clarified

All articles about Vitebi algorithm, which I found, seemed too complicated and hard to understand. They either had too much theory and no examples, or too complex example without an abstract description. I will try to write a my own article and clarify things a little.

## Defining a problem

It is hard to understand something without knowing the exact purpose. Let's sketch a specific problem and talk about possible solutions.

You are a doctor in a little town. There is a patient, who visited you for 3 days in a row. To simplify things a bit, the patient can be in one of 2 states: (Healthy, Fever) and he can tell you 3 feelings: (Normal, Cold, Dizzy). During these 3 days, he told you, that he feels Normal (1st day), Cold (2nd day), Dizzy (3rd day).

You also have some scientific results, which can help you guess the state of the patient:

**Start**: in general, 60% of people are Healthy, 40% have Fever**Transition**:- when Healthy, there is 70% chance of remaining Healthy the next day (30% chance of Fever)
- when Fever, there is 40% chance of becoming Healthy the next day (60% chance of Fever)

**Emission**:- when Healthy, he reports: 50% Normal, 40% Cold, 10% Dizzy
- when Fever, he reports: 10% Normal, 30% Cold, 60% Dizzy

Given the specific 3-day observation **(Normal, Cold, Dizzy)**, we must find the most probable real state in each day. It could be (Fever,Fever,Fever), or (Healthy,Healthy,Fever) etc. For 3 days and 2 possible real states, there are $2^3=8$ possible triples. We can't know the real state of the patient, all we can do is finding a good guess - most probable triple of states.

## Naive approach

What is the chance, that the patient was (Healthy,Fever,Fever)? The chance of Healthy on the 1st day is 60% (Start probability), Fever on the 2nd day is 30% (transition Healthy->Fever) and Fever on the 3rd day is 60% (transition Fever->Fever). The total probability of (Healthy,Fever,Fever) is $0.6 \cdot 0.3 \cdot 0.6 = 0.108$. If we compute probabilities of all triples of states (which must sum up into 1), we will see, that the most probable is (Healthy,Healthy,Healthy), which has the probability $0.6 \cdot 0.7 \cdot 0.7 = 0.294$.

Such results hold for anybody in town, we completely ignore our observations. Let's compute the new probability, while taking observations into account:

What is the chance, that patient was (Healthy,Fever,Fever), when reporting (Normal,Cold,Dizzy)? The chance of Healthy on the 1st day + reporting Normal when healthy is 60% x 50%. The chance of Fever on the 2nd day + reporting Cold is 30% x 30%. The chance of Fever on the 3rd day + reporting Dizzy is 60% x 60%. In total, the probability of (Healthy,Fever,Fever) is $ (0.6\cdot 0.5) \cdot (0.3 \cdot 0.3) \cdot (0.6 \cdot 0.6) = 0.00972 $.

If we compute probabilities of all triples of states, we will see, that **the most probable is (Healthy,Healthy,Fever)** with probability $ (0.6\cdot 0.5) \cdot (0.7 \cdot 0.4) \cdot (0.3 \cdot 0.6) = 0.01512 $. Why are the values so small? Will they sum up to 1? These are probabilities of a specific state AND a specific observation. They will sum up to the probability of that observation.

Our problem is very easy to solve. We just compute the probabilities of each possible real states and choose the most probable one. The problem is, that when we are observing the patient for 10 days, there are $2^{10}=1024$ possible tuples of states, that we must examine. And for 100 days, it seems almost impossible to solve.

## Smart approach (Viterbi algorithm)

The smart solution of our problem is based on dynamic programming. For each day and each state (Healthy or Fever) we will compute the most probable paths (sequences of Healthy / Fever and probabilities) for geting into that state on that day (. Let's calculate it for the first day.

Observation:**(Normal, Cold, Dizzy)**

Day: | 1 | 2 | 3 |
---|---|---|---|

Healthy | 0.6*0.5=0.3 (H & Normal) | ||

Fever | 0.4*0.1=0.04 (F & Normal) |

### The basic principle

The Viterbi algorithm is based on an observation, that the optimal path to each day (and each state) can be easily deduced from the optimal path to the previous day (and each state).

What is the "most probable scenario" to being Healthy on the 2nd day? We can deduce it from the previous day! If we were Healthy on the 1st day, we remain Healthy witht the probability:

(H before) * (H->H) * (H & Cold) = 0.3 * 0.7 * 0.4 = 0.084.

If we had Fever on the 1st day, we are Healthy with the probability:

(F before) * (F->H) * (H & Cold) = 0.04 * 0.4 * 0.4 = 0.0064.

The bigger value is 0.084, so the most probable scenario to being Halthy on the 2nd day is being Healthy on the 1st day. What about Fever on the 2nd day?

(H before) * (H->F) * (F & Cold) = 0.3 * 0.3 * 0.3 = 0.027. (F before) * (F->F) * (F & Cold) = 0.04 * 0.6 * 0.3 = 0.0072.The maximum is 0.027 (the patient was probably Healthy the day before). Let's fill the whole table using that method.

Day: | 1 | 2 | 3 |
---|---|---|---|

Healthy | 0.3 | 0.084 (>0.0064) | 0.00588 (>0.00108) |

Fever | 0.04 | 0.027 (>0.0072) | 0.01512 (>0.00972) |

In order to reconstruct the optimal path, we must remember the previous value, that was chosen for each day and each state (the one that yielded the maximum probability). The maximum value (the most probable state) for the third day is Fever, it was computed from Healthy on the 2nd day, which was computed from Healthy on the 1st day, so the optimal path is (Healthy,Healthy,Fever).

## Generalization

How would the algorithm work, if there were more possible states (Healthy, Fever, Disease1, Disease2 ...) or more possible observations (Normal, Cold, Dizzy, Shaking, Nausea, Headache)? The transition and emission tables would be much larger. We would have to go through more values when computing each column of our table. But the principle would be the same.

For T days and K possible states, algorithm computes T columns of the table, K cells in each column. For each cell, it analyzes K cells in the previous column to find the best one. The algorithm takes time $\Theta (T \cdot K^2)$. The final selection of the path in time T fits into our complexity.

## Other examples

Let's make an autocorrect program. When user types "toqer" into computer, we will offer him to change it into "tower", because he probably misprinted the letter "Q", which is near the letter "W" on the keyboard.

Instead of true health states and feelings, we will have the letters of the alphabet. Instead of guessing the true health state for 5 days, we will be guessing the true 5-letter word, that user had in his mind (unknown). The 5-day observation is the 5-letter word, which user actually typed on the keyboard (toqer).

- Start: the probability, that some letter is the first letter of some English word (26 values).
- Transiton: probabilities, e.g. that the letter O will be followed by letter Q, etc (26 x 26 values).
- Emission: probabilities, that user has typed what he wanted to type. E.g. when user wants to type F, he types F in 90%, G in 4% (because it is near F on the keyboard), etc (26 x 26 values).

The start and transition probability can be obtained by analyzing some large amount of English text, Emission table can be created manually.

When observing the word "toqer", we can compute the most probable "true word" using Viterbi algorithm in the same way we used it earlier, and get the true word "tower". Viterbi algorithm can be used for solving many classes of problems, which seem to be completely unrelated at the first sight.