Using Hidden Markov Models to Align Multiple Sequences

David W. Mount

doi:10.1101/pdb.top41

Using Hidden Markov Models to Align Multiple Sequences

David W. Mount

Adapted from Bioinformatics: Sequence and Genome Analysis, 2nd edition, by David W. Mount. CSHL Press, Cold Spring Harbor, NY, USA, 2004.

INTRODUCTION

A hidden Markov model (HMM) is a probabilistic model of a multiple sequence alignment (msa) of proteins. In the model, each column of symbols in the alignment is represented by a frequency distribution of the symbols (called a “state”), and insertions and deletions are represented by other states. One moves through the model along a particular path from state to state in a Markov chain (i.e., random choice of next move), trying to match a given sequence. The next matching symbol is chosen from each state, recording its probability (frequency) and also the probability of going to that state from a previous one (the transition probability). State and transition probabilities are multiplied to obtain a probability of the given sequence. The hidden nature of the HMM is due to the lack of information about the value of a specific state, which is instead represented by a probability distribution over all possible values. This article discusses the advantages and disadvantages of HMMs in msa and presents algorithms for calculating an HMM and the conditions for producing the best HMM.