#
Finding Words with Unexpected Frequencies in Deoxyribonucleic Acid Sequences

###
Bernard PRUM, François RODOLPHE and Élisabeth de TURCKHEIM

###
*J.R.Statist.Soc. B*, vol. 57, 205-220, 1995.

**Abstract**

Considering a Markov chain model for DNA sequences, this paper
proposes two asymptotically normal statistics to test whether the
frequency of a given word is concordant with the first order Markov
chain model or not. The question is to choose estimates
$\hat{\mu}(W)$ of the expectation of the frequency $M_{W}$ of a
word $W$ in the observed sequence such that the asymptotic
variance of $M_{W}-\hat{\mu}(W) $ is easily computable. The first
estimator is derived from the frequency of $W^{[-1]}$, which is $W$
with its last letter deleted. The second, following an idea of
Cowan (1991), is the conditional expectation $M_{W}$ given the
observed frequencies of all 2-letter words. Two examples on phage
$\lambda$ and phage T7 are finally shown.

**Key words and phrases**
Words in DNA sequences, unexpected
frequencies, Markov chains, central limit theorems.

Statistiques des Séquences Biologiques Home Page