John F. Chandler
Abstract
Relative per-locus mutation rates
for Y-DNA microsatellites, and also for mitochondrial
DNA single-nucleotide polymorphisms, can be estimated directly from diverse
collections of haplotypes without segregating population components. The calibrated average mutation rates for the
FTDNA panels of 12, 25, and 37 loci are 0.00187±0.00028, 0.00278±0.00042, and
0.00492±0.00074, respectively, from a collection of haplotypes from Y-Search.
The individual per-locus rates are given here.
Address for correspondence: john.chandler (at) alum.mit.edu
Received:
Introduction
The task of estimating mutation rates for human DNA is
troublesome, whether for individual loci or for averages over panels of
loci. The mutations are so infrequent,
and the generation spans so long, that direct observation of a statistically
useful sample is expensive and time-consuming.
Nonetheless, these rates are of great interest in genetic genealogy and
related fields. Techniques for circumventing
this basic difficulty to gather the necessary data include: “piggybacking” on
paternity tests (Gusmão et al. 2005), testing
judiciously chosen subjects with interconnected, deep-rooted pedigrees (Heyer et al. 1997), and sifting large, heterogeneous
collections of measured haplotypes using statistical models to extract the mutation
rates from the other information present (Zhivotovsky
et al. 2004; Hutchison et al. 2004). The
latter technique needs calibration, but can be applied to data collected for
other purposes. Indeed, the very same
DNA testing that demands knowledge of the mutation rates for interpreting the
results can lead to the determination of those rates. Of particular interest is the technique
introduced by Hutchison et al (2004) of sorting pairs of haplotypes by
closeness and extracting information from the match profiles. This technique requires a cumbersome extra
step of identifying and isolating a sub-population and depends on the dangerous
assumption that the chosen sub-population is entirely characterized by a single
time depth, but there is a simpler method that avoids these problems. In this paper, I develop expressions for the
“mutation model curve” (MMC) of Hutchison et al. (2004) and outline a procedure
for using the high-match end of the MMC for extracting mutation rates. Throughout this paper, the mutation rates are
assumed to be independent of time, environment, and haplotype. Since these assumptions may be false,
especially the last (Dupuy et al. 2004), the results
must be taken in the context of the data considered. Of these factors, only haplotype is
potentially accessible to a more detailed analysis of currently existing data.
Mathematical Model
First, we
must define a function
as the probability
that all loci, except locus j, match
between two haplotypes separated by a total of g generations. Each
haplotype consists of N loci, each
locus j with a different mutation
rate given by
. If we assume the mutation rates are small, we may
approximate mutation probabilities by an exponential function. In terms of the infinite alleles model, we
would write the probability of a single locus j remaining at the ancestral allele after g generations as
, while the probability of a mutation at that locus would be
. Then
is just the product of
the N individual probabilities at the
N loci:
1
where
the second line of Equation 1 comes from multiplying and dividing the first by
. Note that g, the separation between two
haplotypes, can be viewed either as the number of generations from an ancestor
being compared to a particular descendant or as the “round-trip” number of
generations from one person back to a shared ancestor and then forward to the
contemporary being compared. Since most
DNA testing is done on living individuals, the latter interpretation of g is more commonly applicable, but the
former is equally valid. In the limit of
small
the choice of mutation
model is unimportant, and, to leading order in the small parameters, we would
have for either the infinite alleles model or the
stepwise mutation model:
2
where Q(g)
is the probability of a match on all loci in whichever model. Similarly, we may write the probability that
two haplotypes match at all loci except j,
k, and l as
3
and so on
for any number of mismatching loci. Next
we need an expression for the probability that a mutation has occurred at some
number of loci, b, while the
remainder have remained at the ancestral alleles. As an example, consider the case where
exactly three loci (any three) out of a total of five loci, have mutated. The probability of this occurring is just the
sum of the probabilities of all possible triplets of mutating loci (each given
by Equation 3):
4
where
we define
as the sum of all
products of b distinct elements of
the set
. For the general case, the probability that b loci out of N have mutated is given by
5
Now
consider the probability that a particular
locus has mutated along with any two others, i.e., the probability of a mismatch at the specified locus when all
but three of the loci match. This
probability is similar to that shown in Equation 4, except that the sum within
the brackets includes only those terms that contain, say, µ1. That is
(continuing to use the example of b=3
and N=5), the probability is given by
6
where
we define a function
as the sum of all products of b-1 distinct elements of the set
. The limiting case for both this new function and the
original unsubscripted C is defined as C(0) = 1. Thus, the probability of a
mismatch at locus j when all but b of the loci match, given a separation
of g generations, can be written as
7
where b must be greater than 0, since locus j is a mismatch by definition. Clearly, the probability must vanish when b is 0.
In general, we must deal with a population of haplotype pairs with a
range of separations, and we thus must calculate
, the overall probability of a mismatch at locus j for the whole population when all but b of the loci match. We do so by
weighting the probability in Equation 7 by the fraction
of the population of
pairs having a separation of g
generations and summing over g:
8
Similarly,
we may write D(b), the total probability of b
mismatches, in the same population as
9
The salient
feature of Equations 8 and 9 is the fact that they share a common factor
encapsulating the unknown population statistics
, and thus these equations differ only in terms that depend
just on the mutation rates. It is
therefore useful to define a mismatch profile function
10
This
function is the conditional probability of a mismatch on locus j, given b mismatches in all. This is directly related to the MMC as defined
by Hutchison et al. (2004); their
(probability of a
match at locus i
given n matches) is just
. By construction,
and
, since no locus can mismatch if the number of mismatches is
0, and every locus must mismatch if the number is N. Of course, the MMC will
depart increasingly from Equation 10 as b
grows beyond the linear limits assumed in Equation 2, since the population
structure will contribute differently to the non-linear terms omitted from
Equations 8 and 9.
Equation
10 can be inverted to give an expression for the mutation rate in terms of the
(observed) mismatch profiles and the C functions:
11
Of course,
Equation 11 is recursive, in that the C
values depend on the mutation rates, but it can serve as the basis for an
iterative procedure for calculating the rates.
The definition of C involves
combinations of many terms—so many that the direct computation becomes
prohibitive at relatively modest values of b
and N. It can be shown that C obeys the recursion relation
12
where
13
In the
trivial case where the individual mutation rates are all the same, Equation 12
simplifies to
where
is the uniform
mutation rate, and the parentheses indicate binomial coefficients, where the
coefficient “N take b” stands for N!/(b![N-b]!).
Also, Equation 10 simplifies to
15
Thus, in
the absence of population structure, the MMC for uniform mutation rates would
be straight lines from the (0, 0) to (N,
1), exactly as one would expect. This simple linear form suggests a practical
iterative procedure for determining the relative mutation rates from the
mismatch profiles:
1)
Choose a suitable value b as large as possible, but small enough that the mismatch
functions are small compared to 1.
2)
From the data, find the values
for all loci j.
3)
Initialize the mutation rate estimates to
, where r is a normalization
factor chosen to give values in a convenient range. (Since we have cancelled out the population
statistics, it is clear that we can obtain only relative rates from this
analysis, and so the normalization is arbitrary.)
4)
Use the mutation rate estimates to evaluate Equations 13,
12, and 11 to obtain a new set of estimated rates.
5)
Repeat Step 4 until convergence.
6)
Repeat Steps 2-5 for all values of b from 1 to the value chosen in Step 1 and take a weighted average
of the mutation rate estimates. (See below for a discussion of error analysis.)
Error
analysis
Analyzing
data in pairs instead of singly introduces correlations between pairs with
shared observations. Thus, a proper
error analysis of this procedure would be complicated by the need to deal with
all pairs of observations. Indeed,
correlations could even bias the
results. However, since the mismatch
profiles are computed on restricted subsets of the pairs, the pair-to-pair
correlation within each bin is greatly reduced.
Thus, a simple error analysis treating each pair as an independent
observable should suffice, with just one modification. Normally, the
least-squares estimate of a probability p
from statistics of a population of N
cases carries an uncertainty of {p(1-p)/N}0.5,
but the relevant N here must be the
lesser of the number of pairs found in a given b-bin and the total number of haplotypes, since the latter is the
number of independent data. The uncertainties for the mismatch probability are
scaled to mutation rate uncertainties as in Equation 11, and these in turn are
used as the relative weights for the weighted average described in Step 6 above
(in the usual inverse-square sense).
Calibration
The final
component of this analysis is the calibration for converting the relative
mutation rates to absolute rates.
Pedigree-based rate determinations offer the advantage of “leverage,”
whereby each test subject carries an accumulation of mutations over many
generations (though not so many that multiple mutations on any one locus would
be an issue). However, the collection of
pedigree data from volunteers who already know the results of the DNA tests
leads to serious risks of selection bias in the data. At present, the only reliable large
collection of mutation statistics for Y-DNA microsatellites
is that collected from father-son pair studies by Gusmão
et al. (2005). To make use of such
statistics, we must simultaneously fit the observables
(number of mutations
for locus i
in the calibration data) and
(the “observed”
relative mutation rate) with a simple model by weighted-least-squares analysis:
16
where
is the absolute
mutation rate for locus i,
c is the calibration factor between
relative and absolute rates, and
is the number of
meioses for locus i
in the calibration data. A third
relation using
and c′ can be
used to include a second, independent set of rate estimates in the analysis.
Application
to Y-STRs
I have
applied this procedure to a collection of 8430 Y haplotypes downloaded from Ysearch in July of 2006 as a demonstration. In the parlance of Ysearch,
each haplotype consists of 37 loci, but this set includes five multi-copy loci,
such that the number of independent loci is only 30. Each copy of a multi-copy locus in one
haplotype must match the corresponding copy in the other haplotype if the locus
is to be considered a match. This
grouping of the loci avoids the necessity of guessing which copy corresponds to
which in the two haplotypes being compared.
About half of the collection belongs to haplogroup R1b as specified by
the contributors, and the other half is an assortment of other haplogroups.
Most of
the information carried by this collection is filtered out in Step 1 of the
analysis, since only nearly-matching pairs are considered. In Figure 1 and in each panel
of Figure 2, there is
a vertical line marking the lower limit of matching used in Steps 2-6, and it
is apparent from Figure 1
that only a tiny fraction of the pairs is included. The tall peak on the left of Figure 1 represents the
“typical” inter-haplogroup comparison and is thus characteristic of the
particular mix of haplogroups included in the data collection. The peak for this collection is at 7/30
matching and thus corresponds to a very old population. The slightly smaller peak on the right
represents the “typical” within-haplogroup comparison, mainly the comparison
within R1b, which dominates this collection.
It should be possible to gather collections whose histograms would show
more than two peaks by focusing on distinctive clades within haplogroups. However, as noted above, the information
contained in the peaks of Figure
1 is ignored in the present analysis.
Examination
of Figure 2 shows
that the “ideal” straight-line MMC indicated by Equation 15 is seldom
realized. Indeed, it can be shown that,
even in the absence of population structure (such as is strikingly revealed in
Figure 1), the MMC based on Equation 10 is generally curved when the
locus-specific mutation rates are not the same.
The curvature is positive for loci with below-average mutation rates and
negative for loci with above-average rates.
Attempting to fit straight lines to the MMCs
without taking this curvature into account would tend to compress the apparent
dynamic range of mutation rates, thus reducing the estimates of high rates and
raising the estimates of low rates.
The
results of the analysis for this collection of Y haplotypes, combined with an independent
set of 6955 20-locus Ysearch haplotypes, are
presented in Table 1. These
results include the calibration of Equation 16, and the uncertainties shown
here include the contribution of the calibration and the level of agreement
between the 20- and 30-locus sets.
However, the error analysis here makes no provision for the
uncertainties due to the assumption of constant rates.
Table 1.
Calibration data and final results for absolute mutation rates (per copy for
multi-copy loci, indicated by asterisks). Weighted fit to 6955 20-locus and
8340 30-locus haplotypes.
|
Locus |
Calibration data |
Best
fit |
Std.
dev. |
Locus |
Calibration data |
Best
fit |
Std.
dev. |
|
DYS393 |
0.00075 |
0.00076 |
0.00014 |
DYS447 |
|
0.00264 |
0.00041 |
|
DYS390 |
0.00227 |
0.00311 |
0.00048 |
DYS437 |
0.00222 |
0.00099 |
0.00017 |
|
DYS19 |
0.00168 |
0.00151 |
0.00025 |
DYS448 |
|
0.00135 |
0.00020 |
|
DYS391 |
0.00351 |
0.00265 |
0.00041 |
DYS449 |
|
0.00838 |
0.00128 |
|
DYS385
* |
0.00224 |
0.00226 |
0.00035 |
DYS464 * |
|
0.00566 |
0.00087 |
|
DYS426 |
|
0.00009 |
0.00002 |
DYS460 |
0.00450 |
0.00402 |
0.00069 |
|
DYS388 |
0.00057 |
0.00022 |
0.00004 |
Y-GATA-H4 |
0.00290 |
0.00208 |
0.00033 |
|
DYS439 |
0.00530 |
0.00477 |
0.00073 |
YCAII * |
0.00000 |
0.00123 |
0.00019 |
|
DYS389i |
0.00188 |
0.00186 |
0.00028 |
DYS456 |
|
0.00735 |
0.00115 |
|
DYS392 |
0.00061 |
0.00052 |
0.00010 |
DYS607 |
|
0.00411 |
0.00066 |
|
DYS389ii |
0.00226 |
0.00242 |
0.00041 |
DYS576 |
|
0.01022 |
0.00167 |