Haplogroup
Prediction from Y-STR Values Using an Allele-Frequency Approach
T. Whit Athey
A new
approach to predicting the Y-chromosome haplogroup from a set of Y-STR marker
values is presented and compared to other approaches. The method has been implemented in an
Excel-based program, where an arbitrary number of STR markers may be input and
a “goodness of fit” score for 10 haplogroups (E3a, E3b, G, I1a, I1b, I1c, J2,
N3, R1a, and R1b) is returned. This
method has been applied to 101 R1b haplotypes and 50 I1a haplotypes (all having
37 STR markers available), and the distribution of results is presented. In the case of I1a, the results are compared
with the predictions of another method.
Introduction
Many people have taken advantage of
the availability of reasonably priced Y-chromosome testing of short tandem
repeats (STRs). The resulting data can
be useful in confirming genealogical relationships between two or more males. The set of repeat values that is obtained for
a set of Y-chromosome markers is called a haplotype.
There is also considerable interest
in determining the Y-chromosome haplogroup,
a group or family of Y-chromosomes related by descent. Y haplogroups are determined by the pattern
of single nucleotide polymorphisms
(SNPs), which can also be tested and determined directly. However, the process of determining the
haplogroup by direct testing of SNPs can sometimes be a lengthy process. Therefore, there is considerable interest in
predicting the haplogroup from a set of STR markers.
One of the major DNA testing
companies, Family Tree DNA (FTDNA), in cooperation with the University of
Arizona (UAZ), uses a proprietary algorithm to predict the haplogroup for
persons who have their Y-STR values tested by FTDNA. The prediction algorithm has not been
published, but it appears to be based upon the genetic distance[1]
of the haplotype in question to other haplotypes in the
The disadvantage of this approach is
that if no prediction can be made, then the customer gets no information, even
if it is very clear that some haplogroups could be ruled out, or that the
haplotype is probably in one of a small number of possible haplogroups. Theoretically, the most likely haplogroups
could be provided to the customer using this approach, but this is not currently
done.
Another approach is based on the
allele frequencies for each haplogroup and how well a given test haplotype fits
the pattern of alleles in each haplogroup.
This approach is outlined below and it has been implemented on a web
site since October, 2004, being used by many people. It allows any number of the FTDNA set of 37
markers to be entered, and the program returns a “goodness of fit” score for 10
haplogroups (E3a, E3b, G, I1a, I1b, I1c, J2, N3, R1a, and R1b). More than 98% of people of West European
extraction fall into one of these 10 haplogroups. While the program is known as a “predictor”
program, it really just provides information of how well the given haplotype
fits the pattern of previously reported STR values for a haplogroup.
Nomenclature
In this paper, the order of presentation
of Y-STR values is that traditionally employed by FTDNA. The 37 markers presently tested by FTDNA are
the only markers for which sufficient allele frequency data are available to
make the haplogroup prediction possible.
The 37 markers, ordered as per the FTDNA convention, may be seen at the
following web site:
http://www.ftdna.com/9markers.html
Rarely, in some haplotypes, there are
extra repeat values for markers such as DYS019 (also called DYS394) and
DYS464. These were ignored for purposes
of the method described in this paper.
Methods
Let your haplotype be represented by
the set of values, {wj}. This
can represent the haplotype for a set of 12, 25, 37 or any arbitrary number of markers
up to 37. For example, we could consider
the set of values that represent what FTDNA calls the “Western Atlantic Modal
Haplotype” (WAMH):
{wj} = {13, 24, 14, 11, 11, 14, 12, 12, 12, 13, 13, 16}
In this case the index j runs from 1
to 12.
Let fij(x) represent the
allele frequency at the jth marker for the ith haplogroup, where x represents
the value (repeat count) of the allele.
These allele frequencies are simply determined empirically from public
databases and published haplotypes.[2] fij(x) will form a table of values
where the rows are labeled with the repeat values and the columns are labeled
with the DYS marker names. Table 1
represents an example for the R1b haplogroup, using only the first 12 markers
for simplicity. In practice, there will
be many more columns of markers, 37 in the present implementation, and there
will also be more rows required for many of the other markers. There will be a table like this for each
haplogroup in the prediction program, the haplogroups being labeled with the
index i, and the markers being labeled with the index j.
In Table 1, the values in the column
labeled with DYS426, for example, show the frequency of occurrence of the
repeat values 10, 11, 12, 13, and 14, where we see that almost all (98%) R1b
haplotypes have a repeat value of 12, with small percentages for the other four
closest values. Note also that the great
majority of the table is “empty,” or that most cells contain a frequency of
zero (showing that no haplotype has been found yet with those repeat values on
those markers).
Next we compute for the test
haplotype, the “goodness of fit” parameter for the ith haplogroup. This calculation is straightforward, but
complicated. The approach first
calculates, for a given test haplotype, the following ratio for each marker:
fij(wj)/ fij(wi,max)
where the f represents the table of
allele frequencies. That is, for the jth
marker and the ith haplogroup, we calculate the frequency from the table for
the test haplotype’s repeat value for that marker, and divide by the frequency
of the modal value for that marker (in that particular haplogroup). As an example, let’s calculate this ratio for
the fourth marker (DYS391) in the haplogroup R1b for the following test
haplotype, which has a value for DYS391 of 10:
{wj} = {13, 24, 14, 10,
11, 14, 12, 12, 13, 13, 13, 16}
We look at the column in Table 1
labeled with DYS391 and go down the column to the row corresponding to repeat
value of 10, and here we find the frequency of .318. We see that this is not the modal value for
this haplogroup—11 is the modal value.
For the denominator of the ratio we are calculating, we take the
frequency of the modal value—the frequency for a value of 11, which we see is
.628. Then our ratio becomes:
fij(wj)/ fij(wi,max)
= 0.318/0.628 = 0.506
The overall “goodness of fit”
parameter for that haplogroup, is simply the geometric mean[3]
of all of these ratios (one for each marker).
The calculation of the “goodness of fit” parameter is illustrated in
detail in Table 2 for the test
haplotype above and haplogroup R1b.
Table 1
Allele Frequencies for Haplogroup
R1b
|
R E P E A T |
DYS Marker Number |
|||||||||||
|
393 |
390 |
019 |
391 |
385a |
385b |
426 |
388 |
439 |
389a |
392 |
389b |
|
|
7 |
0.0% |
0.0% |
0.0% |
0.0% |
0.1% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
|
8 |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
|
9 |
0.0% |
0.0% |
0.0% |
0.4% |
0.1% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
|
10 |
0.0% |
0.0% |
0.0% |
31.8% |
2.8% |
0.0% |
0.1% |
0.0% |
0.3% |
0.1% |
0.0% |
0.0% |
|
11 |
0.0% |
0.0% |
0.0% |
62.8% |
89.7% |
1.6% |
0.5% |
0.3% |
14.6% |
0.4% |
0.0% |
0.0% |
|
12 |
2.0% |
0.0% |
0.0% |
4.9% |
5.9% |
1.7% |
98.0% |
98.4% |
74.1% |
3.6% |
0.6% |
0.0% |
|
13 |
95.4% |
0.0% |
0.5% |
0.1% |
0.5% |
8.7% |
1.0% |
1.1% |
9.5% |
85.8% |
90.2% |
0.1% |
|
14 |
2.5% |
0.0% |
93.2% |
0.0% |
0.6% |
69.2% |
0.4% |
0.2% |
1.3% |
9.8% |
8.8% |
0.0% |
|
15 |
0.0% |
0.0% |
5.7% |
0.0% |
0.3% |
16.5% |
0.0% |
0.0% |
0.1% |
0.3% |
0.5% |
5.0% |
|
16 |
0.0% |
0.0% |
0.4% |
0.0% |
0.0% |
2.2% |
0.0% |
0.0% |
0.1% |
0.0% |
0.0% |
79.3% |
|
17 |
0.0% |
0.0% |
0.1% |
0.0% |
0.0% |
0.1% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
13.8% |
|
18 |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
1.6% |
|
19 |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.1% |
0.0% |
0.0% |
0.2% |
|
20 |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
0.0% |
|
21 |
0.0% |
<|||||||||||