Haplogroup Prediction from Y-STR Values Using an Allele-Frequency Approach

 

T. Whit  Athey

 

 

A new approach to predicting the Y-chromosome haplogroup from a set of Y-STR marker values is presented and compared to other approaches.  The method has been implemented in an Excel-based program, where an arbitrary number of STR markers may be input and a “goodness of fit” score for 10 haplogroups (E3a, E3b, G, I1a, I1b, I1c, J2, N3, R1a, and R1b) is returned.  This method has been applied to 101 R1b haplotypes and 50 I1a haplotypes (all having 37 STR markers available), and the distribution of results is presented.  In the case of I1a, the results are compared with the predictions of another method.

 

 


Introduction

 

Many people have taken advantage of the availability of reasonably priced Y-chromosome testing of short tandem repeats (STRs).  The resulting data can be useful in confirming genealogical relationships between two or more males.  The set of repeat values that is obtained for a set of Y-chromosome markers is called a haplotype.

 

There is also considerable interest in determining the Y-chromosome haplogroup, a group or family of Y-chromosomes related by descent.  Y haplogroups are determined by the pattern of single nucleotide polymorphisms (SNPs), which can also be tested and determined directly.  However, the process of determining the haplogroup by direct testing of SNPs can sometimes be a lengthy process.  Therefore, there is considerable interest in predicting the haplogroup from a set of STR markers.

 

One of the major DNA testing companies, Family Tree DNA (FTDNA), in cooperation with the University of Arizona (UAZ), uses a proprietary algorithm to predict the haplogroup for persons who have their Y-STR values tested by FTDNA.  The prediction algorithm has not been published, but it appears to be based upon the genetic distance[1] of the haplotype in question to other haplotypes in the University of Arizona database.  In this approach, if a haplotype (whose haplogroup is known) exists in the database that is no more than some genetic distance, reportedly a distance of two on the first 12 markers, then the haplogroup of the reference haplotype is assigned to the test haplotype as a prediction or estimation.  If there are no haplogroup-confirmed haplotypes in the database within a distance of two, then no estimate of haplogroup is made.  The FTDNA/UAZ approach has been fairly successful and probably 80% of customers get a haplogroup prediction.

 

The disadvantage of this approach is that if no prediction can be made, then the customer gets no information, even if it is very clear that some haplogroups could be ruled out, or that the haplotype is probably in one of a small number of possible haplogroups.  Theoretically, the most likely haplogroups could be provided to the customer using this approach, but this is not currently done.

 

Another approach is based on the allele frequencies for each haplogroup and how well a given test haplotype fits the pattern of alleles in each haplogroup.  This approach is outlined below and it has been implemented on a web site since October, 2004, being used by many people.  It allows any number of the FTDNA set of 37 markers to be entered, and the program returns a “goodness of fit” score for 10 haplogroups (E3a, E3b, G, I1a, I1b, I1c, J2, N3, R1a, and R1b).  More than 98% of people of West European extraction fall into one of these 10 haplogroups.  While the program is known as a “predictor” program, it really just provides information of how well the given haplotype fits the pattern of previously reported STR values for a haplogroup.

 


Nomenclature

 

In this paper, the order of presentation of Y-STR values is that traditionally employed by FTDNA.  The 37 markers presently tested by FTDNA are the only markers for which sufficient allele frequency data are available to make the haplogroup prediction possible.  The 37 markers, ordered as per the FTDNA convention, may be seen at the following web site:

 

http://www.ftdna.com/9markers.html

 

Rarely, in some haplotypes, there are extra repeat values for markers such as DYS019 (also called DYS394) and DYS464.  These were ignored for purposes of the method described in this paper.

 

Methods

 

Let your haplotype be represented by the set of values, {wj}.  This can represent the haplotype for a set of 12, 25, 37 or any arbitrary number of markers up to 37.  For example, we could consider the set of values that represent what FTDNA calls the “Western Atlantic Modal Haplotype” (WAMH):

 

{wj} = {13, 24, 14,  11, 11, 14, 12, 12, 12, 13, 13, 16}

 

In this case the index j runs from 1 to 12.

 

Let fij(x) represent the allele frequency at the jth marker for the ith haplogroup, where x represents the value (repeat count) of the allele.  These allele frequencies are simply determined empirically from public databases and published haplotypes.[2]  fij(x) will form a table of values where the rows are labeled with the repeat values and the columns are labeled with the DYS marker names.  Table 1 represents an example for the R1b haplogroup, using only the first 12 markers for simplicity.  In practice, there will be many more columns of markers, 37 in the present implementation, and there will also be more rows required for many of the other markers.  There will be a table like this for each haplogroup in the prediction program, the haplogroups being labeled with the index i, and the markers being labeled with the index j.

 

In Table 1, the values in the column labeled with DYS426, for example, show the frequency of occurrence of the repeat values 10, 11, 12, 13, and 14, where we see that almost all (98%) R1b haplotypes have a repeat value of 12, with small percentages for the other four closest values.  Note also that the great majority of the table is “empty,” or that most cells contain a frequency of zero (showing that no haplotype has been found yet with those repeat values on those markers).

 

Next we compute for the test haplotype, the “goodness of fit” parameter for the ith haplogroup.  This calculation is straightforward, but complicated.  The approach first calculates, for a given test haplotype, the following ratio for each marker:

 

fij(wj)/ fij(wi,max)

 

where the f represents the table of allele frequencies.  That is, for the jth marker and the ith haplogroup, we calculate the frequency from the table for the test haplotype’s repeat value for that marker, and divide by the frequency of the modal value for that marker (in that particular haplogroup).  As an example, let’s calculate this ratio for the fourth marker (DYS391) in the haplogroup R1b for the following test haplotype, which has a value for DYS391 of 10:

 

{wj} = {13, 24, 14, 10, 11, 14, 12, 12, 13, 13, 13, 16}

 

We look at the column in Table 1 labeled with DYS391 and go down the column to the row corresponding to repeat value of 10, and here we find the frequency of .318.  We see that this is not the modal value for this haplogroup—11 is the modal value.  For the denominator of the ratio we are calculating, we take the frequency of the modal value—the frequency for a value of 11, which we see is .628.  Then our ratio becomes:

 

fij(wj)/ fij(wi,max) = 0.318/0.628 = 0.506

 

The overall “goodness of fit” parameter for that haplogroup, is simply the geometric mean[3] of all of these ratios (one for each marker).  The calculation of the “goodness of fit” parameter is illustrated in

detail in Table 2 for the test haplotype above and haplogroup R1b.



 

 

Table 1

Allele Frequencies for Haplogroup R1b

<

R

E

P

E

A

T

DYS Marker Number

393

390

019

391

385a

385b

426

388

439

389a

 

392

389b

7

0.0%

0.0%

0.0%

0.0%

0.1%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

8

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

9

0.0%

0.0%

0.0%

0.4%

0.1%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

10

0.0%

0.0%

0.0%

31.8%

2.8%

0.0%

0.1%

0.0%

0.3%

0.1%

0.0%

0.0%

11

0.0%

0.0%

0.0%

62.8%

89.7%

1.6%

0.5%

0.3%

14.6%

0.4%

0.0%

0.0%

12

2.0%

0.0%

0.0%

4.9%

5.9%

1.7%

98.0%

98.4%

74.1%

3.6%

0.6%

0.0%

13

95.4%

0.0%

0.5%

0.1%

0.5%

8.7%

1.0%

1.1%

9.5%

85.8%

90.2%

0.1%

14

2.5%

0.0%

93.2%

0.0%

0.6%

69.2%

0.4%

0.2%

1.3%

9.8%

8.8%

0.0%

15

0.0%

0.0%

5.7%

0.0%

0.3%

16.5%

0.0%

0.0%

0.1%

0.3%

0.5%

5.0%

16

0.0%

0.0%

0.4%

0.0%

0.0%

2.2%

0.0%

0.0%

0.1%

0.0%

0.0%

79.3%

17

0.0%

0.0%

0.1%

0.0%

0.0%

0.1%

0.0%

0.0%

0.0%

0.0%

0.0%

13.8%

18

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

1.6%

19

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.1%

0.0%

0.0%

0.2%

20

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

21

0.0%