Diagnostic Y-STR Markers in Haplogroup G
Phillip G. Goff and T.
Abstract
Y-Chromosome Haplogroup G reaches
its highest frequency in the
______________________________________________________________________________________
Received
Address for correspondence: philgoff@comcast.net
___________________________________________________________________________
Introduction
Interest in Y-chromosome
testing for paternal ancestry genealogical research has steadily increased
since 2000.[1] As of December 2005, about 40,000
genealogically-relevant haplotypes are available through various online databases.[2]
Many who have been tested for their Y-STR
haplotype want to know their predicted haplogroup with some level of certainty
before taking a SNP-test. The
identification of diagnostic Y-STR markers will help to fill this demand.
Members of Y Haplogroup G
have repeat values on several Y-STR markers that are distinctively different
from those of other haplogroups. These
markers include DYS425, DYS452, DYS446, and DYF399S1. In this article we review the available data
from public databases on these markers for Haplogroup G.
DYS425
is currently offered by two DNA testing companies: Oxford Ancestors (“OA”) as
part of its 10-marker Y-STR product and by DNA-Fingerprint (DNAFP), starting in
December 2005, as the T-associated allele of the four-copy marker, DYF371. DNA Heritage, another DNA testing company,
offered DYS425 from about October 2003 through March 2004.
The study of DYS425 has
been limited due to the lack of testing of this marker by more of the DNA
testing companies. In addition, a review
of comments on the Rootsweb Genealogy-DNA List
reveals a widely-held view that DYS425 is of little diagnostic value due to the
perception that it always has a repeat value of 12.
Information on the markers
DYS452 and DYS446 is available primarily from the Sorenson Molecular Genetics
Foundation (SMGF) database and to a lesser extent from Y-Base and Y-Search. DYF399S1 is an unusual three-copy marker that
was described by Henson (2005) and recently offered commercially by DNAFP. Information on this marker is available from
Y-Match and personal communications from persons who have tested their own
samples at DNAFP.
Nomenclature
Some Y-STR markers are
reported differently by different companies and by different researchers. In the
Oxford Ancestors (OA) uses
a non-ISFG/NIST-standard nomenclature for the marker DYS389i. DYS389I, used in the OA database, is equal to
ISFG/NIST-standard DYS389I minus three. DYS389b is reported as ISFG/NIST-standard DYS389II
minus ISFG/NIST-standard DYS389I. The
non-standard nomenclature must be used when searching the OA database, but the standard
nomenclature will be used in discussing DYS389I in this article since it is
more familiar.
DYS425 is part of a larger
marker called DYF371. DYF371 has four
alleles, three of which have a C base in a particular location adjacent to the
repeat structure, and fourth has a T base in that location. DYS425 is defined as the T-associated allele
of DYF371. For example, DNAFP might
report the results for DYF371 as “10c-12t-13c-13c”, from which the DYS425 value
is shown to be 12.
The repeat value for
DYS452 is reported differently by various companies. The marker consists of one continuous
repeating TATAC structure of about 12 repeats, plus 19 additional contiguous
units made up of CATAC, TGTAC, or TATAC units.
These 19 repeats are normally invariant.
Some companies (DNAH and RG) report only the main (variable) repeat
value of TATAC (12 in the above example), while others (SMGF, DNAFP) add the
other 19 repeats as well for a total of 31 (and this is also the ISFG/NIST-standard
nomenclature). We will use the latter
notation here.
There are apparently no
differences in nomenclature on DYS446 used by any of the labs or databases that
include this marker.
DYS399S1 has three similar
alleles that are based upon a repeat unit of ‘AAAG.” Within the sequence containing each allele
there are several extra bases that are not a part of an “AAAN” motif (where N
represents any base), and the number of these extra bases, usually 10 or 11, is
placed after the number of full repeats as a decimal quantity. For example, if there were 24 full repeats
plus 11 extra bases, the value on the allele would be reported as 24.11. DNA Fingerprint, the company that developed
the test for this marker, has followed the ISFG/NIST guidelines, but has adopted
a shorter notation for convenience by subtracting 10 from the number of extra
bases. Using this convention, the value
of 24.11 would be reported as 24.1 by the company.
Normally, only
the overall PCR length is used in routine tests of Y-STR markers, and the known
structure allows an unambiguous value to be inferred from that overall
length. However, for all members of
Haplogroup G so far tested, the overall PCR length for the shortest allele of
DYF399S1 has been such that there must be either 8 or 12 extra bases. This causes an ambiguity in interpretation because
an allele with a value of 17.12 has exactly the same PCR product length as an
allele with a value of 18.8. Only direct
sequencing of the PCR product can distinguish these possibilities and this has
not yet been done. Tentatively, the
convention has been adopted that the extra bases total 12 instead of 8, so that
the allele values 17.12/18.8 are reported, for example, as 17.12 (or 17.2 in
the short form notation).
Methods
To test the diagnostic
value of DYS425, the public repositories were searched for haplotypes with
DYS425. This search included
ysearch.com, ybase.com, the Rootsweb Genealogy-DNA
List and websites of private surname DNA studies. In addition, academic papers were reviewed to
find examples of DYS425=14. This initial
review revealed multiple examples of DYS425=14 in haplotypes predicted as Haplogroup
G.
To determine the degree of
correlation between DYS425=14 and Haplogroup G, an effort was made to identify
all Haplogroup G haplotypes in the OA database. To ensure completeness, the number of Haplogroup
G results in the OA database was estimated.
First, the SMGF, Y-Search and Y-Base databases were searched to
determine the frequency of 9-marker modal haplotypes for Haplogroups E3b, G,
I1a and R1b.[3] Next, the OA database was searched for counts
of these same 9-marker modal haplotypes plus DYS425 at each of its possible
values (10 through 15, plus M*--designating a missing t-associated allele). The counts in OA were divided by the weighted
average frequencies in the other public databases to develop four estimates of
the total records in the OA database. The
average of these four estimates accurately reflects that the OA database
contains about 4,108 records (November, 2005).
Y-Search and Y-Base
estimate that 1.6% and 1.0% of their records (in November, 2005), respectively,
are in Haplogroup G. If the OA database
contains the same proportion of Haplogroup G, results, it was predicted that
there would be between 41 and 68 Haplogroup G records in the OA database. The OA database was interrogated with
SNP-tested Haplogroup G 9-marker haplotypes (DYS19, DYS388, DYS390, DYS391, DYS392,
DYS393, DYS389i, DYS389ii-i and DYS426), from the initial Internet search and academic
papers (Butler et al 2002; Behar et al 2004). DYS425 was varied from 10 through 15, plus
M*, searching for exact matches. This
resulted in 97 estimated Haplogroup G records.
The OA database was also interrogated for DYS425 repeats in SNP-tested haplogroups
other than G. Approximately 20% of the estimated number of haplotypes in the OA
database were captured. For those
haplotypes that did not match SNP-tested results, haplogroups were assigned using
the Y-Haplogroup Predictor (Athey, 2005; see Electronic Database Information). In cases of multiple SNP-tested haplogroup
designations or ambiguous results from the Y-Haplogroup Predictor, other steps
were taken to determine the haplogroup, such as the origin of the family in the
OA record.
Until recently, the
markers DYS452 and DYS446 were tested only by Sorenson Genetics and its
resellers DNAH and RG. Now these markers
are also available from DNAFP. Since the
SMGF database covers both of these markers, it was used as the primary source
of information on these markers.
Table 1 Allele Frequencies for DYS425 by Haplogroup
|
Repeats |
E3a |
E3b |
F* |
G |
G1a |
G2 |
H |
I1a |
I-P37 (pka I1b) |
|
10 |
|
|
|
|
|
|
|
|
0.500 |
|
11 |
|
|
|
|
|
|
|
|
|
|
12 |
1.00 |
0.028 |
1.00 |
0.115 |
1.00 |
0.072 |
|
0.974 |
|
|
13 |
|
|
|
|
|
0.139 |
1.00 |
|
|
|
14 |
|
|
|
0.885 |
|
0.841 |
|
0.007 |
|
|
Missing |
|
0.972 |
|
|
|
0.058 |
|
0.020 |
0.500 |
|
N |
7 |
36 |
2 |
26 |
1 |
69 |
1 |
151 |
2 |
|
Repeats |
I-M223 (pka I1c) |
J |
J2 |
K |
|
N |
Q |
R1a |
R1b |
|
10 |
|
0.100 |
|
|
|
|
|
|
|
|
11 |
|
|
|
|
|
|
|
|
0.002 |
|
12 |
0.875 |
0.900 |
1.00 |
1.00 |
1.00 |
1.00 |
0.812 |
1.00 |
0.981 |
|
13 |
|
|
|
|
|
|
|
|
0.013 |
|
14 |
|
|
|
|
|
|
|
|
|
|
Missing |
0.125 |
|
|
|
|
|
0.187 |
|
0.004 |
|
N |
8 |
10 |
1 |
1 |
6 |
3 |
16 |
15 |
474 |
Candidate Haplogroup G
haplotypes were extracted from the SMGF database using somewhat different
search criteria[4]
from those used for the OA database.
Candidate haplotypes were tested using the Haplogroup Predictor Program
(Athey 2005) and only those with a score exceeding 50 for Haplogroup G were
used. Multiple haplotypes with the same
surname listed were deleted, retaining only one haplotype per surname (except
where the haplotypes were clearly unrelated).
The marker DYF399S1 is only available from DNAFP, and none of the
public databases (except DNAFP’s own Y-Match) currently
accept data on this marker. Therefore,
all of the data for members of Haplogroup G were sent to the authors in private
communications (n=5), was commissioned for the present study (n=1), or was
found in Y-Match (n=1, but some of the results we received in private
communications are now also in Y-Match).
Results
DYS425
DYS425=14 was found to be strongly,
but not exclusively, associated with Haplogroup G in the OA database (Table 1).
About 88% of the OA Haplogroup G results
had 14 repeats at DYS425. Outside of Haplogroup
G, 14 repeats at DYS425 was observed in one of 152 estimated Haplogroup I1a
records in the OA database and in two of 69 results in Haplogroup Q in an
academic study (Seielstad et al 2004). While the present study was focused on Haplogroup
G, the results indicate that DYS425 may also have diagnostic value in Haplogroups
E3b, H, I1b, and J.
In Table 1, the
columns labeled G1a and G2 had SNP information that confirmed those
designations. The column labeled simply
G did not have SNP information but was predicted to be in G using the
Haplogroup Predictor program.
The testing of one G2-P15 subject for
DYF371 was carried out to estimate whether or not the value of 14 on DYS425 was
present from the beginning of Haplogroup G2.
The subject was from a tribal area of
Therefore, it appears that the two repeats
were added in a G2 individual at some early time after the founding of G2. Therefore, we would not normally expect to
find DYS425=14 in a member of G1 or G*, and this conclusion is supported by the
single example in Table 1 of a Haplogroup G1a individual, plus the single
example of a GxG2 individual reported to us in a private communication.
DYS452
DYS452 is a complex marker with several
sets of repeats on the main pentabase motif, TATAC,
the longest of which contains about 11-14 repeats. Here is an example of the sequence for one of
the YCC samples, YCC33, which is a member of Haplogroup E3a:
. . . . . GGTGTTCTGATGAGGATAATT/TATAC/TATAC/TGTAC/TGTAC/TATAC/TATAC/TATAC/TATAC/TATAC/TATAC/TATAC/TATAC/TATAC/TATAC/TATAC/TATAC/TATAC/TATAC/CATAC/TATAC/CATAC/TATAC/TATAC/TATAC/CATAC/CATAC/TATAC/TATAC/TATAC/CATAC/TATAC/TATAC/TATAC/AACCAATTAATTAGCTGAGTATAATAA . . . . .
From the sequence, we see
that this example has the following repeat structure (Redd
2002):
(TATAC)2(TGTAC)2(TATAC)14(CATAC)1(TATAC)1(CATAC)1(TATAC)3(CATAC)2
(TATAC)3(CATAC)1(TATAC)3
Some commercial labs (e.g., DNA Heritage,
Relative Genetics) report just the main repeat section, which would give a
value of 14 in the above example. Normally,
it is only this part of the marker that is variable. However, the guidelines of the International
Society for Forensic Genetics (ISFG) and also the guidelines of the U. S.
National Institute of Standards and Technology (NIST) suggest that all of the
similar penta-base repeats in this marker should be
counted, resulting in a value of 33 for YCC33, and this is how it is reported
by DNA Fingerprint and SMGF (their reported values for DYS452 are 19 repeats
greater than those reported by DNAH and RG).
By fortunate coincidence, one of the
sequences for DYS452 that was reported by Redd (2002)
is for YCC24, a member of Haplogroup G2a1-P18.
The published PCR sequence actually shows the deletion. Here is the repeat structure shown by Redd for YCC24:
(TATAC)2(TGTAC)2(TATAC)14(CATAC)1(TATAC)1
(CATAC)1(TATAC)3……….(TATAC)1(CATAC)1(TATAC)3
Here we see that 20 bases
of the form
(CATAC)2(TATAC)2
have been deleted. Since the deletion occurred in a normally invariant
part of the marker, it should be considered as a Unique Event Polymorphism (UEP). Interestingly, the companies reporting only what
they believe to be the main repeat section on this marker, would report a value
of 10 for YCC24, whereas this is not the actual number of repeats (14) of that
structure. For members of Haplogroup G2, they are reporting
a value that is four repeats less than what is actually present. This is a good reason for using the ISFG/NIST
standard nomenclature.
Allele frequencies on DYS452 for the most common European haplogroups are shown in Table 2<