The
Subclades of mtDNA Haplogroup J and Proposed Motifs for Assigning
Control-Region Sequences into These Clades
Jim
Logan
Abstract
This paper presents
a study of the phylogeny of mtDNA Haplogroup J using full genome sequence data
publicly available through GenBank. It
presents a broad history of previous research relative to this haplogroup and
the development of motifs for classification of its clades. It then presents a new phylogeny and a set of
new motifs for classification where only control region data is available. Finally, it evaluates these motifs relative to
full sequence classification and uses them to assess the classic motifs still
in use in some projects.
Address
for correspondence: T. Jim Logan,
jjlnv@comcast.net
Received:
Introduction
Just over
a quarter century ago, it was shown that the mitochondrial
Over the
past decade the technology of analysis and the nomenclature for describing the
analysis of mtDNA has changed dramatically.
Thus, to facilitate historical review, it is appropriate to introduce
some of that nomenclature. The long
string of
The
potential for the use of mtDNA in anthropology (and thus genetic genealogy) was
demonstrated in a study that concluded that all mitochondrial DNAs stem from one woman who is estimated to have lived
about 200,000 years ago, probably in
One of
the earliest of these studies used the technique of restriction fragment length
polymorphism (RFLP) to analyze blood samples from 167 Native American subjects
from five widely dispersed populations–three in North America, one in Central
America, and one in South America (Toronni,
1992). By applying 14 specific
restriction e
This
study was extended by adding 321 individuals from 17 additional Native American
populations (Torroni, 1993a). For 36 of the samples, they also sequenced
341 nucleotides from the displacement loop (D-loop), also known as the control
region, and found that their clustering correlated strongly with the four
haplogroups defined by the restriction analysis.
Finally,
the Torroni team applied their technology and
experience to a study of 411 aboriginal Siberian subjects (Torroni,
1993b). They found similar clusters, but
also differences from the Native Americans.
Details of their analysis support the theory that the Native American
population was genetically derived from early Asian populations. This work also led to the beginning of a
formal mtDNA haplogroup system using letters for names and definitions in terms
of defined restriction sites based on RFLPs.
In a
concurrent study Horai et al. (1993) explored the
concept of race using 72 Native American samples for 16 broadly scattered
populations throughout North, Central, and South America (Horai,
1993). As distinct from the Torroni studies that used restriction sites scattered over
the entire mitochondrial genome, the Horai study
relied entirely on the sequencing of a 482-bp segment within the D-loop. They also found four clusters of Native
Americans. By comparing the haplotypes
with those of world-wide population, including Africans, Europeans, and Asians,
they concluded that peopling of the
The mtDNA
Haplogroup J (Hg J) was first distinguished from Eurasian Haplogroups H, I and
K through the use of the RFLP technique in an analysis of 175 Caucasians
residing primarily in the United States but including 28 French Canadians (Torroni, 1994). Hg J
was defined by the RFLP predecessors of nucleotide polymorphisms at rCRS positions 13708 and 16069.
With this
background, there was rapid identification of other haplogroups and
identification of broad interrelations between them. Haplogroups T, U, V, W, and X were identified
in a study using 134 samples from three European populations of Finns, Swedes,
and Tuscans (Torroni, 1996). This study found that 99% of mtDNAs fell within the ten haplogroups of H, I, J, K, M, T,
U, V, W, and X “suggesting that the identified haplogroups could encompass
virtually all European mtDNAs.” This study was carried out in the RFLP
tradition, but the results were also compared with control region sequences for
the Tuscan examples as determined in a separate study (Francalacci
1996). For groups of haplotypes in each
haplogroup identified through RFLP analysis, they were able to find identifying
concordant nucleotide polymorphisms in the D-loop (control region) that were
indicative of that haplogroup. The
defining polymorphisms for Hg J were found to be at positions 16069 and 16126
in HVR1 and 295 in HVR2, respectively.
A
concurrent, but independent, study that involved sequencing the first
hypervariable region (HVR1) of the mtDNA, showed how this technique can be used
not only to group haplotypes, but can also use their geographic distribution to
infer origin and use their variability to infer age (Richards, 1996). Using 821 widely dispersed test subjects
throughout
A study
of 37 Italian patients with Leber’s Hereditary Optic Neuropathy
(LHON) disease and 90 matched control subjects found that subjects with the
disease were five times more likely to be of Hg J (35.1%) than were the members
of the control group (7.1%) (Toronni,
1997). By contrast, there were
relatively fewer LHON patients in the Haplogroup U than were in the
controls. The associated phylogenetic
analysis found four polymorphic sites to be of particular significance for the
Hg J. 4216 + 13708 defines the J itself;
15257 in turn defines the J2 subgroup with its absence defining J1; and finally
15812 within 15257 defines (using their notation) J2.2 with its absence
defining J2.1. On the other hand, the
mutations most commonly associated with LHON (3460, 11778, and 14484) appear to
be independent mutational events and are not definitive of any clade. The Hg J apparently provides a genetic
background that supports mutations associated with the disease.
Recognizing
that a number of distinct mtDNA classification schemes had arisen due to
differences in technology of testing and the use of “imperfect phylogenetic
analyses and datasets,” a team of researchers, centered on Oxford, proposed a
new flexible (i.e., expandable and changeable) nomenclatures scheme (Richards,
1998). They adopted the same capital
letters for names of the major mtDNA clusters already in use but then suggested
a set theoretic approach such that nomenclature could be systematically
expanded to accommodate naming discrete subsets as they were recognized and
defined. They also developed rules for inserting
nomenclature to represent new groupings relative to previously defined
sets. Following their recommendations,
they applied this scheme to their previous work; for example, the cluster 2 and
its two subclusters 2A and 2B were renamed as
haplogroup cluster JT and Haplogroups J and T, respectively. They went further and partitioned several of
the haplogroups and gave them names and defined HVR1 assignment motifs for
them. For example, nested subsets of Hg
J included J1, J1a, J1b, J1b1, and J2.
The motif for classification of haplogroups based on HVR1 sequence data
were given as sites where differences from the
Another
study, centered in
The team
then turned to researching founder effects, computation of ages of the various
clades, and assessing possible geographic regions of the clade origins
(Richards, 2000). To provide a better
estimate of the Paleolithic and Neolithic contribution to European diversity
their research brought together over 4000 samples from various projects,
carefully chosen as representative of various regions throughout
There
followed a number of studies that analyzed the coding region and reported on
some aspect of Hg J. In the process,
some studies showed that the coding region of the mtDNA was a much better
source of data for analysis than the control region (Ingman,
2000; Finnila, 2001; Kivisild,
2004). Some were regional studies (Finnila, 2001), whereas others looked at the relationship
between haplogroups and the early human expansion (Maca-Meyer,
2001; Richards, 2002 and 2003), linguistics (Forster, 2004), and even longevity
of Hg J centenarians (Rose, 2001). There
were, of course, also studies that emphasized the technical aspects, such as
network analysis (Herrnstadt, 2002, Coble, 2004, 2006;
The first
known study devoted exclusively to Hg J was a Master of Science thesis by Serk (2004), where she compared populations distributed
across
In a
study designed to resolve uncertainties in the relationships between Indian and
western Eurasian mtDNA pools through the study of the phylogeny of mtDNA macrohaplogroup N, a “reappraisal of the Western Eurasian
mtDNA Phylogeny,” was conducted by Palanichamy et. al. (2004). Upon reviewing the works of Finnila (2001), Rose (2001), and Herrnstadt
(2002, 2003), it was concluded that “the former J1a . . . is proven to be one subbranch of J2 on the basis of coding-region
sequences.” This study further
recommended that this subbranch be renamed “J2a,”
retaining the “a,” and that the old J1a name be retired from further use. Using complete mtDNA sequences of 75 Indian
samples, supplemented by 25 complete sequences taken from the literature, they
reconciled “conflicts among published western Eurasian data sets,” refined the
basal phylogeny, and presented it in four parts covering respectively N, pre-HV
and JT, U, and the Indian autochthonous R.
This phylogeny uses both coding and control region polymorphisms in
their definitions. In defining the
structure of Hg J, they define J1 in terms of 462 and 3010 and J2 in terms of
7476 and 15257. There is no J1a on their
chart, but there is a J2a that subsumes the previous HVR1-only motif for J1a.
A more
detailed synthesis of the mtDNA phylogeny in the form of “A human mitochondrial
genome database,” called MITOMAP, is maintained at the
The
largest standardized human mtDNA database to date has been assembled through
the public participation side of the Genographic Project and includes 78,590
genotypes (Behar, 2007). The population
is self-selecting and each kit purchased is analyzed for either a 12-marker Y-
Goals
of the Current Study
Building
upon all these prior studies, the aim of the present study is to gather full genome
sequence data that can be classified as belonging specifically to mtDNA Hg J
and to analyze that data to develop a new phylogenetic tree for that
haplogroup. A second aim is to use that
tree as a basis for development of consistent classification motifs for use
when the only data available is the HVR1 sequence or when there is both HVR1
and HVR2.
Sources
and Methods
Data for
the present analysis comes from a variety of world-wide sources previously
deposited in GenBank maintained by the




One form
of validation of this data set is to show that they fit properly in the general
mtDNA phylogeny. It is generally agreed
that the most recent common ancestor of Hg J and the rCRS
is Haplogroup R. Each of the sequences
selected would be expected to have each of the mutations back to that
point. This path contains 295, 489,
10398, 12612, 13708, and 16069 from J back to JT and 4216, 11251, 15452 and
16126 from JT back to Haplogroup R.
Ignoring the 16069, since this was the original search criterion, this
logic defines the expected content of 999 cells on the matrix. A check of the matrix finds only three cells
that were different than expected – three sequences each presenting a different
polymorphic site. This relatively low
rate of differences could be explained by back mutation or even errors in the
sequencing process. Another source of
differences from the rCRS is the artifacts generated
from the fact that the rCRS is not a direct
ancestor. The polymorphic sites from
Haplogroup R to Haplogroup H are 73, 2706, 7028, 11719, and 14766, and
similarly from H to rCRS itself the sites are 263,
750, 1438, 4769, 8860, and 15326. Only
five differences were found for the 1221 cells so defined.
Finally,
a check was made to see if any sequences were missing. Doing a separate search, but using 295 vs 16069 as the criterion, five additional candidates were
found, but further analysis showed that one of these belonged to Haplogroup I, another to
Haplogroup R1, and three to Haplogroup K1.
Thus, the set of 111 sequences extracted from GenBank was judged to be
complete.
These 111
sequences cannot be considered strictly representative of Hg J since they were
selected simply as all available, rather than being stratified by design. However, since the current goal is the
development of the cladistic structure of the
haplogroup, as distinct from a description of the geographic distribution or
other characteristics, this dataset appears to an adequate sample for the
purpose. It certainly contains the
greatest number of full sequence Hg J records, representing the widest
geographic distribution, of any data set that has been assembled to date. Table 1 provides references to the
research papers describing studies that produced the Hg J sequences in the
GenBank database, the ethnicity or locality that the research covered, and the
accession identifiers for those records that were extracted.
A general
analysis of the matrix was conducted by counting the number of entries in each
column (i.e., for each polymorphism site found in the data) and analyzing the
counts within columns. Of the 333
polymorphic sites found in the matrix, 192 were found to be singletons (i.e.,
they occurred in only one sequence), 50 were doubletons, and 12 occurred only
three times. Since the goal is to
develop a basal structure for Hg J, polymorphic sites that occurred less than
four times were not included in the analysis.
Eliminating the 21 sites that are outside the Hg J phylogenetic
structure reduced the working matrix to 58 polymorphic sites. Insertions and deletions were initially
considered but their distribution was such that no pattern could be discerned
that would be useful in defining a clade or subclades of Hg J.
Results
The
Phylogenetic Tree
In
satisfaction of the first goal of the current study, Table 2 presents a
phylogenetic tree in table form. It is followed
by Table 3, the portion of the matrix from which the tree relationships
were extracted. In partial satisfaction
of the second goal, this tree includes not only identification of the
polymorphic sites that identify the branches of the tree, but it also shows
sites that might be helpful in predicting a clade or subclades when less than
full sequence data is available. Since this tree was ultimately derived using
only transition sites, the nucleotide designators have been dropped and only
site location relative to the rCRS is shown.
The
development of this tree employed a parsimony criterion, where the rows and
columns have been rearranged to show the relationships more clearly. Once rearranged, the deepest structure of the
tree is fairly obvious. First, 99 of the
111 sequences had both the control region 462 polymorphism site and the coding
region 3010, and none of the remaining 12 had either of these mutations. On the other hand the other 12 all had both
7476 and 15257, and neither of these polymorphism sites appeared in the first
99 sequences. Consistent with Herrnstadt (2002), Palanichamy
(2004), Carelli (2006), Ruiz-Pesini
(2007), and others, these two clades were designated as J1 and J2,
respectively. Note that this bifurcation
is complete--there were no sequences left over to be classified as any other
clade or as J*. This is in stark
contrast with the use of the classic motifs currently still in use for
classifying HVR1 only sequence data, as will be discussed further below.
Within J1,
indicators for three subclades were found.
Polymorphic site 8269 is present in 21 sequences and none other; 14798 is present in 71 sequences and none other; and 7963 is
present in 5 sequences and none other.
Following the recommendation of Palanichamy
(2004) and others, these three clades were designated as J1b, J1c, and J1d,
respectively. These strict criteria
account for 97 of the 99 sequences designated as J1, leaving two unassigned. A closer look, however, reveals that 16222
also occurs in all but two of the J1b sequences as assigned (and does not occur
elsewhere in the dataset), and that the combination of 16222 and 16261 occurs
primarily in J1b sequences and in very few others. Since one of the unassigned J sequences,
AF381987, has 16222,
16261, and 16145, all characteristic of J1b, the parsimonious
approach suggested its inclusion in the J1b clade, even without 8269. Possible explanations include a back mutation
of the 8269 or a processing error.
Similarly, the other unassigned J1 sequence, EU007859, can be assigned
to J1c on the basis that it has both the 185 and 228
sites, one or both of which occur in all but two of the already assigned J1c
sequences and none other. Again, the
possible explanation is a back mutation at 14798. Thus, for the purposes of further analysis,
all J1 sequences have been assigned to either J1b, J1c, or J1d, with none left
over to be called J1*.
The
development process of selecting defining characteristics and assigning names
to these subclades continued in like manner through the entire matrix. Note that in the tree thus produced, the
locations in parentheses are informative but are not definitive by themselves
due to homoplasy (that is, these mutations appear in more than one subclade),
whereas those without parentheses are definitive. It is important to note that every clade
shown has at least one such absolute defining mutation, several of which are from the HVR2
region, but a few are from the coding region only.
In
defining Hg J itself, 16069 is a good indicator for classifying a sequence, but
HVR1 data alone is grossly inadequate in classifying sequences to its clades
and subclades—there are no HVR1 only criteria for distinguishing J1 from J2 at
their root. However, where HVR2 results
are available, site 462 is a good indicator for J1 and its absence is a good
indicator for J2. Thus while any attempt
at identifying the clades of Hg J using only HVR1 data will produce major
errors and using both HVR1 and HVR2 will work much better, coding region
indicators are required for definitive classification of all subclades of both
J1 and J2.
Several homoplasies were observed, mostly within the second
hypervariable region. As concluded by others (e.g., Halgason
2000, Behar 2007), sites 16311 and 16519 are too variable across the entire
phylogeny to be useful for classification. Sites 16145, 16193, and 16261,
three of the six polymorphisms used in the classic HVR1 motifs for predicting
the clades of Hg J (Macaulay 2000, and see next section), were also found to be
homoplasic. Nevertheless, these three sites
were found to be well defined within the phylogenetic structure of Hg J, are
informative, and are thus shown in both the phylogeny presented here and in the
associated HVR1 + HVR2 classification motifs. Sites 152 (in HVR2) and
7789 (in the coding region) were also found to be homoplasic,
but well structured and so were similarly included.
Classification
Motifs
In
satisfaction of the second goal of the current study, a new set of
classification motifs was developed for use when only control data is
available. Two sets of motif criteria
are presented in Table 4. The
first represents the “classic” motifs as originally presented by Richards et
al. (2000), based on HVR1 sequence data only.
Although significantly flawed, as pointed out above, it is still in use,
as in the Genographic project (Behar, 2007).
The second is a proposed motif chart that includes classification
criteria for use when both HVR1 and HVR2 data are available. These criteria cannot provide the same detail
as a full genome sequence and they can produce errors in classification, but it
is a significant improvement over the classic approach. To use these motifs, go to whichever chart
matches the data you have (HVR1 only, or HVR1 and HVR2), work your way down
from the first entry until you satisfy all the criteria in that entry. At that point stop and read off the clade
classification from the first column.
Note that even though one of the original goals called for a set of new
motifs for use with HVR1 only sequences, none has been presented here. As described above, any such attempt would be
seriously flawed. Instead, comments on
possible modification of the classic set of motifs are provided below, but the
decision was made not to create a new, but seriously flawed, HVR1-only
alternative.

The
usefulness of any proposed set of prediction motifs is dependent on both the
completeness and accuracy of predictions.
Using the reference sequences as the source, Table 5 shows a
comparison of predictions provided by the new motifs to those produced by
analysis of the full sequence. Of the
111 predictions, all sequences were correctly placed in either subclade J1 or
J2, and within J1 and J2, only one sequence was placed in a subclade where it
did not belong (one FGS J1b1a was assigned to J1b). Full sequence data, of course provided more
precision for the lower level subclades.

Table
6 shows
comparable results when using the classic motifs. An application of the classic motifs failed
to allocate over 65% of the references sequences to a subclade of Hg J (J1 or
J2). In addition, nine of the 111
sequences were placed in an inappropriate subclade. The inability to make assignments is
primarily due to the fact that there is no indicator for the J1c clade within the HVR1 sequence
and the fact that J1c makes up nearly 64% of the reference data set. The inaccuracy stems from the 1998 attempt to
develop a phylogeny and associated motifs based solely on HVR1 data (See
Richards, 1998 and 2000). Without HVR2
data J1c simply cannot be recognized.

Unfortunately,
no other full genome sequence data set is currently available for use in formal
validation of either the phylogeny or the prediction motifs. However, the effectiveness of these motifs can
be demonstrated by their application to records from MitoSearch that met the
criterion of having been sequenced for both HVR1 and HVR2. Table 7 shows the results. Whereas 61% of the 568 Hg J records from
MitoSearch were unassigned to a subclades; after
applying the new motifs to reclassify the data, all were assigned to J1 or J2
or one of their subclades. Of the 348
previously unassigned (J*), 93% were assigned to J1c or one of its subclades.

It should
be pointed out that the reference dataset and that from MitoSearch represent
different populations. The reference
dataset derives primarily from world-wide academic research projects and is
considered to be the most diverse haplogroup dataset available at this
time. By contrast, the MitoSearch population
is that of genetic genealogy and for economic and cultural reasons is probably
heavily biased toward genetic origins in
With
respect to Hg J, it is suggested that the research community would be well
served if projects and testing companies were to acknowledge the problems
described in this report and inform their clients and researchers
accordingly. Unfortunately, because of
lack of indicators in the HVR1 sequence, no change to the classic motifs can be
made to correct for the inability to allocate a large percentage of Hg J test
to subclades. On the other hand the
large inaccuracy in assignment to what has been referred to as the J1a clade
should be changed to designate the 16145-16231-16161 motif to J2a, or even just
J2, in consonance with the academic
research community as described above.
This is
currently an open-ended study. Not only
will these results be refined as new data warrants, but also analysis has begun
to establish age estimates for each clade.
In furtherance of this study and as a service to the community, a
discussion group has been established at http://tech.groups.yahoo.com/group/J-mtDNA/. To help control spam, membership is required
for posting and access to the archives, but membership is free.
Acknowledgements
Special
acknowledgement goes to
Web
Resources
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search&db=nuccore
EntreNucleotide, Portal for GenBank, etc
Mitosearch mtDNA Database
References
Andrews RM, Hubacka I, Chinnery PF, Lightowlers RN,
Turnbull DM, Howell N (1999)
Reanalysis and revision of the
Argus
Biosciences, “
Cann RL, Stoneking M, Wilson AC
(1987) Mitochondrial
DNA and human evolution. Nature,
325:31-36.
Carter
RW (2007)
Mitochondrial diversity within modern human populations. Nucl
Acids Res, 35:3039-3045. http://nar.oxfordjournals.org/cgi/content/abstract/35/9/3039?maxtoshow=&HITS=10&hits=10&RESULTFORMAT=1&title=mitochondrial+diversity&andorexacttitle=and&andorexacttitleabs=and&andorexactfulltext=and&searchid=1&FIRSTINDEX=0&sortspec=relevance&resourcetype=H (URL exceeds the 256 character
limit of Word—cut and paste URL into browser window)
Detjen
KA, Tinschert S,
Greenspan
B (2007) Direct
submission of Family Tree
Herrnstadt C, Elson JL, Fahy E,
Preston G, Turnbull DM, Anderson C, Ghosh SS, Olefsky JM, Beal MF, Davis RE, Howell N (2007) Reduced-median-network
analysis of complete mitochondrial DNA coding-region sequences for the major
African, Asian, and European haplogroups.
Am J Hum Genet, 70:1152-1171. See also Elson (2007) for an update of the
phylogeny.
Logan
I (2007a) Mitochondrial
Logan I (2007b) A suggested genome for ‘Mitochondrial
Eve.’ J Genet Geneal, 3:72-77.
Macaulay V (2001) “Supplementary data from Richards et al.
(2000),” available at http://www.stats.gla.ac.uk/~vincent/founder2000/index.html.
Parsons
TJ (2005) Singular
nucleotide polymorphisms over the entire mtDNA genome that increase the
forensic discrimination of common HV1/HV2 types in ‘Hispanics.’ Unpublished.
Richards
M, Corte-Real H, Forster P, Macaulay V, Wilkinson-Herbots
H, Demaine A, Papiha S,
Hedges R, Bandelt HJ, Sykes B (1996) Paleolithic and Neolithic lineages in the
European mitochondrial gene pool. Am J Hum Genet, 59:185-203. See also the critique by L L. Cavalli-Sforza and E. Minch
(1997) in
61:247-251 and the authors’
reply in 61:251-254.
Richards
M (2003) The
Neolithic invasion of Europe. Annu Rev Anthropol,
32:135-162.