A
Comprehensive Analysis of mtDNA Haplogroup J
Abstract
In the
furtherance of a better understanding of human genetic origins and migration
history, the Federal GenBank database was mined for all Haplogroup J full-genome
mtDNA sequences plus additional sequences that are complete for the coding
region. These data were used to develop
a phylogeny for Haplogroup J using a matrix developed to show polymorphisms for
each sequence organized within clades of the haplogroup. The diversity within clades was then used to
compute estimates of the age of each clade.
In the process, polymorphisms were analyzed to show their relationship
to various genes as well as their relationship to selected medical conditions
as reported in the literature. Finally,
the literature was reviewed for relevant phylogeographic data toward the
ultimate development of a comprehensive history for human mtDNA Haplogroup J.
Address
for correspondence:
Received:
Introduction
The
analysis of mitochondrial
Until
very recently, sequencing has typically been limited to the control region
(displacement loop) of the mtDNA genome, which contains two hypervariable
regions (i.e., regions of significantly higher mutation rates than the coding
region) which provided relatively more information for a given length of
sequence. It was soon discovered,
however, that these hyper variable regions have significantly higher instances
of back mutations and homoplasies, (i.e., the occurrence of a given
polymorphism in more that one haplogroup or even clades of the same
haplogroup), thus leading to ambiguities and uncertainties for haplogroup assignment. Some studies then turned to the use of
selected markers from the coding region for the broad classification into
haplogroups and then used results of sequencing the hypervariable regions to
develop the clade structure within the haplogroup. This has been successful for many purposes,
but can lead to errors in specific haplogroups.
For example, although polymorphisms at nucleotide positions (nps) 16126
and 16069 are adequate for identifying Haplogroup J, and sequencing the
complete control region can provide some substructure for J1, there are no
polymorphisms within the hypervariable region 1 (HVR1) to cleanly differentiate
the J2 clades from those in J1 (Logan, 2008).
Whether the purpose was to study a geographic region, a specific
disease, or some other purpose, scientists have now published a sufficient
number of full-genome sequences to permit a multi-level phylogeny for
Haplogroup J and develop estimates of ages of the various clades.
Methods
All mtDNA
sequence data used in the current analysis was extracted from the GenBank
database maintained by the
Table
1
Studies
Cited and the Geographic Locations of the Haplogroup J Sequences Used in the
Present Study

Of the
156 sequences selected, 111 are the same sequences used in the previous study (
Each of
these sequences was parsed and a matrix was developed to include a column for
each sequence and a row for each polymorphism identified. This matrix is the reference for both a
detailed analysis of the polymorphisms (including a survey of medical
relationships) and the refinement of the Haplogroup J phylogeny. These data were used to compute the average
number of polymorphisms in each branch of the phylogeny and to estimate the age
of the clades in the phylogeny.
However,
certain limitations of this matrix and its origins should be noted. First, although the ethnic origins of donors
were generally from European populations or from those located in the western
or southern regions of
Second,
there was no uniformity in the
Third,
there is no data about either the age or gender of the donor. As shown in the section on Medical Implications
below, both factors are significant in analysis of certain diseases and in
longevity studies.
Analysis
of the Polymorphisms
The
development of a phylogeographic analysis is dependent on both good geographic
data and characterization of the
Analysis
of 156 sequences of Haplogroup J identified 411 distinct polymorphisms, of
which 106 were observed three or more times.
The 243 singletons and 62 doubletons, representing almost three-fourths
of the total polymorphisms observed, are apparently rare within J. Although these rare polymorphisms are not
significant in defining the basal phylogeny, they are useful for inferring the
ages of the clades of that phylogeny, and as additional
The 16569
base-pair length of the Cambridge Reference Sequence (
Polymorphisms
can be arranged into two major categories–those that involve point substitution
(transitions, transversions, and heteroplasmies) and those that affect the
length of the sequence (insertions and deletions, or indels). If a mutation of the first category occurs
within a gene, that mutation has the potential for making a change in the
protein for which the gene encodes and thus affecting the phenotype. However, since there is redundancy in the
genetic code, many of these mutations (referred to as synonymous mutations) do
not result in an amino acid substitution and, thus no change in the protein for
which the gene codes. A mutation of the
second category occurring within a gene results in a shift in the reading frame
which can cause a complete failure of the production of the prospective
protein. On the other hand, with the
possible exception of interfering with replication of the
A catalog
of polymorphisms was developed from the results of comparing each sequence in
the reference database with the revised Cambridge Reference Sequence (Andrews,
1999). The sequences in the reference
database, as identified in Table 1, were extracted from GenBank (Benson
et al., 2007) and the polymorphisms were identified through the use of
Greasemonkey (
A summary
of the type of polymorphism versus its locus class is provided in Table 2. Note, however, that polymorphisms in the
control region are probably under-reported since some of the sequences in the
reference database were not complete in that region. Of the 411 polymorphisms detected, only 20
(4%) were insertions or deletions (indels) and these occurred primarily in the
non-coding region with a few occurring within the region that codes for the
ribosomal RNA. The fact that none
occurred in either the genes or in the transfer RNAs is probably due to the
deleterious effects that would result and thus would not be passed along in the
germ line.
Table
2
Distributions
of Types of Polymorphisms Across the Mitochondrial Genome

Of the 20
indels detected, one deletion and three insertions occurred within regions
defining ribosomal RNA. Each of these is
associated with a successive repeat sequence within that RNA and thus impact
would be expected to be minimal. For
example at positions 2141 through 2149 of the revised Cambridge Reference
Sequence (rCRS) there is a pattern of four AG repeats. The insertion shown as 2149.1A and 2149.2G
simply extends the length of this repeat sequence to five repeats. All remaining indels occurred in non-coding
regions and all but three of these are also associated with repeat
sequences. For example, at locations 514
through 523 of the rCRS there is a pattern of five CA repeats, CACACACACA. There are eight instances of C522 and A523
deletes, reducing the length to four, but there are also two instances of a
523.1C and 523.2A, extending the length to six, and one instance of a 523.1C
and 523.2C (See Hurst (2007) for further
discussion on length heteroplasmies).
Most of
the insertions observed were associated with repeats of a single nucleotide
type – most commonly a C. For example
the 309.1C insertion was observed 48 times in the sample set of 118 full genome
sequences. This insertion relates to the
well known sequence from 303 through 315 of the rCRS which consists of a
sequence of seven C repeats followed by a T and this followed by five C
repeats, CCCCCCCTCCCCC. The 309.1C
indicates that there was the insertion of a C after position 309–that is,
insertion of a C somewhere before the T in the above sequence. Associated with the same sequence there were
also insertions 309.2C, 310.1T and 315.1C.
Of the
substitutions, the vast majority (89 % of the total) were simple transitions
where a purine was substituted for a purine or a pyrimidine was substituted for
a pyrimidine. A little over 4% of the
substitutions, however, were transversions (mostly singletons) where a purine
was substituted for a pyrimidine or visa versa.
Less than 2% were heteroplasmies – a polymorphism within a single
organism where the state at a given locus in some
Table
3 shows how each
of these polymorphism types were distributed throughout the various segments of
the mitochondrial genome. Note that due
to several small overlaps in segment definitions, the lengths of the segments add
to slightly greater that the 16569 base pair length of the rCRS genome. As an indication of variability of
polymorphisms across the genome, the table also shows the polymorphism density
defined as the ratio of the number of polymorphisms within a gene or region
divided by the length of that sequence.
Note that considering the small numbers involved, the density of
polymorphisms throughout the genes encoding for proteins is fairly uniform with
an average of 2.1% compared to the 8.6% for the control regions. This four-to-one ratio is no doubt low
because of the incompleteness of some of the available sequences as described
above. The frequency of polymorphism in
the genes for ribosomal RNA is somewhat lower at 1.0%. The control region, which accounts for less
than 7% of the mtDNA genome, produced over 23% of the polymorphisms.
Table 3
Statistical Distribution of Polymorphisms for Various
Regions of the Mitochondrial Genome

Medical
Implications
A single nucleotide
change within a sequence can cause deleterious or advantageous changes in the
performance of mitochondrial-coded products (e.g., proteins). Such changes can be inherited through the
gene line from mother to child or they may occur somatically within selected
tissues of the individual. Several
recent studies have shown correlation between the frequency of selective
mutations and a variety of diseases and longevity itself. Such correlation, however, does not
necessarily imply a cause and effect relationship. There are very complex relationships between
the workings of mitochondrial
Aging
and longevity, as complex traits having a significant genetic component, likely
depend on many nuclear gene variants interacting with mtDNA variability, both
inherited and somatic. We also surmise
that what we hypothesize for aging and longevity could have more general relevance
and be extended to other complex trains, such as age-related diseases like
cardiovascular diseases and diabetes . . .
and both
Alzheimer’s Disease and Parkinson’s Disease.
The description of such nuclear and mitochondrial
Table
4
Polymorphisms Observed in the Haplogroup J
Reference Database that Have Been Reported as Associated with mtDNA-Related
Diseases

In a
study of the relationships between mtDNA polymorphisms and aging, De Benedictis
et al. (1999), found that 23% a group of centurions in northern
In a
similar study of an Irish population (Ross et al., 2001), Haplogroup J was
singled out for special study of longevity.
No significant association was found when considering that haplogroup as
a whole. However, when they separated
the samples into two categories based on restriction fragment analysis, they
found that one category had a much higher frequency of centenarians than that
the control group whereas the other had a much lower frequency. Then, in a later paper (Ross et al., 2003),
and using the same population, they looked specifically at Parkinson’s disease. They found of the 12% of the population that
was diseased, 2% were in one J group whereas 10%% were in the other J
group. They called the first group J1
and the second J2 but unfortunately, their subdivision cannot be correlated
with the subclades of J found in the present study since the polymorphic
restriction sites have not been identified or to correspond to any
polymorphisms found in the reference database.
In a
related study of the control region only, Zhang et al. (2003) looked at 207
subjects from Northern, Central, and
The somatic event(s) at or near position 150 transition
may be part of a general remodeling of the mtDNA replication machinery,
probably nuclearly controlled. This remodeling could accelerate mtDNA
replication and compensate for the oxidative damage of mtDNA and its functional
deterioration occurring in old age.
The
current study found that T150C occurred exclusively in the J2 subclade of
Haplogroup J and is thus a strong indicator of that subclade, although not
definitive. The reason for this
phenomenon has not been determined.
The
latest available study to look at the relationship between longevity and
Haplogroup J found no significance in the Ashkenazi Jewish centenarians
relative to their control group (Shlush, et al,
2008). Although they referenced the
study by Zhang (2003), who pointed out the possible significance of the
polymorphism 150C, they missed an opportunity for follow-up testing in their
well defined and well understood study population. Unfortunately, 150 is
not within the narrow range of the control region they sequenced
(16024-16300). Similarly, they would be
required to acquire additional test data to permit them to assess the possible borader relationship between longevity and the J2 clade for
which 150C is an indicator.
The
disease most commonly associated with mtDNA Haplogroup J is Leber’s
Hereditary Optic Neuropathy (LHON), also known as Leber
Optic Atrophy (LOA). This disease occurs
about five times more frequently in Haplogroup J than it does in the general
population (Torroni et al., 1997). LHON
is a maternally inherited disease that presents itself in adolescence or
adulthood and can lead to partial or total blindness (Wallace 1988). Although some twenty-five mtDNA variants have
been observed to be related, the primary mutations are G3460A, G11778A, and
T14484C (Brown et al., 2002). One or
another of these mutations is found in ninety percent of the families with
LHON, although they rarely occur together (John
Hopkins, 2008). Of the 156 sequences in
the reference database, G11778A occurred four times (twice in J1c4 and twice in
J1d), T14484C occurred twice (once in J1d and once in J2b1), and G3460A occurred
once in J1c5. MitoMap
(Ruiz-Pesini et al., 2007) also listed two reports of
progressive dystonia as associated with LHON and
specifically with G11778A. The insulin
resistance associated with T4216C may just be due to that position being a
point mutation for the super-haplogroup JT.
Within
the Haplogroup J population, the polymorphism most commonly associated with
either Parkinson’s or Alzheimer’s disease is G5460A, which, incidentally, is
one of the two definitive coding region markers that define subclade J1b1. In addition both Parkinson’s and Alzheimer’s
are highly correlated with deterioration of mitochondrial performance, brought
on by increasing frequency of polymorphisms, many, or most of which are in
heteroplasmic form.
MitoMap
showed a relationship between the T11084C polymorphism and the disease MELAS
(mitochondrial myopathy, encephalopathy, lactic
acidosis, and stroke-like episodes). A
search of the associated bibliography showed only a weak statistical
association and that the most common polymorphism for the disease is at
position 3243, which was not observed in the reference database. Finally, T16189C has been reported as being
associated with various diseases including type 2 diabetes,
cardiomyopathy, and e
There is
a major study currently underway in
A Refined
Phylogeny
An
initial phylogeny for mtDNA haplogroup J was presented in an earlier paper (
As
described in the earlier paper, this phylogeny was developed using a maximum
parsimony approach ignoring insertions and deletions (see Analysis of The
Polymorphisms above). In addition, the
polymorphisms located at sites 16311 and 16519 were excluded from the analysis
as being too variable to be useful.
However, Hagelberg (2003) has suggested that
16311, and possibly 16519, could be the result of ancient recombination. No recent study has been found to support
this hypothesis. Future research may
ultimately show utility of these polymorphisms.
The
refined phylogeny is present in graphic form in Figure 1. The supporting data is available in the
supplementary files. Note that this
chart includes polymorphisms that are in parentheses or are underlined to
indicate special conditions. For example
the 185 and 228 shown as markers for J1d are both in parentheses because they
appear to be subject to back mutations with neither of them appearing in all
samples for the J1c clade, nor either of them defining a proper subclade of
J1c. However, of the 74 full-genome
sequences that are classified as J1c, all but two include one or both of these
markers and there is only one occurrence outside the J1c subclades. Similarly the polymorphisms at 152 and 16193,
shown in conjunction with subclades J1c, appear to have originated more than
once within the haplogroup. These and similar special markers are included to
be used as classification aids for cases that are not full genome sequences,
but do have sequences from the control region.
Age of
The Clades
One of
the first uses of molecular biology to determine the age of the human species
was just over 40 years ago. Sarich and Wilson (1967) looked at the variations of serum
albumins (a blood protein) in humans and non-human primates and concluded that
the split between homo, chimpanzee, and gorilla was approximately 5 to 8
million years ago. For calibration, they
used the assumption that hominoids in general separated from the old world
monkeys 30 million years ago. Within a
decade of that study, techniques were sufficiently developed to analyze the
Before
another decade was complete, excitement was aroused in the press and
anthropology community when Cann et al. (1987) used
mtDNA variations to propose that the current human population “stems from one
woman who is postulated to have lived about 200,000 years ago, probably in
Subsequently,
Mishmar et al. (2003) used the 53 sequences of Ingman
and Gullenstein, but added 48 from African, Asian,
European, Siberian and North American populations, to conclude that there are
significant differences between geographic populations caused by natural
selection brought on by differences in climate and diet. Comparing the ratio of non-synonymous to
synonymous mutations within the various genes, they found significant
differences between tropical, temperate, and arctic-based populations. Based on estimated coalescence dates for
various haplogroups, they estimated the mtDNA evolution rate to be 1.26 x 10-8
substitutions per nucleotide per year.
An
alternate basis for calibration of substitution rates was demonstrated by Stoneking et al. (1992; 2005) by capitalizing on a founding
event to analyze the population of
The studies
described above estimated mutation rates based on evolutionary models with
calibration typically based on assumed date of separation between humans and
chimpanzees. Attempts have also been
made to compute mutation rates directly from pedigree data. Early divergence estimates were typically
obtained using family data developed for disease studies and consisting of very
small sample sizes relative to the rates being estimated. Nevertheless, the general conclusion was that
divergence rates for pedigree data were approximately an order of magnitude
higher that evolutionary rates (e.g., Howell et al., 2003.) However, as described by
This is a
good point to note the imprecision of terminology between mutation rates and
substitution rates. Mutation rate has to
do with the actual change in a
The
problem of calibration and the variability of mutation rates across the
mitochondrial genome have been studied in some detail by Endicott and Ho
(2008). Eventually we will be able to
account for more of the variability in our analysis. In the meantime, the present work takes a
very straightforward but simplified approach for computing the ages of clades
of mtDNA Haplogroup J. A substitution
rate of 1.7 x 10-8 substitutions/site/year for the coding region was
chosen as representative of the literature.
Using 15447 for the number of base-pairs in the coding region, this
converts to 3808 years per substitution.
For each clade the mean length of the branches (i.e., the average
number of substitutions observed back to the defining polymorphisms) is
multiplied by this factor. The result is
an estimate of the coalescence time, or Time to the Most Recent Common Ancestor
(TMRCA) of the members of that clade.
The result of these computations is given in Table 5 and shown on
a time-scaled phylogeny in Figure 2.
It should be noted, the standard deviation of length, and subsequently
the range of ages estimates, is related to the variability of the data; it is
not a confidence interval relative to the estimated age.
Table
5
Estimated
ages of the clades of mtDNA Haplogroup J


Figure
2. Estimated ages of the clades of mtDNA
Haplogroup J
These
ages should be taken as indicating the approximate relative ages of the
clades. The astute reader will notice
anomalies within these ages. For
example, mechanistic computations produced an age for J2 and J2a that are
somewhat older than J as the complete clade.
This is an artifact of the ra
After
describing caveats in their extensive review of status of mutation rates,
Bandelt et al. (2006) concluded that the
. . . extreme form of weighting that only accepts the
coding region but rejects the entire control region is at best provisional and
certainly not recommended in the long run.
An informed strategy would use rules to decode on a site-by-site basis and
contrast synonymous with non-synonymous mutations.
The
technology and data should be available to do such a study in the next few
years. For example, data collected in
association with the Genographic Project has been used to develop substitution
rates for a few selected polymorphisms within the coding region (Rosset et al.,
2008).
Origins
and Migrations
There is
general agreement that there have been three major movements in the peopling of
One
approach to develop such details is the use of genetics and founder analysis to
identify populations, date them through using substitution rates for
calibration, and analyze the associated geographic data (Stoneking et al.,
1992). Phylogeographic analysis, that is
the geographic profile of clusters of haplotypes, can provide the basis for
inferring geographic origins of selected populations, and probably migration
paths. Such inferences take on
additional importance in anthropology and population genetics when they are
supported by studies from archaeology, climatology, ecology, and linguistics.
One of
the earliest uses of the founder analysis approach was the work of Torroni et
al. (1992), which concluded that the Amerind and Nadene populations Native
Americans were primarily from two independent migrations that probably occurred
several thousand years apart. However,
using the modern technique of Bayesian skyline plot analysis (Drummond et al.,
2005), Mulligan et al. (2008) have developed a three-stage model for the
peopling of the Americas; this was one long migration sequence that included
three identifiable stages: (1) divergence of Amerind ancestor from the Asian
gene pool, (2) a prolonged period of isolation, and (3) rapid expansion into
the Americas with a large population increase.
Comas et
al. (1997) demonstrated the potential of mtDNA founder analysis when they
analyzed data from nine distinct European and West Asian populations and
performed analyses to identify statistical similarities between them. Each population came from published samples
from a different research team that focused on a specific geographic area,
including a Basque, British, Sardinian, Swiss, Tuscan, Bulgarian, two different
Turkish, and a Middle Eastern region.
Although differences appeared to be quite low when compared to other
world populations (e.g.,
A
large-scale phylogeographic study of mtDNA in
Using a
much expanded study group, Richards et al. (2000) “formalized the procedure for
founder analysis, investigated the extent of confounding recurrent gene flow
between the putative source and derived populations, and developed criteria
that take into account the effects of both gene flow and recurrent
mutations." Among their results, they
refined the overall age of Haplogroup J to 42,400-53,700 years as determined
from the Near East samples and to 23,000-27,400 years as determined from
European samples. The corresponding ages
for Haplogroup T are 41,900-52,000 and 33,100-40,200 respectively. Although these two clades were apparently
contemporary in the
In an
attempt to identify and describe the effects on mtDNA of “demographic phenomena
dating back to the Paleolithic, the Mesolithic, or the Neolithic” periods,
Simoni et al. (2000) collected 2619 mtDNA sequences for HVR1 distributed over
36 regions of Europe. Although the
sample size was relatively small in some regions, they developed an overall table
of frequencies for the major haplogroups in each of the regions. No occurrences of Haplogroup J were
identified is several regions such as
There is
not yet available a comprehensive founder analysis for Haplogroups J or T
throughout
The
origin of Haplogroups J and T in the
Malyarchuk
and his associates did a series of studies of Eastern European populations
relating to the origin of the Slavs:
Russians and Ukrainians (Malyarchuk and Derenko, 2001), Poles and
Russians (Malyarchuk et al., 2002), Bosnians and Slovenians (Malyarchuk et al.,
2003), and Czechs (Malyarchuk et al., 2006).
In each of these studies they found that most of the mtDNA found
belonged to western haplogroups (H, HV, J, T, U, N1, W, and X). Within this broad similarity, they did find
heterogeneity between regions with a very broad north-south correlation between
their test populations and the corresponding regions to the west. The overall frequencies of Haplogroups J and
T found in each region are shown in Table 6.
Table
6
Frequency
of Haplogroups J and T within

Malyarchuk
and his associates also investigated the origin of the Roma (Gypsies) in
Technology
of extraction and analysis of mtDNA has progressed to the point where studies
of ancient
However,
a recent study was conduced to provide “a more complete characterization of the
mitochondiral genome variability of the Basques”
(Alfonso-Sanchez et al., 2008). They
sequenced HVR1 and HVR2 of 55 healthy men selected to be non-related based on a
three-generation pedigree charts. The
most interesting result from that study was the high frequency of J, especially
J1c and J2a with frequencies of 10.9% and 3.6% respectively. This 14.5% total J is in sharp contrast to
the 2.4% commonly referenced for the Basques.
On the other hand, it is in line both with the results from ancient
Richards
et al. (2000) were cited above as the team that formalized founder analysis of
populations using mtDNA data.
Thirty-five team members were represented as co-authors of that paper
and the supplementary data they produced deserves a more detailed review. Their database (Macaulay, 2001) includes
results of HVR1 analysis of 4100 samples from 24 widely distributed regions of
the Near East and
For the
present study sample sizes and counts for Haplogroups J and T were extracted
from the Macaulay database for each geographical region, and the frequencies of
the corresponding haplogroups were computed.
The results are shown in Table 7.
The average length for each of the haplogroups is also shown.
Table
7
Summary
of Geographic Analysis of Haplogroup Data Extracted from Macaulay (2000)

Maps
showing the frequency by regions for Haplogroups J and T are presented in Figures
3 and 4, respectively. However,
caution must be exercised to avoid reading too much into these maps. In addition to dealing with relative small
numbers for some regions, many other factors must be taken into consideration
before actual migration paths can be drawn.
More specifically, the current frequency in any given region is affected
by many factors: population movements do not necessarily produce smooth
gradients, but may instead represent movements for relative long distances in
an irregular manner; there are back migrations; a population may be decimated
by natural disasters or diseases; etc.
For example, a casual glance at the map for J might suggest that it
originated in what is now

Figure 3. Relative frequency
(in percent) of Haplogroup J as derived from (Macaulay 2001)

Figure 4. Relative frequency
(in percent) of Haplogroup T as derived from (Macaulay 2001)
This review of the literature
concerning the origins of the clades is representative but is certainly not
exhaustive. More work is required to
integrate results, but more importantly, new research is required to provide
more data and more complete data. There
are several reasons for this.
First, studies have not kept up
with the technology. For some
geographical areas, the only results available are from RFLP analysis. In other studies the sequence data was limited
to HVR1, sometimes complemented with RFLP typing and selective sequencing. Very few results are available for the entire
mitochondrial genome.
Second, knowledge of the general
phylogeny of Haplogroup J is still evolving and consensus has not yet been
reached. The HVR1 motifs used by most of
the available studies are not adequate for high-resolution classification of
Haplogroup J, only for identifying a haplogroup as a whole. Most studies are not even distinguishing J2
from J1. Furthermore, errors have been
identified, but not all later studies have recognized these errors, or at least
have not taken them into consideration consistently.
Third, most basic research is
geographically very limited in scope, but then comparisons are made with data
from studies of other geographic areas--studies that may be inconsistent in
purpose.
Fourth, global databases (such as
GenBank) are a great asset for comparing sequences, but are not structured to
capture context data beyond literature citations. Founder analysis, for example, requires
location. Supplementary databases are
needed to cross reference each
With time, the improvements will
be made, but of course, the technology will have moved on. Nevertheless, the author, for one, expects to
continue to review the literature for data relative to a better understanding
of Haplogroup J and performing analyses toward that understanding, including
refinement of the classification structure, development of expanded databases,
and integration of pieces into a global anthropology.
Conclusions
The work described in this paper
is a work-in-progress. It provides a broad
review of available data concerning mtDNA Haplogroup J and tries to contribute
to the evolving knowledge by developing a phylogeny and associated age
estimates. It must be noted, however,
that the quality of the product is limited by the techniques employed. For example, as stated above, a single
mutation rate cannot adequately represent the entire genome. Future analyses should consider both the
differences across the various types of gene (e.g., coding for protein versus
RNA) and even specific genes. For genes
that encode proteins, the analysis should differentiate those polymorphisms
that affect amino acid sequences from those that do not. Currently, neither the size of the database,
nor knowledge of various mutation rates, were adequate
to take these issues into consideration.
Furthermore, as illustrated by the
discussion of Origins and Migrations, just the tip of the iceberg has been
addressed. Much work is needed to bring
together and integrate the many ongoing relevant studies. For example, no attempt has yet been made to
analyze population size growth for Haplogroup J. The potential for such analysis can be seen
in the study by Atkinson et al. (2008).
They employed the Bayesian skyline plot (BSP) with simulation (Drummond
et al., 2005) to “simultaneously estimate a posterior probability distribution
for the ancestral genealogy, branch lengths, substitution model parameters, and
population parameters through time. Such
analyses can then be integrated with the archaeological record, legend, and
recorded history to develop a more complete story of Haplogroup J.
New studies are required, with the
data needs to be developed and integrated.
A single project that is both focused on Haplogroup J (and T) and of
broad geographic scope may not be feasible at this time. However, it is hoped that a consortium might
develop to permit multiple researchers to contribute to an appropriately
designed comprehensive project. The
author is currently administrating a public discussion group and associated file
exchange to further the cause.
Interested persons may join through the link to the mtDNA Haplogroup J
Project shown under Web Resources, below.
Supplementary Material
Supplementary data is available
at:
http://www.jogg.info/42/logansuppl.xls
Web Resources
http://tech.groups.yahoo.com/group/J-mtDNA/
mtDNA Haplogroup J Project
Human Mitochondrial Genome Database
MtDB: Human Mitochondrial
Genome Database
References
Andrews RM, Hubacka I, Chinnery PF, Lightowlers RN,
Turnbull DM, Howell N (1999)
Reanalysis and revision of the
Bandelt HJ, Kong QP, Richards
M, Macaulay V (2006)
Estimates of mutation rates and coalescence times. In: Bandelt HJ,
Macaulay V, Richards M (Eds.)
Nucleic Acids and Molecular Biology, Vol. 18,
Springer-Verlag.
Benson
DA, Karsch-Mizrachi I, Lipman DJ , Ostell J, Wheeler DL (2007) GenBank.
Nuc Acids Res, 35:D21-D25 (Database Issue). The
database is available at the following URL:
http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuccore.
Cann RL, Stoneking M, Wilson AC
(1987) Mitochondrial
DNA and human evolution. Nature,
325:31-36.
Detjen K. A.,
Greenspan B (2007) Direct submission
of Family Tree
Hartmann A., M. Thieme,
L. K. Nanduri, T. Stempfl,
C. Moehle, T. Kivisild, Oefner PJ (2008)
Validation of microarray-based sequencing of
93 worldwide mitochondrial genomes. Unpublished.
Herrnstadt C, Elson JL, Fahy E,
Preston G, Turnbull DM, Anderson C, Ghosh SS, Olefsky JM, Beal MF, Davis RE, Howell N (2007) Reduced-median-network
analysis of complete mitochondrial DNA coding-region sequences for the major
African, Asian, and European haplogroups.
Am J Hum Genet, 70:1152-1171.
See also Elson (2007) for an update of the phylogeny.
Ingman
M, Gyllensten U (2006) MtDB: Human
Mitochondrial Genome Database, a resource for population genetics and medical
sciences. Nucleic Acids Res,
34:D749-D751. The database is available
at http://www.genpat.uu.se/mtDB/.
Jobling MA, Hurles ME, Tyler-Smith C (2004) Human Evolutionary Genetics,
Logan Ian (2007) Mitochondrial
Macaulay V (2001) “Supplementary data from Richards et al. (2000),”
available at http://www.stats.gla.ac.uk/~vincent/founder2000/index.html.
Mitomap – A human mitochondrial genome database (2008), http://www.mitomap.org/
Parsons TJ (2005)
Singular nucleotide polymorphisms over the entire mtDNA genome
that increase the forensic discrimination of common HV1/HV2 types in
‘Hispanics.’ Unpublished.
Rand DM. (2001) The units of selection of mitochondrial
Richards
M, Corte-Real H, Forster P, Macaulay V, Wilkinson-Herbots
H, Demaine A, Papiha S,
Hedges R, Bandelt HJ, Sykes B (1996) Paleolithic and Neolithic lineages in the
European mitochondrial gene pool. Am J Hum Genet, 59:185-203. See also the critique by L L. Cavalli-Sforza and E. Minch
(1997) in
61:247-251 and the authors’
reply in 61:251-254.
Richards
M (2003) The
Neolithic invasion of Europe. Annu Rev Anthropol,
32:135-162.
Sarich VM, Wilson AC (1967) Inummunological
time scale for hominid evolution. Science
158:1200-1203.
Stoneking M,Bharia
K, Wilson AC (1986) Rate of sequence
divergence estimated from restriction maps of mitochondrial DNAs
from
Wills C (1995)
When did Eve live? An evolutionary detective story. Evolution, 49:593-607.