Mitochondrial DNA Control-Region Mutations at Positions 514-524 in Haplogroup K and Beyond

 

William R. Hurst

 

 

Abstract

 

Long neglected by scientists and mostly excluded from their phylogenetic trees, the variants at positions 00514-00524 in mitochondrial DNA were investigated to determine their usefulness within mtDNA haplogroup K and in the full mtDNA tree. The complex and diverse nomenclature for these variants had to be collected. The percentages of these heteroplasmic variants in the haplogroup K subclades were determined.  An attempt was made to establish what, if any, inheritance patterns could be found for these variants in K. How they differ from other mtDNA mutations and how they compare with Y-DNA mutations was investigated. The primary databases used were the mtDNA Haplogroup K Project and the federal GenBank. The few scientific papers on the variants were examined. A less detailed study was made of the variants as they appear in other mtDNA haplogroups. Rules which the variants appear to be following in K were matched against the conclusions of the scientific papers and the observations from the other haplogroups. Finally, areas for further research concerning these variants and other mtDNA mutations were presented.

 

 

 

 

Address for correspondence: wrhurst_17@msn.com.  W. R. Hurst is the Administrator of the Haplogroup K Project.

 

Received:  July 12, 2007; accepted:  August 30, 2007.


 

 

 

Introduction

 

This study began as an investigation of the variants at mitochondrial DNA positions 514 through 524 in sequences from Haplogroup K.  The Cambridge Reference Sequence (CRS) variant of these positions consists of alternating cytosine and adenine bases: CACACACACAC.  An early observation was that the incidence of variants containing insertions with respect to the CRS (the insertion variants) was significantly higher in K than in other mtDNA haplogroups, while the incidence of the CRS variant in K was somewhat lower, and the deletions variant was significantly lower.  Later, using the few scientific papers on the subject, these positions in the other mtDNA haplogroups were investigated.  The insertions and deletions at these positions have not been well studied in the past for several reasons.  Early mtDNA papers focused on the first hypervariable region, HVR1.  Sometimes certain coding-region mutations were investigated, or sometimes HVR2 was included.  However, positions 514-524 are in the old HVR3, which has received even less attention.  These positions were considered unstable and too variable to be of help in defining subclades.  Due to their nature, they, along with certain other lessstable mutations, may cause reticulations in phylogenetic trees, so they were usually excluded.  Scientists and testing companies could not even agree on what to call them.

 

The goal in the present study is to rectify the past neglect of these interesting mutations by (1) studying the added resolution that they bring to one mtDNA haplogroup–Haplogroup K, (2) looking at the few scientific papers that focused on them, (3) looking at their role in the mtDNA tree in general, and (4) summarizing what has been learned.  Suggestions for future research will follow.

 

Nomenclature

 

The first large hurdle that must be dealt with is nomenclature.  The CRS is the standard against which all mtDNA sequences are measured.  The current version is the Revised Cambridge Reference Sequence, or rCRS, but the common initials CRS will be used here (Anderson et al. 1981; Andrews et al. 1999).  The CRS has the sequence CACACACACAC from HVR (hypervariable region) positions 514 to 524.  Depending on how you look at this sequence it is composed of five CA (cytosine and adenine) or five AC dinucleotide pairs.  Mutations occur at these positions when one or more pairs of bases are inserted or deleted.  In accordance with common practice, insertions and deletions are always measured in reference to the five-CA-pairs found in the CRS sequence (the CRS variant).

 

Family Tree DNA (FTDNA), whose sequences are used most often in this paper, recently has labeled the mutation when one CA pair is inserted as 524.1C, 524.2A, and 522-, 523- when one CA pair is deleted.  Additional insertions are shown as 524.3C, 524.4A, etc.  No second pair of deletions has been observed in the FTDNA databases.  However, in some older FTDNA test results indels (insertions or deletions) are shown as 524.1A, 524.2C, etc. or 523-, 524-.  The Sorenson Molecular Genetics Foundation (SMGF) uses the latter set of designations.  Other DNA testing companies use different systems of reporting these.  Relative Genetics uses 523.1C, 523.2A for the insertions.  Argus BioSciences uses 524insA, 524insC and 522delC, 523delA, or even 524insAC for a pair of insertions.  Wilson et al. (2002b), representing the Federal Bureau of Investigation forensic unit, recommended 524.1A, 524.2C and 523D, 524D.  Ian Logan’s mtDNA database commonly uses 523.C, 523.A for the insertions and “C522., A523.” for the deletions; but more often they are not listed for each separate sequence, but under “variable changes” at the beginning of a page of sequences.  Kivisild et al. (2006), which contains the most recent detailed mtDNA tree, uses 523+CA and 523+2(CA) for one and two pairs of insertions and 522-523d for the deletions.  The scientific papers discussed below use a different approach; they simply report the number of repeats.  So the CRS variant is “allele 5” or “(CA)5”, with one pair of deletions as allele 4 or (CA)4, and one pair of insertions as allele 6 or (CA)6, etc.  The Mitomap database lists other scientific papers which refer to insertions and deletions at almost every position between 514 and 524.  The main point to remember is that all of these systems are describing exactly the same things.  Here the terms CRS variant, one or more pairs of insertions, and one or more pairs of deletions, will be used.  Also, the term “position 524” or just “524” will be used instead of 514-524, because there is no way to determine in a string of (CA)n, exactly where any CA insertion or deletion has occurred.  “Variants,” “insertions” and “deletions” will refer to the position 524 variants, unless otherwise specified.  Another point is that the insertions or deletions of C or A never occur individually, but always in CA pairs–except in the rare case of a point mutation (single base change) occurring at one of the positions. 

 

Another nomenclature factor is that the sequence 514-524 is part of the original HVR3 (aka HVS-III) section of the hypervariable region or hypervariable segment (HVS), which runs from positions 438 to 534.  All of the HVR regions together, plus a few other locations, are also known collectively as the displacement loop or D-loop or control region.  FTDNA includes HVR3 as part of its HVR2 test.  Argus Biosciences includes HVR3 as part of its HVR package.  Relative Genetics offers HVR3 separately (Relative Genetics is being acquired by Ancestry.com, effective by the end of 2007).

 

References will be made below to sequences in FTDNA’s MitoSearch database, the mtDNA Haplogroup K Project (which included 321 high-resolution HVR1+HVR2 sequences as of July 23, 2007), and the federal GenBank database.  In MitoSearch, sequences are always labeled as just K, while in the K Project about 10% of the total sequences or 14% of the high-resolution sequences have confirmed subclade (also called subhaplogroup) designations based on full-sequence tests.  GenBank sequences vary in how they are labeled, based on their origin.  Most subclade designations are those from Behar et al. (2006, Fig. 1), referred to below as the “Behar K tree.” Subclade designations of sequences not confirmed by full-sequence tests are as predicted by the author.  Additional provisional subclade designations used in this article are those of the author and may change when a new authoritative K tree is published.

 

Definitions

 

Mutations

 

Mitochondrial DNA mutations are most commonly single nucleotide polymorphisms (SNPs), in which one of the four bases or basic units of DNA, cytosine (C), guanine (G), adenine (A), or thymine (T), is replaced by one of the others.  The most common replacements (greater than 95%), C to T and vice versa, and A to G and vice versa, are called transitions; all other replacements are called transversions.  Mutations may also consist of a base being inserted or deleted (indels).  Those types of de novo mutations are similar to those in nuclear DNA, including Y-chromosome DNA; in fact, SNP mutations are exactly the same in mtDNA as in Y-DNA.  Indels at mtDNA locations 514-524 are similar to the short tandem repeats (STRs) in Y-DNA. (Chung et al. 2005).  Only one copy of Y-DNA is transmitted from father to son, since there is only one copy of the Y chromosome in each cell.  For mtDNA, each cell may contain hundreds or even thousands of mitochondria and therefore hundreds or thousands of copies of mtDNA, so that multiple copies of mtDNA are transmitted from mother to child.  However, the number of copies transmitted is limited by bottlenecks in egg development (Shoubridge et al. 2007).  A de novo mutation may occur in just one of the many mtDNA copies, lie undetected for generations, and then by random chance later become the dominant variant.  Turner (2006, Fig. 1) has a diagram showing how a variant can go from being undetectable to being either the dominant variant or disappearing completely.

 

Heteroplasmy

 

The phenomenon of different mtDNA variants being found in different mitochondria or in different cells in the same person is known as heteroplasmy.  Point heteroplasmy (or structural heteroplasmy) is the term used when different SNP variants are found in a cell.  Length heteroplasmy is the occurrence of any mixture of a CRS variant and insertions or deletions in a given region (Scientific Working Group on DNA Analysis, 2003).  Insertions and deletions are often found where there are strings of the same base; most commonly these are poly-cytosine stretches, sequences with several C bases together. (Carter, 2007, pp 3-4) The suggested mechanism for this type of mutation is the same as that for the common short tandem repeats (STRs) in Y-DNA: replication slippage, where the DNA replication system loses count of the numbers of the same base or combination of bases (Howell, 2000, p. 1596).  The multi-CA string of bases at mtDNA position 524, being composed of repeats of two different bases, is the most similar to that of Y-DNA STRs.  Not coincidentally, as with Y-DNA, the average mutation rate for mtDNA indels may be higher than that for SNPs, although there are exceptions in haplogroup K.  See Dupuy et al. (2004) for an extensive discussion of the effect of repeat counts or allele length, and the number of nucleotides in each repeat, on Y-DNA STR mutation rates.

 

Strictly speaking, when the term heteroplasmic mutation is used, or when heteroplasmy is used as a noun, what is usually meant is a situation where two or more variants for the same position are detected by an mtDNA test.  Where the heteroplasmy is due to SNP variants, there is a set of IUPAC (International Union of Pure and Applied Chemistry) codes; 16093Y, for example, would mean that both the mutated version 16093C and the CRS variant 16093T were detected in a sample (Scientific Working Group on DNA Analysis, 2003).  FTDNA and most other companies performing mtDNA tests for genetic genealogy do not use the codes; apparently, they simply report the variant with the highest percentage.  Reportedly, for full-sequence tests FTDNA reports both bases when heteroplasmy is present; but those results are not normally available to anyone except the test subject.  Relative Genetics does use the codes.  However, there are no IUPAC codes for length heteroplasmies such as occurs at position 524.

 

Heteroplasmy and heteroplasmic mutation are often used more loosely to explain why certain mutations, SNPs or indels, occur by the inheritance of different variants between generations.  Apparently, even if the mutated variant is not detected in the mother, by the normal random processes of cell division and replication, the mutated version may be passed to the child and sometimes become dominant in the child or a later descendant.  Thus, a mutation caused by heteroplasmy may appear without there having been either a recent actual replacement of one base with another or a replication slippage.  The child simply inherits a different dominant variant from that dominant in the mother (Turner 2006, Fig. 1).  The heteroplasmic mutations in the tree appear to be following their own hidden, seemingly mysterious, inheritance patterns.  One may think of them as underground rivers, occasionally popping to the surface and then receding – or a parallel system or a second layer of mutations, or mutations lurking below the radar. Pick your favorite analogy.

 

Perhaps there was an intermediate step where, using the strict definition of a heteroplasmic mutation, both variants were detectable.  The key word here is “detectable,” since those heteroplasmies that are not detectable by the direct sequencing method commonly used by testing companies – which would require perhaps 20% for the minority variant to be observed –  may be detectable at 5% by other methods (Tully et al. 2000).  In fact, detection of heteroplasmies as low as 1-2% has a special name: microheteroplasmy (Smigrodzki and Khan, 2005).

 

The Behar K tree demonstrates the problem which the effects of undetectable heteroplasmy cause with trees created with software such as Fluxus-Engineering’s Network program.  To prevent reticulations caused by heteroplasmic and other recurrent mutations, Behar excluded our 524 insertions as well as the positions 309 and 315 insertions and certain other HVR and coding-region mutations.  And yet, there are patterns involving position 524 in the K subclades.  The 524 insertions are found in certain subclades, but not in others; and likewise the deletions.  These patterns will be discussed in detail for each subclade below.  Even adding them back to the data used for the Fluxus diagram does not always explain the appearances of the 524 indels.  Turner (2006) expressed the situation well in the title of an article in this Journal: “Now You See It, Now You Don’t: Heteroplasmy in Mitochondrial DNA.”  We will see below how this system works in the K subclades for the position 524 variants.

 

We see that a mutation reported for a person may have occurred in two general ways; (1) by a de novo mutation similar to a nuclear DNA mutation, either by a base replacement (SNP) or by replication slippage, or (2) inheritance of a heteroplasmic variant.  Often it is not obvious by which method a mutation has occurred.

 

In the context of heteroplasmy, the term “fixed” means that only one heteroplasmic variant is inherited by the founder of a subclade.  If a different variant appears later in that subclade or a lower subclade, it may be assumed that there has been a de novo mutation.  “Fixed out” means that a particular variant is missing from the group of inherited variants.  If that variant later appears in that subclade or one of its descendant subclades, it again may be assumed that there has been a de novo mutation.  Tully et al. (2000) has some discussion of the term “fixed.” A related term is “resolved.”  If a woman with a strict heteroplasmy (two or more variants detectable) has a descendant with only one variant detectable, the position is said to be resolved at that variant.  A progression over many generations might be (1) a woman with only the T or CRS variant detectable at position 16093, (2) a heteroplasmy such as 16093Y – both C and T variants detectable, (3) a descendant with the position resolved at 16093C – the T variant not longer detectable, (4) a descendant with the position fixed at 16093C—with the T variant nonexistent for all practical purposes.  Sigurđardóttir et al. (2000, p 1606) stated “Furthermore, the processes by which heteroplasmy is resolved—and, hence, the likely long-term fate, in descendants, at the site that is heteroplasmic—does not seem well understood.”  The difficulty we face is that, for example, when FTDNA says that a person has 16093C, it is not obvious whether the variant is fixed or resolved, or whether they have just picked the majority variant of a heteroplasmy.

 

Haplogroup Notation

 

For any mtDNA haplogroup, there are often several levels of subclades or subhaplogroups.  For this article, the major or high-level K subclades are K1, K1a, K1b, K1c and K2.  All others will be called “lower” subclades.  An example of the full list of subclades down one branch of the K tree is K, K1, K1a, K1a1, K1a1b, K1a1b1 and K1a1b1a.  Except when specified below, subclade counts do not include that of their lower subclades.  The analogy to a tree trunk with smaller and smaller branches and twigs is not perfect; in the above example K1a1b1a happens to be larger than most of its parent subclades when their lower subclades are not included.

 

Points of Conundrum

 

For this article the term points of conundrum will be used for certain branching points on the K phylogenetic tree which are clearly defined by coding-region or HVR mutations, but which may appear to originate or pass on the length heteroplasmic variants at position 524 between generations and nodes on the tree by the only occasionally visible heteroplasmic system.  The reason for using the new term is not that a new method of heredity has been discovered, just that the effects of undetected heteroplasmic mutations has not been widely discussed.  Typically, a subclade which has haplotypes with more than one variant, divides into two or more lower subclades with different combinations of the variants.  Table 1 shows the percentages of each type of variant (deletions, CRS and insertions) in Haplogroup K as a whole, along with the same information from the Sorenson Molecular Geneology Foundation (SMGF), representing all haplogroups. 

 

In Table 1, the percentages of the position 524 variants for the members of the mtDNA Haplogroup K Project are those of the Family Tree DNA high-resolution (HVR1+HVR2) members as of July 23, 2007.  The SMGF (Sorenson Molecular Genealogy Foundation) percentages are from their Top 50 Mutations list as of July 10, 2007.  The position 524 insertion variants are 4.6 times more likely in the K Project than in the SMGF database.  The deletions are 7.6 times less likely in K than in SMGF.  The CRS variant percentage is roughly the same for the two databases, with that for the SMGF database slightly higher.  Both databases are probably over-weighted toward USA and Northern Europe samples, so the worldwide percentage of insertions is probably lower than that shown and the percentage of deletions may be higher.

 

 

Table 1.  524 Variants in Haplogroup K

 

 

Deletions %

CRS %

Insertions %

K Project

2.2

68.4

29.4

SMGF

16.8

76.8

6.4

 

 

 

Table 2 illustrates the percentages of each variant in most K subclades.  The subclades listed include those from the Behar K tree which have examples in the K Project confirmed by full-sequence tests or known examples in GenBank, plus provisional subclades used by the author: K1a10, K1a11, Pre-K1a9 and Pre-K1a10.  Those with plus signs, K1a+, K1b+, K1c+ and K2+, include not only samples which have been assigned high-level subclade designations after full-sequence tests; but also samples from the K Project that have not been tested adequately to determine their possible membership in a lower subclades.  These may eventually move into one of the more specific lower subclades listed.  The Counts column lists the number of examples of each subclade from the K Project and GenBank.  The GenBank examples include the 121 full-sequence used in the Behar K tree except for those marked “H” (for Herrnstadt) which, until recently, were not in GenBank.  Even now the published Herrnstadt sequences do not include HVR mutations.  Added are several other K examples listed on Ian Logan’s website.  A very few lower subclades on the Behar K tree do not have confirmed examples in the K Project or known examples in GenBank–with HVR mutations–and so are not listed.  The next six columns are percentages for the 524 variants.  The last column is the combined percentage for the insertion pairs.  Deletion variant counts are marked in tan, with yellow used for the CRS variant and blue used for the insertion variants.  Sequences from FTDNA’s MitoSearch database were also examined, but for the sake of consistency and avoidance of duplications, only the K Project sequences were counted in Table 2.  Also, although MitoSearch has more K entries than the K Project, none of those are labeled with subclade designations.

 

 

 

 


Table 2.  Percentages of Position 524 Heteroplasmic Variants in Haplogroup K Subclades

 

Subclade

Counts

522-,523- %

CRS %

524.1,524.2 %

524.3,524.4 %

524.5,524.6 %

524.7,524.8 %

524 Total Inserts %

Repeats

 

4

5

6

7

8

9

 

K2+

11-KP

 

100

 

 

 

 

0

K2a

34-KP,19-GB

6

94

 

 

 

 

0

K2a1a

1-GB

 

100

 

 

 

 

0

K2a2

1-GB

 

100

 

 

 

 

0

K2a2a

8-KP,2-GB

 

100

 

 

 

 

0

K2a3

2-GB

 

100

 

 

 

 

0

K2a4

1-GB

 

100

 

 

 

 

0

K2c

1-GB

 

100

 

 

 

 

0

K1

1-KP

 

100

 

 

 

 

0

K1c+

14-KP

21

79

 

 

 

 

0

K1c1

1-KP,8-GB

 

100

 

 

 

 

0

K1c1a

1-GB

100

 

 

 

 

 

0

K1c1b

4-GB

 

100

 

 

 

 

0

K1c2

26-KP,1-GB

 

96

4

 

 

 

4

K1a+

67-KP

1

70

24

4

1

 

29

K1a1

1-KP,1-GB

 

100

 

 

 

 

0

K1a1a

1-KP

 

100

 

 

 

 

0

K1a1b

1-KP,1-GB

 

100

 

 

 

 

0

K1a1b1

2-KP,1-GB

33

67

 

 

 

 

0

K1a1b1a

30-KP,7-GB

 

97

3

 

 

 

3

K1a6

2-GB

 

100

 

 

 

 

0

K1a7

1-GB

 

100

 

 

 

 

0

K1a8

3-GB

 

100

 

 

 

 

0

K1a11

8-KP

 

100

 

 

 

 

0

K1a3

1-GB

 

100

 

 

 

 

0