Patterns of Haplogroup R1b in the
Kevin D. Campbell
The recent availability of Y-
Address for correspondence: Campbell@alum.mit.edu
The lack of reliable data can be attributed to a number of causes. These include: inconsistent use of markers and nomenclature, cost involved with extensive panels of markers, and a number of other issues that are familiar to most academic and amateur genetic researchers. However, it is suggested that there are two root causes that have significantly hindered population analysis – (1) the lack of uniformly collected, independently verified data sets and (2) the tendency of researchers to shield and obfuscate their analysis.
The major factor contributing to the first is the nature of the submission process at the largest databases, which results in non-validated and unsubstantiated data. The sad truth is that many of this field’s largest databases rely on data and geographic inputs provided by enthusiastic, but uninformed individuals. While transcription and upload errors for YSearch and SMGF may be very low, many errors creep in related to marker translation and geographic speculation. Though some might dismiss marker translation as insignificant, errors in geographic data are not. In fact the essence of population analysis is the deduction and inferences that can be made about how the Y-STR data relates to specific geographic areas or population migrations.
As an example, in the
The aforementioned statistics call into question the reliability of information provided by individual participants and suggests that possible organized and structured data gathering studies might provide better sources of data. However, some respected researchers have published popular texts with new theories while providing insufficient information or omitting critical linkages that might facilitate a formal and critical review.
For example, though Capelli’s
Though Bryan Sykes has made the data available that he used in his recent popular book “Blood of the Isles”, he also leaves out critical linkages between the particular haplotypes and conclusions that he draws in the text. This is particularly frustrating since while the haplogroups in his study are reasonably distinguishable, the assertions and theories that he makes about specific subclades are not easily reviewed in the context of the supporting data. The present article will fill in some of this missing information.
Similarly, most people would consider the recent book, The Origins of the British, by Stephen Oppenheimer’s (2006) to be an authoritative work in this rapidly evolving field. While a substantial portion of Oppenheimer’s Y-STR study also uses the Capelli data, he assigns new haplotype labels to the data – “16 distinct types of R1b” -- without ever providing the detail necessary to link his haplotypes to the underlying data. As is the case with Sykes’s work, this makes any real academic or juried review of his conclusions impossible and lessens the usefulness of his work to other researchers. In some cases like the aforementioned Pict test, researchers have been quick to partner with testing companies to make money from their private theories.
The purpose of this paper is to take a look at the underlying Sykes R1b data and see if it can be linked to his founder haplotypes and the conclusions of his analysis. The goal of this paper is to attempt to provide additional insights to the work of these researchers to make it more useful to the individual genetic genealogist who look to their data as a link to the past.
This study focuses on Haplogroup
R1b, which comprises the vast majority of the
Step 1 - Coding of the OGAP Data
For data collection, Oxford
Genetic Atlas Project (OGAP) data was downloaded from Bran Sykes web site and
converted from PDF to Microsoft Excel format.
2,322 samples were then coded by haplogroup using
Sykes included the following description of the data in the supplementary data file:
Y-chromosome DNA (yDNA) - Samples collected early in OGAP were amplified across the following seven markers: DYS 19, DYS 389i, DYS389ii, DYS 390, DYS 391, DYS 392, DYS 393 using conditions described by de Knijff et al (International Journal of Legal Medicine, 110: 134-140, 1997). Later samples were typed for these and three additional markers: DYS 388, DYS 425 and DYS 426, using the two-stage multiplex conditions described by Thomas et al. (Human Genetics, 105: 577-581, 1999). Alleles are reported as the number of repeat units. For reasons of continuity within OGAP, DYS 389i is reported as three repeats lower than the allele size produced by the ABI 3100. DYS 398ii-i reports the difference between 389i and 389ii, the reason being that the repeat size at 389ii is not independent of 389i whereas the difference between them is. Although Y-chromosomes were assigned to clades, largely by RFLP [Author - restriction fragment length polymorphism] analysis, these assignments are not reported here as they do not necessarily correspond to the SNP-based system recommended by the Chromosome Consortium (Genome Research 12:339-348, 2002).
Geographical distribution - Genetic data are assigned to geographical regions based on the birthplace/residence of the paternal grandfather. This was done to minimize the effect of very recent migration. The regional boundaries are shown on a map which precedes the Prologue in Blood of the Isles. These data are copyrighted and must not be reproduced without permission. Other formats and additional details may be available for academic collaborations.
Several things are worth noting about the data. First, the Sykes data only uses 10 markers (DYS19, DYS389i, DYS389ii, DYS390, DYS391, DYS392, DYS393, DYS388, DYS425 and DYS426). In addition, only approximately 64% of the data are complete with 36% of the data missing four markers, DYS439, DYS388, DYS425 and DYS426. While the missing markers appear to be a serious shortcoming, 94% of all the DYS425 and DYS439 markers and 73% and 74% of the DYS426 and DYS388 markers in Sykes full data set, respectively, have a value of 12. This means that these markers do not have sufficient spread and variability and are, in general, of limited use in discriminating between haplotype patterns within this set of data.
Another interesting fact is the haplogroup distribution of the data. Due to lack of haplogroup designations in the original data, Athey’s haplogroup calculator was used as a proxy to classify each haplotype. With several haplotypes being removed because of missing data, Table 1 shows a comparison of the results of the haplogroup calculator with Sykes published “Clans.”
In this table, percentages shown in the middle column reflect the output of Athey’s calculator while the percentages in the far right column correspond to the breakdown published by Sykes in Appendix C of his book.
It is clear from this comparison that Athey’s calculator appears to classify the data in similar proportions as Sykes Clans and thus one may infer the underlying meaning of Sykes’ clan nomenclature.
In addition, it is important to note that Syke’s Clan categorizations were not based primarily on single nucleotide polymorphism (SNP) testing. As stated in the italicized quote above, “Y-chromosomes were assigned to clades, largely by RFLP analysis, these assignments are not reported here as they do not necessarily correspond to the SNP-based system recommended by the Chromosome Consortium.”
Finally, one should understand the regional borders that Sykes uses in his study. Since the purpose of this analysis is to draw geographic inferences, we are limited in our insight by the definition of the geographic areas from which the data is collected. Figure 1 shows the regional borders that are coded in the OGAP data.
Table 1. Calculated Haplogroups vs. Sykes Clans
Step 2 – Extracting R1b (Oisin) Data
The 1625 haplotypes identified as
R1b in the previous step were extracted from the data set. These included all of the R1b haplotypes
shown in Sykes Table 1, plus those for
Figure 1. Regional Borders Used in the OGAP Analysis to Classify Individuals
The first thing that was done to better understand the data was to identify modal haplotypes for each region as a descriptive view of the R1b data set. However, examination of the modal haplotypes for the individual regions was not informative because all regions and the full data set matched the standard Atlantic Modal Haplotype.
A view of the R1b data in the form of a connected graph – as shown in Figure 2 – shows a high degree of “cubism.” By this I mean a high degree of nodal interconnectivity among the data points that results in opposite vertices “washing out” differences in the data.
Clearly, data analysis based upon unique combinations of markers (i.e. haplotypes) instead of individual markers would be necessary.
Figure 2. Network Analysis of the Top Twenty R1b Haplotypes
Step 3 – Analysis of Haplotypes
Since descriptive statistics tended to “average out” differences in the data, other methods were needed to identify patterns and analyze the data. To do this the most common haplotypes were identified and two methods of analysis were performed. Appendix A shows the haplotypes in the OGAP data. The OGAP haplotypes roughly follow the frequency distribution of YSearch and in McEwan’s (2007) groups if one takes into account that the OGAP data is light on Irish samples in comparison with these other sources. 
The OGAP designations in this table were assigned sequentially in decreasing frequency of occurrence. The OGAP numbers from this table will be referenced throughout the remainder of this report.
Two types of analysis were conducted to identify patterns – affinity analysis and network analysis.
For affinity analysis of the
haplotypes, an Excel spreadsheet was developed to look for patterns and
anomalies in the data. An algorithm was
created that took as an input the 10 marker values for a haplotype or signature
to be reviewed and then compared that haplotype to the R1b subset of the
database. The algorithm calculated the
genetic distance and reported back the number of perfect (i.e., zero distance)
matches by OGAP region. To account for
the differing level of sampling in each region (e.g., small for
Table 2. Example of Identifying OGAP8 Affinities
For example, haplotype OGAP8 which
is generally considered the quintessential Irish haplotype has 34 perfect
matches in the database. Because
This analysis was repeated for the top 20 haplotypes in the OGAP data. These haplotypes, which cover 60% of all OGAP samples, appear sufficient to identify major regional affinities. Analysis of additional haplotypes would be increasingly subject to sampling error.
The results of the analysis of the top 20 haplotypes are shown in Table 3. It should be noted that in this table negative values were removed to reduce the clutter and significant geographic anomalies were color coded to aid in identifying tendencies. Finally, especially interesting results were boxed to help in latter discussion in this paper.
License was also taken in the reordering of rows and columns of the table in an attempt to group similar haplotypes and close regions. While such analysis is called “affinity analysis” and can be conducted mathe-matically, this analysis was done manually to better allow for subjective considerations of the data.
The second type of analysis that was performed was network analysis. This analysis which is common in the genetic sciences was conducted using the Fluxus Networking program version 126.96.36.199. Figure 3 shows the results of the network analysis for the top twenty OGAP haplotypes.
It should be noted in Figure 3 that nodes have been relocated and line length changed for increased readability. Nodes have also been colored to reflect the regional affinities identified in Table 3.
Analysis of the Oxford Genetic Atlas Project data has yielded interesting results. The combination of the geographic affinity results shown in Table 3 and the network analysis results shown in Figure 3 are synthesized in Figure 4. In this graphic, key haplotypes with strong regional affinities were placed in their rough geographic perspective. No attempt was made to force every haplotype somewhere on the map as it is obvious that at this level of analysis, some haplotypes are pervasive and ubiquitous and not easily generalized to a single geographic region.
Table 3 – Haplotype Affinity by OGAP Region
Once located, haplotypes that differed by a single mutation were connected with lines. Figure 4 reflects the general interconnectivity resulting from the network analysis of Figure 3. The lines in Figure 4 should be thought of as one possible path of migrations – not necessarily the only path. The interconnections shown in Figure 4 are not based on any individual mutation rates. The interconnections shown in Figure 4 are based upon the occurrence of mutations, the principle of parsimony, and the general south-to-north flow of R1b discussed by Sykes and Oppenheimer. Parsimony, in this case, reflects the generally acknowledged flow from higher concentrations of haplotypes to lower, more diffused concentrations.
Figure 4. Geographic Patterns of R1b in the
Some of the observations and conclusions of this analysis are as follows.
1. The methodology clearly identified and quantified
what has been previous called the Irish subclade. Whether called the Irish Modal Haplotype or
the “Ui Neill haplotype” as in the
2. Similarly OGAP10 which shares the
3. Interestingly, OGAP5 is a very
prevalent haplotype that also shows up predominately in
4. OGAP19 is interesting in that it
shows an extreme correlation with both
5. OGAP4 is particularly
intriguing. It is ubiquitous across all
6. OGAP6 is prominent in Argyll and
7. OGAP9 and OGAP11 both show an affinity for both the Northern Isles and the Borders regions. This affinity is distinctive, but the author is unqualified to venture a theory that might explain this geographic discontinuity.
8. OGAP13 and OGAP17 both show a
clear affinity for
9. OGAP14, OGAP16, and OGAP20 all
show a common regional affinity. Though
their presence in Tayside is very slightly higher than in
10. OGAP7 seems to be most prevalent
11. The core haplotypes for the full
Through the analysis of Sykes’ OGAP data, this study has provided a means linking DNA results to haplotypes and conclusions in Sykes’ book, “Blood of the Isles.” The study has confirmed Sykes’ interpretation of the data, and hopefully, provided a means for other researchers to further validate and extend his work. The study both confirmed some subclades identified by Sykes as well as identified some new subclades worthy of further research. Key subclades that the study posits and which are defined by Sykes include those of the Picts and the Dal Riada Celts.
It is asserted that OGAP4 best represents the Pictish ancestry of
Dal Riada Celts – When considered in a narrow genetic sense, the Gaels of Ireland, as
identified by the
Several interesting clusters were
identified that show geographic affinities but discontinuities. Scotish clusters OGAP9 and OGAP11 have a strong
presence in the Borders as well as the Northern Isles. English clusters OGAP14, OGAP16, and OGAP20
show a predisposition to both
Capelli, C. et al. 2003 Data Set: http://freepages.genealogy.rootsweb.com/~gallgaedhil/Capelli.htm
John McEwan’s R1b Haplotypes:
Whit Athey’s Haplogroup Predictor:
Capelli C, Redhead N, Abernethy JK, Gratrix F, Wilson JF, Moen T, Hervig T, Richards M, Stumpf MP, Underhill PA, Bradshaw P, Shaha A, Thomas MG, Bradman N, Goldstein DB (2003) A Y chromosome census of the British Isles. Curr Biol 13:979–984.
Oppenheimer S (2006) The Origins of the British - A Genetic
Detective Story, Constable and Robinson,
Sykes B (2006) Blood of the Isles: Exploring The Genetic
Roots of Our Tribal History. Bantam
Books. Published in the
Appendix A – OGAP Haplotypes
The 1625 data points that comprise the R1b data set include 291 separate haplotypes. However, 50% of the data can be accounted for with only 10 haplotypes; 60% by 20 haplotypes; and 68% by 30 haplotypes. In addition, haplotypes beyond the top 30 have only single-digit frequencies compared to the most frequent – the Atlantic Modal Haplotype (AMH), which occurs 262 times in the data.
Below are the top 50 haplotypes in order of descending frequency. These are number OGAP1 through OGAP55 for reference in this study and represent all haplotypes that occur more than 5 times. The distribution of these haplotypes in Sykes’ OGAP data and in YSearch (www.ysearch.org) as of December 2006) is shown on the right hand side of the chart.
Also, a mapping of John McEwan’s (2007) R1b subclades is shown on the left hand side of the chart. The letters designating the McEwan group refer to the groups described in Appendix B that were assigned when McEwan individual haplotypes were grouped together to reflect the much smaller number of markers in the OGAP data.
Appendix B – McEwan’s R1b Haplotypes Reduced
By mining YSearch and collecting similar 37-marker haplotypes into clusters, McEwan (2007) has identified a large number of R1b “types” that comprise the world-wide scope of this data. It is interesting to consider how the Sykes OGAP data relates to McEwan’s haplotypes. However to compare the data, McEwan’s modal haplotypes had to be collapsed into smaller groups to reflect the smaller number of markers in the OGAP haplotypes (Refer to Appendix A).
The following is the reduction of the McEwan haplotypes to 10-marker haplotypes. In each case, letter designations have been included here for the purpose of mapping the full set of McEwan haplotypes (designated R1bSTR##) to the reduced set used in this analysis (Letter Groups). While this exercise necessitates the loss of considerable resolution in the McEwan haplotypes, the exercise is included here to provide traceability to the analysis included in Appendix A.
 In the case of academic researchers, many keep tight control of their data. It is hoped that in the future that scientific journal editors will require the submission of supporting data, which they might even hold for a period of time even after publication of analytic articles. Such submissions would ensure that important research is fully documented even if it is years later when the article is no longer at the forefront of everyone’s mind.
Address for correspondence: Campbell@alum.mit.edu
 For the Campbell Project, the distribution of the birth year of the oldest proven ancestor of the participants is as follows. 4% earlier than the 1600s, 7% in the 1600s, 56% in the 1700s, and 33% in the 1800s. There is no reason to expect this to be anything but representative, and in fact, one could be convinced that some of these participants are outliers with longer than usual paper genealogies.
 For those markers that have been reported in the full data set. i.e., Alleles of 12 include: DYS425 (1412/1496), DYS439 (1411/1496), DYS426 (1088/1496), DYS388 (1108/1496).
 The results column of Table 2 has the same relative weighting as if samples observed were divided by sample size (e.g. Ireland = 2/17) but this is just an alternative formula.
 OGAP haplotypes below OGAP30 have single digit sample sizes.
 Mark McDonald writes, “The Dal Riada (Dalriads) leadership who came
from Ireland in circa 500AD into what is now Argyll spoke a language akin to
what is now called Erse (Irish Gaelic to the Scots) and introduced that
language into Scotland - the root of modern Scots Gaelic. They were called 'Scoti' by the Romans it is
said - a word for 'raider' used in those days, and it is the root of the name
 When writing about Argyll, Sykes writes, “However, the genetic signal, as far as I can judge, points to a substantial, and by the look of it, hostile replacement of Pictish males by Dalriadian Celts, most of whom relied on Pictish rather than Irish women to propagate their genes.” Sykes, Blood of the Isles, page 210.
 Other researchers suggest that this haplotype might be Dal Riadic Celt (see::
However, the ubiquitous presence of
 Again this analysis confirms the
statement on page 239 of Sykes book, “The
Atlantis Chromosome, the prevalent Y chromosome in the Clan, is very frequent
 Sykes, Blood of the Isles, page 282. It should be noted that Oppenheimer writes very little about the Picts in his book, Origins of the British. The reason may be that Oppenheimer’s analysis based on the Capelli data relies only on 6 markers instead of 10. The lack of markers DYS439, DYS389-1, and DYS389-2 causes the Pict haplotypes (OGAP4) to be grouped with, and mask by, OGAP2 and OGAP12 in the Capelli data.
 Sykes, Blood of the Isles, page 214 – “So far, we have four possible influences on the genetic structure of the people of Scotland, first the Picts, then the Gaels of Ireland, synonymous with the Celts, the Vikings, and in the south of Scotland particularly, the Anglo-Normans.”