Patterns of R1b in the
Kevin D. Campbell
Stephen Oppenheimer’s book, The Origins of the British— A Genetic Detective Story, references a clan nomenclature which is not explicitly defined in the text nor linked to the underlying data. This paper attempts to understand Oppenheimer’s analysis while incorporating results from subsequent clan testing to hypothesize the haplotype definitions for Oppenheimer’s R1b sub-clans.
Address for correspondence: Campbell@alum.mit.edu.
Received: July 12, 2007; accepted: August 30, 2007
essence of genetic genealogy is to understand where we’ve come from. However, many of the recent papers and books
have missed opportunities to provide strong links relating genetics to regional
locations. Specifically, two recent
In the case of Sykes’ work, Blood of the Isles was targeted to the general population and written in the manner of a popular work of non-fiction. In contrast, Stephen Oppenheimer’s book The Origins of the British synthesizes historical, anthropological, archaeological, linguistic, and genetic evidence into a cohesive set of conclusions.
While Oppenheimer chose to include the term “genetics” in the title of his book, it comprises only a part of his overall analysis. However, by leading with genetics, Oppenheimer owes a certain level of traceability to the reader to allow a thorough and detailed review of his analysis.
credit, Sykes has provided his samples underlying Blood of the Isles for
examination, but he failed to provide a detailed analysis of those samples for
the reader. For example, Sykes does not
fully describe his “clan” system in adequate detail. A “clan” is a group of individuals with
closely matching Y-
analysis of Sykes’ clans, which included the determination of the probable clan
definitions, was published in the last issue of this journal (
In this article I will attempt to achieve the same general result for Oppenheimer’s book. In particular, the present article will attempt to provide some insight into the probable definitions of Oppenheimer’s clans.
Sykes’ and Oppenheimer’s analyses have both similarities and differences that affect how one might approach reverse-engineering each. Both authors chose to coin “clan names” as shorthand monikers for genetic groups derived from their analysis. While these clan names provide convenient shorthand for mass market books, serious researchers want to see the genetic definitions of these groups.
Sykes’ work is based upon original data collection (primarily via blood samples), and he has published his full dataset on his web site for use by other researchers.
study is not based upon new genetic data, but rather is based upon a
re-analysis of previously published information. Oppenheimer uses five key sources for his British
data: The studies of Capelli (2003),
Wilson et al (2001), Weale et al (2002), and Hill et al (2000), and data
provided by D. Faux and J. Wilson related to the Orkney and Shetland Islands,
one of which (Capelli’s) is available on the Internet. Collectively from these sources, Oppenheimer
compiled a composite dataset containing 3,084 samples, “though by far the
largest body of data in the composite
Since the vast majority of Oppenheimer’s data came from Capelli, the first step in the method was to understand the nature and limitations of Capelli’s dataset.
A second part of my method resulted in the
identification of the 16 R1b clans that Oppenheimer uses in his analysis. Like Sykes, Oppenheimer does not completely
disclose the Y-
Reviewing Capelli (2003)
Capelli’s study, “A Y chromosome census of the
collected 1,772 Y chromosomes samples from 25 predominantly small urban
locations in the
Figure 1. Locations Used for Capelli’s Original Data Collection
understand Capelli’s published dataset, the 1,772
The haplogroup mapping results from the calculator and key counts of Capelli’s data used in Oppenheimer’s study are shown in Table 1. The haplogroup results are shown as rows while the column counts are derived from the individual datasets.
It can be
seen from this table that the vast majority of Oppenheimer’s data is from the
Capelli dataset. 71% of his overall data
and 85% of that from the
Table 1. Summary of Capelli and Oppenheimer Datasets
can be traced back directly to Capelli. Oppenheimer has acknowledged this heavy reliance in his book.
Given this reliance, it is clear that Oppenheimer’s genetic analysis is based upon the six microsatellites included in the Capelli data. 67 unique R1b haplotypes were extracted from the dataset and are included in Appendix A. These 67 haplotypes subsume the 1,301 R1b Capelli samples shown in Table 1, and this data represents 76% of all the data used by Oppenheimer in his analysis of R1b migration patterns.
the same approach as for Sykes’ data in
The missing piece in Oppenheimer’s study is the definition of these clusters in terms of the underlying microsatellites. Nowhere in his book are these clusters fully specified.
Though one approach to determining the cluster/clan definitions could be a bottom-up analysis of the data listed in Appendix A, essentially reverse-engineering Oppenheimer’s work, another strategy was selected. Simply put, since Oppenheimer’s genetic clans are apparently based primarily on six microsatellites, it was decided to look at how he typed specific participants to attempt to deduce the R1b cluster definitions from their results.
attempting to collect information on the Oppenheimer clan definitions, another
of the results of Oppenheimer’s genotyping has been illuminating. When Oppenheimer Clans are viewed on
2 shows the
Oppenheimer Clan results from
Table 2. Oppenheimer Genotypes with Associated Capelli Microsatellites
and Intra-Clan Differences Highlighted
It should be noted that because there are fewer Oppenheimer clusters than haplotypes in the Capelli dataset (and fewer than the possible number of combinations of six markers, by necessity), an Oppenheimer cluster must span more than one unique combination of six markers. Or stated another way, since Oppenheimer partitions R1b into only 16 groups, some groups must contain more than one of the 67 haplotypes listed in Appendix A.
When looking at Oppenheimer’s empiric results in light of Capelli’s underlying markers, several conclusions are evident from Table 2.
First, no six-marker haplotypes are split among two or more Clans -- i.e., each haplogroup maps into one and only one clan designation. This supports the hypothesis that Clan designations are based primarily on these six markers.
Oppenheimer clan families (e.g., R1b-8 and 8a;
R1b-14 a/b/c, R1b-15 a/b/c, etc.) seem to be generally separated by a
single step mutation of a single marker.
For example, R1b-8 and R1b-8a seem to be differentiated by
even with the small sample size collected by
A summary of Oppenheimer’s R1b Clan Tree is reprinted as Figure 2. In this figure, the estimated time of branching, the standard deviation, and the number of samples of each clan that were present in his dataset are extracted from various footnotes throughout the Oppenheimer’s book. The corresponding six-marker haplotypes are also included where possible.
Analysis, the Atlantic Modal Haplotype (i.e., Ruy or R1b-10) splits from R1b-9
When viewed in the context of the aforementioned microsatellites, one can also see how Oppenheimer might draw this conclusion. Table 3 shows this specific progression of R1b Clans proposed by Oppenheimer in his book.
Table 3. Haplotype Progression Suggested by Oppenheimer’s Analysis
The haplotype progression shown in Table 3 further reinforces the conclusion that Oppenheimer’s analysis used these microsatellites. The haplotype sequence shown in Table 3 is logical and follows Oppenheimer’s sequence. The haplotype sequence does not support other progressions such as R1b-9 à R1b-8 à R1b-10 or R1b-10 à R1b-9 à R1b-8 that would contradict Oppenheimer’s conclusions.
As a final check, the author attempted to recreate
several of the Oppenheimer’s Clan maps included in his book. In Figures 3a and 3b, the data for the
hypothesized clans R1b-15c and R1b-9 were plotted on Capelli’s map of the
Figure 3a. Comparison of Capelli Samples with Oppenheimer Clan R1b-15c
(Size indicates the relative number of observed samples in the Capelli dataset)
Figure 3b. Comparison of Capelli Samples with Oppenheimer Clan R1b-9
(Numbers indicate number of observed samples in the Capelli dataset)
There are some limitations to this analysis. For example, (1) the known ex post facto clan samples shown in Table 2 are very small, (2) not all of Capelli’s known haplotypes have been genotyped into Oppenheimer Clans, and (3) while significant, only 85% of Oppenheimer’s R1b British Isles data is attributable to Capelli’s underlying dataset in the first place.
These caveats notwithstanding, the author believes that Figure 3 further reinforces the assertion that the insights into Oppenheimer’s clan nomenclature can be deduced when ex post facto results are compared to Capelli’s dataset.
The analysis presented in this paper tends to
confirm the hypothesis that Oppenheimer's Clan system is based upon the six
microsatellites presented in the data of
Capelli C, et al.(2003) dataset:
Oxford Genetic Atlas Project (OGAP), data from Sykes (2006):
Capelli C, Redhead N, Abernethy JK, Gratrix F, Wilson JF, Moen T, Hervig T, Richards M, Stumpf MP, Underhill PA, Bradshaw P, Shaha A, Thomas MG, Bradman N, Goldstein DB (2003) A Y chromosome census of the British Isles. Curr Biol, 13:979–984.
Stephen. The Origins of the British—A
Genetic Detective Story. Constable
(2006) Blood of the Isles: Exploring
The Genetic Roots of Our Tribal History.
R1b Haplotypes Present in Capelli’s Study
 Oppenheimer (2006), Chapter 3, footnote 41.
 Capelli (2003) dataset located at: http://freepages.genealogy.rootsweb.com/~gallgaedhil/Capelli.htm
 Oppenheimer (2006), p. 123.
 While this statement was true for a long time, a recent empiric posting provides one contradiction in the 48 observations included in Table 2. i.e., Samples #36 and #40 are typed as R1b-14a and R1b-14c, respectively, but contain the same six marker haplotype. The author suspects that this discontinuity is due to lab error but this discrepancy is noted for the reader so they can weigh this anomaly accordingly.
 1,642 samples are accounted for out of the full 1,947 R1b sample data set. i.e., 1,511 plus 436 (See Table 1)