Geographic
Patterns of Haplogroup R1b in the
Kevin D. Campbell
Abstract
The recent availability of Y-STR databases has provided the
opportunity to further explore geographic and subclade patterns of Haplogroup R1b
in the
Address
for correspondence: Campbell@alum.mit.edu
Received:
Introduction
While DNA testing has evolved rapidly, there is a dearth
of reliable Y-STR DNA data for serious analysis. This absence is particularly surprising given
the heavy concentration of the R1b haplogroup in
The lack of reliable data can be attributed to a number of
causes. These include: inconsistent use
of markers and nomenclature, cost involved with extensive panels of markers,
and a number of other issues that are familiar to most academic and amateur
genetic researchers. However, it is
suggested that there are two root causes that have significantly hindered
population analysis – (1) the lack of uniformly collected, independently
verified data sets and (2) the tendency of researchers to shield and obfuscate[1]
their analysis.
The major factor contributing to the first is the nature
of the submission process at the largest databases, which results in
non-validated and unsubstantiated data.
The sad truth is that many of this field’s largest databases rely on
data and geographic inputs provided by enthusiastic, but uninformed
individuals. While transcription and
upload errors for YSearch and SMGF may be very low, many errors creep in
related to marker translation and geographic speculation. Though some might dismiss marker translation
as insignificant, errors in geographic data are not. In fact the essence of population analysis is
the deduction and inferences that can be made about how the Y-STR data relates
to specific geographic areas or population migrations.
As an example, in the
The aforementioned statistics call into question the
reliability of information provided by individual participants and suggests
that possible organized and structured data gathering studies might provide
better sources of data. However, some
respected researchers have published popular texts with new theories while
providing insufficient information or omitting critical linkages that might
facilitate a formal and critical review.
For example, though Capelli’s study of
Though Bryan Sykes has made the data available that he
used in his recent popular book “Blood of the Isles”, he also leaves out
critical linkages between the particular haplotypes and conclusions that he
draws in the text. This is particularly
frustrating since while the haplogroups in his study are reasonably distinguishable,
the assertions and theories that he makes about specific subclades are not
easily reviewed in the context of the supporting data. The present article will fill in some of this
missing information.
Similarly, most people would consider the recent book, The
Origins of the British, by Stephen Oppenheimer’s (2006) to be an
authoritative work in this rapidly evolving field. While a substantial portion of Oppenheimer’s
Y-STR study also uses the Capelli data, he assigns new haplotype labels to the
data – “16 distinct types of R1b” --
without ever providing the detail necessary to link his haplotypes to the
underlying data. As is the case with
Sykes’s work, this makes any real academic or juried review of his conclusions
impossible and lessens the usefulness of his work to other researchers. In some cases like the aforementioned Pict
test, researchers have been quick to partner with testing companies to make
money from their private theories.[4]
The purpose of this paper is to take a look at the
underlying Sykes R1b data and see if it can be linked to his founder haplotypes
and the conclusions of his analysis. The
goal of this paper is to attempt to provide additional insights to the work of
these researchers to make it more useful to the individual genetic genealogist
who look to their data as a link to the past.
Methods
This study focuses on Haplogroup R1b, which comprises the
vast majority of the
Results
Step 1 -
Coding of the OGAP Data
For data collection, Oxford Genetic Atlas Project (OGAP)
data was downloaded from Bran Sykes web site and converted from PDF to
Microsoft Excel format. 2,322 samples
were then coded by haplogroup using Whit Athey’s improved Bayesian haplogroup
calculator (Athey 2005, 2006). Hereafter,
“OGAP” will refer to the R1b subset of Sykes’s data.
Sykes included the following description of the data in
the supplementary data file:
Y-chromosome DNA (yDNA) - Samples collected early in OGAP were amplified across the
following seven markers: DYS 19, DYS 389i, DYS389ii, DYS 390, DYS 391, DYS 392,
DYS 393 using conditions described by de Knijff et al (International Journal of
Legal Medicine, 110: 134-140, 1997). Later samples were typed for these and
three additional markers: DYS 388, DYS 425 and DYS 426, using the two-stage
multiplex conditions described by Thomas et al. (Human Genetics, 105: 577-581, 1999).
Alleles are reported as the number of repeat units. For reasons of continuity
within OGAP, DYS 389i is reported as three repeats lower than the allele size
produced by the ABI 3100. DYS 398ii-i reports the difference between 389i and
389ii, the reason being that the repeat size at 389ii is not independent of
389i whereas the difference between them is. Although Y-chromosomes were
assigned to clades, largely by RFLP [Author - restriction fragment
length polymorphism] analysis, these assignments are
not reported here as they do not necessarily correspond to the SNP-based system
recommended by the Chromosome Consortium (Genome Research 12:339-348, 2002).
Geographical distribution - Genetic data are assigned to geographical regions based on the
birthplace/residence of the paternal grandfather. This was done to minimize the
effect of very recent migration. The regional boundaries are shown on a map which
precedes the Prologue in Blood of the Isles.
These data are copyrighted and must not be reproduced without
permission. Other formats and additional details may be available for academic
collaborations.
Several things are worth noting about the data. First, the Sykes data only uses 10 markers
(DYS19, DYS389i, DYS389ii, DYS390, DYS391, DYS392, DYS393, DYS388, DYS425 and
DYS426). In addition, only approximately
64% of the data are complete with 36% of the data missing four markers, DYS439,
DYS388, DYS425 and DYS426. While the
missing markers appear to be a serious shortcoming, 94% of all the DYS425 and
DYS439 markers and 73% and 74% of the DYS426 and DYS388 markers in Sykes full
data set, respectively, have a value of 12.[5] This means that these markers do not have
sufficient spread and variability and are, in general, of limited use in
discriminating between haplotype patterns within this set of data.
Another interesting fact is the haplogroup distribution of
the data. Due to lack of haplogroup
designations in the original data, Athey’s haplogroup calculator was used as a
proxy to classify each haplotype. With
several haplotypes being removed because of missing data, Table 1 shows
a comparison of the results of the haplogroup calculator with Sykes published
“Clans.”
In this table, percentages shown in the middle column
reflect the output of Athey’s calculator while the percentages in the far right
column correspond to the breakdown published by Sykes in Appendix C of his
book.
It is clear from this comparison that Athey’s calculator
appears to classify the data in similar proportions as Sykes Clans and thus one
may infer the underlying meaning of Sykes’ clan nomenclature.
In addition, it is important to
note that Syke’s Clan categorizations were not based primarily on single
nucleotide polymorphism (SNP) testing.
As stated in the italicized quote above, “Y-chromosomes were assigned to clades, largely by RFLP analysis, these
assignments are not reported here as they do not necessarily correspond to the
SNP-based system recommended by the Chromosome Consortium.”
Finally, one should understand the regional borders that
Sykes uses in his study. Since the
purpose of this analysis is to draw geographic inferences, we are limited in
our insight by the definition of the geographic areas from which the data is
collected. Figure 1 shows the regional
borders that are coded in the OGAP data.
Table 1. Calculated
Haplogroups vs. Sykes Clans

Step 2 – Extracting R1b (Oisin) Data
The 1625 haplotypes identified as R1b in the previous step
were extracted from the data set. These
included all of the R1b haplotypes shown in Sykes Table 1, plus those for

Figure
1. Regional Borders Used in the OGAP
Analysis to Classify Individuals
The first thing that was done to better understand the
data was to identify modal haplotypes for each region as a descriptive view of
the R1b data set. However, examination
of the modal haplotypes for the individual regions was not informative because
all regions and the full data set matched the standard Atlantic Modal
Haplotype.
A view of the R1b data in the form of a connected graph –
as shown in Figure 2 – shows a high degree of “cubism.” By this I mean a high degree of nodal
interconnectivity among the data points that results in opposite vertices
“washing out” differences in the data.
Clearly, data analysis based upon unique combinations of
markers (i.e. haplotypes) instead of individual markers would be necessary.

Figure
2. Network Analysis of the Top Twenty
R1b Haplotypes
Step 3 – Analysis of Haplotypes
Since descriptive statistics tended to “average out”
differences in the data, other methods were needed to identify patterns and
analyze the data. To do this the most
common haplotypes were identified and two methods of analysis were
performed. Appendix A shows the
haplotypes in the OGAP data. The OGAP
haplotypes roughly follow the frequency distribution of YSearch and in McEwan’s
(2007) groups if one takes into account that the OGAP data is light on Irish
samples in comparison with these other sources. [6]
The OGAP designations in this table were assigned
sequentially in decreasing frequency of occurrence. The OGAP numbers from this table will be
referenced throughout the remainder of this report.
Two types of analysis were conducted to identify patterns –
affinity analysis and network analysis.
For affinity analysis of the haplotypes, an Excel
spreadsheet was developed to look for patterns and anomalies in the data. An algorithm was created that took as an
input the 10 marker values for a haplotype or signature to be reviewed and then
compared that haplotype to the R1b subset of the database. The algorithm calculated the genetic distance
and reported back the number of perfect (i.e., zero distance) matches by OGAP
region. To account for the differing
level of sampling in each region (e.g., small for
Table 2. Example of Identifying
OGAP8 Affinities

For example, haplotype OGAP8 which is generally considered
the quintessential Irish haplotype has 34 perfect matches in the database. Because
This analysis was repeated for the top 20 haplotypes in
the OGAP data. These haplotypes, which
cover 60% of all OGAP samples, appear sufficient to identify major regional
affinities. Analysis of additional haplotypes would be increasingly subject to
sampling error.[8]
The results of the analysis of the top 20 haplotypes are
shown in Table 3. It should be
noted that in this table negative values were removed to reduce the clutter and
significant geographic anomalies were color coded to aid in identifying
tendencies. Finally, especially
interesting results were boxed to help in latter discussion in this paper.
License was also taken in the reordering of rows and
columns of the table in an attempt to group similar haplotypes and close
regions. While such analysis is called
“affinity analysis” and can be conducted mathe-matically, this analysis was
done manually to better allow for subjective considerations of the data.
The second type of analysis that was performed was network
analysis. This analysis which is common
in the genetic sciences was conducted using the Fluxus Networking program
version 4.2.0.0. Figure 3 shows
the results of the network analysis for the top twenty OGAP haplotypes.
It should be noted in Figure 3 that nodes have been relocated
and line length changed for increased readability. Nodes have also been colored to reflect the
regional affinities identified in Table 3.
Conclusions
Analysis of the Oxford Genetic Atlas Project data has
yielded interesting results. The combination
of the geographic affinity results shown in Table 3 and the network analysis
results shown in Figure 3 are synthesized in Figure 4. In this graphic, key haplotypes with strong
regional affinities were placed in their rough geographic perspective. No attempt was made to force every haplotype
somewhere on the map as it is obvious that at this level of analysis, some
haplotypes are pervasive and ubiquitous and not easily generalized to a single
geographic region.
Table 3 – Haplotype Affinity by
OGAP Region

Once located, haplotypes that differed by a single
mutation were connected with lines. Figure
4 reflects the general interconnectivity resulting from the network analysis
of Figure 3. The lines in Figure
4 should be thought of as one possible path of migrations – not necessarily
the only path. The interconnections
shown in Figure 4 are not based on any individual mutation rates. The interconnections shown in Figure 4
are based upon the occurrence of mutations, the principle of parsimony, and the
general south-to-north flow of R1b discussed by Sykes and Oppenheimer. Parsimony, in this case, reflects the
generally acknowledged flow from higher concentrations of haplotypes to lower,
more diffused concentrations.

Figure 4. Geographic Patterns of R1b in the
Some of the observations and conclusions of this analysis
are as follows.
1. The methodology clearly identified and quantified what has
been previous called the Irish subclade.
Whether called the Irish Modal Haplotype or the “Ui Neill haplotype” as
in the
2. Similarly OGAP10 which shares the DYS390=25 allele
is clearly identified as an Irish haplotype.
This haplotype also shows up strongly in the
3. Interestingly, OGAP5 is a very prevalent haplotype
that also shows up predominately in
4. OGAP19 is interesting in that it shows an extreme
correlation with both
5. OGAP4 is particularly intriguing. It is ubiquitous across all areas of
6. OGAP6 is prominent in Argyll and the
7. OGAP9 and OGAP11 both show an affinity for both the
Northern Isles and the Borders regions.
This affinity is distinctive, but the author is unqualified to venture a
theory that might explain this geographic discontinuity.
8. OGAP13 and OGAP17 both show a clear affinity for
9. OGAP14, OGAP16, and OGAP20 all show a common
regional affinity. Though their presence
in Tayside is very slightly higher than in
10. OGAP7 seems to be most prevalent
in
11. The core haplotypes for the full
12.
13. Like
Summary
Through the analysis of Sykes’ OGAP data, this study has
provided a means linking DNA results to haplotypes and conclusions in Sykes’
book, “Blood of the Isles.” The study
has confirmed Sykes’ interpretation of the data, and hopefully, provided a
means for other researchers to further validate and extend his work. The study both confirmed some subclades
identified by Sykes as well as identified some new subclades worthy of further
research. Key subclades that the study
posits and which are defined by Sykes include those of the Picts and the Dal
Riada Celts.
Picts – It is asserted that OGAP4 best
represents the Pictish ancestry of
Dal Riada Celts – When considered in a narrow
genetic sense, the Gaels of Ireland, as identified by the DNA signature of
OGAP8, are as close as any group to being considered the root line and
forbearers of Celts of today.[18]
When present in
Several interesting clusters were identified that show
geographic affinities but discontinuities.
Scotish clusters OGAP9 and OGAP11 have a strong presence in the Borders
as well as the Northern Isles. English
clusters OGAP14, OGAP16, and OGAP20 show a predisposition to both
Electronic Database
Information
Capelli, C. et al. 2003 Data Set: http://freepages.genealogy.rootsweb.com/~gallgaedhil/Capelli.htm
http://www.bloodoftheisles.net/results.html
John
McEwan’s R1b Haplotypes:
http://www.geocities.com/mcewanjc/p3modal.htm
Whit
Athey’s Haplogroup Predictor:
http://home.comcast.net/~hapest5/index.html
References
McEwan
J (2007) Phase 3 analysis: Ysearch 37
STR modal summary and analysis tables (web site).
Oppenheimer S (2006)
The Origins of the British - A Genetic Detective Story, Constable
and Robinson,
Sykes B (2006) Blood
of the Isles: Exploring The Genetic Roots of Our Tribal History. Bantam Books.
Published in the
Appendix A – OGAP
Haplotypes
The 1625 data points that comprise the R1b data set
include 291 separate haplotypes.
However, 50% of the data can be accounted for with only 10 haplotypes;
60% by 20 haplotypes; and 68% by 30 haplotypes.
In addition, haplotypes beyond the top 30 have only single-digit frequencies
compared to the most frequent – the Atlantic Modal Haplotype (AMH), which
occurs 262 times in the data.
Below are the top 50 haplotypes in order of descending
frequency. These are number OGAP1
through OGAP55 for reference in this study and represent all haplotypes that
occur more than 5 times. The
distribution of these haplotypes in Sykes’ OGAP data and in YSearch (www.ysearch.org) as of December 2006) is
shown on the right hand side of the chart.
Also, a mapping of John McEwan’s (2007) R1b subclades is
shown on the left hand side of the chart.
The letters designating the McEwan group refer to the groups described
in Appendix B that were assigned when McEwan individual haplotypes were grouped
together to reflect the much smaller number of markers in the OGAP data.

Appendix B – McEwan’s R1b Haplotypes Reduced
By mining YSearch and collecting similar 37-marker
haplotypes into clusters, McEwan (2007) has identified a large number of R1b “types”
that comprise the world-wide scope of this data. It is interesting to consider how the Sykes
OGAP data relates to McEwan’s haplotypes.
However to compare the data, McEwan’s modal haplotypes had to be collapsed
into smaller groups to reflect the smaller number of markers in the OGAP haplotypes
(Refer to Appendix A).
The following is the reduction of the McEwan haplotypes to
10-marker haplotypes. In each case, letter
designations have been included here for the purpose of mapping the full set of
McEwan haplotypes (designated R1bSTR##) to the reduced set used in this
analysis (Letter Groups). While this exercise necessitates the loss of
considerable resolution in the McEwan haplotypes, the exercise is included here
to provide traceability to the analysis included in Appendix A.
[1] In the case of academic researchers,
many keep tight control of their data.
It is hoped that in the future that scientific journal editors will
require the submission of supporting data, which they might even hold for a
period of time even after publication of analytic articles. Such submissions would ensure that important
research is fully documented even if it is years later when the article is no
longer at the forefront of everyone’s mind.
Address for
correspondence: Campbell@alum.mit.edu
[2] For the Campbell Project, the
distribution of the birth year of the oldest proven ancestor of the
participants is as follows. 4% earlier
than the 1600s, 7% in the 1600s, 56% in the 1700s, and 33% in the 1800s. There is no reason to expect this to be
anything but representative, and in fact, one could be convinced that some of
these participants are outliers with longer than usual paper genealogies.
[5] For those markers that have been reported in the
full data set. i.e., Alleles of 12
include: DYS425 (1412/1496), DYS439
(1411/1496), DYS426 (1088/1496), DYS388 (1108/1496).
[7] The results column of Table 2 has
the same relative weighting as if samples observed were divided by sample size
(e.g. Ireland = 2/17) but this is just an alternative formula.
[8] OGAP haplotypes below OGAP30 have
single digit sample sizes.
[10] http://www.m222.net/R1b1c7