Erica, thanks for sharing this heuristic article with me:
http://repository.up.ac.za/bitstream/handle/2263/32007/Greeff_Appel...
This is like watching an episode of “painted place!” I’d give it an “R” rating. We need more stories like this in the United States – it might interest people more in genealogy and history! :) LOL
And it includes a plug for one of our local companies Applied Biosystems in Foster City California – not too far from where I sit now. Nice!
OK so let’s look at our cast of actors:
1) Jan Cornelitz – the jilted husband who was impotent or sterile.
2) Maria Kickers - the adulterer and wife of above
3) Ferdinandus Appel biological father of the first son Theunis
4) Frederik Botha father of the remaining three younger sons
5) Samuel Friedrich Bode the other potential biological father who ends up getting off the hook.
What are we trying to prove?
“We… show that Maria’s first son was actually fathered by Ferdinandus Appel and that roughly half the living Bothas (38,000 people) actually descend from Ferdinandus Appel while the remaining three sons all stem from the same father, presumably Frederik Botha.”
Is this some novel idea? No this is based on known case law, etc and all we’re trying to do is confirm what is already supposed: “several histories conclude that Ferdinandus Appel did indeed father the first son”
Findings:
1) “The random Botha sample show two distinct STR profiles estimated to belong to the R1b haplogroup and differing at five of the 17 loci in roughly equal proportions”
2) “There was only one mutational step between the one Botha haplotype and one of the Appel males (Inserting the appropriate values into the equation gives a probability of 0.0249. It is thus 130 times more likely that the descendants of Theunis are linked by male descent to the living man with the surname Appel than Theunis being a descended from another immigrant that just happened to have the same haplotype by chance).
3) Samuel Friedrich Bode arrived too late to have so many offspring
Let’s look at this – we have only 17 markers to work with. This makes me reflect that in the BigY and FullGenomes Y chromosome analysis were we are working with 76 - 97% of 55,000 known SNP markers - How spoiled are we? Having roughly 50,000 markers to work with for many genealogies here today we have a much easier task than the authors have with only 17 markers. But many forensic kits and paternity tests rely on very few markers. Kind of scary when you think you could go to jail or not based on very few markers and not have the benefit of 50K Y markers –or- in the case of most autosomal tests about three-quarter of a million markers! But today we’re going to use technology that is three decades old – why not???
How do we start?
1) * consent form*
Always get signed informed consent!
2) Ethical review – check!
So the authors design the experiment. We have:
1) 76,125 people with the name Botha (not all descendants)
2) 10 generations
Wait a second – you mean we are going to take a random sample of everyone in South Africa with the same surname and see how many varieties we have? Yup? Isn’t that the same as above were we have a pond full of colored fish and we want to determine the ratio of each color of fish? Yup! What do we use? Population genetics? What do we need? Random Samples!
So the authors convince us that their samples are random:
“We do not have any reason to believe that any of Maria's sons' ancestors were more likely to participate than any other and we believe this to be a random sample.”
“Fifteen of these participated and we screened the ancestry information they gave to ensure that no close relatives were sampled.”
But we have a kicker. We have some samples from men with a known genealogical descent from each of the sons to compare these random samples with:
“In order to test if all Maria's sons had the same father or not genealogists put us into contact with living male descendants of each of her sons. One male from the random sample knew how he linked up to Maria's sons and he was included here. In this way three descendants of each of her first two sons, one of her third and two of her fourth son were typed (Figure 1).”
“By obtaining haplotypes of different patrilines of her first two sons we can have a high certainty of inferring the haplotype of her sons.”
So let’s cut to the chase. From reading the above we see that we have 3+3+1+2 = 9 samples (Figure 1) from men with the known genealogy. In Table 1 column n2 see all nine men. And in Table 1 column n1 we see there are 7+1+6+1 = 15 randomly selected samples from men who don’t have a known pedigree.
So in table 1 on the X axis we have the different “haplotypes” and on the Y axis we have all the different markers with names like DYS19. DYS stands for DNA Y-Chromosome (unique) Segment. These are short tandem repeats (STRs) which are short strands of DNA with a defined number of repeating pieces. And in the table we see the exact number of these repeats for each STR. For example the first marker DYS19 for haploytype “Botha 2a” has 14 repeats. Got it?
And we have one more piece – we have a sample from a real living male Appel. “If either or both of these match the haplotype of Theunis's descendants it would indicate that Fredinandus Appel may have been Theunis's father.”
Now step back for just a second and look at these data please. We have several haplotypes consisting of 17 markers. Some of these differ by only one or two markers so we can reason they are probably all related and others differ from the first group by 5 of the 17 markers – is that a lot? Wow – yes we have two colors of fish. And not only that the second color of fish is only off by one marker from the living Appel fish – hmmm you think they are related? Yes you do. And you didn’t need fancy software to figure this out did you. This is common sense, right?
So they assigned these samples to the haplogroup R1a. How did they do this:
“The haplogroups for the y-chromosome STR profiles were estimated with the Whit Athey haplogroup predictor.”
Whit Athey is just one of many haplogroup predictors that you can take for a test drive on the web. For a list of others see:
http://isogg.org/wiki/Y-DNA_tools
Two points – we know a lot about specific values for different markers, their stability and mutation rates and their probability of occurrence in a given population and their probability of occurrence in combination with other markers. We know a lot about this. These are not just random values void of meaning. So we can infer a lot. Second point is R1b is the largest haplogroup in Europe – so determining they are R1b is not very useful at all. In our applications for genealogy we are going to be dealing with much better data than the authors have.
“The DYS385 loci (a and b) were excluded”
Really? Well DYS385 is tricky for older methodologies. Y-STR DYS385 consists of two duplicated copies—DYS385a and DYS385b. So when you try to sequence these you create a primer and what happens is you end up with a simultaneous amplification of both copies. Hence it is difficult to tell which sequence is which. So they just tossed these data out. OK Fine.
So how do we know what weight to place on each marker:
“and the inverse of the variance was used to weight the remaining markers (Fgure 2).”
Ouch! Really? With a sample size of 15 you expect to get a weight to place on each marker using the inverse variance? So if one marker changes twice and the other marker changes once the first marker is going to be counted as half as important. But with these data we may only have one sample with the unique value. So this is more for heuristic value and I would posit not really a good idea based on this sample size (small).
“To visualise the haplotypes detected, a median-joining network was constructed.”
Which is what you see in figure 2.
What does this mean? You start with the idea of a phylogenetic tree. Imagine you have an X axis and a Y Axis and you have on the X axis these specimens:
Bacteria
Shark
Fish
Lizard
Mammal
And on the Y axis you have these traits:
Single cell
Notochord
Vertebrae
Feet
Warm blood
And then you fill in the chart with (+) or (-) signs depending on if each creature has this trait. And then you construct a tree with the least number of changes between specimens. In this case you end up with a line from single cell all the way to mammal with one change per step. The tree gets more complex as you add in more specimens and more traits but it is still fairly simple to create a tree. As things get more complex we use Graph Theory were each specimen is a point called a “vertex” and each change in a trait is a line called an “edge.” We can use Kruskal's algorithm to find a subset of the edges that form a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. So this is called a “minimum-spanning-tree algorithm.” So we’re basically looking for the least number of changes to occur in the state of each trait to account for all the traits present in each specimen. Got it? That Kruskal's algorithm in a nutshell without the math. And we use the maximum parsimony concept of James S Farris to build the optimal tree which will minimize the amount of state changes. In other words we want to minimize convergent evolution, parallel evolution, and evolutionary reversals. Median Joining adds to this by adding an element epsilon to adjust for the level of state change (called homoplasy). Note that all of these traits above are binary i.e. they can be + or – only. With STR data you have a range of values depending on the number of repeats recorded. The thing about using Median joining is it can deal with non-binary states like STR values. The problem with it is also that it by default each unique sate of a trait is given equal weight. And as we know with STRs a state change of 1 is more common than a state change of 2 or more. And the state naturally tends to increase in the number of copies and not decrease over time. So it isn’t perfect.
So we can see in Figure 2 the result is a nice bar-ball type tree with all the samples related to Ferdinandus Appel clustered on one side separated by one step each and a huge number of steps and then all the samples related to Frederik Botha on the other end separated by one to a few steps.
“It is thus 130 times more likely that the descendants of Theunis are linked by male descent to the living man with the surname Appel than Theunis being a descended from another immigrant that just happened to have the same haplotype by chance.”
That’s the money sentence I guess because using a formal approach we can come up with the probability of one theory being correct versus the other. But couldn’t you see this just eyeballing the chart of data? Yes you could.
And then we do a sanity check. Y STRs have a mutation rate of 3.0 × 10−8 mutations / nucleotide / generation on average. But some are more stable than others. There can be a seven-fold variance and the authors didn’t really address this. (And also note that STR mutation rates are about 4 to 5 orders of magnitude higher than SNPs). The way these are measured is to look at the known timing of population separation for example on the Maoris and Cook Islanders and then we find a mutation rate for different STRs. Or we can look at multiple father-son pairs. And then we can either use average squared distance or Bayesian analysis such as BATWING uses a Markov chain Monte Carlo method based on coalescent theory. But the point is the mutation rate is consistent with the values measured.
Can we rule out that there are not more than two ancestors here? Not really.
But I think this is a useful paper because it showcases a few techniques and how to apply them.
Plus the intrigue! I guess some of you are familiar with this story but as a Californian this is all new to me. Very fun stuff!