Francoise Martinet, SM/PROG - mtDNA Group known? - what is the strength of the test?

Started by Private User on Friday, January 8, 2016
Problem with this page?

Participants:

Profiles Mentioned:

Related Projects:

Showing 31-60 of 110 posts

Sharon, I am sure you are just trying to help build a great WFT and I really want DNA to be introduced the best it can be. However, I feel that you do not understand what I am trying to explain and have not looked into my posts with enough insight to come to your conclusions.

Specifically, if I take Roland's pond example - with 1000 fish. Lets say there was a 1001 fish just before you took decided to do an exercise - the one that got away - now the 'sampling' referred to by Roland in his first post, is something like there is a 54% chance that the missing fish was red... 36% blue and 10% rest. Altogether different from what we are 'telling' with DNA is that the fish is 100% red. This we do with a paper trail as only aid (assume it is sourced). But I hope you can appreciate, entirely something different. Without the paper trail we would be lost, but that introduces bias. That is where the 'strength' comes in...

There are many sources of bias... have a look here:

https://en.wikipedia.org/wiki/Bias_(statistics)

You can deduct for yourself which will be the main culprits (of bias) of using DNA for genealogy - but I could attempt to spell it out.

I am not sure as to what is the level of the audience and their knowledge of sampling. When I refer to sampling 101 with a :) well that can be a :) towards myself.

Really - the digging up story was not serious at all. It was just a bad example perhaps. But that would be a 100% case.

Btw, with all due respect to the researchers in the Botha/Appel case, from a statistical point they had very little samples and I would like to have seen them make reference towards a meaningful sample size, what my question is all about. As that would have assisted us too... my gut feel is 100 or 1000 - surely that is a question that needs to be answered. If it is as high, then I see many problems, in my heart I wish it was not, but my head tells me otherwise.

You would also see in one of the other 'sources' by Greeff that there is a lowish (0.7%) per generation 'lack of credibility' in their study - and they refer to an emperical 1% per generation... i.e. not more than (1 - 0.01)^k where k is the number of generations - lack of credibility not due to mutations, but just 'common sense' :)

I get what you say in that DNA is only as accurate as the paper trail, and that makes sense to me. But again, that is also not the same, I am specifically referring to cases where they DNA of a descendant differs from the 'assumed' DNA, i.e. DNA does not match - and then, surely you need large numbers for that exercise to be accurate as to what the actual DNA is. For example, again referring to the fish analogy, you could have 980 actual cases (i.e. the group of people alive) of a common ancestor and 20 cases of a fish from another point which you believe are all related to the first fish - now due to the 'sampling' bias say 1 case of original fish and 4 cases of new fish is drawn.

The problems may be that you pick a new fish type first and say that the ancestor is a new fish type. Or you take the sums and aggregate. Both will give you the wrong answer. The problem is that you have drawn a wrong sample. And that is what this is about.

It can also be that the paper trail is wrong, i.e. someone made a mistake. But now you don't know if it is the paper trail that is wrong, if you first say that the paper trail is right due to DNA matched.

Let me know if this is still not understandable?

Jan, you don't have to explain statistical bias or statistical significance to me - that was a significant part of one of my post grad degrees. We are telling you that you're applying them incorrectly:

The statistical likelihood that two random people in a population will share the same mtDNA vs the likelihood that two people whose trees say they are collateral cousins would – is the application that statistical significance has to this problem. The number of people required for this bar to be met is in the single digits, not the thousands.

My question specifically referred to mtDNA and Y-DNA tests, and I said for this sampling you need the same tree (intuitively higher accuracy - else you need even larger samples). It seems you refer to cousins which this is not. (and would only have limited value, max 4 generations - so yes in theory smaller sampling sizes) - but that is not my question. (Your question leads me to think of what would influence the sample size, e.g. locations etc.)

We, the Geni community, have a lot of work ahead of us as we make decisions as to how to best integrate genetics into Geni. Roland Baker brings some good ideas for how to accomplish this task.

Also, the Appel Botha Cornelitz case is fascinating. I am reading it and will get back to you in a bit. Thank you Erica for sharing this case.

It is important to differentiate between population genetics that require a large sample size and genealogical genetics, which can be useful with a quite small sample size. A well-thought out question and strategic testing of a few family members can reveal valuable information about our ancestors.

Erica, thanks for sharing this heuristic article with me:

http://repository.up.ac.za/bitstream/handle/2263/32007/Greeff_Appel...

This is like watching an episode of “painted place!” I’d give it an “R” rating. We need more stories like this in the United States – it might interest people more in genealogy and history! :) LOL

And it includes a plug for one of our local companies Applied Biosystems in Foster City California – not too far from where I sit now. Nice!

OK so let’s look at our cast of actors:

1) Jan Cornelitz – the jilted husband who was impotent or sterile.
2) Maria Kickers - the adulterer and wife of above
3) Ferdinandus Appel biological father of the first son Theunis
4) Frederik Botha father of the remaining three younger sons
5) Samuel Friedrich Bode the other potential biological father who ends up getting off the hook.

What are we trying to prove?

“We… show that Maria’s first son was actually fathered by Ferdinandus Appel and that roughly half the living Bothas (38,000 people) actually descend from Ferdinandus Appel while the remaining three sons all stem from the same father, presumably Frederik Botha.”

Is this some novel idea? No this is based on known case law, etc and all we’re trying to do is confirm what is already supposed: “several histories conclude that Ferdinandus Appel did indeed father the first son”

Findings:
1) “The random Botha sample show two distinct STR profiles estimated to belong to the R1b haplogroup and differing at five of the 17 loci in roughly equal proportions”

2) “There was only one mutational step between the one Botha haplotype and one of the Appel males (Inserting the appropriate values into the equation gives a probability of 0.0249. It is thus 130 times more likely that the descendants of Theunis are linked by male descent to the living man with the surname Appel than Theunis being a descended from another immigrant that just happened to have the same haplotype by chance).

3) Samuel Friedrich Bode arrived too late to have so many offspring

Let’s look at this – we have only 17 markers to work with. This makes me reflect that in the BigY and FullGenomes Y chromosome analysis were we are working with 76 - 97% of 55,000 known SNP markers - How spoiled are we? Having roughly 50,000 markers to work with for many genealogies here today we have a much easier task than the authors have with only 17 markers. But many forensic kits and paternity tests rely on very few markers. Kind of scary when you think you could go to jail or not based on very few markers and not have the benefit of 50K Y markers –or- in the case of most autosomal tests about three-quarter of a million markers! But today we’re going to use technology that is three decades old – why not???

How do we start?
1) * consent form*
Always get signed informed consent!

2) Ethical review – check!

So the authors design the experiment. We have:
1) 76,125 people with the name Botha (not all descendants)
2) 10 generations

Wait a second – you mean we are going to take a random sample of everyone in South Africa with the same surname and see how many varieties we have? Yup? Isn’t that the same as above were we have a pond full of colored fish and we want to determine the ratio of each color of fish? Yup! What do we use? Population genetics? What do we need? Random Samples!
So the authors convince us that their samples are random:

“We do not have any reason to believe that any of Maria's sons' ancestors were more likely to participate than any other and we believe this to be a random sample.”

“Fifteen of these participated and we screened the ancestry information they gave to ensure that no close relatives were sampled.”

But we have a kicker. We have some samples from men with a known genealogical descent from each of the sons to compare these random samples with:

“In order to test if all Maria's sons had the same father or not genealogists put us into contact with living male descendants of each of her sons. One male from the random sample knew how he linked up to Maria's sons and he was included here. In this way three descendants of each of her first two sons, one of her third and two of her fourth son were typed (Figure 1).”

“By obtaining haplotypes of different patrilines of her first two sons we can have a high certainty of inferring the haplotype of her sons.”

So let’s cut to the chase. From reading the above we see that we have 3+3+1+2 = 9 samples (Figure 1) from men with the known genealogy. In Table 1 column n2 see all nine men. And in Table 1 column n1 we see there are 7+1+6+1 = 15 randomly selected samples from men who don’t have a known pedigree.

So in table 1 on the X axis we have the different “haplotypes” and on the Y axis we have all the different markers with names like DYS19. DYS stands for DNA Y-Chromosome (unique) Segment. These are short tandem repeats (STRs) which are short strands of DNA with a defined number of repeating pieces. And in the table we see the exact number of these repeats for each STR. For example the first marker DYS19 for haploytype “Botha 2a” has 14 repeats. Got it?

And we have one more piece – we have a sample from a real living male Appel. “If either or both of these match the haplotype of Theunis's descendants it would indicate that Fredinandus Appel may have been Theunis's father.”

Now step back for just a second and look at these data please. We have several haplotypes consisting of 17 markers. Some of these differ by only one or two markers so we can reason they are probably all related and others differ from the first group by 5 of the 17 markers – is that a lot? Wow – yes we have two colors of fish. And not only that the second color of fish is only off by one marker from the living Appel fish – hmmm you think they are related? Yes you do. And you didn’t need fancy software to figure this out did you. This is common sense, right?

So they assigned these samples to the haplogroup R1a. How did they do this:

“The haplogroups for the y-chromosome STR profiles were estimated with the Whit Athey haplogroup predictor.”

Whit Athey is just one of many haplogroup predictors that you can take for a test drive on the web. For a list of others see:

http://isogg.org/wiki/Y-DNA_tools

Two points – we know a lot about specific values for different markers, their stability and mutation rates and their probability of occurrence in a given population and their probability of occurrence in combination with other markers. We know a lot about this. These are not just random values void of meaning. So we can infer a lot. Second point is R1b is the largest haplogroup in Europe – so determining they are R1b is not very useful at all. In our applications for genealogy we are going to be dealing with much better data than the authors have.

“The DYS385 loci (a and b) were excluded”

Really? Well DYS385 is tricky for older methodologies. Y-STR DYS385 consists of two duplicated copies—DYS385a and DYS385b. So when you try to sequence these you create a primer and what happens is you end up with a simultaneous amplification of both copies. Hence it is difficult to tell which sequence is which. So they just tossed these data out. OK Fine.

So how do we know what weight to place on each marker:

“and the inverse of the variance was used to weight the remaining markers (Fgure 2).”

Ouch! Really? With a sample size of 15 you expect to get a weight to place on each marker using the inverse variance? So if one marker changes twice and the other marker changes once the first marker is going to be counted as half as important. But with these data we may only have one sample with the unique value. So this is more for heuristic value and I would posit not really a good idea based on this sample size (small).

“To visualise the haplotypes detected, a median-joining network was constructed.”

Which is what you see in figure 2.

What does this mean? You start with the idea of a phylogenetic tree. Imagine you have an X axis and a Y Axis and you have on the X axis these specimens:
Bacteria
Shark
Fish
Lizard
Mammal

And on the Y axis you have these traits:
Single cell
Notochord
Vertebrae
Feet
Warm blood

And then you fill in the chart with (+) or (-) signs depending on if each creature has this trait. And then you construct a tree with the least number of changes between specimens. In this case you end up with a line from single cell all the way to mammal with one change per step. The tree gets more complex as you add in more specimens and more traits but it is still fairly simple to create a tree. As things get more complex we use Graph Theory were each specimen is a point called a “vertex” and each change in a trait is a line called an “edge.” We can use Kruskal's algorithm to find a subset of the edges that form a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. So this is called a “minimum-spanning-tree algorithm.” So we’re basically looking for the least number of changes to occur in the state of each trait to account for all the traits present in each specimen. Got it? That Kruskal's algorithm in a nutshell without the math. And we use the maximum parsimony concept of James S Farris to build the optimal tree which will minimize the amount of state changes. In other words we want to minimize convergent evolution, parallel evolution, and evolutionary reversals. Median Joining adds to this by adding an element epsilon to adjust for the level of state change (called homoplasy). Note that all of these traits above are binary i.e. they can be + or – only. With STR data you have a range of values depending on the number of repeats recorded. The thing about using Median joining is it can deal with non-binary states like STR values. The problem with it is also that it by default each unique sate of a trait is given equal weight. And as we know with STRs a state change of 1 is more common than a state change of 2 or more. And the state naturally tends to increase in the number of copies and not decrease over time. So it isn’t perfect.

So we can see in Figure 2 the result is a nice bar-ball type tree with all the samples related to Ferdinandus Appel clustered on one side separated by one step each and a huge number of steps and then all the samples related to Frederik Botha on the other end separated by one to a few steps.
“It is thus 130 times more likely that the descendants of Theunis are linked by male descent to the living man with the surname Appel than Theunis being a descended from another immigrant that just happened to have the same haplotype by chance.”

That’s the money sentence I guess because using a formal approach we can come up with the probability of one theory being correct versus the other. But couldn’t you see this just eyeballing the chart of data? Yes you could.

And then we do a sanity check. Y STRs have a mutation rate of 3.0 × 10−8 mutations / nucleotide / generation on average. But some are more stable than others. There can be a seven-fold variance and the authors didn’t really address this. (And also note that STR mutation rates are about 4 to 5 orders of magnitude higher than SNPs). The way these are measured is to look at the known timing of population separation for example on the Maoris and Cook Islanders and then we find a mutation rate for different STRs. Or we can look at multiple father-son pairs. And then we can either use average squared distance or Bayesian analysis such as BATWING uses a Markov chain Monte Carlo method based on coalescent theory. But the point is the mutation rate is consistent with the values measured.

Can we rule out that there are not more than two ancestors here? Not really.

But I think this is a useful paper because it showcases a few techniques and how to apply them.

Plus the intrigue! I guess some of you are familiar with this story but as a Californian this is all new to me. Very fun stuff!

Thanks Erica I'll take a look at this too!
RE
https://www.geni.com/discussions/151927?msg=1059959

Thanks Sharon Doubell! I accepted your requests!

I have a couple of quick points:

1) In a controlled lab setting I can know the Y or mtDNA genotype of F(1) based on DNA data from F(n) if n is a small number. The only caveats are 1) that a certain number of mutations may have occurred between F(1) and F(n). If n is small these will be de minimis but should not be ignored. And 2) there were no errors in the DNA analysis. This is a scientific fact. So we use this fact as a primary source. The source is that a claimed descendant of this ancestor has this genotype, via this lineage, tested at this lab, available via this repository, with contact information here, documented family tree available over there and sequence, STR and SNP data either as presented or available over there. That’s the source. This source should be presented as the genotype of the person being tested. It should not be presented as the genotype of the ancestor. In this case the hypothesis we are trying to test is the genealogy. And this is only one source like any other source. The source itself isn’t wrong but the genealogy could be. So we allow all such primary DNA sources to be recorded in a column with a hyperlink to the source details as outlined above from each DNA source:

http://www.wikitree.com/wiki/Baker-17919

Like any source it has to be examined by the researcher and compared against all other sources.

So I think what Jan may be objecting to is stating that the genotype belongs to the profile of the ancestor. That has not been proven because the genealogy of the descendant has not been proven. So the genotype along with other DNA data should always belong to the person being tested along with the other information above.

2) The point that is made about a Y or mtDNA DNA test being rare or not is going to be mute soon. We’re looking at millions of people having their entire genomes sequenced within a few years. The level of resolution we are going to have is going to be like nothing we have ever seen before. I’ve already had my complete genome sequenced. These results are very specific or in a sense “rare.” If you’ve looked at the FullGenomes Y chromosome results you have about 55,000 SNPs and well over 500 known STRs. And that list keeps growing. These are unique as a fingerprint in a genealogical sense. They have a specific location in a tree and can be pinpointed with great accuracy. That’s why I’m stating that in the near future all DNA results will yield a “rare” haplotype. And already these Y DNA sequencing results are being used to take the results form STR tests and predict their place on the Y tree with a great deal of accuracy. Likewise we won’t be using makers for mtDNA soon – all tests will be fully sequenced once the cost is cheap enough. I published my mtDNA sequence with the NIH and it is the only one like it in existence on their database. That’s pretty rare. And likewise the resolution and number of SNPs available for autosomal comparisons is going to drastically increase. So imagine a world where every terminal haplogroup is a rare haplogroup. That’s what we have to plan for.

Rereading what I just wrote I see I didn't state that correctly sorry. Under point two please note that mtDNA does mutate very slowly. So while we'll have enhanced resolution with full sequencing - we won't have the same fine genealogical time scale gradations that we do with other types of DNA. I think that would be obvious for this crowd but I wanted to make that clear.

Re: So I think what Jan may be objecting to is stating that the genotype belongs to the profile of the ancestor. ...

I think that might be the case too, but it needs Jan to answer.

I do have to say that having the haplogroup of an ancestor (however "valid") as a reference is enormously helpful to me. At a glance I can determine - "cannot be "that" John Smith, impossible haplotype."

I've probably written more than anyone here wants to read. But I'm going to add one more thing...

Getting back to Jan’s original proposition… I think the thing we are trying to “prove” is the genealogy. We aren’t trying to “prove” the DNA test nor are we trying to prove the haplogroup of the ancestor. The DNA test is a source. So when we ask how many samples do you need to prove the DNA haplogroup of an ancestor I think the question is backwards. In order to prove the DNA haplogroup of the ancestor based on any single sample or even based on the samples of 100% of all living descendants you would need to prove the genealogy is correct. My standards of proof are pretty high. We look to the work of Robert Charles Anderson who wrote the Great Migration series. He is one of the great genealogists of our time. He was trained as a biochemist. So he seeks evidence. But would even he claim that he has “proven” a genealogy? If we look at a lineage that goes back 400 years what exactly counts as proof? A family bible record? A birth certificate? A book reference? We all know all of these can be flawed. How do you put a weight on a specific genealogical lineage? Do you sum of the total number of birth certificates and use the least mean square? When you think of it that way the proposition is absurd. No matter how many DNA samples you take you can’t rationally put a confidence level or margin of error that this ancestor has a specific genotype because we can never be 100% confident that even one of these DNA tested people is actually a descendant. So what I’m saying is there is no way to do this math. But if we view DNA evidence against other genealogical evidence we’d have to say we have to put a lot of weight behind it. That’s why I claim the primary source is the result of the person being tested along with their documented genealogy, etc. We can’t pick a certain number of living descendants to sample like we can with colored fish in a pond. Because while we may be able to “call” the “color” (haplotype) of the fish we sample but we can’t be 100% certain that these are really “fish” (descendants) or “frogs” (a person with an incorrect genealogy).

Erica - you posted at the exact same time I did and I think you are spot on.

And don't forget we are digging up the dead and DNA testing them too. That's not a joke. 2015 was a banner year for ancient genomes sequencing. As the cost, speed and techniques such as inner ear DNA extraction improve we'll be collecting even more ancient genomes. So we are drilling down to the answer from both ends - the living and the dead.

Roland, you are getting close with the 2nd last post, and what you and Erica say is true - my question is on the genotype belonging to the ancestor and the certainty of that statement.

However, in 'sampling' the math actually works... even if there are millions of individuals with a common ancestor, using DNA, you only need a sample of just over 1000 (and they DNA of the samples actually do not need to agree, but your sample will 'carry' you to a meaningful statistical conclusion). Yes too, you need the samples to be 'random' - and I agree in that sense it is impossible or unlikely to determine the full sample frame, but statisticians got past that too - it just reduces the strength.

But your recall on the Botha/Appel case - I cannot agree, in the sense that the individuals were not selected at random. They were chosen as set out under paragraph 2 - to be random you need to have all 70000+ individuals listed and then randomise the selection of them. The researchers took a 'convenience' sample - which is the worst type one can get. That means that no scientific conclusion can be made. You can give their paper to any qualified statistician... but I agree it is exciting. (but this is not my question and I never raised this and only here again try to respond when someone try to draw similarities... :( ) Furthermore, a sample of less than 20 out of over 70000+ is not reliable. You would perhaps appreciate that most other sources they references actually acknowledged that they lack credibility, which can be improved if they had more samples. Furthermore, in this case they should have taken a stratification sample... i.e. grouped by each of the sons' descendants. That also changes the statistical conclusions and it is not clear they have even taken that into consideration. Again, they would have been very unlikely to know all of the 70000+ individuals stratification. So another reason to take a convenience sample.

If you have less than the 'required', you can actually determine the Type 1 Error.... I.e. with 100 you have (ignoring the other errors I pointed out) a relatively high certainty but not as high as 1000, etc.

I searched for two days and could not find any meaningful research by anyone on this aspect - specifically what would be the 'magic' number and what impacts this number. If anyone can please contribute on that front I am sure we will all appreciate it.

I never said or implied DNA is not useful, just that the conclusions drawn must be factual - we have the theory but must take care in the implementation thereof.

Interesting paper!

The conclusion is that a large number of Botha's alive today are actually biological Appels! This conclusion affects many people. so, there are a number of issues I have with the paper. One is sample size. I did not read any evidence that the authors performed
a power analysis. This type of analysis is used to calculate the minimumn sample size in order to have a statistically significant result. Maybe 15 random samples and 9 with known pedigree is good enough. We don't know this without the power calculation. (And if they did one, they should have posted it in their paper.)

However, South Africa has a number of features that makes genealogy and genetics a little easier. One is that there are good genealogical records. Another is the historically low level of non-paternity of 0.8%. Maybe these factors are enough to warrant the small sample size, though we don't know without the power calculation.

Another issue to consider is the 17 STRs tested. Though, the results obtained with these 17 are compelling, it would be better if more STRs were included in the testing. As I wrote above, the conclusion drawn from this data affects a lot of people, so I would
prefer it to be as rigorous as possible. Most surname sites recommend testing at least 25 STRs. Testing at least 67 STR markers would strengthen the claims made in this paper.

Jan advocates for a larger sample size for the Botha/Appel study. I agree with him. However, without a power calculation, it is difficult to know how large the sample size needs to be.

Big conclusions require big evidence.

I think Jan's initial post--and the discussion it raised--brings up an issue that we will need to address sooner or later:

"When in a profile it is claimed that this person's DNA (mt or Y) is <AbC>, how confident are we that this claim is accurate?"

Jan, you are a trained statistician and an actuary by profession, I think this is right up your alley. I'd love to work with you on developing a method (or formula/protocol) to provide a solution.

Maybe we should start a project "Method for estimating the certainty of DNA results applied to ancestors". (If you can think of a better title, shoot...)

For example, it is claimed that Françoise Martinet is of mtDNA haplogroup U5b. I think we can be confident that the probability of this claim is beteen '0.00' and '1.00'. This is a 100% confidence interval. However, to be 100% confident, the interval has to be so wide as to be meaningless.

Now, can we narrow this interval to, say, 90% confidence? That is, can we come up with a interval within which we would be sure at 90% that it contains the "actual" value?

Don't you think that we could use a bayesian model for this? Take the Françoise Martinet case: We already have for priors:

- June's mtDNA test results (with its own uncertainty)
- the SA genealogy database (with its own uncertainty)
- and more....

And the beauty with bayesian analysis is that, as we add evidence, we can update the priors and the estimates become more accurate.

Wouldn't it be cool if, at GENI, users could apply (perhaps with some assistance) our method to be able to post claims like:

> "It is estimated, between 60% and 90% confidence, that 'Jane Doe, born in 1707' is of mtDNA haplogroup Wa1b"

which would reflect a relatively good piece of data. And in this other case:

> "It is estimated, between 10% and 25% confidence (...)"

which would still be data, but to the negative. And yet another case :

> "It is estimated, between 5% and 75% confidence (...)"

which basically would mean the claim is garbage and should be discarded.

Jan, we could even go further. In the method, if we spell out clealy each assumption, it would be possible for a Geni user to determine where to invest energy (or capital) to increase the reliability of the claim.

This may be not be an easy project, but it sure looks fascinating.

(P.S.: if anyone is aware that such a method already exists, please let me know)

Jan, what do you think?

Jan

My fundamental question for your question is "my utter confusion" as to how sampling (and therefore the math that goes with it) is relevant.

If i use GEDMatch to compare my aunt's DNA test results with Roland's, it is a direct chromosome to chromosome comparison. Not a sample. (BTW no match at least at default thresh holds). So how is stats relevant?

+ 1 to morel's proposal.

I would also say that I'm not sure i (personally) would find it especially useful, but that's for a specific reason,a point I've been meaning to raise.

I am not interested in DNA testing for genealogy to prove / disprove a paper trail although that is helpful.

I am interested because for (perhaps well over) 60% of my ancestry beyond 4 gens there is no paper trail and there (likely) never will be much of one. There never were records, or they're behind Iron Curtains, or they burned in wars, or names changed beyond recognition. Etc.

So I look to the collective power of the DNA test databases to construct DNA ancestors.

Jan – I’m glad we’re getting closer.

I totally agree about the sampling but didn’t state that because it was already called out in the paper by the authors and their reviewers.

So let me get on your side for a second and say yes you could construct a sampling model if you could place a level of confidence in the genealogy of each person sampled, correct? That would have to be a condition of constructing such a model.

But I would make the claim that we lack just that ability.

For example in the in the Botha/Appel case if all the court records, etc. had been lost regarding the testimony of the characters involved and we had no knowledge about the paternity issue involved and we examined these data we would conclude that the descendants of three of the sons were in fact also descendants of Jan Cornelitz! This is assuming we fixed the other issues in the study such as the quantity and quality of the samples, etc. In that case we would be assigning a haplotype of R1a to Jan Cornelitz.

Take a more abstract example and let’s go back to fish. I have a pond in my backyard in California with three colors of fish – blue, red and green. All the blue fish descend from a blue patriarch. All the red fish descend from a red patriarch. All the green fish descend from a green patriarch. Now we want to collect samples of the fish, check their color and determine the genotype of the patriarch. However, before I take a sample of fish and game department shows up and dumps in an unknown quantity of identical green fish who hail to a different green patriarch from Nevada who has a different genotype. Not knowing which of the green fish is which – what sample size would I need to determine the genotype of the green patriarch from California?

False pedigrees abound. And it is true that not many people are lining up to claim to be the descendant of my farmer ancestor from New Hampshire. But on the other hand everyone and their dog is lining up to be the descendant of my ancestor Mayflower passenger Stephen Hopkins. (I know because I am culling them right and left). I pose this example as a joke because some of you know what I’m talking about. I think Erika called is poor sportsmanship? But aside from these examples we all know there can be a million legitimate reasons a genealogy can be wrong.

Well Erica - you are still my paper cousin if not my DNA cousin :)

... And perhaps not on "this" discussion we could examine why "paper trail 8th cousins" are not (yet?) matching through DNA testing.

Jan has been very patient - so we should probably stay on topic. But anytime you want to discuss this Erica you know where to reach me :)

My 2 cents:

And I have read every post on this thread, so this may already have been raised..

Jan is correct in saying that 100 plus samples are needed to accept a DNA link from a Progenitor to a living person.. obviously the number of generations separating them will also be a large variable.

However if there is a paper trail, only 2 or 3 samples should now be needed to prove (or disprove) the connection.

So we seem to have gone around the block here and arrived back at the same place.

Yes, Jan is objecting to stating that the genotype belongs to the profile of the ancestor. That was never in question and was answered immediately:

"Jan, you are correct - these are projections only, based on the assumption that the paper trail represented by the genealogy line is biologically correct. (A safer bet with mtDNA than with Y DNA:-))
The assumption is that this presumption is obvious, but perhaps we should be more specific about it on the project."

"Should we point out in the Curator note that we haven't dug up her grave
>Not possible when dealing with Prog profiles which usually contain a lot of other data on a Curator note of limited characters.
Is this dishonest?
>Well I hadn't thought so, given that the position of virtually every ancestor on our tree contains the same implicit caveat: '(As far as the data shows) this is my ... grandparent.' Very few of us have the 'empirical evidence" of DNA proof from samples of even our grandparents to prove the paper trail relationship beyond doubt.
So, '(As far as the data shows) this is her mtDNA Haplogroup.' If more data becomes available to contradict this, we will update.
That doesn't seem to me to be unreasonable, or needing to be 'curbed'."

This is the point that needs discussing.

June and Daan are neither random nor biased samples. They are collateral cousins who have self selected as being carriers of Francois' mtDNA. If Daan's mtDNA results are the same as June's, there is an increased possibility that their paper trail is correct. We are testing the paper trail, not the DNA! This needs to take into account what the chances are that any two white South Africans selected randomly would have an mtDNA match anyway. I don't know how many living collateral cousins purporting to be descendants of Francois testing with the same mtDNA it would take to assume the paper trail is likely to be solid, as I'm not a statistician, and I don't know the figure for the chance that two random white SAs share mtDNA, but I'm positive it would not be thousands, and, depending on the spread of the lines represented by the collateral cousins, I'd say that the convincing number would be in the single digits.

Sorry, Don - crossposted.

At the moment the MtDNA data is in a Curator note - because we're trying to highlight a possibility on a Progenitor profile that we know is likely to have a well developed tree on Geni- in order to get other descendants to test the validity. This is not something we're putting on all ancestors. Just the Progs, where we have a test result. Perhaps there is a better way to do it - that's why I asked Roland to come in as an advisor on the Progenitor DNA projects.

Jan, on the matter of mismatches. The minute Daan tests with a different mtDNA to June, then the line is suspect, and we'd remove the mtDNA note. Until he does, it's the best available Source that we have, and I'm proposing that we treat it like any other Source.

I cannot respond to all... please shout if I missed out something.

Roland - an interesting point raised on the pond and fish and fish from another location. The number of samples to draw is mathematically actually straightforward and the theory exist. The criteria is that it is a 'probability' sample... (if unequal, that it can be quantifiable in order to alter the output appropriately, or adjust the number of samples to take).

In the pond example, there are 4 groups of which (which you know, R1, B1, G1 and G2) and assume there are N fish in total (perhaps you even have N1 and N2 if you have the information on the number of G2 added..?) Then with mathematics you can derive the sample size n (which would be larger than the sample size m if you could 'categorise' beforehand) - have a look here, as it is a good summary of the different types of samples you can take.

https://en.wikipedia.org/wiki/Sampling_(statistics)

Lohr was my textbook for my studies in Sampling, it is listed as a source on wiki and is available freely on the net. In each chapter the mathematics are given, to determine n per sample type - I cannot type it out in text...

https://www.google.co.za/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;s...

As you can appreciate, shortcuts are taken to reduce the complexity in mathematics, that leads to higher cost or computation time. Always an offset - I will work with anyone that can contribute.

I first have to study the other responses... will take some time :)

Erica - if you know your DNA, Roland's DNA and your Aunt's DNA. No sampling necessary.

Sampling is necessary if you don't know your Aunt's DNA and want to use Roland's and your DNA results to 'estimate'/'tell' your Aunt's DNA (in the case where you do not have it). - An Aunt may be a bad example as it is a recent relative and the sample frame therefore very small. This one is 1600s with many living descendants - who knows how many (in statistics the actual number does not matter if it is large).

I think you under-estimate the usefulness if one is able to 'predict' (which is all we will ever have in most cases) the Y-DNA and mtDNA of a common paternal/maternal ancestor... even if you don't know the paper trail, you would be able to say it is your ancestor because someone else (with the same Y-DNA or mtDNA) has the paper trail to that common ancestor... and can be proven with high certainty. Completing your tree from the top...

Currently Geni only shows the closest common relative in any line, but I foresee that it should one day be able to show the closest paternal and maternal relatives as well...maybe realistically only in some cases where people are say 15th cousins or closer.

And it is very useful (if accurate) in all your non-paternal/maternal lines (which has the most of your ancestors) if others can do it too. Actually, the Geni paper trail method, combined with DNA, could in future be many steps ahead of all DNA databases in this regard (and may be able to preempt collobaration).

Sharon - It seems unlikely that I should be trolling on a discussion which I started and just want to steer it into the direction intended. I probably need to object to the 'intentions' attributed to me. Don't agree. But its not important, let's stick to the discussion as that is my only intention.

I am trying to point out the dangers of the current use, how it may impact the WFT and users experiences and possible actions that could be expected.

And I offer solutions, which your last summary did not even include what I offered, but what I feel is not correct or in lieu with my own summary of the discussion. Particularly as you also left out the solution by Morel what is what we should be aiming at. Don't care where DNA is put. Do care that no information is deleted, as it is statistically viable we just need to get a good rule of thumb to say what the reliability thereof is.

Remember, if samples don't match - it does not influence the strength of the relevance! And this is about the strength of the relevance.

Donovan - what it the strength of the 'proof' in the case when 2 or 3 samples are taken and they agree? It is way less than the strength of the proof when 20 or 30 samples are drawn and all but 2 of them agree... because from statistics we have that we actually expect there will be outliers (referring to the +-1% common sense loss per generation expected) The problem is that the 2 or 3 samples selected may be the outliers if we had more samples.... therefore the need for a reasonable amount of samples - even when the paper trail agree.

The solution suggested is to have more than one DNA field per individual and the number of descendants that match that DNA.... surely not too daunting task? Straightforward then a number game and we can say what it the alpha of the samples (with approximations in mathematical theory, hopefully someone can point towards more emperical evidence?), assuming N is very large for generations way back.

If we have done a DNA test on an individual, then the numbers do not matter and such an individual should have only one DNA field with 100% (or 99.9%) certainty.

This discussion was not about DNA fields, but about your objection to my putting the projected mtDNA in a Curator Note on a Progenitor profile, when a motherline or fatherline descendant gets a result.

-You said that I was extrapolating without empirical evidence, and that the result would only hold true if someone took her sample physically.
I said that wasn't true: Extrapolation can be done empirically without physical evidence from the grave.
No reply from you.

-You posted info about significance levels in random samples.
I pointed out that the samples are SELF SELECTED, NOT RANDOM.
No reply from you.

-You posted info about bias.
I said that didn't apply.
No reply from you.

-You said mtDNA isn't accurate beyond 4 generations.
I said that wasn't true.
No reply from you.

-You said that we needed thousands of samples. I said that wasn't true if you have collateral cousins being tested to CONFIRM A PAPER TRAIL.
No reply from you, but you keep explaining in terms of a randomly selected population.
This is not a random selection. There are no outliers - ANY discrepancies mean the paper trail is incorrect.

Since you are not acknowledging that we're testing the paper trail, not the DNA, it's difficult to figure out what test strength you're referring to?
-I take it you're not referring to the veracity of the DNA test June took?
-So then, are you talking about the reliability of extrapolating from n descendants if you know the likelihood of 2 random South Africans' sharing mtDNA? I have referred to this twice, saying "I don't know how many living collateral cousins purporting to be descendants of Francois testing with the same mtDNA it would take to assume the paper trail is likely to be solid, as I'm not a statistician, and I don't know the figure for the chance that two random white SAs share mtDNA, but I'm positive it would not be thousands, and, depending on the spread of the lines represented by the collateral cousins, I'd say that the convincing number would be in the single digits."
If you disagree, you haven't said why yet. It would be good if you did.
As to Morel's ‘Uncertainty Principle’ project :-)
I have my doubts about the possibility of accurately estimating the un/certainty of the SA genealogy database as a predictive factor, but maybe I am demonstrating too little faith :-)

Sharon... you make it easier for me to respond with the previous post. I used => in this post.

=> I took your previous post word for word and my words are after => split by open space in front and behind.

This discussion was not about DNA fields, but about your objection to my putting the projected mtDNA in a Curator Note on a Progenitor profile, when a motherline or fatherline descendant gets a result.

=> Used as an introduction - something that leads to the other conclusions.

-You said that I was extrapolating without empirical evidence, and that the result would only hold true if someone took her sample physically.
I said that wasn't true: Extrapolation can be done empirically without physical evidence from the grave.
No reply from you.

=> I responded! In basically all my post I responded exactly why this is not, in fact, we would not be having this discussion.

-You posted info about significance levels in random samples.
I pointed out that the samples are SELF SELECTED, NOT RANDOM.
No reply from you.

=> I responded many times. Using your analogy of Botha/Appel, they it introduces bias etc.

-You posted info about bias.
I said that didn't apply.
No reply from you.

=> See the item just above :)

-You said mtDNA isn't accurate beyond 4 generations.
I said that wasn't true.
No reply from you.

=> I never said mtDNA isn't accurate beyond 4 generations. I said it seemed that you are referring to Autosomal DNA (cousins) and that Autosomal DNA is not accurate beyond 4 generations, as you are aware.

-You said that we needed thousands of samples. I said that wasn't true if you have collateral cousins being tested to CONFIRM A PAPER TRAIL.
No reply from you, but you keep explaining in terms of a randomly selected population.
This is not a random selection. There are no outliers - ANY discrepancies mean the paper trail is incorrect.

=> I responded and others responded - There are outliers as can be expected. You don't understand.

Since you are not acknowledging that we're testing the paper trail, not the DNA, it's difficult to figure out what test strength you're referring to?

=> I made it clear to many? The test strength is the reliability of the unknown Y or mt DNA of an ancestor. We use the paper trail as additional data.

-I take it you're not referring to the veracity of the DNA test June took?

=> How can I if I do not know what you are talking about?

-So then, are you talking about the reliability of extrapolating from n descendants if you know the likelihood of 2 random South Africans' sharing mtDNA?

=> No, not at all. We are not testing likelihoods at the bottom.

I have referred to this twice, saying "I don't know how many living collateral cousins purporting to be descendants of Francois testing with the same mtDNA it would take to assume the paper trail is likely to be solid, as I'm not a statistician, and I don't know the figure for the chance that two random white SAs share mtDNA, but I'm positive it would not be thousands, and, depending on the spread of the lines represented by the collateral cousins, I'd say that the convincing number would be in the single digits."

=> This is just not on the topic... I don't need to agree or disagree?

If you disagree, you haven't said why yet. It would be good if you did.

???

As to Morel's ‘Uncertainty Principle’ project :-)
I have my doubts about the possibility of accurately estimating the un/certainty of the SA genealogy database as a predictive factor, but maybe I am demonstrating too little faith :-)

=> The test is not at all about a database being a predictive factor. I think he can do what he said he wanted to do.

Showing 31-60 of 110 posts

Create a free account or login to participate in this discussion