Start My Family Tree Welcome to Geni, home of the world's largest family tree.
Join Geni to explore your genealogy and family history in the World's Largest Family Tree.

Using autosomal DNA matches to build the World Tree

Background

Biology

Every human being has, inside every one of their cells, twenty-three pairs of chromosomes. These chromosomes are numbered from 1 to 22. The 23rd pair is special, consisting of a either a pair of "X" chromosomes, or an "X" and a "Y". These determine the biological sex of the person. We will not be considering either "X" or "Y" chromosomes in this discussion.

They are organized by length, with Chromosome 1 being the longest one, and Chromosome 22 being the shortest. Places within the chromosome are described by a number, sometimes called the "base position", e.g. "13456", which represents how far along a chromosome a particular stretch of DNA begins. Because the length of the chromosomes differs, the numbers can get much bigger for Chromosome 1 than they get for Chromosome 22.

A person has exactly the same set of chromosomes in every cell of their body. For every chromosome pair, one of the pair came from the mother, and the other of the pair came from the father. The mother and father each combine every pair of chromosomes they have into one, yielding 23 single chromosomes, through a random process of rearrangement. The process of rearrangement chooses random sections from one or the other of each member of each pair of chromosomes. The length of each section may be long or it may be short. For example, the father's contribution to one of his children for Chromosome 3 might include section 1-13245 from his mother's side of the chromosome pair, section 13246-78812 from his father's side of the chromosome pair, section 78813-78816 from his mother's side, etc.

The random selection process is different for every child of a set of parents. That is why siblings are genetically different from one another.

There is also, very occasionally, small random changes to the genes a person gives to their child. These are called point mutations. For the purposes of this discussion, we consider these to be of no significance.

Technology

Modern technology offered by many testing companies for quite a low price can characterize a person's genome - the DNA content of all 46 individual chromosomes - by the process of checking for the existence or absence of SNPs. SNP stands for "single nucleotide polymorphism". It's a variation in a single place in a person's genome.

A testing company checks a person's genome for tens of thousands of individual SNPs, and from what matched (or what didn't match), they construct a list of SNPs spread out over all 46 individual chromosomes. Each SNP identified comes from a specific chromosome and a specific base position.

The next step that a testing company does is try to align the individual SNPs between the genomes of the tested individual and all the other individuals in their client list. Long sequences of identical SNPs imply a section of DNA that is considered a "match" between these individuals. Such a match for a pair of individuals may appear on any chromosome pair, and will have a beginning and an ending base position.

Since there are two chromosomes in each chromosome pair, it is possible for one person to match one chromosome of the pair, and a different person to match the other chromosome of the pair, both within the same range of base pairs. The testing company does not know whether these constitute father's side matches or mother's side matches unless the father's or mother's DNA is also available to them. The only way to know which side is which is by actually comparing a DNA match's genome against the DNA of relatives which a person knows their relationship to.

Standards of evidence

Throughout this discussion, we attempt to distinguish between conclusions which are likely correct to a very high probability (beyond a reasonable doubt), those which a likely correct (preponderance of the evidence), and those which are in the realm of hypothesis (possible but further evidence needed). There is no such thing in any science as proof of something at the 100% level. Scientific standards, even for published papers in all fields, consider something "proved" if it meets a specific statistical error measurement. Even for evidence that meets such a standard, it does arise from time to time that a hypothesis was concluded to be true - incorrectly - and that leads to a retraction of the conclusion and may indeed upend the science.

It is beyond the scope of this white paper to attempt to numerically quantify the exact chances a conclusion is the correct one. The science of genealogy never relies 100% on any one aspect of evidence, anyway, but instead looks at all evidence to construct a bigger picture. Autosomal DNA evidence is thus only one tool in a genealogist's toolkit. Understanding the tool, its strengths, and its limitations, is increasingly necessary. It is not acceptable to ignore genetic evidence that has been properly examined, since that evidence does represent critical information, and would prevent the discovery of the truth. However, it is possible to challenge any one interpretation of that evidence, and this paper will attempt to describe how that may be done effectively.

It has come to the author's attention that certain societies and professional organizations do not at this time recognize autosomal DNA analysis as valid in any sense. This is not, in the author's opinion, based on the science, but rather on the professional organization's lack of understanding of the science, and perhaps of the oversimplification testing companies have done to make their results be readily accessible by most persons. I have no solution to this problem other than to provide a more-or-less standard analytical approach that can be defended by those who would use the conclusions as evidence.

Technique

Process

The process laid out here has been designed to leverage as much information as possible from a DNA overlap. It is, I believe, similar to the process Ancestry.com uses with its "through-lines" feature, although through-lines is deliberately limited to avoid some of the complexities involved with deeper matches (discussed below).

The basic idea is to recognize that any DNA overlap by definition must come from one particular set of common ancestral parents. The SNP signature of that overlap is unique; no two sets of parents in the history of humankind on earth have ever had an identical combination of parental chromosome SNPs (except for couples of matching identical twins). Any overlap, therefore, is describing a unique pair of individuals that both suppliers of DNA share. There is no realistic possibility of error here at all.

The difficulty, then, is determining who the ancestral couple is.

This is, frankly, quite limited if you are given just two individuals and the parameters of a specific chromosome overlap. Even if you painstakingly built out a genealogical tree for both individuals, the best you could hope for is finding the common ancestral couple by visual inspection. And then you would be also subject to various errors - such as distinguishing between more than one set of common ancestors.

However, if you have more DNA matches that share the same overlap for which you can build trees, you not only cut back severely on the sources of error, such as common ancestral ambiguity, but also you amass significant additional evidence that helps make the case. For example, if one of the DNA matches is a fourth-plus-generation Irishperson, then a common ancestor in Ireland gains additional support.

But you do need to do one thing to confirm that the new match you are considering is from the couple you are thinking it comes from. You need to make sure that all of the matches come from the same side of your genome - mother or father.

This is most easily done if you have access to GEDMatch for all of the DNA you are working with. The right way to do it is to do an autosomal A-to-B comparison between your two matches. If they match each other, and the region of overlap for them includes the chromosome and base position ranges for your overlap at least partially with both of them, you've proven that both matches come from the same side of your genome.

[Note: 23-and-me also direct allows A-B comparisons, but FamilyTreeDNA has a much more primitive feature, called "Matrix", which just checks whether or not your matches are themselves matches. Because it's not specific to any region of the DNA, and because there seems to be severe length limits involved, the "Matrix" feature is error prone and is only mildly helpful. Ancestry does not give access to a chromosome browser at all.]

A properly executed analysis of a particular DNA match therefore involves the following steps:

  1. Building the tree of the match to the extent possible
  2. Finding other DNA matches that overlap the overlap
  3. Validating that the DNA matches are indeed related to each other in the same region as the original overlap
  4. Building trees for the additional matches that can be validated
  5. Searching these trees for commonalities shared by all

In practice, because tree building represents a significant amount of effort, not all qualifying DNA matches will be explored. Unless at least two matches are explored, however, and the commonalities found, the conclusions will be too error-prone to constitute anything close to "beyond a reasonable doubt", and may not even reach the "predominance of the evidence" level. On the other hand, if two matches concur, and you are able to add a third concurring match, or more, the level easily reaches "beyond a reasonable doubt", EXCEPT when:

  • The DNA matches all come from the same family with common heritage
  • The common ancestor is so far back that the trees of the DNA matches mostly don't reach that point

It's also worth pointing out that close family members who have DNA samples available for analysis can help buttress your conclusion, if they too overlap the DNA area under analysis, provided they are distant enough to eliminate parts of your tree that you don't know enough about.

Utility

The technique above is helpful in the following situations:

  1. Validating a tentative relationship that doesn't have sufficient genealogical support to be certain of, e.g. a tentative maiden name;
  2. Determining the immediate general ancestry of a person who is in the right place and right time;
  3. Establishing the family of origin for immigrants from overseas, or to other lands.

The technique above does not help for the following situations:

  1. Trying to prove a general existence of a relationship when the relationship is beyond fifth cousin. Because of the random process by which DNA is chosen, there is a non-negligible chance that any person will not inherit any significant DNA from some of their more distant ancestors, at all.
  2. Building trees that don't exist for any of the DNA matches analyzed. Autosomal DNA can prove a relationship exists, and even where it must be in multiple trees, but definitely does not give details. Those must come from other sources.

Sources of error, and error characterization

Problems resulting from bad trees

As should be clear, the technique as stated is only as reliable as the quality of the trees present for the individual matches being analyzed. On one extreme, a match for whom a tree cannot be built contributes nothing to the analysis of a DNA overlap they are part of, even if all other DNA-related criteria are met. On the other extreme, a tree that is not based on rigorous standards (including DNA) will most likely fail to help reach the truth. The other more worrisome problem, misdirection, happens far less frequently in my experience, because mistakes only very rarely lead to erroneous conclusions. But erroneous conclusions become more possible the deeper the match is, and therefore extra care must be taken for matches that are seven generations or more deep.

In the real world, trees are what they are, and we have another workaround available to address uncertainties that come from incomplete or incorrect trees. When critical parts of a tree are in doubt, the option exists to build out the trees of other overlapping and qualifying matches. This reduces the chances that any one error will lead to an erroneous conclusion. It is also very helpful when dealing with the related concern of multiple lines of possible inheritance. Unless all the overlapping DNA matches share pretty much the same tree, adding even one more match tree will often make it quite clear both whether the hypothesis is correct, or where a tree has gone wrong.

Multiple potential paths to the ancestral couple

Often the situation arises where a family affiliation is proven to within acceptable parameters, but it is not clear which of many possibilities might lead the DNA match to be descended from that family. Incomplete trees are a very typical situation when working with DNA matches at the seventh and eighth cousin level, and it is often not easily done to figure out which is the correct of several lines to pursue. It may even be the case, where people married cousins, that there is not only one path to the ancestral family, but several possibilities.

In the case of cousin marriages of this kind, it may not actually be a significant problem, as long as what's being proven isn't the existence of a cousin marriage. It may not be necessary to know which path the DNA took though a cousin marriage, if the common ancestors are known and proven. The analysis remains the same if the cousins share the same ancestor pair.

On the other hand, if the problem is trying to prove which ancestor has the relationship you know exists but isn't covered by the usual genealogical means, several things can be brought to bear to prove which it is. Adding analyses and trees for more overlapping matches may narrow down the possibilities, if those matches share one known common ancestor but not others that were initially under consideration. The location affinity of the matches may help select which relationship to pursue, even if they do not actually include the common ancestor pair in their tree, for example. Directed research, and knowledge of naming conventions, can sometimes fill in short missing stretches of a tree where there is otherwise no genealogical data. This, of course, doesn't reach a standard of proof even of the level of "preponderance of the evidence". The relationship may be known with near certainty, but the details are not. But sometimes that is enough to help locate records that are helpful, and DNA proof of a relationship means that the rest of the evidence may not have to be as robust as it would need to be without.

Other factors to consider in evaluation

A curator already considers many things when evaluating a claim of ancestry. DNA relationships are just one tool among many. Written records, such as marriage records, shipboard manifests, land records, wills, and oral histories, all work in conjunction with DNA in proving families. Figuring out what to believe and what to discount is part of this process, and it's obviously a highly intuitive one. In order to have confidence in DNA evidence, therefore, there must be a checklist to determine if the proper process has been followed. This can inform a curator as to its worth.

The checklist must include these points:

  1. A GENI profile for the user making the DNA claim should be provided, with a complete tree.
  2. Each DNA match provided in support of the claim should also have a GENI tree that can be checked.
  3. Each DNA match provided should include the overlap; chromosome number, start base position, and end base position with the user making the claim.
  4. There should be at least three analyzed DNA matches supporting the claim. The quality of the trees of those matches should be high.
  5. The DNA matches should be shown, as best as possible, to be related to each other on the same section of DNA as the match.
  6. The common ancestral family hypothesis should be identified, also with GENI profiles.
  7. Any special considerations should be identified - e.g. naming conventions or other information that would support the claim.