The mtDNA Population Database: An Integrated Software and Database Resource for Forensic Comparison by Monson, Miller, Wilson, DiZinno, and Budowle (Forensic Science Communications, April 2002)
April 2002 - Volume 4 - Number 2
Research and Technology
The mtDNA Population Database:
An Integrated Software and Database Resource
for Forensic Comparison
Keith L. Monson
Forensic Science Research Unit
Federal Bureau of Investigation
Kevin W. P. Miller
DNA Analysis Unit 2
Mark R. Wilson
Supervisory Special Agent
DNA Analysis Unit 2
Joseph A. DiZinno
Scientific Analysis Section
Senior Biological Sciences Program Advisor
Forensic Analysis Branch
Federal Bureau of Investigation
Introduction | Population Data | Associated Software | Contact Information
User’s Manual: mtDNA Population Database | Accessing the mtDNA Population Database | Disclaimer | Acknowledgments | References
Nucleotide sequencing of the human mitochondrial DNA (mtDNA) control region has been validated for the genetic characterization of forensic specimens (for references, see Budowle et al. 1999). Mitochondrial DNA analysis is especially useful for the analysis of teeth, bones, and hair, as well as highly degraded tissues that do not lend themselves to successful nuclear DNA analysis.
In contrast to nuclear DNA, mtDNA follows maternal clonal inheritance patterns without recombination. Therefore, with few exceptions (i.e., heteroplasmy), mtDNA types are faithfully inherited from one generation to the next through the maternal line. These characteristics facilitate collection of reference material for forensic comparison, even in cases where generations are skipped. For forensic purposes, the weight of a mtDNA match between two evidentiary items is determined by counting the number of times the profile occurs in one or more datasets of unrelated individuals. Given the level of diversity that has been observed in mtDNA, the estimate of rarity by counting mtDNA types is highly dependent on the size of the reference database and will often be overestimated (National Research Council 1996; Budowle et al. 1999).
This article describes a database of mtDNA control region nucleotide sequences and reports on software for searching the profiles. These data are made available to the forensic and research communities free of charge. The mtDNA Population Database program has two components: population data that are stored as relational tables in Microsoft® Access 2000® format, and specialized software designed to search these data. The mtDNA Population Database program (data, searching software, and user’s manual) is published in this issue of Forensic Science Communications. Technical support is limited to that supplied in the user’s manual. Updates to the population data will be published periodically. Searching functionality similar to that in the mtDNA Population Database program provided here is available to law enforcement laboratories through the CODISmt program (COmbined DNA Index System-mitochondrial DNA). CODISmt offers the additional possibilities of searching combined mitochondrial and nuclear DNA profiles and of searching open case files, particularly missing persons cases, across the CODIS network.
Nucleotide sequence data are divided into two components, forensic and public (Miller and Budowle 2001). In each category, profiles are designated as differences from the Cambridge Reference Sequence (CRS) (Anderson et al. 1981). The forensic component is used to assess the weight of mtDNA associations developed in forensic casework. It consists of anonymous population profiles contributed by collaborating laboratories. In addition to its own quality assurance measures, and as a minimum quality assurance assessment on prospective submissions, each participating laboratory must correctly type a set of control samples before any of its results are accepted for inclusion in the population database. This exercise facilitates compatibility of typing methods and of profile nomenclature. All forensic profiles include, at a minimum, a sequence region in hypervariable region I (HVI), defined by nucleotide positions 16024-16365, and a sequence region in hypervariable region II (HVII), defined by nucleotide positions 73-340. Additional contributions of population data are welcomed.
Public data consist of mtDNA sequence data from the scientific literature and the GenBank and European Molecular Biology Laboratory (EMBL) genetic databases. These sequence data were collected, cataloged, annotated, formatted, and organized. The public data include and replace the mtDNA data from the mtDNA concordance study of Miller et al. (1996). In general, the quality assurance methods described by Miller were applied to the public data. The data were checked for uniformity of nomenclature and cross-checked with other publications by the same authors and the GenBank/EMBL genetic databases to minimize the possibility of error and duplication. Although the public data have not been subjected to the same quality standards as the forensic data, these data provide useful information on worldwide population groups not contained within the forensic dataset and can be used for investigative purposes.
The profiles in both forensic and public datasets are uniquely identified in the database by a systematic naming scheme. Each profile is denoted by a unique identifier and the respective literature citation. Where possible, each profile is indexed by the population group assigned by the contributor, as well as the continent, country, or region of specimen origin. A standard 14-character nucleotide sequence identifier is assigned to each profile, using the structure XXX . YYY . ZZZZZZ, as described by Miller and Budowle (2001). The first three characters (XXX) reflect the country of origin, using codes defined by the United Nations (1997). The second three characters (YYY) describe the group or ethnic affiliation to which a particular profile belongs. The final six characters (ZZZZZZ) are sequential acquisition numbers. For example, profile JPN.ASN.000105 designates the 105th nucleotide sequence from an individual of Asian origin from Japan. An Asian American individual would carry the same code for ethnicity, but a different code for country of origin (e.g., USA.ASN.000105). The population/ethnicity codes for indigenous peoples are numeric and arbitrarily assigned. For example, USA.008.000105 refers to an individual from the Apache tribe sampled from the United States.
The central function of the software is to facilitate searching of a mtDNA nucleotide sequence developed from an evidentiary sample against one or more sequence datasets. The software offers various search parameters, provides options for report details, and provides tools for exploration of the datasets themselves. Two types of searching are supported:
- A comparison of a single profile against the dataset
- A pairwise search in which every profile is compared to every other profile in the selected dataset(s)
Details of program operation are described in a user’s manual that accompanies the download of data and searching software from the Internet.
|Figure 1 The Search screen is used to enter a target profile and to select and search dataset(s) for sequences matching that profile. Click to enlarge image.|
screen (Figure 1) is invoked as the default upon starting the program. It can also be accessed by selecting Mode and then Search on the menu bar. The user specifies the Sequenced Regions (which contain, but are not limited to, HVI and HVII), the profile in terms of Differences from Anderson (i.e., the Cambridge Reference Sequence [CRS]) and which data (forensic or public) are to be searched. To broaden or limit the search within the dataset chosen, individual groups (e.g., African-American, Caucasian, Thai) may be selected. The user can also specify a number of search parameters. These include the following:
Whether partial overlaps (where some, but not all, of the ranges sequenced in the search and database profiles are in common) are to be searched,
- Whether insertions are to be considered, and
- The degree of output detail.
Upon completion of the search, Search output lists all search parameters and tabulates counts and frequency of matches (i.e., count/number of sequences searched) in various groupings of the chosen database (total combined, by major group and by individual groups). If the option is specified, the sequence of every matching profile and the number of sites differing from the target profile are also listed. Figures 2 and 3 illustrate representative portions of the search output. For example, the number of profiles that are identical to the search profile can be indicated, as well as the frequency of the profile relative to the number of profiles in the dataset used in the search (Figure 2). In addition, individual profiles, with their respective identifiers, that are obtained as search results can be displayed (Figure 3).
A pairwise search may be helpful in determining the general relationships between datasets. Pairwise comparisons are performed using the same algorithms as used for a single profile search but are invoked by selecting Mode, then Pairwise from the menu bar (Figure 4). In addition, the user can specify either the specific sequence ranges to be searched or limit the comparison to the regions common to every profile in the selected dataset(s). Note that two regions may be defined (HVI and HVII) but that no gaps in either are permitted in the search sequence. Figure 5 displays excerpted results of several intragroup and intergroup comparisons. It includes the number of matches and number of comparisons performed within the dataset, the quotient of the two, and the mean number of nucleotide differences per comparison. The mean number of differences tends to be similar for different populations within a major group. Also provided (but not illustrated) are counts of mtDNA types observed within each group and estimates of the random match probability (Stoneking et al. 1991) and genetic diversity (Tajima 1989).
Federal Bureau of Investigation
DNA Analysis Unit 2
2501 Investigation Parkway
Quantico, Virginia 22135
Access the Release Notes, published in this issue of Forensic Science Communications.
Click here to install the database on your hard drive. Database will only download if you are using Internet Explorer. If you experience difficulties with the database, please call 202-324-4354.
This version of the mtDNA Population Database was revised as of June 2004. Changes made to the database can be found in the most recent copy of the Release Notes.
Mitochondrial DNA (mtDNA) is a small, circular piece of DNA. It is found outside the nucleus in most cells and is generally involved with the production of energy for the body. Like other types of DNA, a portion of mtDNA does not encode proteins and has no known function. This region, called the control region, is used in forensic DNA analysis because it is highly variable from person to person. The mtDNA population database is a compilation of differences in mtDNA control regions from a random collection of unrelated individuals of various ethnic backgrounds. A variety of forensic, law enforcement, and academic institutions have contributed to the mtDNA population database. The database is intended for use by the scientific community to establish the relative level of occurrence of a particular genetic type in a particular group of individuals. Identifying information specific to any particular person, such as gender, age, sex, or disease status is not available.
The following institutions contributed nucleotide sequence data to the mtDNA population database:
Armed Forces DNA Identification Laboratory, Rockville, Maryland
Illinois State Police, Springfield, Illinois
Institute of Legal Medicine, University of Innsbruck, Innsbruck, Austria
University of California at Berkeley, Berkeley, California
This work was supported in part by FBI contract #J-FBI-98-090, with contributions from the National Institute of Justice.
Anderson, S., Bankier, A. T., Barrell, B. G., de Bruijin, M. H. L., Coulson, A. R., Drouin, J., Eperson, I. C., Nierlich, D. P., Roe, B. A., Sanger, F., Schreier, P. H., Smith, A. J. H., Staden, R., and Young, I. G. Sequence and organization of the human mitochondrial genomes, Nature (1981) 290:457-465.
Budowle, B., Wilson, M. R., DiZinno, J. A., Stauffer, C., Fasano, M. A., Holland, M. M., and Monson, K. L. Mitochondrial DNA regions HVI and HVII population data, Forensic Science International (1999) 103:23-35.
Miller, K. W. P. and Budowle, B. A compendium of human mitochondrial DNA control region: Development of an international standard forensic database, Croatian Medical Journal (2001) 42(3):315-327.
Miller, K. W. P., Dawson, J. L., and Hagelberg, E. A concordance of nucleotide substitutions in the first and second hypervariable segments of the human mtDNA control region, International Journal of Legal Medicine (1996) 109:107-113.
National Research Council. NRC Report II: The Evaluation of Forensic Evidence. National Academy Press, Washington, DC, 1996, p.159.
Stoneking, M., Hedgecock, D., Higuchi, R. G., Vigilant, L., Erlich, H. A., Arnheim, N., and Wilson, L. A. Population variation of human mtDNA control region sequences detected by enzymatic amplification and sequence-specific oligonucleotide probes, American Journal of Human Genetics (1991) 48:370-382.
Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism, Genetics (1989) 123:585-595.
United Nations. Terminology Bulletin No.347/Rev.1: Country Names. United Nations Office of Conference and Support Services, New York, 1997, pp.1-50.