Data


General


FamiLinx is a scientific resource of curated genealogical, demographic, and basic phenotypic data from tens of millions of people mostly from the last 500 years. Different from traditional studies, this resource is the product of an ultra crowd-sourcing approach and is based on the collaborative work of genealogy enthusiasts around the world who documented and shared their family stories.

The starting point of FamiLinx was the public information on Geni.com, a genealogy-driven social network that is operated by MyHeritage. Geni.com allows genealogists to enter their family trees into the website and to create profiles of family members with basic demographic information such as sex, birth date, marital status, and location. The genealogists decide whether they want the profiles in their trees to be public or private. New or modified family tree profiles are constantly compared to all existing profiles, and if there is high similarity to existing ones, the website offers the users the option to merge the profiles and connect the trees.

With permission from MyHeritage, we only downloaded the public profiles of individuals from Geni.com for future scientific studies. We used graph algorithms to clean the data and organize the pedigrees into fast accessible formats. We also employed natural language processing to tokenize birth, residence, death, and burial locations of individuals and converted this information into quantitative longitude and latitude. The format of the FamiLinx data is an SQL database and users can create their own local copy with the download package. We also provide a Python API to query the database with advanced functions for pedigree analysis.

For privacy purposes, the resource does not contain any names and any attempt to re-identify the users is strictly prohibited.

The main advantage of FamiLinx is its ultra-large pedigrees. The largest pedigree has 13 million individuals. To the best of our knowledge, this is the largest pedigree compiled for scientific studies.


Examples


An example of a (small) FamiLinx pedigree of 6,000 people that spans over 7 generations:

Green nodes denote individuals and red nodes denote marriages



Migration patterns detected in our data:



The Database


The database has 43,589,566 individuals whose records are organized in 5 tables described below:
  • Relationship: This table describes the child-parent relationships in the database and it serves as the interface to the pedigree structure. To extract a pedigree for an individual one should start with the individual Id and recursively retrieve parents or children from this table.
  • Gender: Self reported gender of individuals.
  • Founders: This table describes the founders and their statistical properties. A founder is an individual who has no parents in the database. For each founder, we report the number of leaves (descendants with no further children), and the maximum, minimum and median number of generations in the family subtree induced by each founder.
  • Location: The location table reports the longitude and latitude birth/residence/death/burial locations of individuals. We used Yahoo! Geoparser to convert the free text into geographic coordinates. We also provide the country (based on current political borders) and the continent. Data is available for 5,404,864 individuals.
  • Years: This table contains birth and death years for 4,351,044 individuals. We only included individuals whose original record in Geni had an exact birthdate / death date in a single day resolution to enhance the quality of the data. We do not report the exact birthdate/deathdate for privacy reasons.
  • Age: This table contains the age at death for 1,039,321 individuals. The information was calculated by substructing the year of death from the year of birth. We only derived this information for individuals whose exact (i.e. single day resolution) date of birth and date of death was known.

Scripts


We also provide a Python interface to access the database and calculate the identity-by-descent (IBD) and Jacquard's 9 Condensed Coefficients of identity between pairs of individuals. The interface integrates the database SQL queries with Mark Abney's IdCoeff program (Abney, Bioinformatics, 2009). It is given as part of the download package or can be accessed in GitHub.