Short Tandem Repeats (microsatellites) are a highly polymorphic class of genetic variations with repetitive elements of 2.6 nucleotides (nt). These variations are implicated in the etiology of dozens of rare genetic disorders such as Huntingon's Disease and Fragile-X Syndrome. We are interested in uncovering the role of these elemets in complex human traits. For that, we developed an algorithm, called lobSTR, to profile STR variations in high throughput sequencing data.

Public sharing of sequencing datasets without identifiers has become a common practice in genomics to protect the privacy of research participants. We found that that surnames can be recovered from personal genomes by profiling short tandem repeats on the Y chromosome (Y-STRs) and querying recreational genetic genealogy databases. We further demonstrated that this information, together with other types of metadata, such as age and state, can be used to triangulate the identity of the target. We are intrested in investigating strategies to protect genetic privacy while promoting data sharing.

DNA Sudoku is a sequencing strategy to find rare genetic variations in large cohorts. It is based on pooling the specimens according to combinatorial patterns as a means of multiplexing. This is reminiscent of solving a Sudoku puzzle: every specimen is like a cell in the puzzle, and the sequencing results are like the numbers of the Sudoku. By finding the sequencing results of a single specimen, we can propagate the information to the other specimens in the pool, solve their sequencing results, and repeat the process over and over again, until the entire set is solved. This is specifically suitable for 'needle in a haystack' sequencing scenarios - identifying a small number of specimens that carry rare variations out of a large "wildtype" group.

Understanding the genetic architecture of complex traits is one of the top missions of human genetics. Emerging lines of studies have highlighted the entangled etiologies of these traits, including epistasis, parent-of-origin effects, sex and age interactions, and environmental risk factors. To conduct robust genetic epidemiological analysis, statistical models require sampling substantial amount of data from large families. But the recruitment of large cohorts of extended kinships is both logistically challenging and cost-prohibitive. We are working on a strategy to harness existing, free, and massive Web 2.0 social network resources to trace the aggregation of complex traits in millions of people and extremely large families.