Please see my Google Scholar Page for more details

Thesis Research

Detection of mosaic single nucleotide variants and implications for congenital heart disease

Mosaicism, or genetic mutations arising after oocyte fertilization, has been implicated in developmental disorders such as overgrowth syndromes and structural brain malformations, but its role in congenital heart disease (CHD) is not yet well understood. Further, estimates of the frequency of mosaic mutations, a basic genetic question, are inconsistent in recent studies, due to difference in accuracy and power of variant calling methods and sequencing depth. To address these issues, we developed a new computational method to accurately detect mosaic mutations from exome or genome sequencing data. We applied the method to exome sequencing data of 2530 CHD proband-parent trios to estimate the number of mosaic mutations detectable in blood samples and to characterize the contribution of mosaicism to CHD.

Our main findings are summarized below:

  1. We developed a new method (EM-Mosaic) that jointly estimates the overall frequency of mosaic mutations using an Expectation-Maximization approach and identifies mosaic mutations from the data using a pseudo-Bayesian framework, with a 90% validation rate (among the highest of all recent major publications on mosaics).
  2. We estimate that each case carries about 0.14 protein-coding mosaic mutations in the blood with allele fraction above 10%, representing about a tenth of new mutations per generation.
  3. In CHD cases, likely-damaging mosaics have higher allele fraction than benign mutations, strongly supporting a role of mosaics in the disease.
  4. Analysis of a limited number of subjects (n=66) with matched blood and heart tissue available supports the notion that mosaic mutations in blood samples with relatively high allele fraction are more likely to also be found in heart tissues.

Please see our GitHub page and our bioRxiv preprint for more information

(Research Advisors: Dr. Yufeng Shen, Dr. Wendy Chung)

Past Experience

Predicting neurodevelopmental disorder risk in congenital heart disease patients

January 2015 - November 2015

Rates of neurodevelopmental disorder (NDD) are disproportionately high in congenital heart disease (CHD)patients compared to the general population (up to 10-fold higher prevalence), presumably due to disruption of pleiotropic genes central to many key developmental pathways. We sought to answer the question: Can we predict which CHD patients will develop NDD and which will not?

We applied different statistical and machine learning approaches (logistic regression, RandomForest, SVM, boosting) to a cohort of 2530 congenital heart disease patients. Our outcome variable was a binary NDD diagnosis and we used a combination of genetic and clinical features as our predictors. Models were evaluated using 10-fold cross validation. While model performance was middling, we found the following:

  1. Damaging de novo mutations showed the strongest association with NDD, with loss-of-function and damaging-missense mutations in published NDD risk genes contributing the most information
  2. Complex CHD cases, defined as having a CHD diagnosis and other extracardiac manifestations, were at higher risk of NDD than Isolated CHD cases (no extracardiac manifestations).
  3. Males CHD cases were enriched for NDD compared to females in our cohort
  4. Neurological, skeletal, and heart morphology related extracardiac diagnoses had the strongest association with NDD

(Rotation Advisor: Dr. Yufeng Shen)

Studying the effect of mRNA methylation on ribosome translation efficiency

September 2014 - January 2015

Using ribosomal profiling and methylation data, I wrote a tool to compare mRNA methylation sites against mRNA sites bound by ribosomes to investigate the effect of methylation on translation efficiency. (Rotation Advisor: Dr. Peter Sims)

Bioinformatics: Competitive Analysis Group

October 2013 - July 2014

I analyzed next generation sequencing QC data from competing platforms to compare performance metrics (alignment, assembly, variant calling) against those of Illumina's platforms. I also reviewed and evaluated both in-house and external applications submitted to BaseSpace.

Genomic analysis of variants in Kawasaki Disease patients and families

June 2013 - July 2014

Using whole genome sequencing data of families (trios) affected by Kawasaki Disease, I developed an analysis pipeline to discover risk variants. (Project under supervision of Jihoon Kim and Dr. Jane Burns)

Phenotype Finder IN Data Resources (PFINDR)

August 2011 - June 2013

Using phenotype datasets from dbGaP, I developed a tool called DIVER to extract and format demographic information (later integrated into PhenDisco). I also worked on a system PhenDisco that fits existing data in dbGaP to an information model that allows users to free text query dbGaP for studies of interest. (Project under supervision of Dr. Hyeoneui Kim)

Analysis of unidentified carbohydrate complexes in the PDB

June 2011 - August 2011

As part of a summer REU (NSF), with a team of 4 other students, I searched, categorized, and annotated unidentified carbohydrate complexes in the PDB.