Agnes Martine Nielsen: Application of Machine Learning on a Genome-Wide Association Study Dataset

Time: Tue 2015-06-30 10.00

Location: Room 3424 Lindstedtsvägen 25, 4th floor, Department of mathematics, KTH

Subject area: Scientific Computing

Doctoral student: Agnes Martine Nielsen

Supervisor: Line Clemmensen, DTU

The number of individuals affected by Type 2 Diabetes is rapidly increasing. The goal of this thesis is to investigate if Type 2 Diabetes can be predicted more accurately from genome-wide association data using machine learning methods opposed to traditional statistical methods. A variable selection process using random forest have been performed and the variables in the genome, called Single Nucleotide Polymorphisms (SNPs), showing the highest importance for the prediction of Type 2 Diabetes have been identified. It has then been considered if including these SNPs in the models over only using clinical variables or previously univariately identified SNPs will improve the performance. Furthermore, the possible improvement by using random forest over logistic regression have been considered.

The analysis has resulted in identifying genes through the SNPs that are related to biological functions related to Type 2 Diabetes including some which have not been directly associated with the disease. These are interesting for future study. However, the results show little to no improvement in prediction performance over models using only clinical variables suggesting that the signal for Type 2 Diabetes in the genome-wide association dataset is weak. Similarly, there is no improvement from using random forest over logistic regression for the final models suggesting that the linear signal in the genome data is much higher than any non-linear signal.

To the calendar