Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology.
Technological advances have transformed the scientific landscape by enabling comprehensive quantitative measurements, thereby increasingly facilitating data-driven research. This includes genome biology, where many data sets nowadays comprise a collection of heterogeneous high-dimensional data modalities, collected from different assays, tissues, organisms, time points or conditions. An important example are multi-omics data, i.e. data combining measurements from multiple biological layers. Jointly, such data promise to provide a better and more comprehensive understanding of biological processes and complex traits. A critical step to realize these promises is the development of statistical and computational methods that facilitate moving from the data to sound conclusions and biological insights. For this purpose, an integrative analysis that combines information from different data modalities is essential.
In this thesis, we propose novel methods that provide a multivariate approach to data integration, and we apply them in the context of multi-omics studies in precision medicine and single cell biology. Given a collection of different data modalities on a set of samples, we aim at addressing two main questions: First, how can we obtain an (unbiased) overview of the main structures that are present in the data, both within and across data modalities? And second, how can we use all data to predict a response of interest and identify relevant features, whilst taking the heterogeneity of the features into account? The first question is important in all exploratory data analysis and leads us to unsupervised methods for data integration. Finding hidden structures in the data can give important insights into biological and technical sources of variation and yield an informative low-dimensional data representation. To this end, we introduce multi-table methods and latent factor models that can capture main axes of variation and co-variation in the data. Based on this, we present a novel factor method, multi-omics factor analysis (MOFA), to integrate information from different data modalities. By sparsity assumptions on the factor loadings, MOFA decomposes variation into axes present in all, some, or single modalities and promotes interpretable factors with a direct link to molecular drivers. MOFA combines a statistical model that accommodates different data types and missing data with a scalable inference algorithm, thereby ensuring a broad applicability. Once learnt, the factors enable a range of downstream analyses, including identification of sample subgroups, outlier detection and data imputation. We demonstrate its flexibility and potential to generate biological insight by applying MOFA to a multi-omics study on chronic lymphocytic leukaemia as well as a multi-omics single cell data set. The second question leads us to supervised methods that enable building predictive models and selecting features relevant for a response of interest. Reliable methods for this purpose would have far-reaching consequences in many…
Advisors/Committee Members: Bühlmann, Peter, Huber, Wolfgang, Stegle, Oliver.
to Zotero / EndNote / Reference
APA (6th Edition):
Velten, B. (2019). Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology. (Doctoral Dissertation). ETH Zürich. Retrieved from http://hdl.handle.net/20.500.11850/333437
Chicago Manual of Style (16th Edition):
Velten, Britta. “Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology.” 2019. Doctoral Dissertation, ETH Zürich. Accessed July 17, 2019.
MLA Handbook (7th Edition):
Velten, Britta. “Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology.” 2019. Web. 17 Jul 2019.
Velten B. Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology. [Internet] [Doctoral dissertation]. ETH Zürich; 2019. [cited 2019 Jul 17].
Available from: http://hdl.handle.net/20.500.11850/333437.
Council of Science Editors:
Velten B. Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology. [Doctoral Dissertation]. ETH Zürich; 2019. Available from: http://hdl.handle.net/20.500.11850/333437