Full Record

New Search | Similar Records

Title New Clustering and Feature Selection Procedures with Applications to Gene Microarray Data
Publication Date
Degree PhD
Discipline/Department Statistics
Degree Level doctoral
University/Publisher Case Western Reserve University School of Graduate Studies
Abstract Statistical data mining is one of the most active research areas. In this thesis we develop two new data mining procedures and explore their applications to genetic data. The first procedure is called PfCluster - Profile Cluster Analysis. It is a clustering method designed for profiled genetic data. The PfCluster is efficient and flexible in uncovering clusters determined by a new class of biologically meaningful distance metrics. A new internal quality measure of clusters, coherence index, is developed to find coherent clusters. An efficient mechanism for choosing the threshold of coherent clusters is also derived and implemented. The threshold is based on the first and second order approximations to the true threshold under a null distribution for parallel clusters. The PfCluster has been applied to simulated data and two real data examples: a biomarker LOH dataset and a microarray gene expression dataset. PfCluster is competitive to the correlation-based clustering procedures. The second procedure is called RPselection - Resampling based partitioning selection. It is a feature selection algorithm designed for microarray studies. It selects a subset of genes that maximizes a fitness score. The fitness score measures the relevance between the partition labels from a clustering result and an external class label derived from the clinical outcomes. The score is computed using a resampling procedure. The RPselection algorithm has been applied to simulated data and a real uveal melanoma gene expression data. RPselection outperforms gene-by-gene test-based feature selection procedures. Software development is an integral part of modern statistical research. Two software packages, pfclust and rpselect, are developed in this thesis based on our PfCluster method and RPselection algorithm. Packages pfclust and rpselect are implemented based on R object-oriented programming framework, and they can be easily customized and extended by users. The ideas in our two procedures can be generalized and applied to other data mining tasks. This thesis concludes with discussion on connections between two methods and the related future research.
Subjects/Keywords Statistics; Bioinformatics; coherence index; data mining; feature selection; gene expression pathway; gene profiling; informative gene; microarray data; profile cluster analysis; partitioning; regulatory network; statistical pattern recognition
Contributors Sun, Jiayang (Advisor)
Language en
Rights unrestricted ; This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
Country of Publication us
Format application/pdf
Record ID oai:etd.ohiolink.edu:case1196144281
Repository ohiolink
Date Retrieved
Date Indexed 2021-01-29
Grantor Case Western Reserve University School of Graduate Studies

Sample Images