You searched for subject:(High dimensional data)
.
Showing records 1 – 30 of
264 total matches.
◁ [1] [2] [3] [4] [5] [6] [7] [8] [9] ▶
1.
Freyaldenhoven, Simon.
Essays on Factor Models and Latent Variables in
Economics.
Degree: Department of Economics, 2018, Brown University
URL: https://repository.library.brown.edu/studio/item/bdr:792643/
► This dissertation examines the modeling of latent variables in economics in a variety of settings. The first two chapters contribute to the growing body of…
(more)
▼ This dissertation examines the modeling of latent
variables in economics in a variety of settings. The first two
chapters contribute to the growing body of work on how best to find
meaning in
high dimensional datasets. In Chapter 1, I extend the
theory on factor models by incorporating ``local'' factors into the
model. Local factors affect a decreasing fraction of the observed
variables. This implies a continuum of eigenvalues of the
covariance matrix, as is commonly observed in applications. I find
that the factor strength at which the principal component estimator
gives consistent factor estimates coincides with the factor
strength at which factors are economically important in many
economic models. I further propose a novel class of estimators for
the number of those factors. Unlike estimators that have been
proposed in the past, my estimators use information in the
eigenvectors as well as in the eigenvalues. Monte Carlo evidence
suggests significant finite sample gains over existing estimators.
In an empirical application, I find evidence of local factors in a
large panel of US macroeconomic indicators. In Chapter 2, I
establish that Sparse Principal Components can consistently recover
local factors. I further develop a unifying framework that
encompasses both factor augmented regressions and
high-
dimensional
sparse linear regression models. I argue that factor augmented
regressions with local factors partially fill the gap in between
those approaches. Chapter 3 considers a linear panel event-study
design in which a latent factor may affect both the outcome of
interest and the timing of the event. This scenario would
invalidate a traditional difference-in-differences approach.
However, this chapter presents a novel method that nevertheless
allows a practitioner to identify the causal effect of the event on
the outcome of interest. A Covariate related to the latent factor,
but unaffected by the event, is used to achieve identification via
a GMM representation. This approach permits causal inference in the
presence of pre-event trends in the outcome.
Advisors/Committee Members: McCloskey, Adam (Advisor), Shapiro, Jesse (Reader), Renault, Eric (Reader), Kleibergen, Frank (Reader).
Subjects/Keywords: high dimensional data
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Freyaldenhoven, S. (2018). Essays on Factor Models and Latent Variables in
Economics. (Thesis). Brown University. Retrieved from https://repository.library.brown.edu/studio/item/bdr:792643/
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Freyaldenhoven, Simon. “Essays on Factor Models and Latent Variables in
Economics.” 2018. Thesis, Brown University. Accessed March 02, 2021.
https://repository.library.brown.edu/studio/item/bdr:792643/.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Freyaldenhoven, Simon. “Essays on Factor Models and Latent Variables in
Economics.” 2018. Web. 02 Mar 2021.
Vancouver:
Freyaldenhoven S. Essays on Factor Models and Latent Variables in
Economics. [Internet] [Thesis]. Brown University; 2018. [cited 2021 Mar 02].
Available from: https://repository.library.brown.edu/studio/item/bdr:792643/.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Freyaldenhoven S. Essays on Factor Models and Latent Variables in
Economics. [Thesis]. Brown University; 2018. Available from: https://repository.library.brown.edu/studio/item/bdr:792643/
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign
2.
Wang, Runmin.
Statistical inference for high-dimensional data via U-statistcs.
Degree: PhD, Statistics, 2020, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/108476
► Owing to the advances in the science and technology, there is a surge of interest in high-dimensional data. Many methods developed in low or fixed…
(more)
▼ Owing to the advances in the science and technology, there is a surge of interest in
high-
dimensional data. Many methods developed in low or fixed
dimensional setting may not be theoretically valid under this new setting, and sometimes are not even applicable when the dimensionality is larger than the sample size. To circumvent the difficulties brought by the
high-dimensionality, we consider to use U-statistics based methods. In this thesis, we investigate the theoretical properties of U-statistics under the
high-
dimensional setting, and develop the novel U-statistics based methods to three problems.
In the first chapter, we propose a new formulation of self-normalization for inference about the mean of
high-
dimensional stationary processes by using a U-statistic based approach. Self-normalization has attracted considerable attention in the recent literature of time series analysis, but its scope of applicability has been limited to low-/fixed-
dimensional parameters for low-
dimensional time series. Our original test statistic is a U-statistic with a trimming parameter to remove the bias caused by weak dependence. Under the framework of nonlinear causal processes, we show the asymptotic normality of our U-statistic with the convergence rate dependent upon the order of the Frobenius norm of the long-run covariance matrix. The self-normalized test statistic is then constructed on the basis of recursive subsampled U-statistics and its limiting null distribution is shown to be a functional of time-changed Brownian motion, which differs from the pivotal limit used in the low-
dimensional setting. An interesting phenomenon associated with self-normalization is that it works in the
high-
dimensional context even if the convergence rate of original test statistic is unknown. We also present applications to testing for bandedness of the covariance matrix and testing for white noise for
high-
dimensional stationary time series and compare the finite sample performance with existing methods in simulation studies. At the root of our theoretical arguments, we extend the martingale approximation to the
high-
dimensional setting, which could be of independent theoretical interest.
In the second chapter, we consider change point testing and estimation for
high dimensional data. In the case of testing for a mean shift, we propose a new test which is based on U-statistics and utilizes the self-normalization principle. Our test targets dense alternatives in the
high dimensional setting and involves no tuning parameters. The weak convergence of a sequential U-statistic based process is shown as an important theoretical contribution. Extensions to testing for multiple unknown change points in the mean, and testing for changes in the covariance matrix are also presented with rigorous asymptotic theory and encouraging simulation results. Additionally, we illustrate how our approach can be used in combination with wild binary segmentation to estimate the number and location of multiple unknown change points.
In the third chapter, we…
Advisors/Committee Members: Shao, Xiaofeng (advisor), Shao, Xiaofeng (Committee Chair), Chen, Xiaohui (committee member), Fellouris, Georgios (committee member), Simpson, Douglas G (committee member).
Subjects/Keywords: High-dimensional data; U-statistics
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Wang, R. (2020). Statistical inference for high-dimensional data via U-statistcs. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/108476
Chicago Manual of Style (16th Edition):
Wang, Runmin. “Statistical inference for high-dimensional data via U-statistcs.” 2020. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed March 02, 2021.
http://hdl.handle.net/2142/108476.
MLA Handbook (7th Edition):
Wang, Runmin. “Statistical inference for high-dimensional data via U-statistcs.” 2020. Web. 02 Mar 2021.
Vancouver:
Wang R. Statistical inference for high-dimensional data via U-statistcs. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2020. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/2142/108476.
Council of Science Editors:
Wang R. Statistical inference for high-dimensional data via U-statistcs. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2020. Available from: http://hdl.handle.net/2142/108476

University of Alberta
3.
Fedoruk, John P.
Dimensionality Reduction via the Johnson and Lindenstrauss
Lemma: Mathematical and Computational Improvements.
Degree: MS, Department of Mathematical and Statistical
Sciences, 2016, University of Alberta
URL: https://era.library.ualberta.ca/files/cm039k5065
► In an increasingly data-driven society, there is a growing need to simplify high-dimensional data sets. Over the course of the past three decades, the Johnson…
(more)
▼ In an increasingly data-driven society, there is a
growing need to simplify high-dimensional data sets. Over the
course of the past three decades, the Johnson and Lindenstrauss
(JL) lemma has evolved from a highly abstract mathematical result
into a useful tool for dealing with data sets of immense
dimensionality. The lemma asserts that a set of high-dimensional
points can be projected into lower dimensions while approximately
preserving the pairwise distance structure. The JL lemma has been
revisited many times, with improvements to both its sharpness
(i.e., bound on the reduced dimensionality) and its simplicity
(i.e., mathematical derivation). In 2008 Matousek provided
generalizations of the JL lemma that lacked the sharpness of
earlier approaches. The current investigation seeks to strengthen
Matousek's results by maintaining generality while improving
sharpness. First, Matousek's results are reproved with more
detailed mathematics and, second, computational solutions are
obtained on simulated data in Matlab. The reproofs result in a more
specific bound than suggested by Matousek while maintaining his
level of generality. However, the reproofs lack the sharpness
suggested by earlier, less general approaches to the JL lemma. The
computational solutions suggest the existence of a result that
maintains Matousek's generality while attaining the sharpness
suggested by his predecessors. The collective results of the
current investigation support the notion that computational
solutions play a critical role in the development of mathematical
theory.
Subjects/Keywords: Dimensionality Reduction; High Dimensional Data; Johnson Lindenstrauss
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Fedoruk, J. P. (2016). Dimensionality Reduction via the Johnson and Lindenstrauss
Lemma: Mathematical and Computational Improvements. (Masters Thesis). University of Alberta. Retrieved from https://era.library.ualberta.ca/files/cm039k5065
Chicago Manual of Style (16th Edition):
Fedoruk, John P. “Dimensionality Reduction via the Johnson and Lindenstrauss
Lemma: Mathematical and Computational Improvements.” 2016. Masters Thesis, University of Alberta. Accessed March 02, 2021.
https://era.library.ualberta.ca/files/cm039k5065.
MLA Handbook (7th Edition):
Fedoruk, John P. “Dimensionality Reduction via the Johnson and Lindenstrauss
Lemma: Mathematical and Computational Improvements.” 2016. Web. 02 Mar 2021.
Vancouver:
Fedoruk JP. Dimensionality Reduction via the Johnson and Lindenstrauss
Lemma: Mathematical and Computational Improvements. [Internet] [Masters thesis]. University of Alberta; 2016. [cited 2021 Mar 02].
Available from: https://era.library.ualberta.ca/files/cm039k5065.
Council of Science Editors:
Fedoruk JP. Dimensionality Reduction via the Johnson and Lindenstrauss
Lemma: Mathematical and Computational Improvements. [Masters Thesis]. University of Alberta; 2016. Available from: https://era.library.ualberta.ca/files/cm039k5065

University of Michigan
4.
Qian, Cheng.
Some Advances on Modeling High-Dimensional Data with Complex Structures.
Degree: PhD, Statistics, 2017, University of Michigan
URL: http://hdl.handle.net/2027.42/140828
► Recent advances in technology have created an abundance of high-dimensional data and made its analysis possible. These data require new, computationally efficient methodology and new…
(more)
▼ Recent advances in technology have created an abundance of
high-
dimensional data and made its analysis possible. These
data require new, computationally efficient methodology and new kind of asymptotic analysis. This thesis consists of four projects that deal with
high-
dimensional data with complex structures.
The first project focuses on the graph estimation problem for Gaussian graphical models. Graphical models are commonly used in representing conditional independence between random variables, and learning the conditional independence structure from
data has attracted much attention in recent years. However, almost all commonly used graph learning methods rely on the assumption that the observations share the same mean vector. In the first project, we extend the Gaussian graphical model to the setting where the observations are connected by a network and the mean vector can be different for different observations. We propose an efficient estimation method for the model, and under the assumption of network cohesion, we show that our method can accurately estimate the inverse covariance matrix as well as the corresponding graph structure, both from the theoretical perspective and using numerical studies. To further demonstrate the effectiveness of the proposed method, we also analyze a statisticians' coauthorship network
data to learn the term dependency based on statistics publications.
The second project addresses the directed acyclic graph (DAG) estimation problem. Estimation of the DAG structure is often a challenging problem as the computational complexity scales exponentially in the graph size when the total ordering of the DAG is unknown. To reduce the computational cost, and also with the aim of improving the estimation accuracy via the bias-variance trade-off, we propose a two-step approach for estimating the DAG, when
data are generated from a linear structural equation model. In the first step, we infer the moral graph of the DAG via estimation of the inverse covariance matrix, which reduces the parameter space that one would search for the DAG. In the second step, we apply a penalized likelihood method for estimating the DAG restricted in the reduced space. Numerical studies indicate that the proposed method compares favorably with the one-step method in terms of both computational cost and estimation accuracy.
The third and fourth projects investigate supervised learning problems. Specifically, in the third project, we study the cointegration problem for multivariate time series
data and propose a method for identifying cointegrating vectors with simultaneously group and elementwise sparse structures. Such a sparsity structure enables the elimination of certain coordinates of the original multivariate series from all cointegrated series, leading to parsimonious and potentially more interpretable cointegrating vectors. Specifically, we formulate an optimization problem based on the profile likelihood and propose an iterative algorithm for solving the optimization problem. The proposed…
Advisors/Committee Members: Zhu, Ji (committee member), Jin, Judy (committee member), Levina, Elizaveta (committee member), Shedden, Kerby A (committee member).
Subjects/Keywords: High-Dimensional; Statistics and Numeric Data; Science
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Qian, C. (2017). Some Advances on Modeling High-Dimensional Data with Complex Structures. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/140828
Chicago Manual of Style (16th Edition):
Qian, Cheng. “Some Advances on Modeling High-Dimensional Data with Complex Structures.” 2017. Doctoral Dissertation, University of Michigan. Accessed March 02, 2021.
http://hdl.handle.net/2027.42/140828.
MLA Handbook (7th Edition):
Qian, Cheng. “Some Advances on Modeling High-Dimensional Data with Complex Structures.” 2017. Web. 02 Mar 2021.
Vancouver:
Qian C. Some Advances on Modeling High-Dimensional Data with Complex Structures. [Internet] [Doctoral dissertation]. University of Michigan; 2017. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/2027.42/140828.
Council of Science Editors:
Qian C. Some Advances on Modeling High-Dimensional Data with Complex Structures. [Doctoral Dissertation]. University of Michigan; 2017. Available from: http://hdl.handle.net/2027.42/140828

Delft University of Technology
5.
Grisel, Bastiaan (author).
The analysis of three-dimensional embeddings in Virtual Reality.
Degree: 2018, Delft University of Technology
URL: http://resolver.tudelft.nl/uuid:afad36f5-64c7-4969-9615-93d89b43e65f
► Dimensionality reduction algorithms transform high-dimensional datasets with many attributes per observation into lower-dimensional representations (called embeddings) such that the structure of the dataset is maintained…
(more)
▼ Dimensionality reduction algorithms transform high-dimensional datasets with many attributes per observation into lower-dimensional representations (called embeddings) such that the structure of the dataset is maintained as well as possible. In this research, the use of Virtual Reality (VR) to analyse these embeddings has been evaluated and compared to the analysis on a desktop computer. The rationale for using VR is two-fold: three-dimensional embeddings generally better preserve the structure of a high-dimensional dataset than two-dimensional embeddings and the analysis of three-dimensional embeddings is difficult on desktop monitors. A user study (n=29) has been conducted in which participants performed the common analysis task of cluster identification. The task has been performed using a two-dimensional embedding on a desktop computer, a three-dimensional embedding on a desktop computer and a three-dimensional embedding in Virtual Reality. On average, participants that had at least used VR once before could better and more consistently identify clusters in the VR experiments compared to other methods. Participants found it easier to analyse a three-dimensional embedding in VR compared to analysing it on a desktop computer.
Computer Science | Data Science and Technology
Advisors/Committee Members: Eisemann, Elmar (mentor), Vilanova Bartroli, Anna (graduation committee), Brinkman, Willem-Paul (graduation committee), Delft University of Technology (degree granting institution).
Subjects/Keywords: virtual; reality; embedding; visualisation; data; high-dimensional
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Grisel, B. (. (2018). The analysis of three-dimensional embeddings in Virtual Reality. (Masters Thesis). Delft University of Technology. Retrieved from http://resolver.tudelft.nl/uuid:afad36f5-64c7-4969-9615-93d89b43e65f
Chicago Manual of Style (16th Edition):
Grisel, Bastiaan (author). “The analysis of three-dimensional embeddings in Virtual Reality.” 2018. Masters Thesis, Delft University of Technology. Accessed March 02, 2021.
http://resolver.tudelft.nl/uuid:afad36f5-64c7-4969-9615-93d89b43e65f.
MLA Handbook (7th Edition):
Grisel, Bastiaan (author). “The analysis of three-dimensional embeddings in Virtual Reality.” 2018. Web. 02 Mar 2021.
Vancouver:
Grisel B(. The analysis of three-dimensional embeddings in Virtual Reality. [Internet] [Masters thesis]. Delft University of Technology; 2018. [cited 2021 Mar 02].
Available from: http://resolver.tudelft.nl/uuid:afad36f5-64c7-4969-9615-93d89b43e65f.
Council of Science Editors:
Grisel B(. The analysis of three-dimensional embeddings in Virtual Reality. [Masters Thesis]. Delft University of Technology; 2018. Available from: http://resolver.tudelft.nl/uuid:afad36f5-64c7-4969-9615-93d89b43e65f

University of Minnesota
6.
Ye, Changqing.
Network selection, information filtering and scalable computation.
Degree: PhD, Statistics, 2014, University of Minnesota
URL: http://hdl.handle.net/11299/172631
► This dissertation explores two application scenarios of sparsity pursuit method on large scale data sets. The first scenario is classification and regression in analyzing high…
(more)
▼ This dissertation explores two application scenarios of sparsity pursuit method on large scale data sets. The first scenario is classification and regression in analyzing high dimensional structured data, where predictors corresponds to nodes of a given directed graph. This arises in, for instance, identification of disease genes for the Parkinson's diseases from a network of candidate genes. In such a situation, directed graph describes dependencies among the genes, where direction of edges represent certain causal effects. Key to high-dimensional structured classification and regression is how to utilize dependencies among predictors as specified by directions of the graph. In this dissertation, we develop a novel method that fully takes into account such dependencies formulated through certain nonlinear constraints. We apply the proposed method to two applications, feature selection in large margin binary classification and in linear regression. We implement the proposed method through difference convex programming for the cost function and constraints. Finally, theoretical and numerical analyses suggest that the proposed method achieves the desired objectives. An application to disease gene identification is presented.The second application scenario is personalized information filtering which extracts the information specifically relevant to a user, predicting his/her preference over a large number of items, based on the opinions of users who think alike or its content. This problem is cast into the framework of regression and classification, where we introduce novel partial latent models to integrate additional user-specific and content-specific predictors, for higher predictive accuracy. In particular, we factorize a user-over-item preference matrix into a product of two matrices, each representing a user's preference and an item preference by users. Then we propose a likelihood method to seek a sparsest latent factorization, from a class of over-complete factorizations, possibly with a high percentage of missing values. This promotes additional sparsity beyond rank reduction. Computationally, we design methods based on a ``decomposition and combination'' strategy, to break large-scale optimization into many small subproblems to solve in a recursive and parallel manner. On this basis, we implement the proposed methods through multi-platform shared-memory parallel programming, and through Mahout, a library for scalable machine learning and data mining, for mapReduce computation. For example, our methods are scalable to a dataset consisting of three billions of observations on a single machine with sufficient memory, having good timings. Both theoretical and numerical investigations show that the proposed methods exhibit significant improvement in accuracy over state-of-the-art scalable methods.
Subjects/Keywords: High dimensional data; Machine learning; Recommendation; Statistics
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ye, C. (2014). Network selection, information filtering and scalable computation. (Doctoral Dissertation). University of Minnesota. Retrieved from http://hdl.handle.net/11299/172631
Chicago Manual of Style (16th Edition):
Ye, Changqing. “Network selection, information filtering and scalable computation.” 2014. Doctoral Dissertation, University of Minnesota. Accessed March 02, 2021.
http://hdl.handle.net/11299/172631.
MLA Handbook (7th Edition):
Ye, Changqing. “Network selection, information filtering and scalable computation.” 2014. Web. 02 Mar 2021.
Vancouver:
Ye C. Network selection, information filtering and scalable computation. [Internet] [Doctoral dissertation]. University of Minnesota; 2014. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/11299/172631.
Council of Science Editors:
Ye C. Network selection, information filtering and scalable computation. [Doctoral Dissertation]. University of Minnesota; 2014. Available from: http://hdl.handle.net/11299/172631

Massey University
7.
Ullah, Insha.
Contributions to high-dimensional data analysis : some applications of the regularized covariance matrices : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Albany, New Zealand
.
Degree: 2015, Massey University
URL: http://hdl.handle.net/10179/6608
► High-dimensional data sets, particularly those where the number of variables exceeds the number of observations, are now common in many subject areas including genetics, ecology,…
(more)
▼ High-dimensional data sets, particularly those where the number of variables exceeds
the number of observations, are now common in many subject areas including
genetics, ecology, and statistical pattern recognition to name but a few. The
sample covariance matrix becomes rank deficient and is not invertible when the
number of variables are more than the number of observations. This poses a serious
problem for many classical multivariate techniques that rely on an inverse
of a covariance matrix. Recently, regularized alternatives to the sample covariance
have been proposed, which are not only guaranteed to be positive definite
but also provide reliable estimates. In this Thesis, we bring together some of the
important recent regularized estimators of the covariance matrix and explore their
performance in high-dimensional scenarios via numerical simulations. We make
use of these regularized estimators and attempt to improve the performance of the
three classical multivariate techniques in high-dimensional settings.
In a multivariate random effects models, estimating the between-group covariance
is a well known problem. Its classical estimator involves the difference of two
mean square matrices and often results in negative elements on the main diagonal.
We use a lasso-regularized estimate of the between-group mean square and
propose a new approach to estimate the between-group covariance based on the
EM-algorithm. Using simulation, the procedure is shown to be quite effective and
the estimate obtained is always positive definite.
Multivariate analysis of variance (MANOVA) face serious challenges due to the undesirable
properties of the sample covariance in high-dimensional problems. First,
it suffer from low power and does not maintain accurate type-I error when the
dimension is large as compared to the sample size. Second, MANOVA relies on
the inverse of a covariance matrix and fails to work when the number of variables
exceeds the number of observation. We use an approach based on the lasso regularization
and present a comparative study of the existing approaches including
our proposal. The lasso approach is shown to be an improvement in some cases,
in terms of power of the test, over the existing high-dimensional methods.
Another problem that is addressed in the Thesis is how to detect unusual future
observations when the dimension is large. The Hotelling T2 control chart has
traditionally been used for this purpose. The charting statistic in the control chart
rely on the inverse of a covariance matrix and is not reliable in high-dimensional
problems. To get a reliable estimate of the covariance matrix we use a distribution
free shrinkage estimator. We make use of the available baseline set of data and
propose a procedure to estimate the control limits for monitoring the individual
future observations. The procedure do not assume multivariate normality and
seems robust to the violation of multivariate normality. The simulation study
shows that the new method performs better than…
Subjects/Keywords: Multivariate analysis;
High-dimensional data;
Covariance
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ullah, I. (2015). Contributions to high-dimensional data analysis : some applications of the regularized covariance matrices : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Albany, New Zealand
. (Thesis). Massey University. Retrieved from http://hdl.handle.net/10179/6608
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Ullah, Insha. “Contributions to high-dimensional data analysis : some applications of the regularized covariance matrices : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Albany, New Zealand
.” 2015. Thesis, Massey University. Accessed March 02, 2021.
http://hdl.handle.net/10179/6608.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Ullah, Insha. “Contributions to high-dimensional data analysis : some applications of the regularized covariance matrices : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Albany, New Zealand
.” 2015. Web. 02 Mar 2021.
Vancouver:
Ullah I. Contributions to high-dimensional data analysis : some applications of the regularized covariance matrices : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Albany, New Zealand
. [Internet] [Thesis]. Massey University; 2015. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/10179/6608.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Ullah I. Contributions to high-dimensional data analysis : some applications of the regularized covariance matrices : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Albany, New Zealand
. [Thesis]. Massey University; 2015. Available from: http://hdl.handle.net/10179/6608
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Virginia Tech
8.
Blake, Patrick Michael.
Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data Analyzer.
Degree: MS, Electrical Engineering, 2019, Virginia Tech
URL: http://hdl.handle.net/10919/87392
► Many data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such…
(more)
▼ Many
data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such technique, biclustering, clusters
data in both dimensions and is inherently resistant to the challenges posed by having too many features. However, the algorithms that implement biclustering have limitations in that the user must know at least the structure of the
data and how many biclusters to expect. This is where the VIsual Statistical
Data Analyzer, or VISDA, can help. It is a visualization tool that successively and progressively explores the structure of the
data, identifying clusters along the way. This thesis proposes coupling VISDA with biclustering to overcome some of the challenges of
data sets with too many features. Further, to increase the performance, usability, and maintainability as well as reduce costs, VISDA was translated from Matlab to a Python version called VISDApy. Both VISDApy and the overall process were demonstrated with real and synthetic
data sets. The results of this work have the potential to improve analysts’ understanding of the relationships within complex
data sets and their ability to make informed decisions from such
data.
Advisors/Committee Members: Wang, Yue J. (committeechair), Xuan, Jianhua (committee member), Yu, Guoqiang (committee member).
Subjects/Keywords: high-dimensional data; biclustering; VISDA; VISDApy
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Blake, P. M. (2019). Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data Analyzer. (Masters Thesis). Virginia Tech. Retrieved from http://hdl.handle.net/10919/87392
Chicago Manual of Style (16th Edition):
Blake, Patrick Michael. “Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data Analyzer.” 2019. Masters Thesis, Virginia Tech. Accessed March 02, 2021.
http://hdl.handle.net/10919/87392.
MLA Handbook (7th Edition):
Blake, Patrick Michael. “Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data Analyzer.” 2019. Web. 02 Mar 2021.
Vancouver:
Blake PM. Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data Analyzer. [Internet] [Masters thesis]. Virginia Tech; 2019. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/10919/87392.
Council of Science Editors:
Blake PM. Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data Analyzer. [Masters Thesis]. Virginia Tech; 2019. Available from: http://hdl.handle.net/10919/87392

University of Minnesota
9.
Datta, Abhirup.
Statistical Methods for Large Complex Datasets.
Degree: PhD, Biostatistics, 2016, University of Minnesota
URL: http://hdl.handle.net/11299/199089
► Modern technological advancements have enabled massive-scale collection, processing and storage of information triggering the onset of the `big data' era where in every two days…
(more)
▼ Modern technological advancements have enabled massive-scale collection, processing and storage of information triggering the onset of the `big data' era where in every two days now we create as much data as we did in the entire twentieth century. This thesis aims at developing novel statistical methods that can efficiently analyze a variety of large complex datasets. Underlying the umbrella theme of big data modeling, we present statistical methods for two different classes of large complex datasets. The first half of the thesis focuses on the 'large n' problem for large spatial or spatio-temporal datasets where observations exhibit strong dependencies across space and time. In the second half of this thesis we present methods for high-dimensional regression in the `large p small n' setting for datasets that contain measurement errors or change points.
Subjects/Keywords: Big data; High dimensional data; Large spatial data
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Datta, A. (2016). Statistical Methods for Large Complex Datasets. (Doctoral Dissertation). University of Minnesota. Retrieved from http://hdl.handle.net/11299/199089
Chicago Manual of Style (16th Edition):
Datta, Abhirup. “Statistical Methods for Large Complex Datasets.” 2016. Doctoral Dissertation, University of Minnesota. Accessed March 02, 2021.
http://hdl.handle.net/11299/199089.
MLA Handbook (7th Edition):
Datta, Abhirup. “Statistical Methods for Large Complex Datasets.” 2016. Web. 02 Mar 2021.
Vancouver:
Datta A. Statistical Methods for Large Complex Datasets. [Internet] [Doctoral dissertation]. University of Minnesota; 2016. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/11299/199089.
Council of Science Editors:
Datta A. Statistical Methods for Large Complex Datasets. [Doctoral Dissertation]. University of Minnesota; 2016. Available from: http://hdl.handle.net/11299/199089

University of Arizona
10.
Washburn, Ammon.
High-Confidence Learning from Uncertain Data with High Dimensionality
.
Degree: 2018, University of Arizona
URL: http://hdl.handle.net/10150/631476
► Some of the most challenging issues in big data are size, scalability and reliability. Big data, such as pictures, videos, and text, have innate structure…
(more)
▼ Some of the most challenging issues in big
data are size, scalability and reliability. Big
data, such as pictures, videos, and text, have innate structure that does not fit into the structure of the normal
data table. Often the sources come from the internet or other domains where accuracy is not possible. When drawn from these sources, it is likely that important information is missing or cannot be measured. This leads to situations where identifying the important part of the
data would lead to good solutions, but with all the
data the tasks become ill-posed. Another case is where all the
data is useful but there is some important and/or hidden structure of which classical methods are not equipped to take advantage. However, many methods have been developed to either deal with
data uncertainty or with ill-posed problems.
Data uncertainty can come from missing or distributional
data.
Data imputation combined with uncertainty quantification can allow regular statistical and machine learning methods to be applied and then verified. Other methods combine the steps in a robust way to directly inform the model. This last type of method is common in chance-constrained, robust or distributionally robust programs from the mathematical optimization community.
Well-posed problems have a solution which is unique and changes slowly and continuously with the initial conditions. For standard machine learning models, a
data set with many irrelevant features gives rise to ill-posed problems. Regularization and feature selection are two possible ways to deal with these problems. Both the regularization and feature selection techniques have been around for a long time. Regularization approaches can include Lp norms or the matrix trace which will give certain properties. Feature selection has been achieved in many ways including a preprocessing step to rank and select features and the use of stepwise regression to classical modern techniques such as LASSO.
For many applications, there are both uncertainties and a
high-
dimensional component of the
data. By combining methods that deal with both of these methods and then deriving quick computational algorithms, we can formulate robust, highly-generalizable machine learning models that achieve very good results.
Two of our classification models handle samples of points to be classified as one. Traditional machine learning models in classification expect to classify one point but with an interesting
data set from Karyometry, several hundred points must be consolidated into one classification. One of the algorithms also can take advantage of a certain nested structure in this
data set to gain further information useful for doctors. The third model deals with
data and label uncertainty in classification. We do it in a
data-driven, distributionally robust way that gives us some confidence intervals on our classification.
A large part of this dissertation also deals with the algorithms used to solve these optimization formulations. We advance the solution path algorithms to…
Advisors/Committee Members: Fan, Neng (advisor), Zhang, Helen Hao (advisor), Kennedy, Thomas G. (committeemember), Missoum, Samy (committeemember).
Subjects/Keywords: data classification;
data uncertainty;
high dimensional data;
machine learning;
optimization
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Washburn, A. (2018). High-Confidence Learning from Uncertain Data with High Dimensionality
. (Doctoral Dissertation). University of Arizona. Retrieved from http://hdl.handle.net/10150/631476
Chicago Manual of Style (16th Edition):
Washburn, Ammon. “High-Confidence Learning from Uncertain Data with High Dimensionality
.” 2018. Doctoral Dissertation, University of Arizona. Accessed March 02, 2021.
http://hdl.handle.net/10150/631476.
MLA Handbook (7th Edition):
Washburn, Ammon. “High-Confidence Learning from Uncertain Data with High Dimensionality
.” 2018. Web. 02 Mar 2021.
Vancouver:
Washburn A. High-Confidence Learning from Uncertain Data with High Dimensionality
. [Internet] [Doctoral dissertation]. University of Arizona; 2018. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/10150/631476.
Council of Science Editors:
Washburn A. High-Confidence Learning from Uncertain Data with High Dimensionality
. [Doctoral Dissertation]. University of Arizona; 2018. Available from: http://hdl.handle.net/10150/631476

University of California – Riverside
11.
Zakaria, Jesin.
Developing Efficient Algorithms for Data Mining Large Scale High Dimensional Data.
Degree: Computer Science, 2013, University of California – Riverside
URL: http://www.escholarship.org/uc/item/660316zp
► Data mining and knowledge discovery has attracted a great deal of attention in information technology in recent years. The rapid progress of computer hardware technology…
(more)
▼ Data mining and knowledge discovery has attracted a great deal of attention in information technology in recent years. The rapid progress of computer hardware technology in the past three decades provides a great enhancement to the database and information industry. The size and complexity of real world data is dramatically increasing with the growth of hardware technology. Although new efficient algorithms to deal with such data are constantly being proposed, the mining of large scale high dimensional data still presents a lot of challenges. In this dissertation, several novel algorithms are proposed to handle such challenges. These algorithms are applied to domains as diverse as electrocardiography (ECG), stock market data, geospatial data, power supply data, audio data, image data, etc. This dissertation contributes to the data mining community in the following three ways:Firstly, we propose a novel algorithm for clustering time series data efficiently in the presence of noise or extraneous data. Most existing methods for time series clustering rely on distances calculated from the entire raw data. As a consequence, most work on time series clustering only considers the clustering of individual time series "behaviors," e.g., individual heart beats and contrives the time series in some way to make them all equal in length. However, for any real world problem, formatting the data in such a way is often a harder task than the clustering itself. In order to remove these unrealistic assumptions, we have developed a new primitive called unsupervised shapelet or u-shapelet and shown its utility for clustering time series.Secondly, in order to speed up the discovery of u-shapelet and make it scalable we have proposed two optimization techniques which can speed up the unsupervised shapelet discovery independently of each other. Moreover, if we combine the two optimization procedures, it results in a super linear speedup. In addition to the above, we can also cast our u-shapelet discovery algorithm as an anytime algorithm. In my final contribution, we have developed a novel and robust algorithm for mining mice vocalizations with symbolized representation. Our algorithm processes large scale, high dimensional, noisy mice vocalization by dimensionality reduction and cardinality reduction and make it suitable for knowledge discovery like classification, clustering, similarity search, motif discovery, contrast set mining etc.
Subjects/Keywords: Computer science; Clustering; Data Mining; High Dimensional Data; Scalable; Time Series
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Zakaria, J. (2013). Developing Efficient Algorithms for Data Mining Large Scale High Dimensional Data. (Thesis). University of California – Riverside. Retrieved from http://www.escholarship.org/uc/item/660316zp
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Zakaria, Jesin. “Developing Efficient Algorithms for Data Mining Large Scale High Dimensional Data.” 2013. Thesis, University of California – Riverside. Accessed March 02, 2021.
http://www.escholarship.org/uc/item/660316zp.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Zakaria, Jesin. “Developing Efficient Algorithms for Data Mining Large Scale High Dimensional Data.” 2013. Web. 02 Mar 2021.
Vancouver:
Zakaria J. Developing Efficient Algorithms for Data Mining Large Scale High Dimensional Data. [Internet] [Thesis]. University of California – Riverside; 2013. [cited 2021 Mar 02].
Available from: http://www.escholarship.org/uc/item/660316zp.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Zakaria J. Developing Efficient Algorithms for Data Mining Large Scale High Dimensional Data. [Thesis]. University of California – Riverside; 2013. Available from: http://www.escholarship.org/uc/item/660316zp
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Tulane University
12.
Qu, Zhe.
High-dimensional statistical data integration.
Degree: 2019, Tulane University
URL: https://digitallibrary.tulane.edu/islandora/object/tulane:106916
► [email protected]
Modern biomedical studies often collect multiple types of high-dimensional data on a common set of objects. A representative model for the integrative analysis of…
(more)
▼ [email protected]
Modern biomedical studies often collect multiple types of high-dimensional data on a common set of objects. A representative model for the integrative analysis of multiple data types is to decompose each data matrix into a low-rank common-source matrix generated by latent factors shared across all data types, a low-rank distinctive-source matrix corresponding to each data type, and an additive noise matrix. We propose a novel decomposition method, called the decomposition-based generalized canonical correlation analysis, which appropriately defines those matrices by imposing a desirable orthogonality constraint on distinctive latent factors that aims to sufficiently capture the common latent factors. To further delineate the common and distinctive patterns between two data types, we propose another new decomposition method, called the common and distinctive pattern analysis. This method takes into account the common and distinctive information between the coefficient matrices of the common latent factors. We develop consistent estimation approaches for both proposed decompositions under high-dimensional settings, and demonstrate their finite-sample performance via extensive simulations. We illustrate the superiority of proposed methods over the state of the arts by real-world data examples obtained from The Cancer Genome Atlas and Human Connectome Project.
1
Zhe Qu
Advisors/Committee Members: Hyman, James (Thesis advisor), School of Science & Engineering Mathematics (Degree granting institution).
Subjects/Keywords: High-dimensional data analysis; Data integration; Canonical correlation analysis
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Qu, Z. (2019). High-dimensional statistical data integration. (Thesis). Tulane University. Retrieved from https://digitallibrary.tulane.edu/islandora/object/tulane:106916
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Qu, Zhe. “High-dimensional statistical data integration.” 2019. Thesis, Tulane University. Accessed March 02, 2021.
https://digitallibrary.tulane.edu/islandora/object/tulane:106916.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Qu, Zhe. “High-dimensional statistical data integration.” 2019. Web. 02 Mar 2021.
Vancouver:
Qu Z. High-dimensional statistical data integration. [Internet] [Thesis]. Tulane University; 2019. [cited 2021 Mar 02].
Available from: https://digitallibrary.tulane.edu/islandora/object/tulane:106916.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Qu Z. High-dimensional statistical data integration. [Thesis]. Tulane University; 2019. Available from: https://digitallibrary.tulane.edu/islandora/object/tulane:106916
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Adelaide
13.
Conway, Annie.
Clustering of proteomics imaging mass spectrometry data.
Degree: 2016, University of Adelaide
URL: http://hdl.handle.net/2440/112036
► This thesis presents a toolbox for the exploratory analysis of multivariate data, in particular proteomics imaging mass spectrometry data. Typically such data consist of 15000…
(more)
▼ This thesis presents a toolbox for the exploratory analysis of multivariate
data, in particular proteomics imaging mass spectrometry
data. Typically such
data consist of 15000 - 20000 spectra with a spatial component, and for each spectrum ion intensities are recorded at specific masses. Clustering is a focus of this thesis, with discussion of k-means clustering and clustering with principal component analysis
(PCA). Theoretical results relating PCA and clustering are given based on Ding and He (2004), and detailed and corrected proofs of the authors' results are presented. The benefits of transformations prior to clustering of the
data are explored. Transformations include normalisation, peak intensity correction (PIC), binary and log transformations. A number of techniques for comparing different clustering results are also discussed and these include set based comparisons with the Jaccard distance, an information based criterion (variation of information), point-pair comparisons (Rand index) and a modified version of the prediction strength of Tibshirani and Walther (2005). These exploratory analyses are applied to imaging mass spectrometry
data taken from patients with ovarian cancer. The
data are taken from slices of cancerous tissue. The analyses in this thesis are primarily focused on
data from one patient, with some techniques demonstrated on other patients for comparison.
Advisors/Committee Members: Koch, Inge (advisor), School of Mathematical Sciences (school).
Subjects/Keywords: clustering; proteomics; multivariate data analysis; high-dimensional data analysis; machine learning
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Conway, A. (2016). Clustering of proteomics imaging mass spectrometry data. (Thesis). University of Adelaide. Retrieved from http://hdl.handle.net/2440/112036
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Conway, Annie. “Clustering of proteomics imaging mass spectrometry data.” 2016. Thesis, University of Adelaide. Accessed March 02, 2021.
http://hdl.handle.net/2440/112036.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Conway, Annie. “Clustering of proteomics imaging mass spectrometry data.” 2016. Web. 02 Mar 2021.
Vancouver:
Conway A. Clustering of proteomics imaging mass spectrometry data. [Internet] [Thesis]. University of Adelaide; 2016. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/2440/112036.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Conway A. Clustering of proteomics imaging mass spectrometry data. [Thesis]. University of Adelaide; 2016. Available from: http://hdl.handle.net/2440/112036
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Minnesota
14.
O'Connell, Michael.
Integrative Analyses for Multi-source Data with Multiple Shared Dimensions.
Degree: PhD, Biostatistics, 2018, University of Minnesota
URL: http://hdl.handle.net/11299/200286
► High dimensional data consists of matrices with a large number of features and is common across many fields of study, including genetics, imaging, and toxicology.…
(more)
▼ High dimensional data consists of matrices with a large number of features and is common across many fields of study, including genetics, imaging, and toxicology. This type of data is challenging to analyze because of its size, and many traditional methods are difficult to implement or interpret with such data. One way of handling high dimensional data is dimension reduction, which aims to reduce high rank, high-dimensional data sets into low-rank approximations, which maintain important components of the structures of the matrices but are easier to use in models. The most common method for dimension reduction of a single matrix is principal components analysis (PCA). Multi-source data are high dimensional data in which multiple data sources share a dimension. When two or more data sets share a feature set, this is called horizontal integration. When two or more data sets share a sample set, this is called vertical integration. Traditionally, there are two ways to approach such a data set: either analyze each data source separately or treat them as one data set. However, these analyses may miss important features that are unique to each data source or miss important relationships between the data sources. A number of recent methods have been developed for analyzing multi-source data that are either vertically or horizontally integrated. One such method is Joint and Individual Variation Explained (JIVE), which decomposes the variation in multi-source data sets into structure that is shared between data sources (called joint structure) and structure that is unique to each of the data sources (called individual structure) (Lock et al. 2013). We have created an R package, r.jive, that implements the JIVE algorithm and provides visualization tools for multi-source data, making multi-source methods more accessible. While there are several methods for data sets with horizontal or vertical integration, there have been no previous methods for data sets with simultaneous horizontal and vertical integration (which we call bidimensional integration). We introduce a method called Linked Matrix Factorization that allows for simultaneous decomposition of multi-source data sets with bidimensional integration. We also introduce a method for bidimensionally integrated data that are not normally distributed, called Generalized Linked Matrix Factorization, which is based on generalized linear models rather than ordinary least squares.
Subjects/Keywords: data integration; high-dimensional data; matrix decomposition; multi-source
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
O'Connell, M. (2018). Integrative Analyses for Multi-source Data with Multiple Shared Dimensions. (Doctoral Dissertation). University of Minnesota. Retrieved from http://hdl.handle.net/11299/200286
Chicago Manual of Style (16th Edition):
O'Connell, Michael. “Integrative Analyses for Multi-source Data with Multiple Shared Dimensions.” 2018. Doctoral Dissertation, University of Minnesota. Accessed March 02, 2021.
http://hdl.handle.net/11299/200286.
MLA Handbook (7th Edition):
O'Connell, Michael. “Integrative Analyses for Multi-source Data with Multiple Shared Dimensions.” 2018. Web. 02 Mar 2021.
Vancouver:
O'Connell M. Integrative Analyses for Multi-source Data with Multiple Shared Dimensions. [Internet] [Doctoral dissertation]. University of Minnesota; 2018. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/11299/200286.
Council of Science Editors:
O'Connell M. Integrative Analyses for Multi-source Data with Multiple Shared Dimensions. [Doctoral Dissertation]. University of Minnesota; 2018. Available from: http://hdl.handle.net/11299/200286

University of Southern California
15.
Ren, Jie.
Robust feature selection with penalized regression in
imbalanced high dimensional data.
Degree: PhD, Statistical Genetics and Genetic
Epidemiology, 2014, University of Southern California
URL: http://digitallibrary.usc.edu/cdm/compoundobject/collection/p15799coll3/id/443080/rec/5620
► This work is motivated by an ongoing USC/Illumina study of prostate cancer recurrence after radical prostatectomy. The study generated gene expression data for nearly thirty…
(more)
▼ This work is motivated by an ongoing USC/Illumina
study of prostate cancer recurrence after radical prostatectomy.
The study generated gene expression
data for nearly thirty thousand
probes from 187 tumor samples, of which 33 came from patients with
recurrent prostate cancer and 154 came from patients with
non‐recurrent prostate cancer after years of follow‐up. Our goal
was to use penalized logistic regression and stability selection to
find a “gene signature” of recurrence that can improve upon PSA and
Gleason‐score, which are well‐established but poor predictors. For
interpretability and future clinical use, the gene signature should
ideally involve a small proportion of probes to predict recurrence
in new patients. Due to the skewness in the
data, the model
selected by tuning the LASSO penalty parameter based on the average
misclassification rate in cross validation did not have a balanced
performance, i.e. it predicted non‐recurrent cancer with
high
accuracy but predicted recurrent cancer with very low accuracy. In
addition, standard penalized regression with cross validation
appeared to select many noise features. In my simulation study in
Chapter 2, I evaluated the performance of models selected by
different metrics in imbalanced
data. I concluded that G‐mean‐thr
(G‐mean with an alternative cutoff) and area under the ROC curve
(AUC‐ROC) were the most robust metrics to class imbalance. In
Chapter 3, I examined the performance of stability selection
(SS‐thr) in simulation studies and found that its feature selection
capability (a) depended on the stability cutoff chosen and (b) is
conservative as a result of a stringent error control. To address
these problems, I proposed new feature selectors based on stability
selection, including SS‐test, an essentially parameter‐free
test‐based outlier‐detection approach, and SS‐rank and SS‐ranktest,
parameter‐free rank‐based methods. I demonstrated their advantage
over SS‐thr, ULR, and LASSO with cross validation in extensive
simulation studies and also found that all these stability
selection based methods were robust to class imbalance. These newly
developed methods and procedures was applied to the prostate cancer
recurrence
data. I used a variety of metrics to do model selection
within the penalized logistic regression framework using imbalanced
prostate cancer recurrence
data and demonstrated that G‐mean with
the case‐proportion cutoff selected the model with the most
balanced prediction accuracy in cross validation. In addition, I
also showed the importance of using an appropriate cutoff to
evaluate models when models were built from skewed
data. I also
applied stability selection based methods including SS‐thr,
SS‐test, SS‐rank and SS‐ranktest to select important genes from the
same
data. Three genes, ABCC1, NKX2‐1 and ZYG11A were identified by
all of the methods and also appeared to stand out from other
features in the stability path plot. I fit a logistic regression
model using these genes and clinical features, which has
significantly higher prediction accuracy…
Advisors/Committee Members: Lewinger, Juan PabloWatanabe, Richard M. (Committee Chair), Siegmund, Kimberly D. (Committee Member), Stern, Mariana C. (Committee Member), Lv, Jinchi (Committee Member).
Subjects/Keywords: feature selection; penalized regression; imbalanced data; high dimensional data; stability selection
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ren, J. (2014). Robust feature selection with penalized regression in
imbalanced high dimensional data. (Doctoral Dissertation). University of Southern California. Retrieved from http://digitallibrary.usc.edu/cdm/compoundobject/collection/p15799coll3/id/443080/rec/5620
Chicago Manual of Style (16th Edition):
Ren, Jie. “Robust feature selection with penalized regression in
imbalanced high dimensional data.” 2014. Doctoral Dissertation, University of Southern California. Accessed March 02, 2021.
http://digitallibrary.usc.edu/cdm/compoundobject/collection/p15799coll3/id/443080/rec/5620.
MLA Handbook (7th Edition):
Ren, Jie. “Robust feature selection with penalized regression in
imbalanced high dimensional data.” 2014. Web. 02 Mar 2021.
Vancouver:
Ren J. Robust feature selection with penalized regression in
imbalanced high dimensional data. [Internet] [Doctoral dissertation]. University of Southern California; 2014. [cited 2021 Mar 02].
Available from: http://digitallibrary.usc.edu/cdm/compoundobject/collection/p15799coll3/id/443080/rec/5620.
Council of Science Editors:
Ren J. Robust feature selection with penalized regression in
imbalanced high dimensional data. [Doctoral Dissertation]. University of Southern California; 2014. Available from: http://digitallibrary.usc.edu/cdm/compoundobject/collection/p15799coll3/id/443080/rec/5620
16.
Waddell, Adrian.
Interactive Visualization and Exploration of High-Dimensional Data.
Degree: 2016, University of Waterloo
URL: http://hdl.handle.net/10012/10188
► Visualizing data is an essential part of good statistical practice. Plots are useful for revealing structure in the data, checking model assumptions, detecting outliers and…
(more)
▼ Visualizing data is an essential part of good statistical practice. Plots are useful for revealing structure in the data, checking model assumptions, detecting outliers and finding unanticipated patterns. Post-analysis visualization is commonly used to communicate the results of statistical analyses. The availability of good statistical visualization software is key in effectively performing data analysis and in exploring and developing new methods for data visualization. Compared to static visualization, interactive visualization adds natural and powerful ways to explore the data. With interactive visualization an analyst can dive into the data and quickly react to visual clues by, for example, re-focusing and creating interactive queries of the data. Further, linking visual attributes of the data points such as color and size allows the analyst to compare different visual representations of the data such as histograms and scatterplots.
In this thesis, we explore and develop new interactive data visualization and exploration tools for high-dimensional data. The original focus of our research was a software implementation of navigation graphs. Navigation graphs are navigational infrastructures for controlled exploration of high-dimensional data. As part of this thesis, we developed the first interactive implementation of these navigation graphs called RnavGraph. With RnavGraph we explored various features for enhancing the usability of navigation graphs. We concluded that a powerful interactive scatterplot display and methods to deal with large graphs were two areas that would add great value to the navigation graph framework.
RnavGraph's scatterplot display proved to be particularly useful for data analysis and we continued our research with the design and implementation of a general-purpose interactive visualization toolkit called loon. The core contributions of loon are as follows. loon implements a general design for interactive statistical graphic displays that supports layering of visual information such as point objects, lines and polygons. These displays further support zooming, panning and selection, and modification and deactivation of plot elements and layers. Interactions with plots are provided with mouse and keyboard gestures as well as via command line control and with inspectors. These inspectors provide graphical user interfaces for modifying and overseeing the plots. loon also implements a novel dynamic linking mechanism that can be used to assign the plots that are to be linked and the linking rules at run time. Additionally, loon's design provides several different types of event bindings to add and customize functionality of loon's displays. In this thesis, we discuss loon's design and framework by giving concrete examples that show how these design choices can be used to efficiently explore and visualize data interactively. These examples revolve around loon's statistical interactive displays such as histograms, scatterplots and graph displays. We also illustrate how loon's design can be…
Subjects/Keywords: Interactive Data Visualization; High-dimensional Data; Statistical Visualization
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Waddell, A. (2016). Interactive Visualization and Exploration of High-Dimensional Data. (Thesis). University of Waterloo. Retrieved from http://hdl.handle.net/10012/10188
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Waddell, Adrian. “Interactive Visualization and Exploration of High-Dimensional Data.” 2016. Thesis, University of Waterloo. Accessed March 02, 2021.
http://hdl.handle.net/10012/10188.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Waddell, Adrian. “Interactive Visualization and Exploration of High-Dimensional Data.” 2016. Web. 02 Mar 2021.
Vancouver:
Waddell A. Interactive Visualization and Exploration of High-Dimensional Data. [Internet] [Thesis]. University of Waterloo; 2016. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/10012/10188.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Waddell A. Interactive Visualization and Exploration of High-Dimensional Data. [Thesis]. University of Waterloo; 2016. Available from: http://hdl.handle.net/10012/10188
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Temple University
17.
Lou, Qiang.
LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA.
Degree: PhD, 2013, Temple University
URL: http://digital.library.temple.edu/u?/p245801coll10,214785
► Computer and Information Science
Data sets with irrelevant and redundant features and large fraction of missing values are common in the real life application. Learning…
(more)
▼ Computer and Information Science
Data sets with irrelevant and redundant features and large fraction of missing values are common in the real life application. Learning such data usually requires some preprocess such as selecting informative features and imputing missing values based on observed data. These processes can provide more accurate and more efficient prediction as well as better understanding of the data distribution. In my dissertation I will describe my work in both of these aspects and also my following up work on feature selection in incomplete dataset without imputing missing values. In the last part of my dissertation, I will present my current work on more challenging situation where high-dimensional data is time-involving. The first two parts of my dissertation consist of my methods that focus on handling such data in a straightforward way: imputing missing values first, and then applying traditional feature selection method to select informative features. We proposed two novel methods, one for imputing missing values and the other one for selecting informative features. We proposed a new method that imputes the missing attributes by exploiting temporal correlation of attributes, correlations among multiple attributes collected at the same time and space, and spatial correlations among attributes from multiple sources. The proposed feature selection method aims to find a minimum subset of the most informative variables for classification/regression by efficiently approximating the Markov Blanket which is a set of variables that can shield a certain variable from the target. I present, in the third part, how to perform feature selection in incomplete high-dimensional data without imputation, since imputation methods only work well when data is missing completely at random, when fraction of missing values is small, or when there is prior knowledge about the data distribution. We define the objective function of the uncertainty margin-based feature selection method to maximize each instance's uncertainty margin in its own relevant subspace. In optimization, we take into account the uncertainty of each instance due to the missing values. The experimental results on synthetic and 6 benchmark data sets with few missing values (less than 25%) provide evidence that our method can select the same accurate features as the alternative methods which apply an imputation method first. However, when there is a large fraction of missing values (more than 25%) in data, our feature selection method outperforms the alternatives, which impute missing values first. In the fourth part, I introduce my method handling more challenging situation where the high-dimensional data varies in time. Existing way to handle such data is to flatten temporal data into single static data matrix, and then applying traditional feature selection method. In order to keep the dynamics in the time series data, our method avoid flattening the data in advance. We propose a way to measure the distance between multivariate temporal data from…
Advisors/Committee Members: Obradovic, Zoran, Vucetic, Slobodan, Latecki, Longin, Davey, Adam.
Subjects/Keywords: Computer science; data mining; feature selection; high dimensional data; incomplete data; machine learning
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Lou, Q. (2013). LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA. (Doctoral Dissertation). Temple University. Retrieved from http://digital.library.temple.edu/u?/p245801coll10,214785
Chicago Manual of Style (16th Edition):
Lou, Qiang. “LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA.” 2013. Doctoral Dissertation, Temple University. Accessed March 02, 2021.
http://digital.library.temple.edu/u?/p245801coll10,214785.
MLA Handbook (7th Edition):
Lou, Qiang. “LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA.” 2013. Web. 02 Mar 2021.
Vancouver:
Lou Q. LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA. [Internet] [Doctoral dissertation]. Temple University; 2013. [cited 2021 Mar 02].
Available from: http://digital.library.temple.edu/u?/p245801coll10,214785.
Council of Science Editors:
Lou Q. LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA. [Doctoral Dissertation]. Temple University; 2013. Available from: http://digital.library.temple.edu/u?/p245801coll10,214785
18.
Shou, Haochang.
Statistical Methods for Structured Multilevel Functional Data: Estimation and Reliability.
Degree: 2014, Johns Hopkins University
URL: http://jhir.library.jhu.edu/handle/1774.2/37867
► The thesis investigates a specific type of functional data with multilevel structures induced by complex experimental designs. Novel statistical methods based on principal component analysis…
(more)
▼ The thesis investigates a specific type of functional
data with multilevel structures induced by complex experimental designs. Novel statistical methods based on principal component analysis that account for different layers of correlations in the
data are introduced. A robust metric is proposed to evaluate the reproducibility of replicated functional and imaging studies. Shrinkage-based methods are extended to functional and imaging
data with no or few replicates, and studies with low reliability. The proposed estimator is shown to correct for measurement error and improve prediction at the
subject level by borrowing strength from the population average. Methods have been motivated by and applied to
high-throughput physical activity measurements and several brain imaging studies based on different modalities including functional magnetic resonance imaging (fMRI), voxel-based morphometry, and diffusion tensor imaging (DTI). Fast algorithms are developed to expand the
applicability of the methods proposed to ultra-
high dimensional data.
Advisors/Committee Members: Calabresi, Peter A (advisor).
Subjects/Keywords: functional data analysis;
multilevel and structured data;
high-dimensional data;
imaging reproducibility;
shrinkage estimation
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Shou, H. (2014). Statistical Methods for Structured Multilevel Functional Data: Estimation and Reliability. (Thesis). Johns Hopkins University. Retrieved from http://jhir.library.jhu.edu/handle/1774.2/37867
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Shou, Haochang. “Statistical Methods for Structured Multilevel Functional Data: Estimation and Reliability.” 2014. Thesis, Johns Hopkins University. Accessed March 02, 2021.
http://jhir.library.jhu.edu/handle/1774.2/37867.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Shou, Haochang. “Statistical Methods for Structured Multilevel Functional Data: Estimation and Reliability.” 2014. Web. 02 Mar 2021.
Vancouver:
Shou H. Statistical Methods for Structured Multilevel Functional Data: Estimation and Reliability. [Internet] [Thesis]. Johns Hopkins University; 2014. [cited 2021 Mar 02].
Available from: http://jhir.library.jhu.edu/handle/1774.2/37867.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Shou H. Statistical Methods for Structured Multilevel Functional Data: Estimation and Reliability. [Thesis]. Johns Hopkins University; 2014. Available from: http://jhir.library.jhu.edu/handle/1774.2/37867
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

NSYSU
19.
Tai, Chiech-an.
An Automatic Data Clustering Algorithm based on Differential Evolution.
Degree: Master, Computer Science and Engineering, 2013, NSYSU
URL: http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0730113-152814
► As one of the traditional optimization problems, clustering still plays a vital role for the re-searches both theoretically and practically nowadays. Although many successful clustering…
(more)
▼ As one of the traditional optimization problems, clustering still plays a vital role for the re-searches both theoretically and practically nowadays. Although many successful clustering algorithms have been presented, most (if not all) need to be given the number of clusters before the clustering procedure is invoked. A novel differential evolution based clustering algorithm is presented in this paper to solve the problem of automatically determining the number of clusters. The proposed algorithm, called enhanced differential evolution for automatic cluster-ing (EDEAC), leverages the strengths of two technologies: a novel histogram-based analysis technique for finding the approximate number of clusters and a heuristic search algorithm for
fine-tuning the automatic clustering results. The experimental results show that the proposed algorithm can not only determine the approximate number of clusters automatically, but it can also provide an accurate number of clusters rapidly even for
high dimensional datasets com-pared to other existing automatic clustering algorithms.
Advisors/Committee Members: Chun-Wei Tsai (chair), Ming-Chao Chiang (committee member), Chu-Sing Yang (chair), Tzung-Pei Hong (chair).
Subjects/Keywords: automatic clustering; data clustering; high-dimensional dataset; histogram analysis; differential evolution
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Tai, C. (2013). An Automatic Data Clustering Algorithm based on Differential Evolution. (Thesis). NSYSU. Retrieved from http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0730113-152814
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Tai, Chiech-an. “An Automatic Data Clustering Algorithm based on Differential Evolution.” 2013. Thesis, NSYSU. Accessed March 02, 2021.
http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0730113-152814.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Tai, Chiech-an. “An Automatic Data Clustering Algorithm based on Differential Evolution.” 2013. Web. 02 Mar 2021.
Vancouver:
Tai C. An Automatic Data Clustering Algorithm based on Differential Evolution. [Internet] [Thesis]. NSYSU; 2013. [cited 2021 Mar 02].
Available from: http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0730113-152814.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Tai C. An Automatic Data Clustering Algorithm based on Differential Evolution. [Thesis]. NSYSU; 2013. Available from: http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0730113-152814
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Tulane University
20.
Xu, Chao.
Hypothesis Testing for High-Dimensional Regression Under Extreme Phenotype Sampling of Continuous Traits.
Degree: 2018, Tulane University
URL: https://digitallibrary.tulane.edu/islandora/object/tulane:78817
► Extreme phenotype sampling (EPS) is a broadly-used design to identify candidate genetic factors contributing to the variation of quantitative traits. By enriching the signals in…
(more)
▼ Extreme phenotype sampling (EPS) is a broadly-used design to identify candidate genetic factors contributing to the variation of quantitative traits. By enriching the signals in the extreme phenotypic samples within the top and bottom percentiles, EPS can boost the study power compared with the random sampling with the same sample size. The existing statistical methods for EPS data test the variants/regions individually. However, many disorders are caused by multiple genetic factors. Therefore, it is critical to simultaneously model the effects of genetic factors, which may increase the power of current genetic studies and identify novel disease-associated genetic factors in EPS. The challenge of the simultaneous analysis of genetic data is that the number (p ~10,000) of genetic factors is typically greater than the sample size (n ~1,000) in a single study. The standard linear model would be inappropriate for this p>n problem due to the rank deficiency of the design matrix. An alternative solution is to apply a penalized regression method – the least absolute shrinkage and selection operator (LASSO).
LASSO can deal with this high-dimensional (p>n) problem by forcing certain regression coefficients to be zero. Although the application of LASSO in genetic studies under random sampling has been widely studied, its statistical inference and testing under EPS remain unknown. We propose a novel sparse model (EPS-LASSO) with hypothesis test for high-dimensional regression under EPS based on a decorrelated score function to investigate the genetic associations, including the gene expression and rare variant analyses. The comprehensive simulation shows EPS-LASSO outperforms existing methods with superior power when the effects are large and stable type I error and FDR control. Together with the real data analysis of genetic study for obesity, our results indicate that EPS-LASSO is an effective method for EPS data analysis, which can account for correlated predictors.
1
Chao Xu
Advisors/Committee Members: Deng, Hong-Wen (Thesis advisor), School of Public Health & Tropical Medicine Biostatistics and Bioinformatics (Degree granting institution).
Subjects/Keywords: extreme sampling; high-dimensional regression; genetic data analysis
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Xu, C. (2018). Hypothesis Testing for High-Dimensional Regression Under Extreme Phenotype Sampling of Continuous Traits. (Thesis). Tulane University. Retrieved from https://digitallibrary.tulane.edu/islandora/object/tulane:78817
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Xu, Chao. “Hypothesis Testing for High-Dimensional Regression Under Extreme Phenotype Sampling of Continuous Traits.” 2018. Thesis, Tulane University. Accessed March 02, 2021.
https://digitallibrary.tulane.edu/islandora/object/tulane:78817.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Xu, Chao. “Hypothesis Testing for High-Dimensional Regression Under Extreme Phenotype Sampling of Continuous Traits.” 2018. Web. 02 Mar 2021.
Vancouver:
Xu C. Hypothesis Testing for High-Dimensional Regression Under Extreme Phenotype Sampling of Continuous Traits. [Internet] [Thesis]. Tulane University; 2018. [cited 2021 Mar 02].
Available from: https://digitallibrary.tulane.edu/islandora/object/tulane:78817.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Xu C. Hypothesis Testing for High-Dimensional Regression Under Extreme Phenotype Sampling of Continuous Traits. [Thesis]. Tulane University; 2018. Available from: https://digitallibrary.tulane.edu/islandora/object/tulane:78817
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
21.
Hwang, Sung Jin.
Geometric Representations of High Dimensional Random Data.
Degree: PhD, Electrical Engineering-Systems, 2012, University of Michigan
URL: http://hdl.handle.net/2027.42/96097
► This thesis introduces geometric representations relevant to the analysis of datasets of random vectors in high dimension. These representations are used to study the behavior…
(more)
▼ This thesis introduces geometric representations relevant to the analysis of datasets of random vectors in
high dimension. These representations are used to study the behavior of near-neighbor clusters in the dataset, shortest paths through the dataset, and evolution of multivariate probability distributions over the dataset. The results in this thesis have wide applicability to machine learning problems and are illustrated for problems including: spectral clustering; dimensionality reduction; activity recognition; and video indexing and retrieval.
This thesis makes several contributions. The first contribution is the shortest path over random points in a Riemannian manifold. More precisely, we establish complete convergence results of power-weighted shortest path lengths in compact Riemannian manifolds to conformal deformation distances. These shortest path results are used to interpret and extend Coiffman's anisotropic diffusion maps for clustering and dimensionality reduction. The second contribution is statistical manifolds that describe differences between curves evolving over a space of probability measures. A statistical manifold is a space of probability measures induced by the Fisher-Riemann metric. We propose to compare smoothly evolving probability distributions in statistical manifold by the surface area of the region between a pair of curves. The surface area measure is applied to activity classification for human movements. The third contribution proposes a dimensionality reduction and cluster analysis framework that uses a quantum mechanical model. This model leads to a generalization of geometric clustering methods such as k-means and Laplacian eigenmap in which the logical equivalence relation "two points are in the same cluster" is relaxed to a probabilistic equivalence relation.
Advisors/Committee Members: Damelin, Steven B. (committee member), Hero Iii, Alfred O. (committee member), Gilbert, Anna Catherine (committee member), Nadakuditi, Rajesh Rao (committee member), Scott, Clayton D. (committee member).
Subjects/Keywords: High Dimensional Data; Engineering
…foundation to analyze and understand the practice. When random data
from a high dimensional space… …representations for high-dimensional data are based on linear
models. For example, principal component… …and Alfred O. Hero III (2012). “Shortest path
for high-dimensional data… …dimensional structure in the data. This thesis explores data
representations using differential… …analysis extends the idea and assumes the
data lies in some curved non-flat lower dimensional…
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Hwang, S. J. (2012). Geometric Representations of High Dimensional Random Data. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/96097
Chicago Manual of Style (16th Edition):
Hwang, Sung Jin. “Geometric Representations of High Dimensional Random Data.” 2012. Doctoral Dissertation, University of Michigan. Accessed March 02, 2021.
http://hdl.handle.net/2027.42/96097.
MLA Handbook (7th Edition):
Hwang, Sung Jin. “Geometric Representations of High Dimensional Random Data.” 2012. Web. 02 Mar 2021.
Vancouver:
Hwang SJ. Geometric Representations of High Dimensional Random Data. [Internet] [Doctoral dissertation]. University of Michigan; 2012. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/2027.42/96097.
Council of Science Editors:
Hwang SJ. Geometric Representations of High Dimensional Random Data. [Doctoral Dissertation]. University of Michigan; 2012. Available from: http://hdl.handle.net/2027.42/96097

University of Illinois – Urbana-Champaign
22.
Ouyang, Yunbo.
Scalable sparsity structure learning using Bayesian methods.
Degree: PhD, Statistics, 2018, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/101264
► Learning sparsity pattern in high dimension is a great challenge in both implementation and theory. In this thesis we develop scalable Bayesian algorithms based on…
(more)
▼ Learning sparsity pattern in
high dimension is a great challenge in both implementation and theory. In this thesis we develop scalable Bayesian algorithms based on EM algorithm and variational inference to learn sparsity structure in various models. Estimation consistency and selection consistency of our methods are established. First, a nonparametric Bayes estimator is proposed for the problem of estimating a sparse sequence based on Gaussian random variables. We adopt the popular two-group prior with one component being a point mass at zero, and the other component being a mixture of Gaussian distributions. Although the Gaussian family has been shown to be suboptimal for this problem, we find that Gaussian mixtures, with a proper choice on the means and mixing weights, have the desired asymptotic behavior, e.g., the corresponding posterior concentrates on balls with the desired minimax rate. Second, the above estimator could be directly applied to the
high dimensional linear classification. In theory, we not only build a bridge to connect the estimation error of the mean difference and the classification error in different scenarios, also provide sufficient conditions of sub-optimal classifiers and optimal classifiers. Third, we study adaptive ridge regression for linear models. Adaptive ridge regression is closely related with Bayesian variable selection problem with Gaussian mixture spike-and-slab prior because it resembles EM algorithm developed in Wang et al. (2016) for the above problem. The output of adaptive ridge regression can be used to construct a distribution estimator to approximate posterior. We show the approximate posterior has the desired concentration property and adaptive ridge regression estimator has desired predictive error. Last, we propose a Bayesian approach to sparse principal components analysis (PCA). We show that our algorithm, which is based on variational approximation, achieves Bayesian selection consistency. Empirical studies have demonstrated the competitive performance of the proposed algorithm.
Advisors/Committee Members: Liang, Feng (advisor), Liang, Feng (Committee Chair), Qu, Annie (committee member), Narisetty, Naveen N (committee member), Zhu, Ruoqing (committee member).
Subjects/Keywords: Bayesian statistics; high-dimensional data analysis; variable selection
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ouyang, Y. (2018). Scalable sparsity structure learning using Bayesian methods. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/101264
Chicago Manual of Style (16th Edition):
Ouyang, Yunbo. “Scalable sparsity structure learning using Bayesian methods.” 2018. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed March 02, 2021.
http://hdl.handle.net/2142/101264.
MLA Handbook (7th Edition):
Ouyang, Yunbo. “Scalable sparsity structure learning using Bayesian methods.” 2018. Web. 02 Mar 2021.
Vancouver:
Ouyang Y. Scalable sparsity structure learning using Bayesian methods. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2018. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/2142/101264.
Council of Science Editors:
Ouyang Y. Scalable sparsity structure learning using Bayesian methods. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2018. Available from: http://hdl.handle.net/2142/101264

Texas A&M University
23.
Song, Qifan.
Variable Selection for Ultra High Dimensional Data.
Degree: PhD, Statistics, 2014, Texas A&M University
URL: http://hdl.handle.net/1969.1/153224
► Variable selection plays an important role for the high dimensional data analysis. In this work, we first propose a Bayesian variable selection approach for ultra-high…
(more)
▼ Variable selection plays an important role for the
high dimensional data analysis. In this work, we first propose a Bayesian variable selection approach for ultra-
high dimensional linear regression based on the strategy of split-and-merge. The proposed approach consists of two stages: (i) split the ultra-
high dimensional data set into a number of lower
dimensional subsets and select relevant variables from each of the subsets, and (ii) aggregate the variables selected from each subset and then select relevant variables from the aggregated
data set. Since the proposed approach has an embarrassingly parallel structure, it can be easily implemented in a parallel architecture and applied to big
data problems with millions or more of explanatory variables. Under mild conditions, we show that the proposed approach is consistent. That is, asymptotically, the true explanatory variables will be correctly identified by the proposed approach as the sample size becomes large. Extensive comparisons of the proposed approach have been made with the penalized likelihood approaches,
such as Lasso, elastic net, SIS and ISIS. The numerical results show that the proposed approach generally outperforms the penalized likelihood approaches. The models selected by the proposed approach tend to be more sparse and closer to the true model.
In the frequentist realm, penalized likelihood methods have been widely used in variable selection problems, where the penalty functions are typically symmetric about 0, continuous and nondecreasing in (0,∞). The second contribution of this work is that, we propose a new penalized likelihood method, reciprocal Lasso (or in short, rLasso), based on a new class of penalty functions which are decreasing in (0,∞), discontinuous at 0, and converge to infinity when the coefficients approach zero. The new penalty functions give nearly zero coefficients infinity penalties; in contrast, the conventional penalty functions give nearly zero coefficients nearly zero penalties (e.g., Lasso and SCAD) or constant penalties (e.g., L0 penalty). This distinguishing feature makes rLasso very attractive for variable selection: It can effectively avoid selecting overly dense models. We establish the consistency of the rLasso for variable selection and coefficient estimation under both the low and
high dimensional settings. Since the rLasso penalty functions induce an objective function with multiple local minima, we also propose an efficient Monte Carlo optimization algorithm to solve the minimization problem. Our simulation results show that the rLasso outperforms other popular penalized likelihood methods, such as Lasso, SCAD, MCP, SIS, ISIS and EBIC. It can produce sparser and more accurate coefficient estimates, and have a higher probability to catch true models.
Advisors/Committee Members: Liang, Faming (advisor), Carroll, Raymond (committee member), Johnson, Valen (committee member), Lahiri, Soumendra (committee member), Zhou, Jianxin (committee member).
Subjects/Keywords: High Dimensional Variable Selection; Big Data; Penalized Likelihood Approach; Posterior Consistency
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Song, Q. (2014). Variable Selection for Ultra High Dimensional Data. (Doctoral Dissertation). Texas A&M University. Retrieved from http://hdl.handle.net/1969.1/153224
Chicago Manual of Style (16th Edition):
Song, Qifan. “Variable Selection for Ultra High Dimensional Data.” 2014. Doctoral Dissertation, Texas A&M University. Accessed March 02, 2021.
http://hdl.handle.net/1969.1/153224.
MLA Handbook (7th Edition):
Song, Qifan. “Variable Selection for Ultra High Dimensional Data.” 2014. Web. 02 Mar 2021.
Vancouver:
Song Q. Variable Selection for Ultra High Dimensional Data. [Internet] [Doctoral dissertation]. Texas A&M University; 2014. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/1969.1/153224.
Council of Science Editors:
Song Q. Variable Selection for Ultra High Dimensional Data. [Doctoral Dissertation]. Texas A&M University; 2014. Available from: http://hdl.handle.net/1969.1/153224

Penn State University
24.
Guha Thakurta, Abhradeep.
Differentially Private Convex Optimization For Empirical Risk Minimization And High-dimensional Regression.
Degree: 2012, Penn State University
URL: https://submit-etda.libraries.psu.edu/catalog/16390
► Learning systems are the backbone of most web-scale advertisement and recommendation systems. Such systems rely on past inputs from users to decide on a particular…
(more)
▼ Learning systems are the backbone of most web-scale advertisement and recommendation systems. Such systems rely on past inputs from users to decide on a particular advertisement or recommendation to be displayed to a new user. The way these learning systems work is by first recording past user responses (collectively called the \emph{training}
data) and then learning a prediction model which decides on a relevant advertisement or recommendation for a new user. Often the training
data sets contain sensitive information about users (e.g., sexual orientation or marital status). Recent research has shown that these large-scale learning systems can inadvertently leak sensitive information about individual users in the training
data . This leads us to think about designing learning algorithms which do not leak ``too much'' information about any individual (user) in the training
data.
In this dissertation we design learning algorithms with rigorous privacy guarantees. We adhere to a formal well-accepted notion of privacy, called \emph{differential privacy} . Differential privacy guarantees that an algorithm's output does not depend too much on the
data of any individual in the
data set. This is crucial in fields that handle sensitive
data, such as genomics, collaborative filtering, and economics.
In our work we design differentially private algorithms for the following two sets of learning problems: i) convex optimization for empirical risk minimization (the most common use of convex optimization in machine learning), and ii) sparse regression (a broad class of
high-
dimensional problems). For both these problems, we design differentially private algorithms that are provably (almost) as accurate as the best non-private algorithm.
Advisors/Committee Members: Adam Davison Smith, Dissertation Advisor/Co-Advisor, Adam Davison Smith, Committee Chair/Co-Chair, Daniel Kifer, Committee Member, Aleksandra B Slavkovic, Committee Member, Sofya Raskhodnikova, Committee Member.
Subjects/Keywords: Data Privacy; Differential Privacy; Machine Learning; High-dimensional Statistics; Sparse Regression
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Guha Thakurta, A. (2012). Differentially Private Convex Optimization For Empirical Risk Minimization And High-dimensional Regression. (Thesis). Penn State University. Retrieved from https://submit-etda.libraries.psu.edu/catalog/16390
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Guha Thakurta, Abhradeep. “Differentially Private Convex Optimization For Empirical Risk Minimization And High-dimensional Regression.” 2012. Thesis, Penn State University. Accessed March 02, 2021.
https://submit-etda.libraries.psu.edu/catalog/16390.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Guha Thakurta, Abhradeep. “Differentially Private Convex Optimization For Empirical Risk Minimization And High-dimensional Regression.” 2012. Web. 02 Mar 2021.
Vancouver:
Guha Thakurta A. Differentially Private Convex Optimization For Empirical Risk Minimization And High-dimensional Regression. [Internet] [Thesis]. Penn State University; 2012. [cited 2021 Mar 02].
Available from: https://submit-etda.libraries.psu.edu/catalog/16390.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Guha Thakurta A. Differentially Private Convex Optimization For Empirical Risk Minimization And High-dimensional Regression. [Thesis]. Penn State University; 2012. Available from: https://submit-etda.libraries.psu.edu/catalog/16390
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Penn State University
25.
Chu, Wanghuan.
Feature Screening For Ultra-high Dimensional Longitudinal Data.
Degree: 2016, Penn State University
URL: https://submit-etda.libraries.psu.edu/catalog/3197xm04j
► High and ultrahigh dimensional data analysis is now receiving more and more attention in many scientific fields. Various variable selection methods have been proposed for…
(more)
▼ High and ultrahigh
dimensional data analysis is now receiving more and more attention in many scientific fields. Various variable selection methods have been proposed for
high dimensional data where feature dimension p increases with sample size n at polynomial rates. In ultrahigh
dimensional setting, p is allowed to grow with n at an exponential rate. Instead of jointly selecting active covariates, a more effective approach is to incorporate screening rule that aims at filtering out unimportant covariates through marginal regression techniques. This thesis is concerned with feature screening methods for ultrahigh
dimensional longitudinal
data. Such
data occur frequently in longitudinal genetic studies, where phenotypes and some covariates are measured repeatedly over a certain time period. Along with the genetic measurements, longitudinal genetic studies provide valuable resources for exploring primary genetic and environmental factors that influence complex phenotypes over time. The proposed statistical methods in this work allow us not only to identify genetic determinants of common complex disease, but also to understand at which stage of human life do the genetic determinants become important.
In Chapter 3, we propose a new feature screening procedure for ultrahigh
dimensional time-varying coefficient models. We present an effective screening rule based on marginal B-spline regression that incorporates time-varying variance and within-
subject correlations. We show that under certain conditions, this procedure possesses sure screening property, and the false selection rates can be controlled. We demonstrate how within
subject variability can be harnessed for increasing screening accuracy by Monte Carlo simulation studies. Furthermore, we illustrate the proposed screening rule via an empirical analysis of the Childhood Asthma Management Program (CAMP)
data. Our empirical analysis clearly shows that the proposed approach is especially useful for such studies as children change quite extensively over a four-year period with highly nonlinear patterns.
In Chapter 4, we study screening rules for ultrahigh
dimensional covariates that are potentially associated with random effects. Mixed effects models are popular for taking into account the dependence structure of longitudinal
data, as
subject-specific random effects can explicitly account for within-
subject correlation. We propose a two-step screening procedure for generalized varying-coefficient mixed effects models. The two-step procedure screens fixed effects first and then random effects. We conduct simulation studies to assess the finite sample performance of this two-step screening approach for continuous response with linear regression, binary response with logistic regression, count response with Poisson regression, and ordinal response with proportional-odds cumulative logit model. In real
data application, we apply this procedure to
data from Framingham Heart Study (FHS), and explore the genetic and environmental effects on body mass index (BMI), obesity…
Advisors/Committee Members: Runze Li, Dissertation Advisor/Co-Advisor, Runze Li, Committee Chair/Co-Chair, Matthew Reimherr, Committee Member, Lingzhou Xue, Committee Member, Donna Coffman, Outside Member.
Subjects/Keywords: Feature screening; ultra-high dimensional data; longitudinal genetic study
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Chu, W. (2016). Feature Screening For Ultra-high Dimensional Longitudinal Data. (Thesis). Penn State University. Retrieved from https://submit-etda.libraries.psu.edu/catalog/3197xm04j
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Chu, Wanghuan. “Feature Screening For Ultra-high Dimensional Longitudinal Data.” 2016. Thesis, Penn State University. Accessed March 02, 2021.
https://submit-etda.libraries.psu.edu/catalog/3197xm04j.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Chu, Wanghuan. “Feature Screening For Ultra-high Dimensional Longitudinal Data.” 2016. Web. 02 Mar 2021.
Vancouver:
Chu W. Feature Screening For Ultra-high Dimensional Longitudinal Data. [Internet] [Thesis]. Penn State University; 2016. [cited 2021 Mar 02].
Available from: https://submit-etda.libraries.psu.edu/catalog/3197xm04j.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Chu W. Feature Screening For Ultra-high Dimensional Longitudinal Data. [Thesis]. Penn State University; 2016. Available from: https://submit-etda.libraries.psu.edu/catalog/3197xm04j
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Penn State University
26.
Li, Jiahan.
THE BAYESIAN LASSO, BAYESIAN SCAD AND BAYESIAN GROUP LASSO WITH APPLICATIONS TO GENOME-WIDE ASSOCIATION STUDIES
.
Degree: 2011, Penn State University
URL: https://submit-etda.libraries.psu.edu/catalog/12143
► Recently, genome-wide association studies (GWAS) have successfully identified genes that may affect complex traits or diseases. However, the standard statistical tests for each single-nucleotide polymorphism…
(more)
▼ Recently, genome-wide association studies (GWAS) have successfully identified genes that may affect complex traits or diseases. However, the standard statistical tests for each single-nucleotide polymorphism (SNP) separately are too simple to elucidate a comprehensive picture of the genetic architecture of phenotypes. A simultaneous analysis of a large number of SNPs, although statistically challenging, especially with a small number of samples, is crucial for genetic modeling. This is a variable selection problem for
high-
dimensional data, with SNPs as the predictors and phenotypes as the responses in our statistical model.
In genome-wide association studies, phenotypical values are either collected at a single time point for each
subject, or collected repeatedly over a period at
subject-specific time points. When the response variable is univariate, we present two-stage procedures designed for the problems where the number of predictors greatly exceeds the number of observations. At the first stage, we preprocess the
data such that variable selection procedure can be proceeded in an accurate and efficient manner. At the second stage, variable selection techniques based on penalized linear regression are applied to the preprocessed
data.
When the longitudinal phenotype of interest is measured at irregularly spaced time points, we develop a Bayesian regularized estimation procedure for the variable selection of nonparametric varying-coefficient models. Our method could simultaneously selection important predictors and estimate their time-varying effects. We approximate time-varying effects by Legendre polynomials, and present a Bayesian hierarchical model with group lasso penalties that encourage sparse solutions at the group level.
In both scenarios, our models obviate the choice of the tuning parameters by imposing diffuse hyperpriors on them and estimating them along with other parameters, and provide not only point estimates but also interval estimates of all parameters. Markov chain Monte Carlo (MCMC) algorithms are developed to simulate the parameters from their posterior distributions. The proposed methods are illustrated with numerical examples and a real
data set from the Framingham Heart Study.
Advisors/Committee Members: Rongling Wu, Dissertation Advisor/Co-Advisor, Rongling Wu, Committee Chair/Co-Chair, Runze Li, Committee Chair/Co-Chair, Bruce G Lindsay, Committee Member, Tao Yao, Committee Member.
Subjects/Keywords: lasso; variable selection; Bayesian approach; high-dimensional data
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Li, J. (2011). THE BAYESIAN LASSO, BAYESIAN SCAD AND BAYESIAN GROUP LASSO WITH APPLICATIONS TO GENOME-WIDE ASSOCIATION STUDIES
. (Thesis). Penn State University. Retrieved from https://submit-etda.libraries.psu.edu/catalog/12143
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Li, Jiahan. “THE BAYESIAN LASSO, BAYESIAN SCAD AND BAYESIAN GROUP LASSO WITH APPLICATIONS TO GENOME-WIDE ASSOCIATION STUDIES
.” 2011. Thesis, Penn State University. Accessed March 02, 2021.
https://submit-etda.libraries.psu.edu/catalog/12143.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Li, Jiahan. “THE BAYESIAN LASSO, BAYESIAN SCAD AND BAYESIAN GROUP LASSO WITH APPLICATIONS TO GENOME-WIDE ASSOCIATION STUDIES
.” 2011. Web. 02 Mar 2021.
Vancouver:
Li J. THE BAYESIAN LASSO, BAYESIAN SCAD AND BAYESIAN GROUP LASSO WITH APPLICATIONS TO GENOME-WIDE ASSOCIATION STUDIES
. [Internet] [Thesis]. Penn State University; 2011. [cited 2021 Mar 02].
Available from: https://submit-etda.libraries.psu.edu/catalog/12143.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Li J. THE BAYESIAN LASSO, BAYESIAN SCAD AND BAYESIAN GROUP LASSO WITH APPLICATIONS TO GENOME-WIDE ASSOCIATION STUDIES
. [Thesis]. Penn State University; 2011. Available from: https://submit-etda.libraries.psu.edu/catalog/12143
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of California – San Diego
27.
Hou, Jue.
Modern Statistical Methods for Complex Survival Data.
Degree: Mathematics, 2019, University of California – San Diego
URL: http://www.escholarship.org/uc/item/2qj8m7vs
► With the booming of big complex data, various Statistical methods and Data Science techniques have been developed to retrieve valuable information from them.The progress is…
(more)
▼ With the booming of big complex data, various Statistical methods and Data Science techniques have been developed to retrieve valuable information from them.The progress is slower with survival data due to the additional difficulty from censoring and truncation. Except for a few straightforward extensions, most modern learning methods have been absent in survival analysis for years since their invention. The theory on the survival version of those methods also falls further behind. There is a strong demand on computational efficient and theoretical reliable methods for big complex data withtime-to-event outcomesin various Health related fields where immense resource has been poured into. This thesis is devoted to incorporating censoring and truncation to state-of-art Statistical methodology and theory, to promote the evolution of survival analysis and support Medical research with up-to-date tools. In Chapter 1, I study the mixture cure-rate model with left truncation and right-censoring. We propose a Nonparametric Maximum Likelihood Estimation (NPMLE) approach to effectively handle the truncation issue. We adopt an efficient and stable EM algorithm. We are able to give a closed form variance estimator giving rise to valid inference. In Chapter 2, I study the estimation and inference for the Fine-Gray competing risks model with high-dimensional covariates. We develop confidence intervals based on a one-step bias-correction to an initial regularized estimator. We lay down a methodological and theoretical framework for the one-step bias-corrected estimator with the partial likelihood. In Chapter 3, I study the inference on treatment effect with censored time-to-event outcome while adjusting for high-dimensional covariates. We propose an orthogonal score method to construct honest confidence intervals for the treatment effect. With a slight modification, we obtain a doubly robust estimator extremely tolerant to both estimation inconsistency and volatility. All the methods in aforementioned chapters are tested through extensive numerical experiments and applied on real data with authentic medical interests.
Subjects/Keywords: Mathematics; Statistics; Average treatment effect; High-dimensional data; Inference; Left-truncation
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Hou, J. (2019). Modern Statistical Methods for Complex Survival Data. (Thesis). University of California – San Diego. Retrieved from http://www.escholarship.org/uc/item/2qj8m7vs
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Hou, Jue. “Modern Statistical Methods for Complex Survival Data.” 2019. Thesis, University of California – San Diego. Accessed March 02, 2021.
http://www.escholarship.org/uc/item/2qj8m7vs.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Hou, Jue. “Modern Statistical Methods for Complex Survival Data.” 2019. Web. 02 Mar 2021.
Vancouver:
Hou J. Modern Statistical Methods for Complex Survival Data. [Internet] [Thesis]. University of California – San Diego; 2019. [cited 2021 Mar 02].
Available from: http://www.escholarship.org/uc/item/2qj8m7vs.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Hou J. Modern Statistical Methods for Complex Survival Data. [Thesis]. University of California – San Diego; 2019. Available from: http://www.escholarship.org/uc/item/2qj8m7vs
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Victoria University of Wellington
28.
Tran, Binh Ngan.
Evolutionary Computation for Feature Manipulation in Classification on High-dimensional Data.
Degree: 2018, Victoria University of Wellington
URL: http://hdl.handle.net/10063/7078
► More and more high-dimensional data appears in machine learning, especially in classification tasks. With thousands of features, these datasets bring challenges to learning algorithms not…
(more)
▼ More and more
high-
dimensional data appears in machine learning, especially in classification tasks. With thousands of features, these datasets bring challenges to learning algorithms not only because of the curse of dimensionality but also the existence of many irrelevant and redundant features. Therefore, feature selection and feature construction (or feature manipulation in short) are essential techniques in preprocessing these datasets. While feature selection aims to select relevant features, feature construction constructs
high-level features from the original ones to better represent the target concept. Both methods can decrease the dimensionality and improve the performance of learning algorithms in terms of classification accuracy and computation time.
Although feature manipulation has been studied for decades, the task on
high-
dimensional data is still challenging due to the huge search space. Existing methods usually face the problem of stagnation in local optima and/or require
high computation time. Evolutionary computation techniques are well-known for their global search. Particle swarm optimisation (PSO) and genetic programming (GP) have shown promise in feature selection and feature construction, respectively. However, the use of these techniques to
high-
dimensional data usually requires
high memory and computation time.
The overall goal of this thesis is to investigate new approaches to using PSO for feature selection and GP for feature construction on
high-
dimensional classification problems. This thesis focuses on incorporating a variety of strategies into the evolutionary process and developing new PSO and GP representations to improve the effectiveness and efficiency of PSO and GP for feature manipulation on
high-
dimensional data.
This thesis proposes a new PSO based feature selection approach to
high-
dimensional data by incorporating a new local search to balance global and local search of PSO. A hybrid of wrapper and filter evaluation method which can be sped up in the local search is proposed to help PSO achieve better performance, scalability and robustness on
high-
dimensional data. The results show that the proposed method significantly outperforms the compared methods in 80% of the cases with an increase up to 16% average accuracy while reduces the number of features from one to two orders of magnitude.
This thesis develops the first PSO based feature selection via discretisation method that performs both multivariate discretisation and feature selection in a single stage to achieve better solutions than applying these techniques separately in two stages. Two new PSO representations are proposed to evolve cut-points for multiple features simultaneously. The results show that the proposed method selects less than 4.6% of the features in all cases to improve the classification performance from 5% to 23% in most cases.
This thesis proposes the first clustering-based feature construction method to improve the performance of single-tree GP on
high-
dimensional data. A new feature clustering…
Advisors/Committee Members: Zhang, Mengjie, Xue, Bing.
Subjects/Keywords: Evolutionary Computation; Feature selection; Feature construction; Classification; High-dimensional data
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Tran, B. N. (2018). Evolutionary Computation for Feature Manipulation in Classification on High-dimensional Data. (Doctoral Dissertation). Victoria University of Wellington. Retrieved from http://hdl.handle.net/10063/7078
Chicago Manual of Style (16th Edition):
Tran, Binh Ngan. “Evolutionary Computation for Feature Manipulation in Classification on High-dimensional Data.” 2018. Doctoral Dissertation, Victoria University of Wellington. Accessed March 02, 2021.
http://hdl.handle.net/10063/7078.
MLA Handbook (7th Edition):
Tran, Binh Ngan. “Evolutionary Computation for Feature Manipulation in Classification on High-dimensional Data.” 2018. Web. 02 Mar 2021.
Vancouver:
Tran BN. Evolutionary Computation for Feature Manipulation in Classification on High-dimensional Data. [Internet] [Doctoral dissertation]. Victoria University of Wellington; 2018. [cited 2021 Mar 02].
Available from: http://hdl.handle.net/10063/7078.
Council of Science Editors:
Tran BN. Evolutionary Computation for Feature Manipulation in Classification on High-dimensional Data. [Doctoral Dissertation]. Victoria University of Wellington; 2018. Available from: http://hdl.handle.net/10063/7078

Harvard University
29.
Minnier, Jessica.
Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods.
Degree: PhD, Biostatistics, 2012, Harvard University
URL: http://nrs.harvard.edu/urn-3:HUL.InstRepos:9367010
► Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Furthermore, the ultimate goal…
(more)
▼ Analysis of
high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Furthermore, the ultimate goal is often to build a prediction model with these features that accurately assesses risk for future subjects. Such statistical challenges arise in the study of genetic associations with health outcomes. However, accurate inference and prediction with genetic information remains challenging, in part due to the complexity in the genetic architecture of human health and disease. A valuable approach for improving prediction models with a large number of potential predictors is to build a parsimonious model that includes only important variables. Regularized regression methods are useful, though often pose challenges for inference due to nonstandard limiting distributions or finite sample distributions that are difficult to approximate. In Chapter 1 we propose and theoretically justify a perturbation-resampling method to derive confidence regions and covariance estimates for marker effects estimated from regularized procedures with a general class of objective functions and concave penalties. Our methods outperform their asymptotic-based counterparts, even when effects are estimated as zero. In Chapters 2 and 3 we focus on genetic risk prediction. The difficulty in accurate risk assessment with genetic studies can in part be attributed to several potential obstacles: sparsity in marker effects, a large number of weak signals, and non-linear effects. Single marker analyses often lack power to select informative markers and typically do not account for non-linearity. One approach to gain predictive power and efficiency is to group markers based on biological knowledge such genetic pathways or gene structure. In Chapter 2 we propose and theoretically justify a multi-stage method for risk assessment that imposes a naive bayes kernel machine (KM) model to estimate gene-set specific risk models, and then aggregates information across all gene-sets by adaptively estimating gene-set weights via a regularization procedure. In Chapter 3 we extend these methods to meta-analyses by introducing sampling-based weights in the KM model. This permits building risk prediction models with multiple studies that have heterogeneous sampling schemes
Advisors/Committee Members: Cai, Tianxi (advisor).
Subjects/Keywords: biostatistics; high dimensional data; kernel machine learning; prediction; statistical genetics; statistics
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Minnier, J. (2012). Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods. (Doctoral Dissertation). Harvard University. Retrieved from http://nrs.harvard.edu/urn-3:HUL.InstRepos:9367010
Chicago Manual of Style (16th Edition):
Minnier, Jessica. “Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods.” 2012. Doctoral Dissertation, Harvard University. Accessed March 02, 2021.
http://nrs.harvard.edu/urn-3:HUL.InstRepos:9367010.
MLA Handbook (7th Edition):
Minnier, Jessica. “Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods.” 2012. Web. 02 Mar 2021.
Vancouver:
Minnier J. Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods. [Internet] [Doctoral dissertation]. Harvard University; 2012. [cited 2021 Mar 02].
Available from: http://nrs.harvard.edu/urn-3:HUL.InstRepos:9367010.
Council of Science Editors:
Minnier J. Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods. [Doctoral Dissertation]. Harvard University; 2012. Available from: http://nrs.harvard.edu/urn-3:HUL.InstRepos:9367010

Harvard University
30.
Sinnott, Jennifer Anne.
Kernel Machine Methods for Risk Prediction with High Dimensional Data.
Degree: PhD, Biostatistics, 2012, Harvard University
URL: http://nrs.harvard.edu/urn-3:HUL.InstRepos:9793867
► Understanding the relationship between genomic markers and complex disease could have a profound impact on medicine, but the large number of potential markers can make…
(more)
▼ Understanding the relationship between genomic markers and complex disease could have a profound impact on medicine, but the large number of potential markers can make it hard to differentiate true biological signal from noise and false positive associations. A standard approach for relating genetic markers to complex disease is to test each marker for its association with disease outcome by comparing disease cases to healthy controls. It would be cost-effective to use control groups across studies of many different diseases; however, this can be problematic when the controls are genotyped on a platform different from the one used for cases. Since different platforms genotype different SNPs, imputation is needed to provide full genomic coverage, but introduces differential measurement error. In Chapter 1, we consider the effects of this differential error on association tests. We quantify the inflation in Type I Error by comparing two healthy control groups drawn from the same cohort study but genotyped on different platforms, and assess several methods for mitigating this error. Analyzing genomic
data one marker at a time can effectively identify associations, but the resulting lists of significant SNPs or differentially expressed genes can be hard to interpret. Integrating prior biological knowledge into risk prediction with such
data by grouping genomic features into pathways reduces the dimensionality of the problem and could improve models by making them more biologically grounded and interpretable. The kernel machine framework has been proposed to model pathway effects because it allows nonlinear associations between the genes in a pathway and disease risk. In Chapter 2, we propose kernel machine regression under the accelerated failure time model. We derive a pseudo-score statistic for testing and a risk score for prediction using genes in a single pathway. We propose omnibus procedures that alleviate the need to prespecify the kernel and allow the
data to drive the complexity of the resulting model. In Chapter 3, we extend methods for risk prediction using a single pathway to methods for risk prediction model using multiple pathways using a multiple kernel learning approach to select important pathways and efficiently combine information across pathways.
Advisors/Committee Members: Cai, Tianxi (advisor), Kraft, Peter (committee member), Mucci, Lorelei (committee member).
Subjects/Keywords: high dimensional data; kernel machines; pathways; risk prediction; biostatistics
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Sinnott, J. A. (2012). Kernel Machine Methods for Risk Prediction with High Dimensional Data. (Doctoral Dissertation). Harvard University. Retrieved from http://nrs.harvard.edu/urn-3:HUL.InstRepos:9793867
Chicago Manual of Style (16th Edition):
Sinnott, Jennifer Anne. “Kernel Machine Methods for Risk Prediction with High Dimensional Data.” 2012. Doctoral Dissertation, Harvard University. Accessed March 02, 2021.
http://nrs.harvard.edu/urn-3:HUL.InstRepos:9793867.
MLA Handbook (7th Edition):
Sinnott, Jennifer Anne. “Kernel Machine Methods for Risk Prediction with High Dimensional Data.” 2012. Web. 02 Mar 2021.
Vancouver:
Sinnott JA. Kernel Machine Methods for Risk Prediction with High Dimensional Data. [Internet] [Doctoral dissertation]. Harvard University; 2012. [cited 2021 Mar 02].
Available from: http://nrs.harvard.edu/urn-3:HUL.InstRepos:9793867.
Council of Science Editors:
Sinnott JA. Kernel Machine Methods for Risk Prediction with High Dimensional Data. [Doctoral Dissertation]. Harvard University; 2012. Available from: http://nrs.harvard.edu/urn-3:HUL.InstRepos:9793867
◁ [1] [2] [3] [4] [5] [6] [7] [8] [9] ▶
.