Advanced search options

Advanced Search Options 🞨

Browse by author name (“Author name starts with…”).

Find ETDs with:

in
/  
in
/  
in
/  
in

Written in Published in Earliest date Latest date

Sorted by

Results per page:

Sorted by: relevance · author · university · dateNew search

You searched for +publisher:"University of Illinois – Urbana-Champaign" +contributor:("Yu, Cong"). Showing records 1 – 2 of 2 total matches.

Search Limiters

Last 2 Years | English Only

No search limiters apply to these results.

▼ Search Limiters


University of Illinois – Urbana-Champaign

1. Gui, Huan. Low-rank estimation and embedding learning: theory and applications.

Degree: PhD, Computer Science, 2017, University of Illinois – Urbana-Champaign

In many real-world applications of data mining, datasets can be represented using matrices, where rows of the matrix correspond to objects (or data instances) and columns to features (or attributes). Often the datasets are in high-dimensional feature space. For example, in the vector space model of text data, the feature dimension is the vocabulary size. If representing a social network using an adjacency matrix, the feature dimension corresponds to the number of objects in the network. Many other datasets also fall into this category, such as genetic datasets, images, and medical datasets. Even though the feature dimension is enormous, a common observation is that the high-dimensional datasets may (approximately) lie in a subspace of smaller dimensionality, due to dependency or correlation among features. This thesis studies the problem of automatically identifying the low-dimensional space that high-dimensional datasets (approximately) lie in based on dimension reduction models: one is low-rank estimation models and the other is embedding learning models. For data matrices, low-rank estimation is to recover an underlying data matrix, subject to the constraint the matrix is of reduced rank. Such analysis is also generalized to the high-dimensional higher-order tensor data. Meanwhile, embedding learning models are to directly project the observation data into a low-dimensional vector space. In the first part, the theoretical analysis of low-rank estimation models is established in the regime of high-dimensional statistics. For matrices, the low-rank structure corresponds to the sparsity of the singular values; while for tensors, the low-rank model can be defined as the low-rankness of the unfolding matrices of the tensor. To achieve low-rank solutions, two categories of regularization are imposed. Firstly, the problem of robust tensor decomposition with gross corruption is considered. To recover the underlying true tensor and corruption of large magnitude, structure assumptions of low-rankness and sparsity are imposed on the tensor and corruption, respectively. The Schatten-1 norm is applied as convex regularization for the low-rank structure. Secondly, the problem of matrix estimation is considered with a nonconvex penalty. Compared with convex regularization, nonconvex penalty takes advantage of the large singular values, which leads to faster statistical convergence rate and oracle property under a mild condition on the magnitude of the singular values. For both problems, efficient optimization algorithms are proposed, and extensive numerical experiments are conducted to corroborate the efficacy of the proposed algorithms and the theoretical analysis. In the second part, embedding learning models for real-world applications are presented. The high-dimensional data is projected into a low-dimensional vector space via preserving the proximity among objects. Each object is represented by a low-dimensional vector, called embedding or distributed representation. In the first application, the… Advisors/Committee Members: Han, Jiawei (advisor), Han, Jiawei (Committee Chair), Peng, Jian (committee member), Zhai, Chengxiang (committee member), Yu, Cong (committee member).

Subjects/Keywords: Low-rank model; Embedding learning; Noncovex

Record DetailsSimilar RecordsGoogle PlusoneFacebookTwitterCiteULikeMendeleyreddit

APA · Chicago · MLA · Vancouver · CSE | Export to Zotero / EndNote / Reference Manager

APA (6th Edition):

Gui, H. (2017). Low-rank estimation and embedding learning: theory and applications. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/98280

Chicago Manual of Style (16th Edition):

Gui, Huan. “Low-rank estimation and embedding learning: theory and applications.” 2017. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed August 03, 2020. http://hdl.handle.net/2142/98280.

MLA Handbook (7th Edition):

Gui, Huan. “Low-rank estimation and embedding learning: theory and applications.” 2017. Web. 03 Aug 2020.

Vancouver:

Gui H. Low-rank estimation and embedding learning: theory and applications. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2017. [cited 2020 Aug 03]. Available from: http://hdl.handle.net/2142/98280.

Council of Science Editors:

Gui H. Low-rank estimation and embedding learning: theory and applications. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2017. Available from: http://hdl.handle.net/2142/98280

2. Liu, Jialu. Constructing and modeling text-rich information networks: a phrase mining-based approach.

Degree: PhD, Computer Science, 2016, University of Illinois – Urbana-Champaign

A lot of digital ink has been spilled on "big data" over the past few years, which is often characterized by an explosion of information. Most of this surge owes its origin to the unstructured data in the wild like words, images and video as comparing to the structured information stored in fielded form in databases. The proliferation of text-heavy data is particularly overwhelming, reflected in everyone's daily life in forms of web documents, business reviews, news, social posts, etc. In the mean time, textual data and structured entities often come in intertwined, such as authors/posters, document categories and tags, and document-associated geo locations. With this background, a core research challenge presents itself as how to turn massive, (semi-)unstructured data into structured knowledge. One promising paradigm studied in this dissertation is to integrate structured and unstructured data, constructing an organized heterogeneous information network, and developing powerful modeling mechanisms on such organized network. We name it text-rich information network, since it is an integrated representation of both structured and unstructured textual data. To thoroughly develop the construction and modeling paradigm, this dissertation will focus on forming a scalable data-driven framework and propose a new line of techniques relying on the idea of phrase mining to bridge textual documents and structured entities. We will first introduce the phrase mining method named SegPhrase+ to globally discover semantically meaningful phrases from massive textual data, providing a high quality dictionary for text structuralization. Clearly distinct from previous works that mostly focused on raw statistics of string matching, SegPhrase+ looks into the phrase context and effectively rectifies raw statistics to significantly boost the performance. Next, a novel algorithm based on latent keyphrases is developed and adopted to largely eliminate irregularities in massive text via providing an consistent and interpretable document representation. As a critical process in constructing the network, it uses the quality phrases generated in the previous step as candidates. From them a set of keyphrases are extracted to represent a particular document with inferred strength through a statistical model. After this step, documents become more structured and are consistently represented in the form of a bipartite network connecting documents with quality keyphrases. A more heterogeneous text-rich information network can be constructed by incorporating different types of document-associated entities as additional nodes. Lastly, a general and scalable framework, Tensor2vec, are to be added to trational data minining machanism, as the latter cannot readily solve the problem when the organized heterogeneous network has nodes with different types. Tensor2vec is expected to elegantly handle relevance search, entity classification, summarization and recommendation problems, by making use of higher-order link information and projecting multi-typed… Advisors/Committee Members: Han, Jiawei (advisor), Han, Jiawei (Committee Chair), Zhai, Chengxiang (committee member), Parameswaran, Aditya (committee member), Yu, Cong (committee member).

Subjects/Keywords: Text-Rich Information Networks; Phrase Mining; Heterogeneous Information Network; Network Embedding; Keyphrase Extraction

Record DetailsSimilar RecordsGoogle PlusoneFacebookTwitterCiteULikeMendeleyreddit

APA · Chicago · MLA · Vancouver · CSE | Export to Zotero / EndNote / Reference Manager

APA (6th Edition):

Liu, J. (2016). Constructing and modeling text-rich information networks: a phrase mining-based approach. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/92777

Chicago Manual of Style (16th Edition):

Liu, Jialu. “Constructing and modeling text-rich information networks: a phrase mining-based approach.” 2016. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed August 03, 2020. http://hdl.handle.net/2142/92777.

MLA Handbook (7th Edition):

Liu, Jialu. “Constructing and modeling text-rich information networks: a phrase mining-based approach.” 2016. Web. 03 Aug 2020.

Vancouver:

Liu J. Constructing and modeling text-rich information networks: a phrase mining-based approach. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2016. [cited 2020 Aug 03]. Available from: http://hdl.handle.net/2142/92777.

Council of Science Editors:

Liu J. Constructing and modeling text-rich information networks: a phrase mining-based approach. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2016. Available from: http://hdl.handle.net/2142/92777

.