Full Record

New Search | Similar Records

Author
Title Constructing and modeling text-rich information networks: a phrase mining-based approach
URL
Publication Date
Date Accessioned
Degree PhD
Discipline/Department Computer Science
Degree Level doctoral
University/Publisher University of Illinois – Urbana-Champaign
Abstract A lot of digital ink has been spilled on "big data" over the past few years, which is often characterized by an explosion of information. Most of this surge owes its origin to the unstructured data in the wild like words, images and video as comparing to the structured information stored in fielded form in databases. The proliferation of text-heavy data is particularly overwhelming, reflected in everyone's daily life in forms of web documents, business reviews, news, social posts, etc. In the mean time, textual data and structured entities often come in intertwined, such as authors/posters, document categories and tags, and document-associated geo locations. With this background, a core research challenge presents itself as how to turn massive, (semi-)unstructured data into structured knowledge. One promising paradigm studied in this dissertation is to integrate structured and unstructured data, constructing an organized heterogeneous information network, and developing powerful modeling mechanisms on such organized network. We name it text-rich information network, since it is an integrated representation of both structured and unstructured textual data. To thoroughly develop the construction and modeling paradigm, this dissertation will focus on forming a scalable data-driven framework and propose a new line of techniques relying on the idea of phrase mining to bridge textual documents and structured entities. We will first introduce the phrase mining method named SegPhrase+ to globally discover semantically meaningful phrases from massive textual data, providing a high quality dictionary for text structuralization. Clearly distinct from previous works that mostly focused on raw statistics of string matching, SegPhrase+ looks into the phrase context and effectively rectifies raw statistics to significantly boost the performance. Next, a novel algorithm based on latent keyphrases is developed and adopted to largely eliminate irregularities in massive text via providing an consistent and interpretable document representation. As a critical process in constructing the network, it uses the quality phrases generated in the previous step as candidates. From them a set of keyphrases are extracted to represent a particular document with inferred strength through a statistical model. After this step, documents become more structured and are consistently represented in the form of a bipartite network connecting documents with quality keyphrases. A more heterogeneous text-rich information network can be constructed by incorporating different types of document-associated entities as additional nodes. Lastly, a general and scalable framework, Tensor2vec, are to be added to trational data minining machanism, as the latter cannot readily solve the problem when the organized heterogeneous network has nodes with different types. Tensor2vec is expected to elegantly handle relevance search, entity classification, summarization and recommendation problems, by making use of higher-order link information and projecting multi-typed…
Subjects/Keywords Text-Rich Information Networks; Phrase Mining; Heterogeneous Information Network; Network Embedding; Keyphrase Extraction
Contributors Han, Jiawei (advisor); Han, Jiawei (Committee Chair); Zhai, Chengxiang (committee member); Parameswaran, Aditya (committee member); Yu, Cong (committee member)
Language en
Rights Copyright 2016 Jialu Liu
Country of Publication us
Record ID handle:2142/92777
Repository uiuc
Date Indexed 2020-03-09
Grantor University of Illinois at Urbana-Champaign
Issued Date 2016-07-11 00:00:00

Sample Search Hits | Sample Images | Cited Works

…search and multi-aspect mining. We achieve this goal by solving a fundamental problem shared by all these tasks. That is, 6 Problem 1.3: Text-Rich Information Network Embedding Given a text-rich information network, the problem of network embedding is…

…classification, clustering, recommender system, and link prediction. 1.3 Framework This section presents a coherent framework for constructing text-rich information network using phrase mining-based techniques and modeling it using the embedding method. It is…

…Network Embedding By representing documents as a collection of keyphrases, one can naturally view those keyphrases as structured units and build a bipartite network between them and documents. Moreover, a gigantic text-rich information network can be…

…learning the embeddings of words and/or documents through (deep) neural networks. Observing the trend in text embedding, researchers have proposed work to embed largescale networked data. [100] and [80] utilize the network link…

…1 1 3 7 Chapter 2 Literature Review . . . . . . . . . . . . . . . 2.1 Constructing Text-Rich Information Network . . . . . . 2.2 Modeling Text-Rich Information Network . . . . . . . . 2.3 Phrase Mining…

…2.4 Document Representations . . . . . . . . . . . . . . . . 2.5 Keyphrase Extraction . . . . . . . . . . . . . . . . . . . 2.6 Network Embedding

…57 60 62 64 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 66 67 69 75 77 78 Chapter 5 Tensor-Based Large-Scale Network Embedding 5.1…

…Network Schema and Entity Proximity . . . . . . . . . . . 5.2 Tensor2vec: The Network Embedding Framework . . . . . 5.2.1 Entity2vec . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Relation2vec . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Multiple…

.