Advanced search options

Advanced Search Options 🞨

Browse by author name (“Author name starts with…”).

Find ETDs with:

in
/  
in
/  
in
/  
in

Written in Published in Earliest date Latest date

Sorted by

Results per page:

Sorted by: relevance · author · university · dateNew search

You searched for +publisher:"University of Cincinnati" +contributor:("Wilson, Shomir"). Showing records 1 – 2 of 2 total matches.

Search Limiters

Last 2 Years | English Only

No search limiters apply to these results.

▼ Search Limiters


University of Cincinnati

1. Mysore Gopinath, Abhijith Athreya. Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning.

Degree: MS, Engineering and Applied Science: Computer Science, 2018, University of Cincinnati

Web documents are one of the most important sources of obtaining publicly available information, and researchers in need of textual data often scour the web for information. Most web documents organize the textual content into different sections based on the topicality of the text. Each section contains two distinguishable parts: (1) the title, which consists of a summary/title of the text which follows it, and (2) the text which follows the title, also known as prose text. Apart from the aesthetic appeal, this organization could be helpful in natural language processing (NLP) tasks such as question answering, information extraction, text summarization and text classification. The section title acts as an index or a quick summary of the prose content that follows it. Just like searching for information using a table of contents in a book, these indexes can be used to focus on content relevant to a search. Each section is lexically cohesive, and at the same time, it is cohesively different from other sections.Current methods of web text extraction are agnostic of these textual demarcations, as they cannot identify titles and prose text. One reason is the inherent difficulty in determining sections, since two documents with the same appearance can be structured in many different ways, and a rule-based method may not work well on various websites. Also, the complex nesting of HTML tags and the copious presence of unrelated data complicate processing. Through this thesis, we solve the problem of automatic identification of section titles and prose text.We developed two methods: one an unsupervised domain-independent approach and the other a supervised domain-dependent approach. In the domain-independent approach, we make use of lexical and morphological features of text to perform k-means clustering to identify title labels. Then, further techniques are used to determine corresponding prose text for the titles. In the domain-dependent approach, we train a neural network classifier on the dense word embeddings of title and prose text collected from a domain. The system produces a simplified output of the original HTML page which can be machine-read using simple rules. Along with these novel methods, we also have created a corpus of web documents containing privacy policies, terms of service agreements and miscellaneous web documents. This corpus includes both the original version and the simplified output of all HTML documents.To test our assumptions and methods, we used online privacy policies, terms of service agreements and miscellaneous web documents. We evaluated the models on two fronts: (1) the traditional precision, recall and F-1 scores for segment identification, and (2) a metric we name coverage, which measures the amount of the original legitimate text reproduced in the final output. The domain-independent approach achieved an overall precision of 0.82, recall of 0.98 and coverage of 0.97. The domain-dependent model returned with an accuracy of 0.99, recall of 0.75 and coverage of 0.93. These results… Advisors/Committee Members: Wilson, Shomir (Committee Chair).

Subjects/Keywords: Computer Science; HTML Structure Analysis; Natural Language Processing; Topicality Detection in HTML; Machine Learning; Privacy Policies

Record DetailsSimilar RecordsGoogle PlusoneFacebookTwitterCiteULikeMendeleyreddit

APA · Chicago · MLA · Vancouver · CSE | Export to Zotero / EndNote / Reference Manager

APA (6th Edition):

Mysore Gopinath, A. A. (2018). Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning. (Masters Thesis). University of Cincinnati. Retrieved from http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677

Chicago Manual of Style (16th Edition):

Mysore Gopinath, Abhijith Athreya. “Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning.” 2018. Masters Thesis, University of Cincinnati. Accessed September 22, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677.

MLA Handbook (7th Edition):

Mysore Gopinath, Abhijith Athreya. “Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning.” 2018. Web. 22 Sep 2020.

Vancouver:

Mysore Gopinath AA. Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning. [Internet] [Masters thesis]. University of Cincinnati; 2018. [cited 2020 Sep 22]. Available from: http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677.

Council of Science Editors:

Mysore Gopinath AA. Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning. [Masters Thesis]. University of Cincinnati; 2018. Available from: http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677

2. Aryasomayajula, Naga Srinivasa Baradwaj. Machine Learning Models for Categorizing Privacy Policy Text.

Degree: MS, Engineering and Applied Science: Computer Science, 2018, University of Cincinnati

A privacy policy is a legal document that discloses the privacy practices of a company to its customers and contains information on how the company collects, uses and manages their data. The privacy policies of many companies on the web are written in natural language. The vocabulary employed in these documents is often sophisticated, and the policy documents themselves are lengthy. This complex nature of privacy policy documents leads end users to skip reading them or not perceive vital information, thus resulting in users not making informed decisions whether to allow the company to collect their personal information. There is a need to address this issue by making privacy policies more user-friendly. In order to address these issues, this thesis makes use of a privacy policy corpus called OPP-115, which contains 115 privacy policies annotated with different data practices.In this thesis, privacy policy text from First Party Collection/Use category of OPP-115 corpus is used for the analysis. The methods used here are a combination of linguistic and machine learning techniques applied to the corpus. A set of features which include noun phrases, verb phrases, and therelative positions of text are derived in this thesis, after observing the behavior of the text fragments in the corpus. These features are used in various supervised learning algorithms. Using the bag of words on the text as a base model, the performance of these algorithms with the extracted features iscompared using various statistical measures. It is observed that the supervised learning methods with the features extracted in this thesis outperform the baseline methods. Advisors/Committee Members: Wilson, Shomir (Committee Chair).

Subjects/Keywords: Computer Science; machine learning; privacy policy text; classification; text spans

Record DetailsSimilar RecordsGoogle PlusoneFacebookTwitterCiteULikeMendeleyreddit

APA · Chicago · MLA · Vancouver · CSE | Export to Zotero / EndNote / Reference Manager

APA (6th Edition):

Aryasomayajula, N. S. B. (2018). Machine Learning Models for Categorizing Privacy Policy Text. (Masters Thesis). University of Cincinnati. Retrieved from http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535633397362514

Chicago Manual of Style (16th Edition):

Aryasomayajula, Naga Srinivasa Baradwaj. “Machine Learning Models for Categorizing Privacy Policy Text.” 2018. Masters Thesis, University of Cincinnati. Accessed September 22, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535633397362514.

MLA Handbook (7th Edition):

Aryasomayajula, Naga Srinivasa Baradwaj. “Machine Learning Models for Categorizing Privacy Policy Text.” 2018. Web. 22 Sep 2020.

Vancouver:

Aryasomayajula NSB. Machine Learning Models for Categorizing Privacy Policy Text. [Internet] [Masters thesis]. University of Cincinnati; 2018. [cited 2020 Sep 22]. Available from: http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535633397362514.

Council of Science Editors:

Aryasomayajula NSB. Machine Learning Models for Categorizing Privacy Policy Text. [Masters Thesis]. University of Cincinnati; 2018. Available from: http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535633397362514

.