Advanced search options

Advanced Search Options 🞨

Browse by author name (“Author name starts with…”).

Find ETDs with:

in
/  
in
/  
in
/  
in

Written in Published in Earliest date Latest date

Sorted by

Results per page:

Sorted by: relevance · author · university · dateNew search

You searched for subject:(HTML Structure Analysis). Showing records 1 – 2 of 2 total matches.

Search Limiters

Last 2 Years | English Only

No search limiters apply to these results.

▼ Search Limiters


University of Cincinnati

1. Mysore Gopinath, Abhijith Athreya. Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning.

Degree: MS, Engineering and Applied Science: Computer Science, 2018, University of Cincinnati

Web documents are one of the most important sources of obtaining publicly available information, and researchers in need of textual data often scour the web for information. Most web documents organize the textual content into different sections based on the topicality of the text. Each section contains two distinguishable parts: (1) the title, which consists of a summary/title of the text which follows it, and (2) the text which follows the title, also known as prose text. Apart from the aesthetic appeal, this organization could be helpful in natural language processing (NLP) tasks such as question answering, information extraction, text summarization and text classification. The section title acts as an index or a quick summary of the prose content that follows it. Just like searching for information using a table of contents in a book, these indexes can be used to focus on content relevant to a search. Each section is lexically cohesive, and at the same time, it is cohesively different from other sections.Current methods of web text extraction are agnostic of these textual demarcations, as they cannot identify titles and prose text. One reason is the inherent difficulty in determining sections, since two documents with the same appearance can be structured in many different ways, and a rule-based method may not work well on various websites. Also, the complex nesting of HTML tags and the copious presence of unrelated data complicate processing. Through this thesis, we solve the problem of automatic identification of section titles and prose text.We developed two methods: one an unsupervised domain-independent approach and the other a supervised domain-dependent approach. In the domain-independent approach, we make use of lexical and morphological features of text to perform k-means clustering to identify title labels. Then, further techniques are used to determine corresponding prose text for the titles. In the domain-dependent approach, we train a neural network classifier on the dense word embeddings of title and prose text collected from a domain. The system produces a simplified output of the original HTML page which can be machine-read using simple rules. Along with these novel methods, we also have created a corpus of web documents containing privacy policies, terms of service agreements and miscellaneous web documents. This corpus includes both the original version and the simplified output of all HTML documents.To test our assumptions and methods, we used online privacy policies, terms of service agreements and miscellaneous web documents. We evaluated the models on two fronts: (1) the traditional precision, recall and F-1 scores for segment identification, and (2) a metric we name coverage, which measures the amount of the original legitimate text reproduced in the final output. The domain-independent approach achieved an overall precision of 0.82, recall of 0.98 and coverage of 0.97. The domain-dependent model returned with an accuracy of 0.99, recall of 0.75 and coverage of 0.93. These results… Advisors/Committee Members: Wilson, Shomir (Committee Chair).

Subjects/Keywords: Computer Science; HTML Structure Analysis; Natural Language Processing; Topicality Detection in HTML; Machine Learning; Privacy Policies

Record DetailsSimilar RecordsGoogle PlusoneFacebookTwitterCiteULikeMendeleyreddit

APA · Chicago · MLA · Vancouver · CSE | Export to Zotero / EndNote / Reference Manager

APA (6th Edition):

Mysore Gopinath, A. A. (2018). Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning. (Masters Thesis). University of Cincinnati. Retrieved from http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677

Chicago Manual of Style (16th Edition):

Mysore Gopinath, Abhijith Athreya. “Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning.” 2018. Masters Thesis, University of Cincinnati. Accessed September 26, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677.

MLA Handbook (7th Edition):

Mysore Gopinath, Abhijith Athreya. “Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning.” 2018. Web. 26 Sep 2020.

Vancouver:

Mysore Gopinath AA. Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning. [Internet] [Masters thesis]. University of Cincinnati; 2018. [cited 2020 Sep 26]. Available from: http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677.

Council of Science Editors:

Mysore Gopinath AA. Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning. [Masters Thesis]. University of Cincinnati; 2018. Available from: http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677


Brno University of Technology

2. Blahušiaková, Barbora. Návrh a tvorba datové struktury a webové stránky pro Mystery Shopping agenturu: Data Structure Proposal and Website Design for Mystery Shopping Agency.

Degree: 2019, Brno University of Technology

The main object of this thesis is a data structure proposal as a base for a website and information system for a mystery shopping company. The data structure and website will minimize employee’s mistakes, simplify their work and shorten time intervals in communication between employees and company’s clients. The theoretical part of bachelor thesis describes resources and methods used for the creation of the data structure and company’s website. The analytical part will be defining and describing the company’s requirements as a basis for proposal of data structure and realization of the company’s website. Furthermore, the thesis contains concrete real design of data structure and website. Advisors/Committee Members: Klusák, Aleš (advisor), Lát, Radek (referee).

Subjects/Keywords: Dátová štruktúra; databáza; webová stránka; informačný systém; HTML; CSS; PHP; MySQL; SWOT analýza; Data structure; database; website; information system; HTML; CSS; PHP; MySQL; SWOT analysis

Record DetailsSimilar RecordsGoogle PlusoneFacebookTwitterCiteULikeMendeleyreddit

APA · Chicago · MLA · Vancouver · CSE | Export to Zotero / EndNote / Reference Manager

APA (6th Edition):

Blahušiaková, B. (2019). Návrh a tvorba datové struktury a webové stránky pro Mystery Shopping agenturu: Data Structure Proposal and Website Design for Mystery Shopping Agency. (Thesis). Brno University of Technology. Retrieved from http://hdl.handle.net/11012/61168

Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Chicago Manual of Style (16th Edition):

Blahušiaková, Barbora. “Návrh a tvorba datové struktury a webové stránky pro Mystery Shopping agenturu: Data Structure Proposal and Website Design for Mystery Shopping Agency.” 2019. Thesis, Brno University of Technology. Accessed September 26, 2020. http://hdl.handle.net/11012/61168.

Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

MLA Handbook (7th Edition):

Blahušiaková, Barbora. “Návrh a tvorba datové struktury a webové stránky pro Mystery Shopping agenturu: Data Structure Proposal and Website Design for Mystery Shopping Agency.” 2019. Web. 26 Sep 2020.

Vancouver:

Blahušiaková B. Návrh a tvorba datové struktury a webové stránky pro Mystery Shopping agenturu: Data Structure Proposal and Website Design for Mystery Shopping Agency. [Internet] [Thesis]. Brno University of Technology; 2019. [cited 2020 Sep 26]. Available from: http://hdl.handle.net/11012/61168.

Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Council of Science Editors:

Blahušiaková B. Návrh a tvorba datové struktury a webové stránky pro Mystery Shopping agenturu: Data Structure Proposal and Website Design for Mystery Shopping Agency. [Thesis]. Brno University of Technology; 2019. Available from: http://hdl.handle.net/11012/61168

Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

.