You searched for subject:(data quality)
.
Showing records 1 – 30 of
935 total matches.
◁ [1] [2] [3] [4] [5] … [32] ▶

San Jose State University
1.
Desai, Khushali Yashodhar.
Big Data Quality Modeling And Validation.
Degree: MS, Computer Engineering, 2018, San Jose State University
URL: https://doi.org/10.31979/etd.c68w-98uf
;
https://scholarworks.sjsu.edu/etd_theses/4898
► The chief purpose of this study is to characterize various big data quality models and to validate each with an example. As the volume…
(more)
▼ The chief purpose of this study is to characterize various big data quality models and to validate each with an example. As the volume of data is increasing at an exponential speed in the era of the broadband Internet, the success of a product or decision largely depends upon selecting the highest quality raw materials, or data, to be used in production. However, working with data in high volumes, fast velocities, and various formats can be fraught with problems. Therefore, software industries need a quality check, especially for data being generated by either software or a sensor. This study explores various big data quality parameters and their definitions and proposes a quality model for each parameter. By using data from the Water Quality U. S. Geological Survey (USGS), San Francisco Bay, an example for each of the proposed big data quality models is given. To calculate composite data quality, prevalent methods such as Monte Carlo and neural networks were used. This thesis proposes eight big data quality parameters in total. Six out of eight of those models were coded and made into a final year project by a group of Master’s degree students at SJSU. A case study is carried out using linear regression analysis, and all the big data quality parameters are validated with positive results.
Subjects/Keywords: Big Data; Big Data Quality
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Desai, K. Y. (2018). Big Data Quality Modeling And Validation. (Masters Thesis). San Jose State University. Retrieved from https://doi.org/10.31979/etd.c68w-98uf ; https://scholarworks.sjsu.edu/etd_theses/4898
Chicago Manual of Style (16th Edition):
Desai, Khushali Yashodhar. “Big Data Quality Modeling And Validation.” 2018. Masters Thesis, San Jose State University. Accessed January 21, 2021.
https://doi.org/10.31979/etd.c68w-98uf ; https://scholarworks.sjsu.edu/etd_theses/4898.
MLA Handbook (7th Edition):
Desai, Khushali Yashodhar. “Big Data Quality Modeling And Validation.” 2018. Web. 21 Jan 2021.
Vancouver:
Desai KY. Big Data Quality Modeling And Validation. [Internet] [Masters thesis]. San Jose State University; 2018. [cited 2021 Jan 21].
Available from: https://doi.org/10.31979/etd.c68w-98uf ; https://scholarworks.sjsu.edu/etd_theses/4898.
Council of Science Editors:
Desai KY. Big Data Quality Modeling And Validation. [Masters Thesis]. San Jose State University; 2018. Available from: https://doi.org/10.31979/etd.c68w-98uf ; https://scholarworks.sjsu.edu/etd_theses/4898

University of Waterloo
2.
Chu, Xu.
Scalable and Holistic Qualitative Data Cleaning.
Degree: 2017, University of Waterloo
URL: http://hdl.handle.net/10012/12138
► Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business…
(more)
▼ Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Poor data across businesses and the government cost the U.S. economy 3.1 trillion a year, according to a report by InsightSquared in 2012. Data scientists reportedly spend 60% of their time in cleaning and organizing the data according to a survey published in Forbes in 2016. Therefore, we need effective and efficient techniques to reduce the human efforts in data cleaning.
Data cleaning activities usually consist of two phases: error detection and error repair. Error detection techniques can be generally classified as either quantitative or qualitative. Quantitative error detection techniques often involve statistical and machine learning methods to identify abnormal behaviors and errors. Quantitative error detection techniques have been mostly studied in the context of outlier detection. On the other hand, qualitative error detection techniques rely on descriptive approaches to specify patterns or constraints of a legal data instance. One common way of specifying those patterns or constraints is by using data quality rules expressed in some integrity constraint languages; and errors are captured by identifying violations of the specified rules. This dissertation focuses on tackling the challenges associated with detecting and repairing qualitative errors.
To clean a dirty dataset using rule-based qualitative data cleaning techniques, we first need to design data quality rules that reflect the semantics of the data. Since obtaining data quality rules by consulting domain experts is usually a time-consuming processing, we need automatic techniques to discover them. We show how to mine data quality rules expressed in the formalism of denial constraints (DCs). We choose DCs as the formal integrity constraint language for capturing data quality rules because it is able to capture many real-life data quality rules, and at the same time, it allows for efficient discovery algorithm.
Since error detection often requires a tuple pairwise comparison, a quadratic complexity that is expensive for a large dataset, we present a distribution strategy that distributes the error detection workload to a cluster of machines in a parallel shared-nothing computing environment. Our proposed distribution strategy aims at minimizing, across all machines, the maximum computation cost and the maximum communication cost, which are the two main types of cost one needs to consider in a shared-nothing environment.
In repairing qualitative errors, we propose a holistic data cleaning technique, which accumulates evidences from a broad spectrum of data quality rules, and suggests possible data updates in a holistic manner. Compared with previous piece-meal data repairing approaches, the holistic approach produces data updates with higher accuracy because it realizes the interactions between different errors using one representation, and aims at generating data…
Subjects/Keywords: Data Quality; Data Cleaning; Databases
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Chu, X. (2017). Scalable and Holistic Qualitative Data Cleaning. (Thesis). University of Waterloo. Retrieved from http://hdl.handle.net/10012/12138
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Chu, Xu. “Scalable and Holistic Qualitative Data Cleaning.” 2017. Thesis, University of Waterloo. Accessed January 21, 2021.
http://hdl.handle.net/10012/12138.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Chu, Xu. “Scalable and Holistic Qualitative Data Cleaning.” 2017. Web. 21 Jan 2021.
Vancouver:
Chu X. Scalable and Holistic Qualitative Data Cleaning. [Internet] [Thesis]. University of Waterloo; 2017. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/10012/12138.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Chu X. Scalable and Holistic Qualitative Data Cleaning. [Thesis]. University of Waterloo; 2017. Available from: http://hdl.handle.net/10012/12138
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Virginia Tech
3.
Gupta, Ragini.
A Framework for Data Quality for Synthetic Information.
Degree: MS, Industrial and Systems Engineering, 2014, Virginia Tech
URL: http://hdl.handle.net/10919/49675
► Data quality has been an area of increasing interest for researchers in recent years due to the rapid emergence of 'big data' processes and applications.…
(more)
▼ Data quality has been an area of increasing interest for researchers in recent years due to the rapid emergence of 'big
data' processes and applications. In this work, the
data quality problem is viewed from the standpoint of synthetic information. Based on the structure and complexity of synthetic
data, a need to have a
data quality framework specific to it was realized. This thesis presents this framework along with implementation details and results of a large synthetic dataset to which the developed testing framework is applied. A formal conceptual framework was designed for assessing
data quality of synthetic information. This framework involves developing analytical methods and software for assessing
data quality for synthetic information. It includes dimensions of
data quality that check the inherent properties of the
data as well as evaluate it in the context of its use. The framework developed here is a software framework which is designed considering software design techniques like scalability, generality, integrability and modularity. A
data abstraction layer has been introduced between the synthetic
data and the tests. This abstraction layer has multiple benefits over direct access of the
data by the tests. It decouples the tests from the
data so that the details of storage and implementation are kept hidden from the user. We have implemented
data quality measures for several
quality dimensions: accuracy and precision, reliability, completeness, consistency, and validity. The particular tests and
quality measures implemented span a range from low-level syntactic checks to high-level semantic
quality measures. In each case, in addition to the results of the
quality measure itself, we also present results on the computational performance (scalability) of the measure.
Advisors/Committee Members: Bish, Douglas R. (committeechair), Swarup, Samarth (committee member), Marathe, Madhav Vishnu (committee member), Fraticelli, Barbara M. P. (committee member).
Subjects/Keywords: Data quality; Synthetic data; Testing
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Gupta, R. (2014). A Framework for Data Quality for Synthetic Information. (Masters Thesis). Virginia Tech. Retrieved from http://hdl.handle.net/10919/49675
Chicago Manual of Style (16th Edition):
Gupta, Ragini. “A Framework for Data Quality for Synthetic Information.” 2014. Masters Thesis, Virginia Tech. Accessed January 21, 2021.
http://hdl.handle.net/10919/49675.
MLA Handbook (7th Edition):
Gupta, Ragini. “A Framework for Data Quality for Synthetic Information.” 2014. Web. 21 Jan 2021.
Vancouver:
Gupta R. A Framework for Data Quality for Synthetic Information. [Internet] [Masters thesis]. Virginia Tech; 2014. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/10919/49675.
Council of Science Editors:
Gupta R. A Framework for Data Quality for Synthetic Information. [Masters Thesis]. Virginia Tech; 2014. Available from: http://hdl.handle.net/10919/49675
4.
BATISTA, Maria da Conceição Moraes.
Schema quality analysis in a data integration system
.
Degree: 2008, Universidade Federal de Pernambuco
URL: http://repositorio.ufpe.br/handle/123456789/1335
► Qualidade da Informação (QI) tem se tornado um aspecto crítico nas organizações e em pesquisas da área de sistemas de informação. Informações de pouca qualidade…
(more)
▼ Qualidade da Informação (QI) tem se tornado um aspecto crítico nas organizações e em
pesquisas da área de sistemas de informação. Informações de pouca qualidade podem ter
impactos negativos na efetividade de uma organização. O crescimento do uso de
data
warehouses e acesso direto de gerentes e usários a informações obtidas de várias fontes
contribuíram para o crescimento da necessidade de qualidade nas informações das empresas.
A noção de QI em sistemas de informação emergiu nos últimos e vem sendo alvo de interesse
cada vez maior. Não existe ainda um acordo comum acerca de uma definição da QI. Apenas
um consenso de que tratase de um conceito de adequação ao uso . A informação é
considerada apropriada para o uso dentro da perspectiva dos requisitos e necessidades de um
usuário, ou seja, a qualidade da informação depende de sua utilidade.
O acesso integrado a informações distribuídas em múltiplas fontes de dados heterogêneas,
distribuídas e autônomas é um problema importante a ser resolvido em muitos domínios de
aplicações. Tipicamente existem algumas formas de se obter respostas a consultas globais,
sobre dados em fontes diferentes com diferentes combinações. entretanto é bastante custoso
obter todas as respostas possíveis. Enquanto muita pesquisa tem sido feita em relação a
processamento de consultas e seleção de planos com critérios de custo, pouco se conhece com
relação ao problema de incorporar aspectos de QI em esquemas globais de sistemas de
integração de dados.
Neste trabalho, nós propomos a análise da QI em um sistema de integração de dados, mais
especificamente a qualidade dos esquemas do sistema. O nosso principal objetivo é melhorar a
qualidade da execução das consultas. Nossa proposta baseiasse na hipótese de que uma
alternativa de otimizar o processamento de consultas seria a construção de esquemas com
altos escores de QI.
Assim, o foco deste trabalho está no desenvolvimento de mecanismos de análise da QI voltados
esquemas de integração de dados, especialmente o esquema global. Inicialmente, nós
construímos uma lista de critérios de QI e relacionamos estes critérios com os elementos
existentes em sistemas de integração de dados. Em seguida, direcionamos o foco para o
esquema integrado e especificamos formalmente critérios de qualidade de esquemas
minimalidade, completude do esquema e consistência de tipo. Também especificamos um
algoritmo de execução de ajustes de forma a melhorar a minimalidade e algoritmos para medir a
consistência de tipo nos esquemas. Com esses experimentos conseguimos mostrar que o
tempo de execução de uma consulta em um sistema de integração de dados pode diminuir se
esta consulta for submetida a um esquema com escores altos de minimalidade e consistência
de tipo
Advisors/Committee Members: SALGADO, Ana Carolina Brandão (advisor).
Subjects/Keywords: Information Quality;
Data Quality;
Data Integration
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
BATISTA, M. d. C. M. (2008). Schema quality analysis in a data integration system
. (Thesis). Universidade Federal de Pernambuco. Retrieved from http://repositorio.ufpe.br/handle/123456789/1335
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
BATISTA, Maria da Conceição Moraes. “Schema quality analysis in a data integration system
.” 2008. Thesis, Universidade Federal de Pernambuco. Accessed January 21, 2021.
http://repositorio.ufpe.br/handle/123456789/1335.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
BATISTA, Maria da Conceição Moraes. “Schema quality analysis in a data integration system
.” 2008. Web. 21 Jan 2021.
Vancouver:
BATISTA MdCM. Schema quality analysis in a data integration system
. [Internet] [Thesis]. Universidade Federal de Pernambuco; 2008. [cited 2021 Jan 21].
Available from: http://repositorio.ufpe.br/handle/123456789/1335.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
BATISTA MdCM. Schema quality analysis in a data integration system
. [Thesis]. Universidade Federal de Pernambuco; 2008. Available from: http://repositorio.ufpe.br/handle/123456789/1335
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Nelson Mandela Metropolitan University
5.
Saiod, Abdul Kader.
Data quality issues in electronic health records for large-scale databases.
Degree: 2019, Nelson Mandela Metropolitan University
URL: http://hdl.handle.net/10948/44577
► Data Quality (DQ) in Electronic Health Records (EHRs) is one of the core functions that play a decisive role to improve the healthcare service quality.…
(more)
▼ Data Quality (DQ) in Electronic Health Records (EHRs) is one of the core functions that play a decisive role to improve the healthcare service quality. The DQ issues in EHRs are a noticeable trend to improve the introduction of an adaptive framework for interoperability and standards in Large-Scale Databases (LSDB) management systems. Therefore, large data communications are challenging in the traditional approaches to satisfy the needs of the consumers, as data is often not capture directly into the Database Management Systems (DBMS) in a seasonably enough fashion to enable their subsequent uses. In addition, large data plays a vital role in containing plenty of treasures for all the fields in the DBMS. EHRs technology provides portfolio management systems that allow HealthCare Organisations (HCOs) to deliver a higher quality of care to their patients than that which is possible with paper-based records. EHRs are in high demand for HCOs to run their daily services as increasing numbers of huge datasets occur every day. Efficient EHR systems reduce the data redundancy as well as the system application failure and increase the possibility to draw all necessary reports. However, one of the main challenges in developing efficient EHR systems is the inherent difficulty to coherently manage data from diverse heterogeneous sources. It is practically challenging to integrate diverse data into a global schema, which satisfies the need of users. The efficient management of EHR systems using an existing DBMS present challenges because of incompatibility and sometimes inconsistency of data structures. As a result, no common methodological approach is currently in existence to effectively solve every data integration problem. The challenges of the DQ issue raised the need to find an efficient way to integrate large EHRs from diverse heterogeneous sources. To handle and align a large dataset efficiently, the hybrid algorithm method with the logical combination of Fuzzy-Ontology along with a large-scale EHRs analysis platform has shown the results in term of improved accuracy. This study investigated and addressed the raised DQ issues to interventions to overcome these barriers and challenges, including the provision of EHRs as they pertain to DQ and has combined features to search, extract, filter, clean and integrate data to ensure that users can coherently create new consistent data sets. The study researched the design of a hybrid method based on Fuzzy-Ontology with performed mathematical simulations based on the Markov Chain Probability Model. The similarity measurement based on dynamic Hungarian algorithm was followed by the Design Science Research (DSR) methodology, which will increase the quality of service over HCOs in adaptive frameworks.
Subjects/Keywords: Healthcare – Data quality
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Saiod, A. K. (2019). Data quality issues in electronic health records for large-scale databases. (Thesis). Nelson Mandela Metropolitan University. Retrieved from http://hdl.handle.net/10948/44577
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Saiod, Abdul Kader. “Data quality issues in electronic health records for large-scale databases.” 2019. Thesis, Nelson Mandela Metropolitan University. Accessed January 21, 2021.
http://hdl.handle.net/10948/44577.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Saiod, Abdul Kader. “Data quality issues in electronic health records for large-scale databases.” 2019. Web. 21 Jan 2021.
Vancouver:
Saiod AK. Data quality issues in electronic health records for large-scale databases. [Internet] [Thesis]. Nelson Mandela Metropolitan University; 2019. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/10948/44577.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Saiod AK. Data quality issues in electronic health records for large-scale databases. [Thesis]. Nelson Mandela Metropolitan University; 2019. Available from: http://hdl.handle.net/10948/44577
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

McMaster University
6.
Al-janabi, Samir.
An Integrated Approach to Improve Data Quality.
Degree: PhD, 2016, McMaster University
URL: http://hdl.handle.net/11375/19161
► Thesis
A huge quantity of data is created and saved everyday in databases from different types of data sources, including financial data, web log data,…
(more)
▼ Thesis
A huge quantity of data is created and saved everyday in databases from different types of data sources, including financial data, web log data, sensor data, and human input. Information technology enables organizations to collect and store large amounts of data in databases. Different organizations worldwide use data to support their activities through various applications. Issues in data quality such as duplicate records, inaccurate data, violations of integrity constraints, and outdated data are common in databases. Thus, data in databases are often unclean. Such issues in data quality might cost billions of dollars annually and might have severe consequences on critical tasks such as analysis, decision making, and planning. Data cleaning processes are required to detect and correct errors in the unclean data. Despite the fact that there are multiple quality issues, current data cleaning techniques generally deal with only one or two aspects of quality. The techniques assume either the availability of master data, or training data, or the involvement of users in data cleaning. For instance, users might manually place confidence scores that represent the correctness of the values of data or they may be consulted about the repairs. In addition, the techniques may depend on high-quality master data or pre-labeled training data to fix errors. However, relying on human effort to correct errors is expensive, and master data or training data are not always available. These factors make it challenging to discover which values have issues, thereby making it difficult to fix the data (e.g., merging several duplicate records into a single representative record). To address these problems in data cleaning, we propose algorithms that integrate multiple data quality issues in the cleaning. In this thesis, we apply this approach in the context of multiple data quality issues where errors in data are introduced from multiple causes. The issues include duplicate records, violations of integrity constraints, inaccurate data, and outdated data. We fix these issues holistically, without a need for human manual interaction, master data, or training data. We propose an algorithm to tackle the problem of data cleaning. We concentrate on issues in data quality including duplicate records, violations of integrity constraints, and inaccurate data. We utilize the embedded density information in data to eliminate duplicates based on data density, where tuples that are close to each other are packed together. Density information enables us to reduce manual user interaction in the deduplication process, and the dependency on master data or training data. To resolve inconsistency in duplicate records, we present a weight model to automatically assign confidence scores that are based on the density of data. We consider the inconsistent data in terms of violations with respect to a set of functional dependencies (FDs). We present a cost model for data repair that is based on the weight model. To resolve inaccurate data in duplicate…
Advisors/Committee Members: Janicki, Ryszard, Computing and Software.
Subjects/Keywords: data management; data quality; data mining
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Al-janabi, S. (2016). An Integrated Approach to Improve Data Quality. (Doctoral Dissertation). McMaster University. Retrieved from http://hdl.handle.net/11375/19161
Chicago Manual of Style (16th Edition):
Al-janabi, Samir. “An Integrated Approach to Improve Data Quality.” 2016. Doctoral Dissertation, McMaster University. Accessed January 21, 2021.
http://hdl.handle.net/11375/19161.
MLA Handbook (7th Edition):
Al-janabi, Samir. “An Integrated Approach to Improve Data Quality.” 2016. Web. 21 Jan 2021.
Vancouver:
Al-janabi S. An Integrated Approach to Improve Data Quality. [Internet] [Doctoral dissertation]. McMaster University; 2016. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/11375/19161.
Council of Science Editors:
Al-janabi S. An Integrated Approach to Improve Data Quality. [Doctoral Dissertation]. McMaster University; 2016. Available from: http://hdl.handle.net/11375/19161

University of Edinburgh
7.
Ma, Shuai.
Extending dependencies for improving data quality.
Degree: PhD, 2011, University of Edinburgh
URL: http://hdl.handle.net/1842/5045
► This doctoral thesis presents the results of my work on extending dependencies for improving data quality, both in a centralized environment with a single database…
(more)
▼ This doctoral thesis presents the results of my work on extending dependencies for improving data quality, both in a centralized environment with a single database and in a data exchange and integration environment with multiple databases. The first part of the thesis proposes five classes of data dependencies, referred to as CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly found in practice in a centralized environment. For each class of these dependencies, we investigate two central problems: the satisfiability problem and the implication problem. The satisfiability problem is to determine given a set Σ of dependencies defined on a database schema R, whether or not there exists a nonempty database D of R that satisfies Σ. And the implication problem is to determine whether or not a set Σ of dependencies defined on a database schema R entails another dependency φ on R. That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are important for the validation and optimization of data-cleaning processes. We establish complexity results of the satisfiability problem and the implication problem for all these five classes of dependencies, both in the absence of finite-domain attributes and in the general setting with finite-domain attributes. Moreover, SQL-based techniques are developed to detect data inconsistencies for each class of the proposed dependencies, which can be easily implemented on the top of current database management systems. The second part of the thesis studies three important topics for data cleaning in a data exchange and integration environment with multiple databases. One is the dependency propagation problem, which is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. We investigate dependency propagation for views defined in various fragments of relational algebra, conditional functional dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies given as either CFDs or traditional functional dependencies (FDs). And we establish lower and upper bounds, all matching, ranging from PTIME to undecidable. These not only provide the first results for CFD propagation, but also extend the classical work of FD propagation by giving new complexity bounds in the presence of a setting with finite domains. We finally provide the first algorithm for computing a minimal cover of all CFDs propagated via SPC views. The algorithm has the same complexity as one of the most efficient algorithms for computing a cover of FDs propagated via a projection view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching dependencies (MDs) is introduced for specifying the semantics of unreliable data. As opposed to static constraints for schema design such as FDs, MDs are developed for record matching, and are defined in terms of similarity metrics and a…
Subjects/Keywords: 005.3; data quality; data repairing; data dependencies
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ma, S. (2011). Extending dependencies for improving data quality. (Doctoral Dissertation). University of Edinburgh. Retrieved from http://hdl.handle.net/1842/5045
Chicago Manual of Style (16th Edition):
Ma, Shuai. “Extending dependencies for improving data quality.” 2011. Doctoral Dissertation, University of Edinburgh. Accessed January 21, 2021.
http://hdl.handle.net/1842/5045.
MLA Handbook (7th Edition):
Ma, Shuai. “Extending dependencies for improving data quality.” 2011. Web. 21 Jan 2021.
Vancouver:
Ma S. Extending dependencies for improving data quality. [Internet] [Doctoral dissertation]. University of Edinburgh; 2011. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/1842/5045.
Council of Science Editors:
Ma S. Extending dependencies for improving data quality. [Doctoral Dissertation]. University of Edinburgh; 2011. Available from: http://hdl.handle.net/1842/5045

McMaster University
8.
Huang, Yu.
Relational Data Curation by Deduplication, Anonymization, and Diversification.
Degree: PhD, 2020, McMaster University
URL: http://hdl.handle.net/11375/26009
► Enterprises acquire large amounts of data from a variety of sources with the goal of extracting valuable insights and enabling informed analysis. Unfortunately, organizations continue…
(more)
▼ Enterprises acquire large amounts of data from a variety of sources with the goal of extracting valuable insights and enabling informed analysis. Unfortunately, organizations continue to be hindered by poor data quality as they wrangle with their data to extract value since most real datasets are rarely error-free. Poor data quality is a pervasive problem that spans across all industries causing unreliable data analysis, and costing billions of dollars. The large body of datasets, the pace of data acquisition, and the heterogeneity of data sources pose challenges towards achieving high-quality data. These challenges are further exacerbated with data privacy and data diversity requirements. In this thesis, we study and propose solutions to address data duplication, managing the trade-off between data cleaning and data privacy, and computing diverse data instances.
In the first part of this thesis, we address the data duplication problem. We propose a duplication detection framework, which combines word-embeddings with constraints among attributes to improve the accuracy of deduplication. We propose a set of constraint-based statistical features to capture the semantic relationship among attributes. We showed that our techniques achieve comparative accuracy on real datasets. In the second part of this thesis, we study the problem of data privacy and data cleaning, and we present a Privacy-Aware data Cleaning-As-a-Service (PACAS) framework to protect privacy during the cleaning process. Our evaluation shows that PACAS safeguards semantically related sensitive values, and provides lower repair errors compared to existing privacy-aware cleaning techniques. In the third part of this thesis, we study the problem of finding a diverse anonymized data instance where diversity is measured via a set of diversity constraints, and propose an algorithm to seek a k-anonymous relation with value suppression as well as satisfying given diversity constraints. We conduct extensive experiments using real and synthetic data showing the effectiveness of our techniques, and improvement over existing baselines.
Thesis
Doctor of Philosophy (PhD)
Advisors/Committee Members: Chiang, Fei, Computing and Software.
Subjects/Keywords: data quality; data cleaning; data privacy
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Huang, Y. (2020). Relational Data Curation by Deduplication, Anonymization, and Diversification. (Doctoral Dissertation). McMaster University. Retrieved from http://hdl.handle.net/11375/26009
Chicago Manual of Style (16th Edition):
Huang, Yu. “Relational Data Curation by Deduplication, Anonymization, and Diversification.” 2020. Doctoral Dissertation, McMaster University. Accessed January 21, 2021.
http://hdl.handle.net/11375/26009.
MLA Handbook (7th Edition):
Huang, Yu. “Relational Data Curation by Deduplication, Anonymization, and Diversification.” 2020. Web. 21 Jan 2021.
Vancouver:
Huang Y. Relational Data Curation by Deduplication, Anonymization, and Diversification. [Internet] [Doctoral dissertation]. McMaster University; 2020. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/11375/26009.
Council of Science Editors:
Huang Y. Relational Data Curation by Deduplication, Anonymization, and Diversification. [Doctoral Dissertation]. McMaster University; 2020. Available from: http://hdl.handle.net/11375/26009
9.
변, 정현.
Development of standardized stepwise evaluation and integrated management model for Clinical Big data Quality improvement.
Degree: 2020, Ajou University
URL: http://repository.ajou.ac.kr/handle/201003/19212
;
http://dcoll.ajou.ac.kr:9080/dcollection/jsp/common/DcLoOrgPer.jsp?sItemId=000000030235
► 비공개
I. Introduction 1 A. Background 1 B. Purpose of study 8 II. Methods 9 A. Selection of Data Quality Assessment concepts in literature 12…
(more)
▼ 비공개
I. Introduction 1
A. Background 1
B. Purpose of study 8
II. Methods 9
A. Selection of Data Quality Assessment concepts in literature 12
B. Integration of Data Quality checks 16
C. Process design of Data Quality Assessment 19
D. Design of Data Quality Assessment model 33
E. Data Quality Score Definition for Data Quality Assessment 38
F. Selecting a weight for comprehensive data quality index 42
G. Test data for Data Quality Assessment 43
H. Performance evaluation of Data Quality Assessment model 45
I. Development of a system to utilize Data Quality Assessment model 47
III. Result 48
A. Integration result of Data Quality checks 48
B. Test data result of Data Quality Assessment 53
C. Comparison with Data Quality Assessment Tool 72
D. Verification of Data Quality Assessment 84
E. Development system of DQUEEN 93
IV. Discussion 95
A. Consideration of the study 95
B. Limitation of this study 102
V. Conclusion 104
References 105
Doctor
Advisors/Committee Members: 대학원 의학과, 201324378, 변, 정현.
Subjects/Keywords: Data Quality; Data Quality Assessment; EHR Data Quality; Data Quality management; Data Quality Assessment model; OMOP-CDM; Measure the Data Quality; Common Data Model; Distributed Research Network
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
변, . (2020). Development of standardized stepwise evaluation and integrated management model for Clinical Big data Quality improvement. (Thesis). Ajou University. Retrieved from http://repository.ajou.ac.kr/handle/201003/19212 ; http://dcoll.ajou.ac.kr:9080/dcollection/jsp/common/DcLoOrgPer.jsp?sItemId=000000030235
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
변, 정현. “Development of standardized stepwise evaluation and integrated management model for Clinical Big data Quality improvement.” 2020. Thesis, Ajou University. Accessed January 21, 2021.
http://repository.ajou.ac.kr/handle/201003/19212 ; http://dcoll.ajou.ac.kr:9080/dcollection/jsp/common/DcLoOrgPer.jsp?sItemId=000000030235.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
변, 정현. “Development of standardized stepwise evaluation and integrated management model for Clinical Big data Quality improvement.” 2020. Web. 21 Jan 2021.
Vancouver:
변 . Development of standardized stepwise evaluation and integrated management model for Clinical Big data Quality improvement. [Internet] [Thesis]. Ajou University; 2020. [cited 2021 Jan 21].
Available from: http://repository.ajou.ac.kr/handle/201003/19212 ; http://dcoll.ajou.ac.kr:9080/dcollection/jsp/common/DcLoOrgPer.jsp?sItemId=000000030235.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
변 . Development of standardized stepwise evaluation and integrated management model for Clinical Big data Quality improvement. [Thesis]. Ajou University; 2020. Available from: http://repository.ajou.ac.kr/handle/201003/19212 ; http://dcoll.ajou.ac.kr:9080/dcollection/jsp/common/DcLoOrgPer.jsp?sItemId=000000030235
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of the Western Cape
10.
Hattas, Mogamat Mahier.
A quality Assurance framework for digital household survey processes in South Africa
.
Degree: 2019, University of the Western Cape
URL: http://hdl.handle.net/11394/7172
► Official household-based survey statistics is predominantly collected using the paper-and-pen data collection (PAPDC) method. In recent times, the world has seen a global rise in…
(more)
▼ Official household-based survey statistics is predominantly collected using the paper-and-pen
data collection (PAPDC) method. In recent times, the world has seen a global rise in the use of digital technology, especially the use of mobile handheld devices for the collection of survey
data in various fields of statistical collection. Various sectors in the population require
data for a multitude of purposes, from planning, monitoring and during the evaluation of projects and programmes. The pressure of attaining the
data often requires
data or information producers to gather more
data or information more frequently with improved
quality, efficiency, and accuracy.
The
quality of
data or information collected remains uncertain as more surveys enter the global arena. The overall survey
quality needs to improve continuously. The
data used may not be trustworthy and users should be aware of this. There should be a continuous holistic assessment of the validity and reliability of
data before these are used (T. Chen, Raeside, & Khan, 2014). Digital
data collection (DDC) offers national statistical organisations (NSOs) in Africa possible, albeit partial, solutions to several current
quality, performance, and cost-efficiency concerns. Potential benefits found in the literature for DDC methods over PAPDC methods include, inter alia: increased speed of
data collection, increased
data accuracy, timeous availability of
data, higher
data quality, effective
data security and lower costs for
data-collection processes. Most NSOs in Africa, including South Africa, currently rely on manual, paper-based
data collection methods for continuous official household survey collection. Paper-based methods tend to be slower, to rely on manual reporting and involve more survey-intensive resources. With the rise of handheld mobile Global Positioning Systems (GPS) enabled devices, official household surveys are able to monitor surveys spatially, and in real-time. The information could be securely synchronised to a central secure database, to allow for immediate post-processing and
data analysis.
Advisors/Committee Members: Breytenbach, Johan (advisor).
Subjects/Keywords: data collector;
enumerator;
Digital Household;
Quality assurance;
Data collection;
Data quality management;
Quality-assessment framework
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Hattas, M. M. (2019). A quality Assurance framework for digital household survey processes in South Africa
. (Thesis). University of the Western Cape. Retrieved from http://hdl.handle.net/11394/7172
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Hattas, Mogamat Mahier. “A quality Assurance framework for digital household survey processes in South Africa
.” 2019. Thesis, University of the Western Cape. Accessed January 21, 2021.
http://hdl.handle.net/11394/7172.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Hattas, Mogamat Mahier. “A quality Assurance framework for digital household survey processes in South Africa
.” 2019. Web. 21 Jan 2021.
Vancouver:
Hattas MM. A quality Assurance framework for digital household survey processes in South Africa
. [Internet] [Thesis]. University of the Western Cape; 2019. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/11394/7172.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Hattas MM. A quality Assurance framework for digital household survey processes in South Africa
. [Thesis]. University of the Western Cape; 2019. Available from: http://hdl.handle.net/11394/7172
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Vanderbilt University
11.
Parr, Sharidan Kristen.
Automated Mapping of Laboratory Tests to LOINC Codes using Noisy
Labels in a National Electronic Health Record System Database.
Degree: MS, Biomedical Informatics, 2018, Vanderbilt University
URL: http://hdl.handle.net/1803/13005
► Standards, such as the Logical Observation Identifiers Names and Codes (LOINC®) are critical for interoperability and integrating data into common data models, but are inconsistently…
(more)
▼ Standards, such as the Logical Observation Identifiers Names and Codes (LOINC®) are critical for interoperability and integrating
data into common
data models, but are inconsistently used. Without consistent mapping to standards, clinical
data cannot be harmonized, shared, or interpreted in a meaningful context. We sought to develop an automated machine learning pipeline that leverages noisy labels to map laboratory
data to LOINC codes. Across 130 sites in the Department of Veterans Affairs Corporate
Data Warehouse, we selected the 150 most commonly-used laboratory tests with numeric results per site from 2000 through 2016. Using source
data text and numeric fields, we developed a machine learning model and manually validated random samples from both labeled and unlabeled datasets. The raw laboratory
data consisted of >6.5 billion test results, with 2,215 distinct LOINC codes. The model predicted the correct LOINC code in 85% of the unlabeled
data and 96% of the labeled
data by test frequency. In the subset of labeled
data where the original and model-predicted LOINC codes disagreed, the model-predicted LOINC code was correct in 83% of the
data by test frequency. Using a completely automated process, we are able to assign LOINC codes to unlabeled
data with high accuracy. When the model-predicted LOINC code differed from the original LOINC code, the model prediction was correct in the vast majority of cases. This scalable, automated algorithm may improve
data quality and interoperability, while substantially reducing the manual effort currently needed to accurately map laboratory
data.
Advisors/Committee Members: Thomas Lasko (committee member), Matthew Shotwell (committee member), Michael Matheny (Committee Chair).
Subjects/Keywords: Laboratory; Data Quality; Machine Learning
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Parr, S. K. (2018). Automated Mapping of Laboratory Tests to LOINC Codes using Noisy
Labels in a National Electronic Health Record System Database. (Thesis). Vanderbilt University. Retrieved from http://hdl.handle.net/1803/13005
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Parr, Sharidan Kristen. “Automated Mapping of Laboratory Tests to LOINC Codes using Noisy
Labels in a National Electronic Health Record System Database.” 2018. Thesis, Vanderbilt University. Accessed January 21, 2021.
http://hdl.handle.net/1803/13005.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Parr, Sharidan Kristen. “Automated Mapping of Laboratory Tests to LOINC Codes using Noisy
Labels in a National Electronic Health Record System Database.” 2018. Web. 21 Jan 2021.
Vancouver:
Parr SK. Automated Mapping of Laboratory Tests to LOINC Codes using Noisy
Labels in a National Electronic Health Record System Database. [Internet] [Thesis]. Vanderbilt University; 2018. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/1803/13005.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Parr SK. Automated Mapping of Laboratory Tests to LOINC Codes using Noisy
Labels in a National Electronic Health Record System Database. [Thesis]. Vanderbilt University; 2018. Available from: http://hdl.handle.net/1803/13005
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Waterloo
12.
Copes, Nicholas.
A Planning based Evaluation of Spatial Data Quality of OpenStreetMap Building Footprints in Canada.
Degree: 2019, University of Waterloo
URL: http://hdl.handle.net/10012/14712
► OpenStreetMap (OSM) is an editable world map where users can create and retrieve data. Building footprints are an OSM dataset that is of particular interest,…
(more)
▼ OpenStreetMap (OSM) is an editable world map where users can create and retrieve data. Building footprints are an OSM dataset that is of particular interest, as this data has many useful applications for planners and academic professionals. Measuring the spatial data quality of OSM building footprints remains a challenge as there are numerous quality measures that can be used and existing studies have focused on other OSM datasets or rather a single quality measure. The study performed in this thesis developed a set of ArcGIS models to test numerous spatial data quality measures for OSM building footprints in a sample of mid-sized Canadian municipalities and gain a comprehensive understanding of spatial data quality. The models performed tests by comparing to municipal datasets as well as determining other quality measures without a reference dataset. The results of this study found that the overall spatial data quality of OSM building footprints varies across mid-sized municipalities in Canada. There is no link between a municipality’s location or perceived importance and the level of spatial data quality. The study also found that commercial areas have a higher level of completeness than residential areas. While the models worked well to test numerous spatial data quality measures for building footprints and can be used by others on other building footprint datasets, there exist some limitations. Certain tests that identify potential building footprint errors need to be checked to see if they are indeed errors. Also, the models were not able to measure any aspect of shape metrics. Suggestions for further studies include measuring shape metrics of building footprints from OSM as well as encouraging and subsequently monitoring OSM contributions in a particular area.
Subjects/Keywords: OpenStreetMap; spatial data quality; VGI
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Copes, N. (2019). A Planning based Evaluation of Spatial Data Quality of OpenStreetMap Building Footprints in Canada. (Thesis). University of Waterloo. Retrieved from http://hdl.handle.net/10012/14712
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Copes, Nicholas. “A Planning based Evaluation of Spatial Data Quality of OpenStreetMap Building Footprints in Canada.” 2019. Thesis, University of Waterloo. Accessed January 21, 2021.
http://hdl.handle.net/10012/14712.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Copes, Nicholas. “A Planning based Evaluation of Spatial Data Quality of OpenStreetMap Building Footprints in Canada.” 2019. Web. 21 Jan 2021.
Vancouver:
Copes N. A Planning based Evaluation of Spatial Data Quality of OpenStreetMap Building Footprints in Canada. [Internet] [Thesis]. University of Waterloo; 2019. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/10012/14712.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Copes N. A Planning based Evaluation of Spatial Data Quality of OpenStreetMap Building Footprints in Canada. [Thesis]. University of Waterloo; 2019. Available from: http://hdl.handle.net/10012/14712
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Waterloo
13.
Farid, Mina.
Extracting and Cleaning RDF Data.
Degree: 2020, University of Waterloo
URL: http://hdl.handle.net/10012/15934
► The RDF data model has become a prevalent format to represent heterogeneous data because of its versatility. The capability of dismantling information from its native…
(more)
▼ The RDF data model has become a prevalent format to represent heterogeneous data because of its versatility. The capability of dismantling information from its native formats and representing it in triple format offers a simple yet powerful way of modelling data that is obtained from multiple sources. In addition, the triple format and schema constraints of the RDF model make the RDF data easy to process as labeled, directed graphs.
This graph representation of RDF data supports higher-level analytics by enabling querying using different techniques and querying languages, e.g., SPARQL. Anlaytics that require structured data are supported by transforming the graph data on-the-fly to populate the target schema that is needed for downstream analysis. These target schemas are defined by downstream applications according to their information need.
The flexibility of RDF data brings two main challenges. First, the extraction of RDF data is a complex task that may involve domain expertise about the information required to be extracted for different applications. Another significant aspect of analyzing RDF data is its quality, which depends on multiple factors including the reliability of data sources and the accuracy of the extraction systems. The quality of the analysis depends mainly on the quality of the underlying data. Therefore, evaluating and improving the quality of RDF data has a direct effect on the correctness of downstream analytics.
This work presents multiple approaches related to the extraction and quality evaluation of RDF data. To cope with the large amounts of data that needs to be extracted, we present DSTLR, a scalable framework to extract RDF triples from semi-structured and unstructured data sources. For rare entities that fall on the long tail of information, there may not be enough signals to support high-confidence extraction. Towards this problem, we present an approach to estimate property values for long tail entities. We also present multiple algorithms and approaches that focus on the quality of RDF data. These include discovering quality constraints from RDF data, and utilizing machine learning techniques to repair errors in RDF data.
Subjects/Keywords: rdf; data quality; information extraction
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Farid, M. (2020). Extracting and Cleaning RDF Data. (Thesis). University of Waterloo. Retrieved from http://hdl.handle.net/10012/15934
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Farid, Mina. “Extracting and Cleaning RDF Data.” 2020. Thesis, University of Waterloo. Accessed January 21, 2021.
http://hdl.handle.net/10012/15934.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Farid, Mina. “Extracting and Cleaning RDF Data.” 2020. Web. 21 Jan 2021.
Vancouver:
Farid M. Extracting and Cleaning RDF Data. [Internet] [Thesis]. University of Waterloo; 2020. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/10012/15934.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Farid M. Extracting and Cleaning RDF Data. [Thesis]. University of Waterloo; 2020. Available from: http://hdl.handle.net/10012/15934
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Manchester
14.
Emran, Nurul Akmar Binti.
DEFINITION AND ANALYSIS OF POPULATION-BASED DATA
COMPLETENESS MEASUREMENT.
Degree: 2011, University of Manchester
URL: http://www.manchester.ac.uk/escholar/uk-ac-man-scw:132588
► Poor quality data such as data with errors or missing values cause negative consequences in many application domains. An important aspect of data quality is…
(more)
▼ Poor
quality data such as
data with errors or
missing values cause negative consequences in many application
domains. An important aspect of
data quality is completeness. One
problem in
data completeness is the problem of missing individuals
in
data sets. Within a
data set, the individuals refer to the real
world entities whose information is recorded. So far, in
completeness studies however, there has been little discussion
about how missing individuals are assessed. In this thesis, we
propose the notion of population-based completeness (PBC) that
deals with the missing individuals problem, with the aim of
investigating what is required to measure PBC and to identify what
is needed to support PBC measurements in practice. To achieve these
aims, we analyse the elements of PBC and the requirements for PBC
measurement, resulting in a definition of the PBC elements and PBC
measurement formula. We propose an architecture for PBC measurement
systems and determine the technical requirements of PBC systems in
terms of software and hardware components. An analysis of the
technical issues that arise in implementing PBC makes a
contribution to an understanding of the feasibility of PBC
measurements to provide accurate measurement results. Further
exploration of a particular issue that was discovered in the
analysis showed that when measuring PBC across multiple databases,
data from those databases need to be integrated and materialised.
Unfortunately, this requirement may lead to a large internal store
for the PBC system that is impractical to maintain. We propose an
approach to test the hypothesis that the available storage space
can be optimised by materialising only partial information from the
contributing databases, while retaining accuracy of the PBC
measurements. Our approach involves substituting some of the
attributes from the contributing databases with smaller
alternatives, by exploiting the approximate functional dependencies
(AFDs) that can be discovered within each local database. An
analysis of the space-accuracy trade-offs of the approach leads to
the development of an algorithm to assess candidate alternative
attributes in terms of space-saving and accuracy (of PBC
measurement). The result of several case studies conducted for
proxy assessment contributes to an understanding of the
space-accuracy trade-offs offered by the proxies. A better
understanding of dealing with the completeness problem has been
achieved through the proposal and the investigation of PBC, in
terms of the requirements to measure and to support PBC in
practice.
Advisors/Committee Members: MISSIER, PAOLO P, Embury, Suzanne, Missier, Paolo.
Subjects/Keywords: completeness measurement; data quality
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Emran, N. A. B. (2011). DEFINITION AND ANALYSIS OF POPULATION-BASED DATA
COMPLETENESS MEASUREMENT. (Doctoral Dissertation). University of Manchester. Retrieved from http://www.manchester.ac.uk/escholar/uk-ac-man-scw:132588
Chicago Manual of Style (16th Edition):
Emran, Nurul Akmar Binti. “DEFINITION AND ANALYSIS OF POPULATION-BASED DATA
COMPLETENESS MEASUREMENT.” 2011. Doctoral Dissertation, University of Manchester. Accessed January 21, 2021.
http://www.manchester.ac.uk/escholar/uk-ac-man-scw:132588.
MLA Handbook (7th Edition):
Emran, Nurul Akmar Binti. “DEFINITION AND ANALYSIS OF POPULATION-BASED DATA
COMPLETENESS MEASUREMENT.” 2011. Web. 21 Jan 2021.
Vancouver:
Emran NAB. DEFINITION AND ANALYSIS OF POPULATION-BASED DATA
COMPLETENESS MEASUREMENT. [Internet] [Doctoral dissertation]. University of Manchester; 2011. [cited 2021 Jan 21].
Available from: http://www.manchester.ac.uk/escholar/uk-ac-man-scw:132588.
Council of Science Editors:
Emran NAB. DEFINITION AND ANALYSIS OF POPULATION-BASED DATA
COMPLETENESS MEASUREMENT. [Doctoral Dissertation]. University of Manchester; 2011. Available from: http://www.manchester.ac.uk/escholar/uk-ac-man-scw:132588

University of Manchester
15.
Emran, Nurul Akmar Binti.
Definition and analysis of population-based data completeness measurement.
Degree: PhD, 2011, University of Manchester
URL: https://www.research.manchester.ac.uk/portal/en/theses/definition-and-analysis-of-populationbased-data-completeness-measurement(bcf137fc-1550-4e26-89e5-2c605734da12).html
;
http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.542761
► Poor quality data such as data with errors or missing values cause negative consequences in many application domains. An important aspect of data quality is…
(more)
▼ Poor quality data such as data with errors or missing values cause negative consequences in many application domains. An important aspect of data quality is completeness. One problem in data completeness is the problem of missing individuals in data sets. Within a data set, the individuals refer to the real world entities whose information is recorded. So far, in completeness studies however, there has been little discussion about how missing individuals are assessed. In this thesis, we propose the notion of population-based completeness (PBC) that deals with the missing individuals problem, with the aim of investigating what is required to measure PBC and to identify what is needed to support PBC measurements in practice. To achieve these aims, we analyse the elements of PBC and the requirements for PBC measurement, resulting in a definition of the PBC elements and PBC measurement formula. We propose an architecture for PBC measurement systems and determine the technical requirements of PBC systems in terms of software and hardware components. An analysis of the technical issues that arise in implementing PBC makes a contribution to an understanding of the feasibility of PBC measurements to provide accurate measurement results. Further exploration of a particular issue that was discovered in the analysis showed that when measuring PBC across multiple databases, data from those databases need to be integrated and materialised. Unfortunately, this requirement may lead to a large internal store for the PBC system that is impractical to maintain. We propose an approach to test the hypothesis that the available storage space can be optimised by materialising only partial information from the contributing databases, while retaining accuracy of the PBC measurements. Our approach involves substituting some of the attributes from the contributing databases with smaller alternatives, by exploiting the approximate functional dependencies (AFDs) that can be discovered within each local database. An analysis of the space-accuracy trade-offs of the approach leads to the development of an algorithm to assess candidate alternative attributes in terms of space-saving and accuracy (of PBC measurement). The result of several case studies conducted for proxy assessment contributes to an understanding of the space-accuracy trade-offs offered by the proxies. A better understanding of dealing with the completeness problem has been achieved through the proposal and the investigation of PBC, in terms of the requirements to measure and to support PBC in practice.
Subjects/Keywords: 004; completeness measurement; data quality
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Emran, N. A. B. (2011). Definition and analysis of population-based data completeness measurement. (Doctoral Dissertation). University of Manchester. Retrieved from https://www.research.manchester.ac.uk/portal/en/theses/definition-and-analysis-of-populationbased-data-completeness-measurement(bcf137fc-1550-4e26-89e5-2c605734da12).html ; http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.542761
Chicago Manual of Style (16th Edition):
Emran, Nurul Akmar Binti. “Definition and analysis of population-based data completeness measurement.” 2011. Doctoral Dissertation, University of Manchester. Accessed January 21, 2021.
https://www.research.manchester.ac.uk/portal/en/theses/definition-and-analysis-of-populationbased-data-completeness-measurement(bcf137fc-1550-4e26-89e5-2c605734da12).html ; http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.542761.
MLA Handbook (7th Edition):
Emran, Nurul Akmar Binti. “Definition and analysis of population-based data completeness measurement.” 2011. Web. 21 Jan 2021.
Vancouver:
Emran NAB. Definition and analysis of population-based data completeness measurement. [Internet] [Doctoral dissertation]. University of Manchester; 2011. [cited 2021 Jan 21].
Available from: https://www.research.manchester.ac.uk/portal/en/theses/definition-and-analysis-of-populationbased-data-completeness-measurement(bcf137fc-1550-4e26-89e5-2c605734da12).html ; http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.542761.
Council of Science Editors:
Emran NAB. Definition and analysis of population-based data completeness measurement. [Doctoral Dissertation]. University of Manchester; 2011. Available from: https://www.research.manchester.ac.uk/portal/en/theses/definition-and-analysis-of-populationbased-data-completeness-measurement(bcf137fc-1550-4e26-89e5-2c605734da12).html ; http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.542761
16.
Rankins, Brooke Anne.
DQTunePipe: a Set of Python Tools for LIGO Detector Characterization.
Degree: M.S. in Physics, Physics, 2011, University of Mississippi
URL: https://egrove.olemiss.edu/etd/239
► When LIGO's interferometers are in operation, many auxiliary data channels monitor and record the state of the instruments and surrounding environmental conditions. Analyzing these channels…
(more)
▼ When LIGO's interferometers are in operation, many auxiliary
data channels monitor and record the state of the instruments and surrounding environmental conditions. Analyzing these channels allows LIGO scientists to evaluate the
quality of the
data collected and veto
data segments of poor
quality. A set of scripts were built up in an ad hoc fashion, sometimes with limited documentation, to assist in this analysis. In this thesis, we present DQTunePipe , a set of Python modules to replace these scripts and aid in the detector characterization of the LIGO instruments. The use of Python makes the analysis method more compatible with existing LIGO tools. DQTunePipe improves
data quality analysis by allowing users to select specific detector characterization tasks as well as providing a maintainable framework upon which additional modules may be built. The nature of the Python DQTunePipe code allows the addition of new features with great simplicity. This thesis details the structure of DQTunePipe, serves as its documentation at the time of this writing, and outlines the procedures for incorporating new features.
Advisors/Committee Members: Marco Cavaglia, Emanuele Berti, Lucien Cremaldi.
Subjects/Keywords: Data Quality; Detchar; LIGO; Physics
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Rankins, B. A. (2011). DQTunePipe: a Set of Python Tools for LIGO Detector Characterization. (Thesis). University of Mississippi. Retrieved from https://egrove.olemiss.edu/etd/239
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Rankins, Brooke Anne. “DQTunePipe: a Set of Python Tools for LIGO Detector Characterization.” 2011. Thesis, University of Mississippi. Accessed January 21, 2021.
https://egrove.olemiss.edu/etd/239.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Rankins, Brooke Anne. “DQTunePipe: a Set of Python Tools for LIGO Detector Characterization.” 2011. Web. 21 Jan 2021.
Vancouver:
Rankins BA. DQTunePipe: a Set of Python Tools for LIGO Detector Characterization. [Internet] [Thesis]. University of Mississippi; 2011. [cited 2021 Jan 21].
Available from: https://egrove.olemiss.edu/etd/239.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Rankins BA. DQTunePipe: a Set of Python Tools for LIGO Detector Characterization. [Thesis]. University of Mississippi; 2011. Available from: https://egrove.olemiss.edu/etd/239
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Minnesota
17.
Johnson, Steven.
A Data Quality Framework for the Secondary Use of Electronic Health Information.
Degree: PhD, Health Informatics, 2016, University of Minnesota
URL: http://hdl.handle.net/11299/188958
► Electronic health record (EHR) systems are designed to replace paper charts and facilitate the delivery of care. Since EHR data is now readily available in…
(more)
▼ Electronic health record (EHR) systems are designed to replace paper charts and facilitate the delivery of care. Since EHR data is now readily available in electronic form, it is increasingly used for other purposes. This is expected to improve health outcomes for patients; however, the benefits will only be realized if the data that is captured in the EHR is of sufficient quality to support these secondary uses. This research demonstrated that a healthcare data quality framework can be developed that produces metrics that characterize underlying EHR data quality and it can be used to quantify the impact of data quality issues on the correctness of the intended use of the data. The framework described in this research defined a Data Quality (DQ) Ontology and implemented an assessment method. The DQ Ontology was developed by mining the healthcare data quality literature for important terms used to discuss data quality concepts and these terms were harmonized into an ontology. Four high-level data quality dimensions (CorrectnessMeasure, ConsistencyMeasure, CompletenessMeasure and CurrencyMeasure) categorized 19 lower level Measures. The ontology serves as an unambiguous vocabulary and allows more precision when discussing healthcare data quality. The DQ Ontology is expressed with sufficient rigor that it can be used for logical inference and computation. The data quality framework was used to characterize data quality of an EHR for 10 data quality Measures. The results demonstrate that data quality can be quantified and Metrics can track data quality trends over time and for specific domain concepts. The DQ framework produces scalar quantities which can be computed on individual domain concepts and can be meaningfully aggregated at different levels of an information model. The data quality assessment process was also used to quantify the impact of data quality issues on a task. The EHR data was systematically degraded and a measure of the impact on the correctness of CMS178 eMeasure (Urinary Catheter Removal after Surgery) was computed. This information can help healthcare organizations prioritize data quality improvement efforts to focus on the areas that are most important and determine if the data can support its intended use.
Subjects/Keywords: data quality; healthcare; informatics; ontology
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Johnson, S. (2016). A Data Quality Framework for the Secondary Use of Electronic Health Information. (Doctoral Dissertation). University of Minnesota. Retrieved from http://hdl.handle.net/11299/188958
Chicago Manual of Style (16th Edition):
Johnson, Steven. “A Data Quality Framework for the Secondary Use of Electronic Health Information.” 2016. Doctoral Dissertation, University of Minnesota. Accessed January 21, 2021.
http://hdl.handle.net/11299/188958.
MLA Handbook (7th Edition):
Johnson, Steven. “A Data Quality Framework for the Secondary Use of Electronic Health Information.” 2016. Web. 21 Jan 2021.
Vancouver:
Johnson S. A Data Quality Framework for the Secondary Use of Electronic Health Information. [Internet] [Doctoral dissertation]. University of Minnesota; 2016. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/11299/188958.
Council of Science Editors:
Johnson S. A Data Quality Framework for the Secondary Use of Electronic Health Information. [Doctoral Dissertation]. University of Minnesota; 2016. Available from: http://hdl.handle.net/11299/188958

University of Ontario Institute of Technology
18.
Keller, Alexander.
Data curation with ontology functional dependences.
Degree: 2017, University of Ontario Institute of Technology
URL: http://hdl.handle.net/10155/792
► Poor data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Functional dependencies have been used in existing…
(more)
▼ Poor
data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Functional
dependencies have been used in existing cleaning solutions to model syntactic equivalence. They are not able to model
semantic equivelence, however. We advance the state of
data quality constraints by defining, discovering, and cleaning
Ontology Functional Dependencies. We define their theoretical foundations, including sound and complete axioms, and
linear inference procedure. We develop algorithms for
data verification, constraint discovery,
data cleaning, ontology
versus
data inconsistency identification, and optimizations to each. Our experimental evaluation shows the scalability and
accuracy of our algorithms. We show that ontology FDs are useful to capture domain attribute relationships, and can
significantly reduce the number of false positive errors in
data cleaning techniques that rely on traditional FDs.
Advisors/Committee Members: Szlichta, Jaroslaw.
Subjects/Keywords: Constraints; Data; Quality; Cleaning; Discovery
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Keller, A. (2017). Data curation with ontology functional dependences. (Thesis). University of Ontario Institute of Technology. Retrieved from http://hdl.handle.net/10155/792
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Keller, Alexander. “Data curation with ontology functional dependences.” 2017. Thesis, University of Ontario Institute of Technology. Accessed January 21, 2021.
http://hdl.handle.net/10155/792.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Keller, Alexander. “Data curation with ontology functional dependences.” 2017. Web. 21 Jan 2021.
Vancouver:
Keller A. Data curation with ontology functional dependences. [Internet] [Thesis]. University of Ontario Institute of Technology; 2017. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/10155/792.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Keller A. Data curation with ontology functional dependences. [Thesis]. University of Ontario Institute of Technology; 2017. Available from: http://hdl.handle.net/10155/792
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign
19.
Zhi, Shi.
Integrating multiple conflicting sources by truth discovery and source quality estimation.
Degree: MS, 0112, 2014, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/50493
► Multiple descriptions about the same entity from different sources will inevitably result in data or information inconsistency. Among conflicting pieces of information, which one is…
(more)
▼ Multiple descriptions about the same entity from different sources will inevitably result in
data or information inconsistency. Among conflicting pieces of information, which one is the most trustworthy? How to detect the fraudulence of a rumor? Obviously, it is unrealistic to curate and validate the trustworthiness of every piece of information because of the high cost of human labeling and lack of experts. To find the truth of each entity, much research work has shown that considering the
quality of information providers can improve the performance of
data integration. Due to different
quality of
data sources, it is hard to find a general solution that works for every case. Therefore, we start from a general setting of truth analysis at first and narrow down to two basic problems in
data integration. We first propose a general framework to deal with numerical
data with flexibility of defining loss function. Source
quality is represented by a vector to model the source credibility in different error interval. Then we propose a new method called No Truth Truth Model(NTTM) to deal with truth existence problem in low-
quality data. Preliminary experiments on real stock
data and slot filling
data show promising results.
Advisors/Committee Members: Han, Jiawei (advisor).
Subjects/Keywords: Truth Discovery; Data Integration; Data Quality
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Zhi, S. (2014). Integrating multiple conflicting sources by truth discovery and source quality estimation. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/50493
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Zhi, Shi. “Integrating multiple conflicting sources by truth discovery and source quality estimation.” 2014. Thesis, University of Illinois – Urbana-Champaign. Accessed January 21, 2021.
http://hdl.handle.net/2142/50493.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Zhi, Shi. “Integrating multiple conflicting sources by truth discovery and source quality estimation.” 2014. Web. 21 Jan 2021.
Vancouver:
Zhi S. Integrating multiple conflicting sources by truth discovery and source quality estimation. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2014. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/2142/50493.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Zhi S. Integrating multiple conflicting sources by truth discovery and source quality estimation. [Thesis]. University of Illinois – Urbana-Champaign; 2014. Available from: http://hdl.handle.net/2142/50493
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of the Western Cape
20.
Zimri, Irma Selina.
The complexities and possibilities of health data utilization in the West Coast District
.
Degree: 2018, University of the Western Cape
URL: http://hdl.handle.net/11394/6175
► In an ideal public health arena, scientific evidence should be incorporated in the health information practices of making management decisions, developing policies, and implementing programs.…
(more)
▼ In an ideal public health arena, scientific evidence should be incorporated in the health information
practices of making management decisions, developing policies, and implementing programs.
However, much effort has been spent in developing health information practices focusing mainly
on
data collection,
data quality and processing, with relatively little development on the utilization
side of the information spectrum. Although the South Africa Health National Indicator Dataset of
2013 routinely collects and reports on more than two hundred elements, the degree to which this
information is being used is not empirically known. The overall aim of the study was to explore
the dynamics of routine primary healthcare information utilization in the West Coast district while
identifying specific interventions that could ultimately lead to the improved use of
data to better
inform decision making. The ultimate goal being to enable managers to better utilize their routine
health information for effective decision making.
Advisors/Committee Members: Njenga, James K (advisor).
Subjects/Keywords: Data;
Health information;
Data quality;
Data use;
Health data
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Zimri, I. S. (2018). The complexities and possibilities of health data utilization in the West Coast District
. (Thesis). University of the Western Cape. Retrieved from http://hdl.handle.net/11394/6175
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Zimri, Irma Selina. “The complexities and possibilities of health data utilization in the West Coast District
.” 2018. Thesis, University of the Western Cape. Accessed January 21, 2021.
http://hdl.handle.net/11394/6175.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Zimri, Irma Selina. “The complexities and possibilities of health data utilization in the West Coast District
.” 2018. Web. 21 Jan 2021.
Vancouver:
Zimri IS. The complexities and possibilities of health data utilization in the West Coast District
. [Internet] [Thesis]. University of the Western Cape; 2018. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/11394/6175.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Zimri IS. The complexities and possibilities of health data utilization in the West Coast District
. [Thesis]. University of the Western Cape; 2018. Available from: http://hdl.handle.net/11394/6175
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Universidade Nova
21.
Adeboye, Adeyemi Adebayo.
Towards reinventing the statistical system of the central bank of nigeria for enhanced knowledge creation.
Degree: 2018, Universidade Nova
URL: https://www.rcaap.pt/detail.jsp?id=oai:run.unl.pt:10362/37699
► The Central Bank of Nigeria (CBN) produces statistics that meet some of the data needs of monetary policy and other uses. How well this is…
(more)
▼ The Central Bank of Nigeria (CBN) produces statistics that meet some of the
data needs of
monetary policy and other uses. How well this is fulfilled by the CBN is consequent on the
quality of its statistical system, which has direct implication for knowledge creation through
processed mass of statistical information. Questionnaires based on IMF
Data Quality
Assessment Frameworks (DQAFs) for BOP & IIP Statistics, and monetary statistics are applied
to evaluate the
quality of the CBN statistical system. Extant sound practices and deficiencies
of the statistical system are identified; while improvement measures and statistical
innovations are suggested.
Enabled by relevant organic laws, the CBN compiles statistics in a supportive environment
with commensurate human and work tool resources that meet the needs of statistical
programs. Statistics production is carried out impartially and professionally, in broad
conformity with IMF statistics manuals and compilation guides, regarding concepts, scope,
classification and sectorization; and in compliance with e‐GDDS periodicity and timeliness for
dissemination. Other observed sound statistical practices include valuation of transactions
and positions using market prices or appropriate proxies; and recording, generally, of flows
and stocks on accrual basis; while compiled statistics are consistent within datasets and
reconcilable over a time period; etc.
Some of the generic weaknesses are absence of statistics procedural guide; lack of routine
evaluation and monitoring of statistical processes; inadequacy of branding to distinctively
identify the bank’s statistical products; non‐disclosure of changes in statistical practices;
non‐conduct of revision studies; and metadata concerns. The BOP & IIP statistics weaknesses
comprise coverage inadequacies, sectorization/classification issues, lack of routine
assessment of source
data and inadequate assessment and validation of intermediate
data
and statistical output; while for monetary statistics, non‐compilation of the OFCS is identified
apart from the generic. Recommendations include broadening source
data, developing useroriented
statistical
quality manuals, establishing comprehensive manuals of procedures and
their corresponding statistical compilation techniques, integrating statistical auditing into
the statistical system, enhancing metadata and conducting revision studies, among others.
Advisors/Committee Members: Lima, Susana Filipa de Moura, Agostinho, António.
Subjects/Keywords: Statistical System; Data Quality; Data Quality Assessment Framework; Statistical Auditing
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Adeboye, A. A. (2018). Towards reinventing the statistical system of the central bank of nigeria for enhanced knowledge creation. (Thesis). Universidade Nova. Retrieved from https://www.rcaap.pt/detail.jsp?id=oai:run.unl.pt:10362/37699
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Adeboye, Adeyemi Adebayo. “Towards reinventing the statistical system of the central bank of nigeria for enhanced knowledge creation.” 2018. Thesis, Universidade Nova. Accessed January 21, 2021.
https://www.rcaap.pt/detail.jsp?id=oai:run.unl.pt:10362/37699.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Adeboye, Adeyemi Adebayo. “Towards reinventing the statistical system of the central bank of nigeria for enhanced knowledge creation.” 2018. Web. 21 Jan 2021.
Vancouver:
Adeboye AA. Towards reinventing the statistical system of the central bank of nigeria for enhanced knowledge creation. [Internet] [Thesis]. Universidade Nova; 2018. [cited 2021 Jan 21].
Available from: https://www.rcaap.pt/detail.jsp?id=oai:run.unl.pt:10362/37699.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Adeboye AA. Towards reinventing the statistical system of the central bank of nigeria for enhanced knowledge creation. [Thesis]. Universidade Nova; 2018. Available from: https://www.rcaap.pt/detail.jsp?id=oai:run.unl.pt:10362/37699
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
22.
Sarwar, Muhammad Azeem.
Assessing Data Quality of ERP and CRM Systems.
Degree: 2014, , Faculty of Technology and Society (TS)
URL: http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20137
► Data Quality confirms the correct and meaningful representation of real world information. Researchers have proposed frameworks to measure and analyze the Data Quality. Still…
(more)
▼ Data Quality confirms the correct and meaningful representation of real world information. Researchers have proposed frameworks to measure and analyze the Data Quality. Still modern organizations find it very challenging to state the level of enterprise Data Quality maturity. This study aims at defining the Data Quality of a system also examine the Data Quality Assessment practices. A definition for Data Quality is suggested with the help of systematic literature review. Literature review also provided a list of dimensions and initiatives for Data Quality Assessment. A survey is conducted to examine these aggregated aspects of Data Quality in an organization actively using ERP and CRM systems. The survey was aimed at collecting organizational awareness of Data Quality and to study the practices followed to ensure the Data Quality in ERP and CRM systems. The survey results identified data validity, accuracy and security as the main areas of interest for Data Quality. The results also indicate that, due to audit requirements of ERP systems, ERP systems have higher demand of Data Quality as compared to CRM systems.
Subjects/Keywords: Data Quality; Data Quality Management; Quality Assessment; ERP; CRM; Engineering and Technology; Teknik och teknologier
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Sarwar, M. A. (2014). Assessing Data Quality of ERP and CRM Systems. (Thesis). , Faculty of Technology and Society (TS). Retrieved from http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20137
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Sarwar, Muhammad Azeem. “Assessing Data Quality of ERP and CRM Systems.” 2014. Thesis, , Faculty of Technology and Society (TS). Accessed January 21, 2021.
http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20137.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Sarwar, Muhammad Azeem. “Assessing Data Quality of ERP and CRM Systems.” 2014. Web. 21 Jan 2021.
Vancouver:
Sarwar MA. Assessing Data Quality of ERP and CRM Systems. [Internet] [Thesis]. , Faculty of Technology and Society (TS); 2014. [cited 2021 Jan 21].
Available from: http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20137.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Sarwar MA. Assessing Data Quality of ERP and CRM Systems. [Thesis]. , Faculty of Technology and Society (TS); 2014. Available from: http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20137
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Tennessee – Knoxville
23.
Juriga, David.
Improving Manufacturing Data Quality with Data Fusion and Advanced Algorithms for Improved Total Data Quality Management.
Degree: MS, Forestry, 2019, University of Tennessee – Knoxville
URL: https://trace.tennessee.edu/utk_gradthes/5492
► Data mining and predictive analytics in the sustainable-biomaterials industries is currently not feasible given the lack of organization and management of the database structures. The…
(more)
▼ Data mining and predictive analytics in the sustainable-biomaterials industries is currently not feasible given the lack of organization and management of the database structures. The advent of artificial intelligence,
data mining, robotics, etc., has become a standard for successful business endeavors and is known as the ‘Fourth Industrial Revolution’ or ‘Industry 4.0’ in Europe.
Data quality improvement through real-time multi-layer
data fusion across interconnected networks and statistical
quality assessment may improve the usefulness of databases maintained by these industries. Relational databases with a high degree of
quality may be the gateway for predictive modeling and enhanced business analytics.
Data quality is a key issue in the sustainable bio-materials industry. Untreated
data from multiple databases (e.g., sensor
data and destructive test
data) are generally not in the right structure to perform advanced analytics. Some inherent problems of
data from sensors that are stored in
data warehouses at millisecond intervals include missing values, duplicate records, sensor failure
data (
data out of feasible range), outliers, etc. These inherent problems of the untreated
data represent information loss and mute predictive analytics. The goal of this
data science focused research was to create a continuous real-time software algorithm for
data cleaning that automatically aligns, fuses, and assesses
data quality for missing fields and potential outliers. The program automatically reduces the variable size, imputes missing values, and predicts the destructive test
data for every record in a database. Improved
data quality was assessed using 10-fold cross-validation and the normalized root mean square error of prediction (NRMSEP) statistic. The impact of outliers and missing
data were tested on a simulated dataset with 201 variations of outlier percentages ranging from 0-90% and missing
data percentages ranging from 0-90%. The software program was also validated on a real dataset from the wood composites industry. One result of the research was that the number of sensors needed for accurate predictions are highly dependent on the correlation between independent variables and dependent variables. Overall, the
data cleaning software program significantly decreased the NRMSEP ranging from 64% to 12% of
quality control variables for key destructive test values (e.g., internal bond, water absorption and modulus of rupture).
Advisors/Committee Members: Timothy Young, Alexander Petutschnig, Bogdan Bichescu, Terry Liles.
Subjects/Keywords: Big data; Industry 4.0; machine learning; quality control; data quality improvement; data fusion
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Juriga, D. (2019). Improving Manufacturing Data Quality with Data Fusion and Advanced Algorithms for Improved Total Data Quality Management. (Thesis). University of Tennessee – Knoxville. Retrieved from https://trace.tennessee.edu/utk_gradthes/5492
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Juriga, David. “Improving Manufacturing Data Quality with Data Fusion and Advanced Algorithms for Improved Total Data Quality Management.” 2019. Thesis, University of Tennessee – Knoxville. Accessed January 21, 2021.
https://trace.tennessee.edu/utk_gradthes/5492.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Juriga, David. “Improving Manufacturing Data Quality with Data Fusion and Advanced Algorithms for Improved Total Data Quality Management.” 2019. Web. 21 Jan 2021.
Vancouver:
Juriga D. Improving Manufacturing Data Quality with Data Fusion and Advanced Algorithms for Improved Total Data Quality Management. [Internet] [Thesis]. University of Tennessee – Knoxville; 2019. [cited 2021 Jan 21].
Available from: https://trace.tennessee.edu/utk_gradthes/5492.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Juriga D. Improving Manufacturing Data Quality with Data Fusion and Advanced Algorithms for Improved Total Data Quality Management. [Thesis]. University of Tennessee – Knoxville; 2019. Available from: https://trace.tennessee.edu/utk_gradthes/5492
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Delft University of Technology
24.
Fardani Haryadi, A. (author).
Requirements on and Antecedents of Big Data Quality: An Empirical Examination to Improve Big Data Quality in Financial Service Organizations.
Degree: 2016, Delft University of Technology
URL: http://resolver.tudelft.nl/uuid:d41b297e-3194-4d36-97bb-603a949f97f4
► Big data has been widely known for its enormous potential in various industries, including finance. However, more than half of financial service organizations reported that…
(more)
▼ Big data has been widely known for its enormous potential in various industries, including finance. However, more than half of financial service organizations reported that big data has not delivered their expected value, yet. Among all, data quality is one of the issues behind this phenomenon, and thereby being the focus of this study. The objective of this research project is to develop key requirements for a reference architecture that improves big data quality in financial institutions. A joint approach between Requirement Engineering and Data Quality Management frameworks is performed to yield the desired requirements. Data collecting method through three case studies, seven content analysis, and literature review are performed to obtain the most comprehensive result. Overall findings indicate that antecedents of big data quality consist of data, technology, people, process and procedure, organization, and external aspects. They also encompass discovery of big data value, accessibility to data, and operationality of big data projects. Furthermore, there are six Information System requirements and two Human and Organizational requirements to improve big data quality.
Technology, Policy and Management
Multi-Actor Systems
Advisors/Committee Members: Janssen, M.F.W.H.A. (mentor), Hulstijn, J. (mentor), van der Voort, H.G. (mentor).
Subjects/Keywords: big data; data quality; big data quality; antecedents; requirements; reference architecture; finance
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Fardani Haryadi, A. (. (2016). Requirements on and Antecedents of Big Data Quality: An Empirical Examination to Improve Big Data Quality in Financial Service Organizations. (Masters Thesis). Delft University of Technology. Retrieved from http://resolver.tudelft.nl/uuid:d41b297e-3194-4d36-97bb-603a949f97f4
Chicago Manual of Style (16th Edition):
Fardani Haryadi, A (author). “Requirements on and Antecedents of Big Data Quality: An Empirical Examination to Improve Big Data Quality in Financial Service Organizations.” 2016. Masters Thesis, Delft University of Technology. Accessed January 21, 2021.
http://resolver.tudelft.nl/uuid:d41b297e-3194-4d36-97bb-603a949f97f4.
MLA Handbook (7th Edition):
Fardani Haryadi, A (author). “Requirements on and Antecedents of Big Data Quality: An Empirical Examination to Improve Big Data Quality in Financial Service Organizations.” 2016. Web. 21 Jan 2021.
Vancouver:
Fardani Haryadi A(. Requirements on and Antecedents of Big Data Quality: An Empirical Examination to Improve Big Data Quality in Financial Service Organizations. [Internet] [Masters thesis]. Delft University of Technology; 2016. [cited 2021 Jan 21].
Available from: http://resolver.tudelft.nl/uuid:d41b297e-3194-4d36-97bb-603a949f97f4.
Council of Science Editors:
Fardani Haryadi A(. Requirements on and Antecedents of Big Data Quality: An Empirical Examination to Improve Big Data Quality in Financial Service Organizations. [Masters Thesis]. Delft University of Technology; 2016. Available from: http://resolver.tudelft.nl/uuid:d41b297e-3194-4d36-97bb-603a949f97f4

Texas State University – San Marcos
25.
Parr, David A.
The Production of Volunteered Geographic Information: A Study of OpenStreetMap in the United States.
Degree: PhD, Geographic Information Science, 2015, Texas State University – San Marcos
URL: https://digital.library.txstate.edu/handle/10877/5776
► The arrival of the World Wide Web, smartphones, tablets and GPS-units has increased the use, availability, and amount of digital geospatial information present on the…
(more)
▼ The arrival of the World Wide Web, smartphones, tablets and GPS-units has increased the use, availability, and amount of digital geospatial information present on the Internet. Users can view maps, follow routes, find addresses, or share their locations in applications including Google Maps, Facebook, Foursquare, Waze and Twitter. These applications use digital geospatial information and rely on
data sources of street networks and address listings. Previously, these
data sources were mostly governmental or corporate and much of the
data was proprietary. Frustrated with the availability of free digital geospatial
data, Steve Coast created the OpenStreetMap project in 2004 to collect a free, open, and global digital geospatial dataset. Now with over one million contributors from around the world, and a growing user base, the OpenStreetMap project has grown into a viable alternative source for digital geospatial information. The growth of the dataset relies on the contributions of volunteers who have been labeled ‘neogeographers’ because of their perceived lack-of-training in geography and cartography (Goodchild 2009b; Warf and Sui 2010; Connors, Lei, and Kelly 2012). This has raised many questions into the nature,
quality, and use of OpenStreetMap
data and contributors (Neis and Zielstra 2014; Neis and Zipf 2012; Estima and Painho 2013; Fan et al. 2014; Haklay and Weber 2008; Corcoran and Mooney 2013; Helbich et al. 2010; Mooney and Corcoran 2012b; Haklay 2010b; Budhathoki and Haythornthwaite 2013; Mooney, Corcoran, and Winstanley 2010; Mooney and Corcoran 2011; Haklay et al. 2010; Mooney, Corcoran, and Ciepluch 2013; Stephens 2013). This study aims to complement and contribute to this body of research on Volunteered Geographic Information in general and OpenStreetMap in particular by analyzing three aspects of OpenStreetMap geographic
data. The first aspect considers the contributors to OSM by building a typology of contributors and analyzing the contribution
quality through the lens of this typology. This part of the study develops the Activity-Context-Geography model of VGI contribution which uses three aspect dimensions of VGI contributions: the Activity (the amount and frequency of content creation, modification and deletion); Context (the technological and social circumstances that support a contribution); and Geography (the spatial dimensions of a contributor’s pattern). Using the complete OpenStreetMap dataset from 2005 to 2013 for the forty-eight contiguous United States and the District of Columbia, the study creates twenty clusters of contributors and examines the differences in positional accuracy of the contributors against two datasets of public school locations in Texas and California. The second part of the study considers the questions of where mapping occurs by evaluating the spatial variability of OSM contributions and comparing mapping activity against population and socioeconomic variables in the US. The third part of the study considers the choices that OSM contributors make through the types of…
Advisors/Committee Members: Lu, Yongmei (advisor), Hagelman, Ronald (committee member), Chow, T. Edwin (committee member), Mark, David (committee member).
Subjects/Keywords: VGI; OpenStreetMap; Data quality; OpenStreetMap; Data quality; OpenStreetMap; Digital mapping; Geospatial data; Geographic information systems
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Parr, D. A. (2015). The Production of Volunteered Geographic Information: A Study of OpenStreetMap in the United States. (Doctoral Dissertation). Texas State University – San Marcos. Retrieved from https://digital.library.txstate.edu/handle/10877/5776
Chicago Manual of Style (16th Edition):
Parr, David A. “The Production of Volunteered Geographic Information: A Study of OpenStreetMap in the United States.” 2015. Doctoral Dissertation, Texas State University – San Marcos. Accessed January 21, 2021.
https://digital.library.txstate.edu/handle/10877/5776.
MLA Handbook (7th Edition):
Parr, David A. “The Production of Volunteered Geographic Information: A Study of OpenStreetMap in the United States.” 2015. Web. 21 Jan 2021.
Vancouver:
Parr DA. The Production of Volunteered Geographic Information: A Study of OpenStreetMap in the United States. [Internet] [Doctoral dissertation]. Texas State University – San Marcos; 2015. [cited 2021 Jan 21].
Available from: https://digital.library.txstate.edu/handle/10877/5776.
Council of Science Editors:
Parr DA. The Production of Volunteered Geographic Information: A Study of OpenStreetMap in the United States. [Doctoral Dissertation]. Texas State University – San Marcos; 2015. Available from: https://digital.library.txstate.edu/handle/10877/5776

University of Edinburgh
26.
Yu, Wenyuan.
Improving data quality : data consistency, deduplication, currency and accuracy.
Degree: PhD, 2013, University of Edinburgh
URL: http://hdl.handle.net/1842/8899
► Data quality is one of the key problems in data management. An unprecedented amount of data has been accumulated and has become a valuable asset…
(more)
▼ Data quality is one of the key problems in data management. An unprecedented amount of data has been accumulated and has become a valuable asset of an organization. The value of the data relies greatly on its quality. However, data is often dirty in real life. It may be inconsistent, duplicated, stale, inaccurate or incomplete, which can reduce its usability and increase the cost of businesses. Consequently the need for improving data quality arises, which comprises of five central issues of improving data quality, namely, data consistency, data deduplication, data currency, data accuracy and information completeness. This thesis presents the results of our work on the first four issues with regards to data consistency, deduplication, currency and accuracy. The first part of the thesis investigates incremental verifications of data consistencies in distributed data. Given a distributed database D, a set S of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates ΔD to D, it is to find, with minimum data shipment, changes ΔV to V in response to ΔD. Although the problems are intractable, we show that they are bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of ΔD and ΔV, independent of the size of the database D. Such incremental algorithms are provided for both vertically and horizontally partitioned data, and we show that the algorithms are optimal. The second part of the thesis studies the interaction between record matching and data repairing. Record matching, the main technique underlying data deduplication, aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data using constraints. These are treated as separate processes in most data cleaning systems, based on heuristic solutions. However, our studies show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, a uniform framework that seamlessly unifies repairing and matching operations is proposed to clean a database based on integrity constraints, matching rules and master data. The third part of the thesis presents our study of finding certain fixes that are absolutely correct for data repairing. Data repairing methods based on integrity constraints are normally heuristic, and they may not find certain fixes. Worse still, they may even introduce new errors when attempting to repair the data, which may not work well when repairing critical data such as medical records, in which a seemingly minor error often has disastrous consequences. We propose a framework and an algorithm to find certain fixes, based on master data, a class of editing rules and user interactions. A prototype system is also developed. The fourth part of the thesis introduces inferring data currency and consistency for conflict resolution, where data currency aims to identify the current values of entities, and conflict resolution is to…
Subjects/Keywords: 005.7; Data quality; Data consistency; Deduplication; Data currency; Dara accuracy
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Yu, W. (2013). Improving data quality : data consistency, deduplication, currency and accuracy. (Doctoral Dissertation). University of Edinburgh. Retrieved from http://hdl.handle.net/1842/8899
Chicago Manual of Style (16th Edition):
Yu, Wenyuan. “Improving data quality : data consistency, deduplication, currency and accuracy.” 2013. Doctoral Dissertation, University of Edinburgh. Accessed January 21, 2021.
http://hdl.handle.net/1842/8899.
MLA Handbook (7th Edition):
Yu, Wenyuan. “Improving data quality : data consistency, deduplication, currency and accuracy.” 2013. Web. 21 Jan 2021.
Vancouver:
Yu W. Improving data quality : data consistency, deduplication, currency and accuracy. [Internet] [Doctoral dissertation]. University of Edinburgh; 2013. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/1842/8899.
Council of Science Editors:
Yu W. Improving data quality : data consistency, deduplication, currency and accuracy. [Doctoral Dissertation]. University of Edinburgh; 2013. Available from: http://hdl.handle.net/1842/8899

Delft University of Technology
27.
Pronk, Martijn (author).
Policy recommendations to improve data quality in the electricity chain.
Degree: 2017, Delft University of Technology
URL: http://resolver.tudelft.nl/uuid:2ffb81d3-7027-4e64-b192-ed465b261da4
► Data quality is essential in the modern world. The higher the quality of the data, the more information, knowledge and thereby wisdom can be retrieved.…
(more)
▼ Data quality is essential in the modern world. The higher the quality of the data, the more information, knowledge and thereby wisdom can be retrieved. Besides better wisdom high quality data has another advantage: it can save costs since errors can be avoided. Low quality data is costing a lot of money to companies and will lead to information with more noise. The energy sector is also an industry where data is of great importance. Data quality is of importance in the electricity chain since a lot of parties depend on it. The greatest challenge emerging from this energy transition is the balance between decentralized electricity production and constant electricity consumption. However the data in the electricity chain is not always of great quality leading to problems. The main research question is: Which measures can be taken to improve data quality in the electricity data chain? Through a case study this will be explored.
Complex Systems Engineering and Management (CoSEM)
Advisors/Committee Members: Janssen, Marijn (graduation committee), Enserink, Bert (mentor), Zuiderwijk-van Eijk, Anneke (mentor), Ubacht, Jolien (mentor), op de Weegh, Jikke (mentor), Delft University of Technology (degree granting institution).
Subjects/Keywords: data quality; data chain; data dimension; energy sector
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Pronk, M. (. (2017). Policy recommendations to improve data quality in the electricity chain. (Masters Thesis). Delft University of Technology. Retrieved from http://resolver.tudelft.nl/uuid:2ffb81d3-7027-4e64-b192-ed465b261da4
Chicago Manual of Style (16th Edition):
Pronk, Martijn (author). “Policy recommendations to improve data quality in the electricity chain.” 2017. Masters Thesis, Delft University of Technology. Accessed January 21, 2021.
http://resolver.tudelft.nl/uuid:2ffb81d3-7027-4e64-b192-ed465b261da4.
MLA Handbook (7th Edition):
Pronk, Martijn (author). “Policy recommendations to improve data quality in the electricity chain.” 2017. Web. 21 Jan 2021.
Vancouver:
Pronk M(. Policy recommendations to improve data quality in the electricity chain. [Internet] [Masters thesis]. Delft University of Technology; 2017. [cited 2021 Jan 21].
Available from: http://resolver.tudelft.nl/uuid:2ffb81d3-7027-4e64-b192-ed465b261da4.
Council of Science Editors:
Pronk M(. Policy recommendations to improve data quality in the electricity chain. [Masters Thesis]. Delft University of Technology; 2017. Available from: http://resolver.tudelft.nl/uuid:2ffb81d3-7027-4e64-b192-ed465b261da4

Curtin University of Technology
28.
Mohd Shaharanee, Izwan Nizal.
Quality and interestingness of association rules derived from data mining of relational and semi-structured data
.
Degree: 2012, Curtin University of Technology
URL: http://hdl.handle.net/20.500.11937/1643
► Deriving useful and interesting rules from a data mining system are essential and important tasks. Problems such as the discovery of random and coincidental patterns…
(more)
▼ Deriving useful and interesting rules from a data mining system are essential and important tasks. Problems such as the discovery of random and coincidental patterns or patterns with no significant values, and the generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness of rules generated by data mining algorithms are actively and constantly being examined and developed. As the data mining techniques are data-driven, it is beneficial to affirm the rules using a statistical approach. It is important to establish the ways in which the existing statistical measures and constraint parameters can be effectively utilized and the sequence of their usage.In this thesis, a systematic way to evaluate the association rules discovered from frequent, closed and maximal itemset mining algorithms; and frequent subtree mining algorithm including the rules based on induced, embedded and disconnected subtrees is presented. With reference to the frequent subtree mining, in addition a new direction is explored based on utilizing the DSM approach capable of preserving all information from tree-structured database in a flat data format, consequently enabling the direct application of a wider range of data mining analysis/techniques to tree-structured data. Implications of this approach were investigated and it was found that basing rules on disconnected subtrees, can be useful in terms of increasing the accuracy and the coverage rate of the rule set.A strategy that combines data mining and statistical measurement techniques such as sampling, redundancy and contradictive checks, correlation and regression analysis to evaluate the rules is developed. This framework is then applied to real-world datasets that represent diverse characteristics of data/items. Empirical results show that with a proper combination of data mining and statistical analysis, the proposed framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy rules. Moreover, the results reveal the important characteristics and differences between mining frequent, closed or maximal itemsets; and mining frequent subtree including the rules based on induced, embedded and disconnected subtrees; as well as the impact of confidence measure for the prediction and classification task.
Subjects/Keywords: data mining;
relational data;
semi-structured data;
interestingness;
quality;
association rules
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Mohd Shaharanee, I. N. (2012). Quality and interestingness of association rules derived from data mining of relational and semi-structured data
. (Thesis). Curtin University of Technology. Retrieved from http://hdl.handle.net/20.500.11937/1643
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Mohd Shaharanee, Izwan Nizal. “Quality and interestingness of association rules derived from data mining of relational and semi-structured data
.” 2012. Thesis, Curtin University of Technology. Accessed January 21, 2021.
http://hdl.handle.net/20.500.11937/1643.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Mohd Shaharanee, Izwan Nizal. “Quality and interestingness of association rules derived from data mining of relational and semi-structured data
.” 2012. Web. 21 Jan 2021.
Vancouver:
Mohd Shaharanee IN. Quality and interestingness of association rules derived from data mining of relational and semi-structured data
. [Internet] [Thesis]. Curtin University of Technology; 2012. [cited 2021 Jan 21].
Available from: http://hdl.handle.net/20.500.11937/1643.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Mohd Shaharanee IN. Quality and interestingness of association rules derived from data mining of relational and semi-structured data
. [Thesis]. Curtin University of Technology; 2012. Available from: http://hdl.handle.net/20.500.11937/1643
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Manchester
29.
Cisneros Cabrera, Sonia.
Experimenting with a Big Data Framework for Scaling a
Data Quality Query System.
Degree: 2016, University of Manchester
URL: http://www.manchester.ac.uk/escholar/uk-ac-man-scw:306252
► The work presented in this thesis comprises the design, implementation and evaluation of extensions made to the Data Quality Query System (DQ2S), a state-of-the-art data…
(more)
▼ The work presented in this thesis comprises the
design, implementation and evaluation of extensions made to the
Data Quality Query System (DQ2S), a state-of-the-art
data quality
aware query processing framework and query language, towards
testing and improving its scalability when working with increasing
amounts of
data. The purpose of the evaluation is to assess to what
extent a big
data framework, such as Apache Spark, can offer
significant gains in performance, including runtime, required
amount of memory, processing capacity, and resource utilisation,
when running over different environments. DQ2S enables assessing
and improving
data quality within information management by
facilitating profiling of the
data in use, and leading to the
support of
data cleansing tasks, which represent an important step
in the big
data life-cycle. Despite this, DQ2S, as the majority of
data quality management systems, is not designed to process very
large amounts of
data. This research describes the journey of how
data quality extensions from an earlier implementation that
processed two datasets with 50 000 rows each one in 397 seconds,
were designed, implemented and tested to achieve a big
data
solution capable of processing 105 000 000 rows in 145 seconds. The
research described in this thesis provides a detailed account of
the experimental journey followed to extend DQ2S towards exploring
the capabilities of a popular big
data framework (Apache Spark),
including the experiments used to measure the scalability and
usefulness of the approach. The study also provides a roadmap for
researchers interested in re-purposing and porting existing
information management systems and tools to explore the
capabilities provided by big
data frameworks, particularly useful
given that re-purposing and re-writing existing software to work
with big
data frameworks is a less costly and risky approach when
compared to greenfield engineering of information management
systems and tools.
Advisors/Committee Members: SAMPAIO, PEDRO PRF, Sampaio, Sandra, Sampaio, Pedro.
Subjects/Keywords: Big Data; Data Quality; Data Profiling; Empirical Evaluation; Scalability
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Cisneros Cabrera, S. (2016). Experimenting with a Big Data Framework for Scaling a
Data Quality Query System. (Doctoral Dissertation). University of Manchester. Retrieved from http://www.manchester.ac.uk/escholar/uk-ac-man-scw:306252
Chicago Manual of Style (16th Edition):
Cisneros Cabrera, Sonia. “Experimenting with a Big Data Framework for Scaling a
Data Quality Query System.” 2016. Doctoral Dissertation, University of Manchester. Accessed January 21, 2021.
http://www.manchester.ac.uk/escholar/uk-ac-man-scw:306252.
MLA Handbook (7th Edition):
Cisneros Cabrera, Sonia. “Experimenting with a Big Data Framework for Scaling a
Data Quality Query System.” 2016. Web. 21 Jan 2021.
Vancouver:
Cisneros Cabrera S. Experimenting with a Big Data Framework for Scaling a
Data Quality Query System. [Internet] [Doctoral dissertation]. University of Manchester; 2016. [cited 2021 Jan 21].
Available from: http://www.manchester.ac.uk/escholar/uk-ac-man-scw:306252.
Council of Science Editors:
Cisneros Cabrera S. Experimenting with a Big Data Framework for Scaling a
Data Quality Query System. [Doctoral Dissertation]. University of Manchester; 2016. Available from: http://www.manchester.ac.uk/escholar/uk-ac-man-scw:306252

University of California – Irvine
30.
Altowim, Yasser Abdulaziz.
Progressive Approach To Entity Resolution.
Degree: Information and Computer Science, 2015, University of California – Irvine
URL: http://www.escholarship.org/uc/item/1w97p3k1
► Data-driven technologies such as decision support, analysis, and scientific discovery tools have become a critical component of many organizations and businesses. The effectiveness of such…
(more)
▼ Data-driven technologies such as decision support, analysis, and scientific discovery tools have become a critical component of many organizations and businesses. The effectiveness of such technologies, however, is closely tied to the quality of data on which they are applied. That is why today organizations spend a substantial percentage of their budgets on cleaning tasks such as removing duplicates, correcting errors, and filling missing values, to improve data quality prior to pushing data through the analysis pipeline. Entity resolution (ER), the process of identifying which entities in a dataset refer to the same real-world object, is a well-known data cleaning challenge. This process, however, is traditionally performed as an offline step prior to making the data available to analysis. Such an offline strategy is simply unsuitable for many emerging analytical applications that require low latency response (and thus can not tolerate delays caused by cleaning the entire dataset) and also in situations where the underlying resources are constrained or costly to use. To overcome these limitations, we study in this thesis a new paradigm for ER, which is that of progressive entity resolution. Progressive ER aims to resolve the dataset in such a way that maximizes the rate at which the data quality improves. This approach can help in substantially reducing the resolution cost since the ER process can be prematurely terminated whenever a satisfying level of quality is achieved.In this thesis, we explore two aspects of the ER problem and propose a progressive approach to each of them. In particular, we first propose a progressive approach to relational ER, wherein the input dataset consists of multiple entity-sets and relationships among them. We then propose a parallel approach to entity resolution using the popular MapReduce (MR) framework. The comprehensive empirical evaluation of the two proposed approaches demonstrates that they achieve high-quality results using limited amounts of resolution cost.
Subjects/Keywords: Computer science; Big Data; Data Cleaning; Data Management; Data Quality; Entity Resolution; MapReduce
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Altowim, Y. A. (2015). Progressive Approach To Entity Resolution. (Thesis). University of California – Irvine. Retrieved from http://www.escholarship.org/uc/item/1w97p3k1
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Altowim, Yasser Abdulaziz. “Progressive Approach To Entity Resolution.” 2015. Thesis, University of California – Irvine. Accessed January 21, 2021.
http://www.escholarship.org/uc/item/1w97p3k1.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Altowim, Yasser Abdulaziz. “Progressive Approach To Entity Resolution.” 2015. Web. 21 Jan 2021.
Vancouver:
Altowim YA. Progressive Approach To Entity Resolution. [Internet] [Thesis]. University of California – Irvine; 2015. [cited 2021 Jan 21].
Available from: http://www.escholarship.org/uc/item/1w97p3k1.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Altowim YA. Progressive Approach To Entity Resolution. [Thesis]. University of California – Irvine; 2015. Available from: http://www.escholarship.org/uc/item/1w97p3k1
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
◁ [1] [2] [3] [4] [5] … [32] ▶
.