You searched for subject:(Checkpoint restart)
.
Showing records 1 – 15 of
15 total matches.
No search limiters apply to these results.

Rice University
1.
Vrvilo, Nick.
Implementing Asynchronous Checkpoint/Restart for the Concurrent Collections Model.
Degree: MS, Engineering, 2014, Rice University
URL: http://hdl.handle.net/1911/88191
► It has been claimed that what simplifies parallelism can also simplify resilience. Based on that assertion, we present the Concurrent Collections programming model (CnC) as…
(more)
▼ It has been claimed that what simplifies parallelism can also simplify resilience. Based on that assertion, we present the Concurrent Collections programming model (CnC) as an ideal target for a simple yet powerful resilience system for parallel computations. Specifically, we claim that the same attributes that simplify reasoning about parallel applications written in CnC will similarly simplify the implementation of a
checkpoint/
restart system within the CnC runtime. We define these properties of CnC in the context of a model built in K. To demonstrate how these simplifying properties of CnC help to simplify resilience, we have implemented a simple
checkpoint/
restart system within Rice’s Habanero C implementation of the CnC runtime. We show how the CnC runtime can fully encapsulate the checkpointing and restarting processes, allowing application programmers to gain all the benefits of resilience without any added effort beyond implementing the application in CnC, while avoiding the synchronization overheads present in traditional techniques.
Advisors/Committee Members: Sarkar, Vivek (advisor), Mellor-Crummey, John (committee member), Chaudhuri, Swarat (committee member).
Subjects/Keywords: Concurrent Collections; Resilience; Checkpoint/Restart
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Vrvilo, N. (2014). Implementing Asynchronous Checkpoint/Restart for the Concurrent Collections Model. (Masters Thesis). Rice University. Retrieved from http://hdl.handle.net/1911/88191
Chicago Manual of Style (16th Edition):
Vrvilo, Nick. “Implementing Asynchronous Checkpoint/Restart for the Concurrent Collections Model.” 2014. Masters Thesis, Rice University. Accessed April 14, 2021.
http://hdl.handle.net/1911/88191.
MLA Handbook (7th Edition):
Vrvilo, Nick. “Implementing Asynchronous Checkpoint/Restart for the Concurrent Collections Model.” 2014. Web. 14 Apr 2021.
Vancouver:
Vrvilo N. Implementing Asynchronous Checkpoint/Restart for the Concurrent Collections Model. [Internet] [Masters thesis]. Rice University; 2014. [cited 2021 Apr 14].
Available from: http://hdl.handle.net/1911/88191.
Council of Science Editors:
Vrvilo N. Implementing Asynchronous Checkpoint/Restart for the Concurrent Collections Model. [Masters Thesis]. Rice University; 2014. Available from: http://hdl.handle.net/1911/88191

University of New Mexico
2.
Ferreira, Kurt.
Keeping checkpoint/restart viable for exascale systems.
Degree: Department of Computer Science, 2011, University of New Mexico
URL: http://hdl.handle.net/1928/17473
► Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which…
(more)
▼ Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like
checkpoint/
restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of
checkpoint/
restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the
checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the
checkpoint commit time (the time to save one
checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.
Advisors/Committee Members: Bridges, Patrick, Arnold, Dorian, Taufer, Michela, Crandall, Jed.
Subjects/Keywords: Checkpoint/ Restart; Reliability; Exascale; State-Machine Replication
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ferreira, K. (2011). Keeping checkpoint/restart viable for exascale systems. (Doctoral Dissertation). University of New Mexico. Retrieved from http://hdl.handle.net/1928/17473
Chicago Manual of Style (16th Edition):
Ferreira, Kurt. “Keeping checkpoint/restart viable for exascale systems.” 2011. Doctoral Dissertation, University of New Mexico. Accessed April 14, 2021.
http://hdl.handle.net/1928/17473.
MLA Handbook (7th Edition):
Ferreira, Kurt. “Keeping checkpoint/restart viable for exascale systems.” 2011. Web. 14 Apr 2021.
Vancouver:
Ferreira K. Keeping checkpoint/restart viable for exascale systems. [Internet] [Doctoral dissertation]. University of New Mexico; 2011. [cited 2021 Apr 14].
Available from: http://hdl.handle.net/1928/17473.
Council of Science Editors:
Ferreira K. Keeping checkpoint/restart viable for exascale systems. [Doctoral Dissertation]. University of New Mexico; 2011. Available from: http://hdl.handle.net/1928/17473
3.
Popov, Mihail.
Décomposition automatique des programmes parallèles pour l'optimisation et la prédiction de performance. : Automatic decomposition of parallel programs for optimization and performance prediction.
Degree: Docteur es, Informatique, 2016, Université Paris-Saclay (ComUE)
URL: http://www.theses.fr/2016SACLV087
► Dans le domaine du calcul haute performance, de nombreux programmes étalons ou benchmarks sont utilisés pour mesurer l’efficacité des calculateurs,des compilateurs et des optimisations de…
(more)
▼ Dans le domaine du calcul haute performance, de nombreux programmes étalons ou benchmarks sont utilisés pour mesurer l’efficacité des calculateurs,des compilateurs et des optimisations de performance. Les benchmarks de référence regroupent souvent des programmes de calcul issus de l’industrie et peuvent être très longs. Le processus d’´étalonnage d’une nouvelle architecture de calcul ou d’une optimisation est donc coûteux.La plupart des benchmarks sont constitués d’un ensemble de noyaux de calcul indépendants. Souvent l’´étalonneur n’est intéressé que par un sous ensemble de ces noyaux, il serait donc intéressant de pouvoir les exécuter séparément. Ainsi, il devient plus facile et rapide d’appliquer des optimisations locales sur les benchmarks. De plus, les benchmarks contiennent de nombreux noyaux de calcul redondants. Certaines opérations, bien que mesurées plusieurs fois, n’apportent pas d’informations supplémentaires sur le système à étudier. En détectant les similarités entre eux et en éliminant les noyaux redondants, on diminue le coût des benchmarks sans perte d’information.Cette thèse propose une méthode permettant de décomposer automatiquement une application en un ensemble de noyaux de performance, que nous appelons codelets. La méthode proposée permet de rejouer les codelets,de manière isolée, dans différentes conditions expérimentales pour pouvoir étalonner leur performance. Cette thèse étudie dans quelle mesure la décomposition en noyaux permet de diminuer le coût du processus de benchmarking et d’optimisation. Elle évalue aussi l’avantage d’optimisations locales par rapport à une approche globale.De nombreux travaux ont été réalisés afin d’améliorer le processus de benchmarking. Dans ce domaine, on remarquera l’utilisation de techniques d’apprentissage machine ou d’´echantillonnage. L’idée de décomposer les benchmarks en morceaux indépendants n’est pas nouvelle. Ce concept a été aappliqué avec succès sur les programmes séquentiels et nous le portons à maturité sur les programmes parallèles.Evaluer des nouvelles micro-architectures ou la scalabilité est 25× fois plus rapide avec des codelets que avec des exécutions d’applications. Les codelets prédisent le temps d’exécution avec une précision de 94% et permettent de trouver des optimisations locales jusqu’`a 1.06× fois plus efficaces que la meilleure approche globale.
In high performance computing, benchmarks evaluate architectures, compilers and optimizations. Standard benchmarks are mostly issued from the industrial world and may have a very long execution time. So, evaluating a new architecture or an optimization is costly. Most of the benchmarks are composed of independent kernels. Usually, users are only interested by a small subset of these kernels. To get faster and easier local optimizations, we should find ways to extract kernels as standalone executables. Also, benchmarks have redundant computational kernels. Some calculations do not bring new informations about the system that we want to study, despite that we measure them many times. By…
Advisors/Committee Members: Jalby, William (thesis director).
Subjects/Keywords: Prédiction de performance; Parallélisme; Compilation; Optimisation; Checkpoint restart; Performance prediction; Parallelism; Compilation; Optimization; Checkpoint restart; 004.35
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Popov, M. (2016). Décomposition automatique des programmes parallèles pour l'optimisation et la prédiction de performance. : Automatic decomposition of parallel programs for optimization and performance prediction. (Doctoral Dissertation). Université Paris-Saclay (ComUE). Retrieved from http://www.theses.fr/2016SACLV087
Chicago Manual of Style (16th Edition):
Popov, Mihail. “Décomposition automatique des programmes parallèles pour l'optimisation et la prédiction de performance. : Automatic decomposition of parallel programs for optimization and performance prediction.” 2016. Doctoral Dissertation, Université Paris-Saclay (ComUE). Accessed April 14, 2021.
http://www.theses.fr/2016SACLV087.
MLA Handbook (7th Edition):
Popov, Mihail. “Décomposition automatique des programmes parallèles pour l'optimisation et la prédiction de performance. : Automatic decomposition of parallel programs for optimization and performance prediction.” 2016. Web. 14 Apr 2021.
Vancouver:
Popov M. Décomposition automatique des programmes parallèles pour l'optimisation et la prédiction de performance. : Automatic decomposition of parallel programs for optimization and performance prediction. [Internet] [Doctoral dissertation]. Université Paris-Saclay (ComUE); 2016. [cited 2021 Apr 14].
Available from: http://www.theses.fr/2016SACLV087.
Council of Science Editors:
Popov M. Décomposition automatique des programmes parallèles pour l'optimisation et la prédiction de performance. : Automatic decomposition of parallel programs for optimization and performance prediction. [Doctoral Dissertation]. Université Paris-Saclay (ComUE); 2016. Available from: http://www.theses.fr/2016SACLV087

University of Toronto
4.
Siniavine, Maxim.
Seamless Kernel Updates.
Degree: 2012, University of Toronto
URL: http://hdl.handle.net/1807/33532
► Kernel patches are frequently released to fix security vulnerabilities and bugs. However, users and system administrators often delay installing these updates because they require a…
(more)
▼ Kernel patches are frequently released to fix security vulnerabilities and bugs. However, users and system administrators often delay installing these updates because they require a system reboot, which results in disruption of service and the loss of application state. Unfortunately, the longer an out-of-date system remains operational, the higher is the likelihood of a system being exploited.
Approaches, such as dynamic patching and hot swapping, have been proposed for updating the kernel. All of them either limit the types of updates that are supported, or require significant programming effort to manage.
We have designed a system that checkpoints application-visible state, updates the kernel, and restores the application state. By checkpointing high-level state, our system no longer depends on the precise implementation of a patch and can apply all backward compatible patches. The results show that updates to major kernel releases can be applied with minimal changes.
MAST
Advisors/Committee Members: Goel, Ashvin, Electrical and Computer Engineering.
Subjects/Keywords: Updates; Operating Systems; Linux; Security; checkpoint; restart; 0984
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Siniavine, M. (2012). Seamless Kernel Updates. (Masters Thesis). University of Toronto. Retrieved from http://hdl.handle.net/1807/33532
Chicago Manual of Style (16th Edition):
Siniavine, Maxim. “Seamless Kernel Updates.” 2012. Masters Thesis, University of Toronto. Accessed April 14, 2021.
http://hdl.handle.net/1807/33532.
MLA Handbook (7th Edition):
Siniavine, Maxim. “Seamless Kernel Updates.” 2012. Web. 14 Apr 2021.
Vancouver:
Siniavine M. Seamless Kernel Updates. [Internet] [Masters thesis]. University of Toronto; 2012. [cited 2021 Apr 14].
Available from: http://hdl.handle.net/1807/33532.
Council of Science Editors:
Siniavine M. Seamless Kernel Updates. [Masters Thesis]. University of Toronto; 2012. Available from: http://hdl.handle.net/1807/33532

Northeastern University
5.
Cao, Jiajun.
Transparent checkpointing over RDMA-based networks.
Degree: PhD, Computer Science Program, 2017, Northeastern University
URL: http://hdl.handle.net/2047/D20290419
► Fault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components…
(more)
▼ Fault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components of a fault-tolerance strategy is checkpointing. However, no explicit checkpoint-restart solution has been available for applications running over RDMA-based networks. RDMA-based networks are the primary network used in high-performance computing, and many researchers believe that RDMA networks will be widely deployed in the Cloud as the costs decrease. Existing approaches often rely on a solution that is specific to the particular MPI implementation or other parallel model in order to disconnect the network at checkpoint time, and to reconnect the network at restart time. Such schemes are difficult to incorporate for new parallel programming models, and also imply higher checkpoint overhead.; In this dissertation, we present the first transparent, system-initiated checkpoint-restart solution that directly supports RDMA networks. This new approach does not depend on a specific parallel programming model, and does not require any modification to the operating system. In addition, network connections remain active during checkpointing, thus making checkpointing more efficient.; Conceptually, this dissertation can be divided into three parts. First, we introduce a new, generic model for RDMA networks, by extracting the key components for checkpointing an RDMA network. These components are the essential states that need to be saved, in order to restore the network connection on restart. This model is then applied to two distinct RDMA networks: InfiniBand, and Intel Omni-Path. This work demonstrates the generality of the model, and it also describes variations needed to adapt to InfiniBand or Omni-Path.; Second, we demonstrate the performance of the proposed approach. Moving from a medium-sized academic computer cluster to a petascale supercomputer, we show what issues are exposed as the application scales up, and how these issues are addressed. In particular, different strategies to drain the network at checkpoint time are investigated, based on the underlying network protocol. As a result, failure-free overhead is reduced to below 1%, even at the largest scale demonstrated: 32,752 processes.; Third, we show how to retrofit transparent checkpointing into the Cloud, as RDMA networks are also becoming more popular in the Cloud. A Checkpointing as a Service approach is presented, which employs checkpointing to provide fault tolerance as a service in the Cloud, and enables application migration in heterogeneous cloud environments.
Subjects/Keywords: checkpoint-restart; cloud computing; MPI; RDMA; supercomputing; virtualization
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Cao, J. (2017). Transparent checkpointing over RDMA-based networks. (Doctoral Dissertation). Northeastern University. Retrieved from http://hdl.handle.net/2047/D20290419
Chicago Manual of Style (16th Edition):
Cao, Jiajun. “Transparent checkpointing over RDMA-based networks.” 2017. Doctoral Dissertation, Northeastern University. Accessed April 14, 2021.
http://hdl.handle.net/2047/D20290419.
MLA Handbook (7th Edition):
Cao, Jiajun. “Transparent checkpointing over RDMA-based networks.” 2017. Web. 14 Apr 2021.
Vancouver:
Cao J. Transparent checkpointing over RDMA-based networks. [Internet] [Doctoral dissertation]. Northeastern University; 2017. [cited 2021 Apr 14].
Available from: http://hdl.handle.net/2047/D20290419.
Council of Science Editors:
Cao J. Transparent checkpointing over RDMA-based networks. [Doctoral Dissertation]. Northeastern University; 2017. Available from: http://hdl.handle.net/2047/D20290419

University of California – Irvine
6.
POURGHASSEMI, BEHNAM.
cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications.
Degree: Electrical and Computer Engineering, 2017, University of California – Irvine
URL: http://www.escholarship.org/uc/item/7nc05406
► Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between…
(more)
▼ Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although there exist several checkpoint/restart implementations for CPUs, only a handful have been proposed for GPUs even though more than 60 supercomputers in the TOP 500 list are heterogeneous CPU-GPU systems. In this work, we propose a scalable application-level checkpoint/restart scheme, called cudaCR for long-running kernels on NVIDIA GPUs. Our proposed scheme is able to capture GPU state inside the kernel and roll back to the previous state within the same kernel, unlike state-of-the-art approaches. This thesis presents cudaCR implementation in detail and evaluate the first version of that on application benchmarks with different characteristics such as dense matrix multiply, stencil computation, and k-means clustering on a Tesla K40 GPU. We observe that cudaCR can fully restore state with low overheads in both time (less than 10% in best case) and memory requirements after applying a number of different optimizations (storage gain: 54% for dense matrix multiply, 31% for k-means, and 4% for stencil computation). Looking forward, we identify new optimizations to further reduce the overhead to make cudaCR highly scalable.
Subjects/Keywords: Computer engineering; Computer science; checkpoint/restart; Fault tolerance; GPU; soft-errors; supercomputer
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
POURGHASSEMI, B. (2017). cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications. (Thesis). University of California – Irvine. Retrieved from http://www.escholarship.org/uc/item/7nc05406
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
POURGHASSEMI, BEHNAM. “cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications.” 2017. Thesis, University of California – Irvine. Accessed April 14, 2021.
http://www.escholarship.org/uc/item/7nc05406.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
POURGHASSEMI, BEHNAM. “cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications.” 2017. Web. 14 Apr 2021.
Vancouver:
POURGHASSEMI B. cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications. [Internet] [Thesis]. University of California – Irvine; 2017. [cited 2021 Apr 14].
Available from: http://www.escholarship.org/uc/item/7nc05406.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
POURGHASSEMI B. cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications. [Thesis]. University of California – Irvine; 2017. Available from: http://www.escholarship.org/uc/item/7nc05406
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

North Carolina State University
7.
Wang, Chao.
Transparent Fault Tolerance for Job Healing in HPC Environments.
Degree: PhD, Computer Science, 2009, North Carolina State University
URL: http://www.lib.ncsu.edu/resolver/1840.16/4437
► As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore,…
(more)
▼ As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions.
This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas.
First, at the job level, novel, scalable mechanisms are built in support of proactive
FT and to significantly enhance reactive FT. The contributions of this dissertation in this
area are (1) a transparent job pause mechanism, which allows a job to pause when a process
fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant
approach that combines process-level live migration with health monitoring to complement
reactive with proactive FT and to reduce the number of checkpoints when a majority of the
faults can be handled proactively; (3) a novel back migration approach to eliminate load
imbalance or bottlenecks caused by migrated tasks; and (4) an incremental checkpointing
mechanism, which is combined with full checkpoints to explore the potential of reducing the
overhead of checkpointing by performing fewer full checkpoints interspersed with multiple
smaller incremental checkpoints.
Second, for the job input data, transparent techniques are provided to improve the
reliability, availability and performance of HPC I/O systems. In this area, the dissertation
contributes (1) a mechanism for offline job input data reconstruction to ensure availability
of job input data and to improve center-wide performance at no cost to job owners; (2)
an approach to automatic recover job input data at run-time during failures by recovering
staged data from an original source; and (3) “just in time†replication of job input data so
as to maximize the use of supercomputer cycles.
Experimental results demonstrate the value of these advanced fault tolerance techniques
to increase fault resilience in HPC environments.
Advisors/Committee Members: Dr. Frank Mueller, Committee Chair (advisor), Dr. Xiaosong Ma, Committee Member (advisor), Dr. Yan Solihin, Committee Member (advisor), Dr. Nagiza Samatova, Committee Member (advisor).
Subjects/Keywords: job input data; fault tolerance; high-performance computing; fault resilience; checkpoint/restart
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Wang, C. (2009). Transparent Fault Tolerance for Job Healing in HPC Environments. (Doctoral Dissertation). North Carolina State University. Retrieved from http://www.lib.ncsu.edu/resolver/1840.16/4437
Chicago Manual of Style (16th Edition):
Wang, Chao. “Transparent Fault Tolerance for Job Healing in HPC Environments.” 2009. Doctoral Dissertation, North Carolina State University. Accessed April 14, 2021.
http://www.lib.ncsu.edu/resolver/1840.16/4437.
MLA Handbook (7th Edition):
Wang, Chao. “Transparent Fault Tolerance for Job Healing in HPC Environments.” 2009. Web. 14 Apr 2021.
Vancouver:
Wang C. Transparent Fault Tolerance for Job Healing in HPC Environments. [Internet] [Doctoral dissertation]. North Carolina State University; 2009. [cited 2021 Apr 14].
Available from: http://www.lib.ncsu.edu/resolver/1840.16/4437.
Council of Science Editors:
Wang C. Transparent Fault Tolerance for Job Healing in HPC Environments. [Doctoral Dissertation]. North Carolina State University; 2009. Available from: http://www.lib.ncsu.edu/resolver/1840.16/4437
8.
Tao, Dingwen.
Fault Tolerance for Iterative Methods in High-Performance Computing.
Degree: Computer Science, 2018, University of California – Riverside
URL: http://www.escholarship.org/uc/item/4fc474t2
► Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative…
(more)
▼ Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems and fail-stop errors in the entire system, considering large component counts and lower power margins of emerging high-performance computing (HPC) platforms.To protect iterative methods from soft errors, we propose an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low-performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in iterative methods.Iterative methods have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. Thus, significantly reducing the data to be checkpointed is critical to improving the overall performance of iterative methods. Lossy compression allowing user-controlled data loss can significantly reduce the I/O burden. To this end, we design a new error-controlled lossy compression algorithm for large-scale scientific data. We significantly improve the prediction accuracy for each data point based on its nearby data values along multiple dimensions. We derive a series of multilayer prediction formulas and their unified formula in the context of data compression. One serious challenge is that the data prediction has to be performed based on the preceding decompressed values during the compression in order to guarantee the error bounds, which may degrade the prediction accuracy in turn. We explore the best layer for the prediction by considering the impact of compression errors on the prediction accuracy and propose an adaptive error-controlled quantization encoder, which can further improve the prediction…
Subjects/Keywords: Computer science; Checkpoint/Restart; Fault Tolerance; High Performance Computing; Iterative Methods; Performance; Resilience
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Tao, D. (2018). Fault Tolerance for Iterative Methods in High-Performance Computing. (Thesis). University of California – Riverside. Retrieved from http://www.escholarship.org/uc/item/4fc474t2
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Tao, Dingwen. “Fault Tolerance for Iterative Methods in High-Performance Computing.” 2018. Thesis, University of California – Riverside. Accessed April 14, 2021.
http://www.escholarship.org/uc/item/4fc474t2.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Tao, Dingwen. “Fault Tolerance for Iterative Methods in High-Performance Computing.” 2018. Web. 14 Apr 2021.
Vancouver:
Tao D. Fault Tolerance for Iterative Methods in High-Performance Computing. [Internet] [Thesis]. University of California – Riverside; 2018. [cited 2021 Apr 14].
Available from: http://www.escholarship.org/uc/item/4fc474t2.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Tao D. Fault Tolerance for Iterative Methods in High-Performance Computing. [Thesis]. University of California – Riverside; 2018. Available from: http://www.escholarship.org/uc/item/4fc474t2
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Sydney
9.
Egwutuoha, Ifeanyi Paulinus.
A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
.
Degree: 2013, University of Sydney
URL: http://hdl.handle.net/2123/11484
► High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most…
(more)
▼ High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems.
Subjects/Keywords: HPC systems in the cloud;
Fault tolerance;
Cloud computing;
Checkpoint/restart;
HaaS
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Egwutuoha, I. P. (2013). A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
. (Thesis). University of Sydney. Retrieved from http://hdl.handle.net/2123/11484
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Egwutuoha, Ifeanyi Paulinus. “A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
.” 2013. Thesis, University of Sydney. Accessed April 14, 2021.
http://hdl.handle.net/2123/11484.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Egwutuoha, Ifeanyi Paulinus. “A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
.” 2013. Web. 14 Apr 2021.
Vancouver:
Egwutuoha IP. A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
. [Internet] [Thesis]. University of Sydney; 2013. [cited 2021 Apr 14].
Available from: http://hdl.handle.net/2123/11484.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Egwutuoha IP. A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
. [Thesis]. University of Sydney; 2013. Available from: http://hdl.handle.net/2123/11484
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Université de Grenoble
10.
Bouguerra, Mohamed Slim.
Tolérance aux pannes dans des environnements de calcul parallèle et distribué : optimisation des stratégies de sauvegarde/reprise et ordonnancement : Fault tolerance in the parallel and distributed environments : optimizing the checkpoint restart strategy and scheduling.
Degree: Docteur es, Informatique, 2012, Université de Grenoble
URL: http://www.theses.fr/2012GRENM023
► Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nombreux défis scientifiques. À terme, il est envisageable de voir apparaître…
(more)
▼ Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nombreux défis scientifiques. À terme, il est envisageable de voir apparaître des applications composées d'un milliard de processus exécutés sur des systèmes à un million de coeurs. Cette augmentation fulgurante du nombre de processeurs pose un défi de résilience incontournable, puisque ces applications devraient faire face à plusieurs pannes par jours. Pour assurer une bonne exécution dans ce contexte hautement perturbé par des interruptions, de nombreuses techniques de tolérance aux pannes telle que l'approche de sauvegarde et reprise (checkpoint) ont été imaginées et étudiées. Cependant, l'intégration de ces approches de tolérance aux pannes dans le couple formé par l'application et la plate-forme d'exécution soulève des problématiques d'optimisation pour déterminer le compromis entre le surcoût induit par le mécanisme de tolérance aux pannes d'un coté et l'impact des pannes sur l'exécution d'un autre coté. Dans la première partie de cette thèse nous concevons deux modèles de performance stochastique (minimisation de l'impact des pannes et du surcoût des points de sauvegarde sur l'espérance du temps de complétion de l'exécution en fonction de la distribution d'inter-arrivées des pannes). Dans la première variante l'objectif est la minimisation de l'espérance du temps de complétion en considérant que l'application est de nature préemptive. Nous exhibons dans ce cas de figure tout d'abord une expression analytique de la période de sauvegarde optimale quand le taux de panne et le surcoût des points de sauvegarde sont constants. Par contre dans le cas où le taux de panne ou les surcoûts des points de sauvegarde sont arbitraires nous présentons une approche numérique pour calculer l'ordonnancement optimal des points de sauvegarde. Dans la deuxième variante, l'objectif est la minimisation de l'espérance de la quantité totale de temps perdu avant la première panne en considérant les applications de nature non-préemptive. Dans ce cas de figure, nous démontrons tout d'abord que si les surcoûts des points sauvegarde sont arbitraires alors le problème du meilleur ordonnancement des points de sauvegarde est NP-complet. Ensuite, nous exhibons un schéma de programmation dynamique pour calculer un ordonnancement optimal. Dans la deuxième partie de cette thèse nous nous focalisons sur la conception des stratégies d'ordonnancement tolérant aux pannes qui optimisent à la fois le temps de complétion de la dernière tâche et la probabilité de succès de l'application. Nous mettons en évidence dans ce cas de figure qu'en fonction de la nature de la distribution de pannes, les deux objectifs à optimiser sont tantôt antagonistes, tantôt congruents. Ensuite en fonction de la nature de distribution de pannes nous donnons des approches d'ordonnancement avec des ratios de performance garantis par rapport aux deux objectifs.
The parallel computing platforms available today are increasingly larger. Typically the emerging parallel platforms will be…
Advisors/Committee Members: Trystram, Denis (thesis director), Gautier, Thierry (thesis director).
Subjects/Keywords: Tolérance aux pannes; Sauvegarde et reprise; Ordonnancement multi-objectifs; Grille de calcul; Fiabilité; Fault tolerance; Checkpoint restart; Multi-objective scheduling; HPC
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Bouguerra, M. S. (2012). Tolérance aux pannes dans des environnements de calcul parallèle et distribué : optimisation des stratégies de sauvegarde/reprise et ordonnancement : Fault tolerance in the parallel and distributed environments : optimizing the checkpoint restart strategy and scheduling. (Doctoral Dissertation). Université de Grenoble. Retrieved from http://www.theses.fr/2012GRENM023
Chicago Manual of Style (16th Edition):
Bouguerra, Mohamed Slim. “Tolérance aux pannes dans des environnements de calcul parallèle et distribué : optimisation des stratégies de sauvegarde/reprise et ordonnancement : Fault tolerance in the parallel and distributed environments : optimizing the checkpoint restart strategy and scheduling.” 2012. Doctoral Dissertation, Université de Grenoble. Accessed April 14, 2021.
http://www.theses.fr/2012GRENM023.
MLA Handbook (7th Edition):
Bouguerra, Mohamed Slim. “Tolérance aux pannes dans des environnements de calcul parallèle et distribué : optimisation des stratégies de sauvegarde/reprise et ordonnancement : Fault tolerance in the parallel and distributed environments : optimizing the checkpoint restart strategy and scheduling.” 2012. Web. 14 Apr 2021.
Vancouver:
Bouguerra MS. Tolérance aux pannes dans des environnements de calcul parallèle et distribué : optimisation des stratégies de sauvegarde/reprise et ordonnancement : Fault tolerance in the parallel and distributed environments : optimizing the checkpoint restart strategy and scheduling. [Internet] [Doctoral dissertation]. Université de Grenoble; 2012. [cited 2021 Apr 14].
Available from: http://www.theses.fr/2012GRENM023.
Council of Science Editors:
Bouguerra MS. Tolérance aux pannes dans des environnements de calcul parallèle et distribué : optimisation des stratégies de sauvegarde/reprise et ordonnancement : Fault tolerance in the parallel and distributed environments : optimizing the checkpoint restart strategy and scheduling. [Doctoral Dissertation]. Université de Grenoble; 2012. Available from: http://www.theses.fr/2012GRENM023
11.
Hamouda, Sara S.
Resilience in high-level parallel programming languages
.
Degree: 2019, Australian National University
URL: http://hdl.handle.net/1885/164137
► The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputers make supporting task parallelism and resilience a necessity in HPC programming models. Given…
(more)
▼ The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputers make supporting task parallelism and resilience a necessity in HPC programming models. Given the complexity of managing multi-threaded distributed execution in the presence of failures, there is a critical need for task-parallel abstractions that simplify writing efficient, modular, and understandable fault-tolerant applications. MPI User-Level Failure Mitigation (MPI-ULFM) is an emerging fault-tolerant specification of MPI. It supports failure detection by returning special error codes and provides new interfaces for failure mitigation. Unfortunately, the unstructured form of failure reporting provided by MPI-ULFM hinders the composability and the clarity of the fault-tolerant programs. The low-level programming model of MPI and the simplistic failure reporting mechanism adopted by MPI-ULFM make MPI-ULFM more suitable as a low-level communication layer for resilient high-level languages, rather than a direct programming model for application development. The asynchronous partitioned global address space model is a high-level programming model designed to improve the productivity of developing large-scale applications. It represents a computation as a global control flow of nested parallel tasks that use global data partitioned among processes. Recent advances in the APGAS model supported control flow recovery by adding failure awareness to the nested parallelism model – async-finish – and by providing structured failure reporting through exceptions. Unfortunately, the current implementation of the resilient async-finish model results in a high performance overhead that can restrict the scalability of applications. Moreover, the lack of data resilience support limits the productivity of the model as it shifts the challenges of handling data availability and atomicity under failure to the programmer. In this thesis, we demonstrate that resilient APGAS languages can achieve scalable performance under failure by exploiting fault tolerance features in emerging communication libraries such as MPI-ULFM. We propose multi-resolution resilience, in which high-level resilient constructs are composed from efficient lower-level resilient constructs, as an approach for bridging the gap between the efficiency of user-level fault tolerance and the productivity of system-level fault tolerance. To address the limited resilience efficiency of the async-finish model, we propose 'optimistic finish' – a message-optimal resilient termination detection protocol for the finish construct. To improve programmer productivity, we augment the APGAS model with resilient data stores that can simplify preserving critical application data in the presence of failure. In addition, we propose the 'transactional finish' construct as a productive mechanism for handling atomic updates on resilient data. Finally, we demonstrate the multi-resolution resilience approach by designing high-level resilient application frameworks based on the async-finish…
Subjects/Keywords: APGAS;
Resilience;
Fault Tolerance;
X10;
MPI-ULFM;
Transactional Memory;
Checkpoint-Restart;
Async-Finish;
Task-Based Runtime Systems;
Termination Detection;
Taxonomy of Resilient Programming Models
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Hamouda, S. S. (2019). Resilience in high-level parallel programming languages
. (Thesis). Australian National University. Retrieved from http://hdl.handle.net/1885/164137
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Hamouda, Sara S. “Resilience in high-level parallel programming languages
.” 2019. Thesis, Australian National University. Accessed April 14, 2021.
http://hdl.handle.net/1885/164137.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Hamouda, Sara S. “Resilience in high-level parallel programming languages
.” 2019. Web. 14 Apr 2021.
Vancouver:
Hamouda SS. Resilience in high-level parallel programming languages
. [Internet] [Thesis]. Australian National University; 2019. [cited 2021 Apr 14].
Available from: http://hdl.handle.net/1885/164137.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Hamouda SS. Resilience in high-level parallel programming languages
. [Thesis]. Australian National University; 2019. Available from: http://hdl.handle.net/1885/164137
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

Northeastern University
12.
Arya, Kapil.
User-space process virtualization in the context of checkpoint-restart and virtual machines.
Degree: PhD, Department of Computer Science, 2015, Northeastern University
URL: http://hdl.handle.net/2047/d20005096
► Checkpoint-Restart is the ability to save a set of running processes to a checkpoint image on disk, and to later restart them from the disk.…
(more)
▼ Checkpoint-Restart is the ability to save a set of running processes to a checkpoint image on disk, and to later restart them from the disk. In addition to its traditional use in fault tolerance, recovering from a system failure, it has numerous other uses, such as for application debugging and save/restore of the workspace of an interactive problem-solving environment. Transparent checkpointing operates without modifying the underlying application program, but it implicitly relies on a "Closed World Assumption" – the world (including file system, network, etc.) will look the same upon restart as it did at the time of checkpoint. This is not valid for more complex programs. Until now, checkpoint-restart packages have adopted ad~hoc solutions for each case where the environment changes upon restart.; This dissertation presents user-space process virtualization to decouple application processes from the external subsystems. A thin virtualization layer is introduced between the application and each external subsystem. It provides the application with a consistent view of the external world and allows for checkpoint-restart to succeed. The ever growing number of external subsystems make it harder to deploy and maintain virtualization layers in a monolithic checkpoint-restart system. To address this, an adaptive plugin based approach is used to implement the virtualization layers that allow the checkpoint-restart system to grow organically.; The principle of decoupling the external subsystem through process virtualization is also applied in the context of virtual machines for providing a solution to the long standing double-paging problem. Double-paging occurs when the guest attempts to page out memory that has previously been swapped out by the hypervisor and leads to long delays for the guest as the contents are read back into machine memory only to be written out again. The performance rapidly drops as a result of significant lengthening of the time to complete the guest I/O request.
Subjects/Keywords: Checkpoint-restart; Distributed computing; Fault-tolerance; Paging; Virtualization; Virtual machines; Computer Sciences; Virtual computer systems; Programming; Data recovery (Computer science); Application software; Programming; Electronic data processing; Distributed processing
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Arya, K. (2015). User-space process virtualization in the context of checkpoint-restart and virtual machines. (Doctoral Dissertation). Northeastern University. Retrieved from http://hdl.handle.net/2047/d20005096
Chicago Manual of Style (16th Edition):
Arya, Kapil. “User-space process virtualization in the context of checkpoint-restart and virtual machines.” 2015. Doctoral Dissertation, Northeastern University. Accessed April 14, 2021.
http://hdl.handle.net/2047/d20005096.
MLA Handbook (7th Edition):
Arya, Kapil. “User-space process virtualization in the context of checkpoint-restart and virtual machines.” 2015. Web. 14 Apr 2021.
Vancouver:
Arya K. User-space process virtualization in the context of checkpoint-restart and virtual machines. [Internet] [Doctoral dissertation]. Northeastern University; 2015. [cited 2021 Apr 14].
Available from: http://hdl.handle.net/2047/d20005096.
Council of Science Editors:
Arya K. User-space process virtualization in the context of checkpoint-restart and virtual machines. [Doctoral Dissertation]. Northeastern University; 2015. Available from: http://hdl.handle.net/2047/d20005096
13.
Abeyratne, Sandunmalee.
Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing.
Degree: PhD, Computer Science & Engineering, 2017, University of Michigan
URL: http://hdl.handle.net/2027.42/137096
► Today’s supercomputers are built from the state-of-the-art components to extract as much performance as possible to solve the most computationally intensive problems in the world.…
(more)
▼ Today’s supercomputers are built from the state-of-the-art components to extract as much performance as possible to solve the most computationally intensive problems in the world. Building the next generation of exascale supercomputers, however, would require re-architecting many of these components to extract over 50x more performance than the current fastest supercomputer in the United States. To contribute towards this goal, two aspects of the compute node architecture were examined in this thesis: the on-chip interconnect topology and the memory and storage checkpointing platforms.
As a first step, a skeleton exascale system was modeled to meet 1 exaflop of performance along with 100 petabytes of main memory. The model revealed that large kilo-core processors would be necessary to meet the exaflop performance goal; existing topologies, however, would not scale to those levels. To address this new challenge, we investigated and proposed asymmetric high-radix topologies that decoupled local and global communications and used different radix routers for switching network traffic at each level. The proposed topologies scaled more readily to higher numbers of cores with better latency and energy consumption than before.
The vast number of components that the model revealed would be needed in these exascale systems cautioned towards better fault tolerance mechanisms. To address this challenge, we showed that local checkpoints within the compute node can be saved to a hybrid DRAM and SSD platform in order to write them faster without wearing out the SSD or consuming a lot of energy. A hybrid checkpointing platform allowed more frequent checkpoints to be made without sacrificing performance. Subsequently, we proposed switching to a DIMM-based SSD in order to perform fine-grained I/O operations that would be integral in interleaving checkpointing and computation while still providing persistence guarantees. Two more techniques that consolidate and overlap checkpointing were designed to better hide the checkpointing latency to the SSD.
Advisors/Committee Members: Dreslinski Jr, Ronald (committee member), Mudge, Trevor N (committee member), Blaauw, David (committee member), Chakrabarti, Chaitali (committee member), Das, Reetuparna (committee member).
Subjects/Keywords: exascale supercomputer architecture; kilo-core on-chip interconnect topology; checkpoint/restart fault tolerance; Computer Science; Engineering
…3.1
3.2
Fault Tolerance in High Performance Computing . .
Checkpoint/Restart… …compute node’s storage. Checkpoint/restart is a key ingredient in attaining
resilience, but it… …information of fault tolerance, checkpoint/restart, non-volatile memories
and flash. Chapter III… …background into fault tolerance, checkpoint/restart,
and flash memory.
2.1
Fault Tolerance in… …be higher.
2.2
Checkpoint/Restart
The most common approach to fault tolerance in high…
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Abeyratne, S. (2017). Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/137096
Chicago Manual of Style (16th Edition):
Abeyratne, Sandunmalee. “Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing.” 2017. Doctoral Dissertation, University of Michigan. Accessed April 14, 2021.
http://hdl.handle.net/2027.42/137096.
MLA Handbook (7th Edition):
Abeyratne, Sandunmalee. “Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing.” 2017. Web. 14 Apr 2021.
Vancouver:
Abeyratne S. Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing. [Internet] [Doctoral dissertation]. University of Michigan; 2017. [cited 2021 Apr 14].
Available from: http://hdl.handle.net/2027.42/137096.
Council of Science Editors:
Abeyratne S. Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing. [Doctoral Dissertation]. University of Michigan; 2017. Available from: http://hdl.handle.net/2027.42/137096
14.
Bentria, Dounia.
Combining checkpointing and other resilience mechanisms for exascale systems : L'utilisation conjointe de mécanismes de sauvegarde de points de reprise (checkpoints) et d'autres mécanismes de résilience pour les systèmes exascales.
Degree: Docteur es, Informatique, 2014, Lyon, École normale supérieure
URL: http://www.theses.fr/2014ENSL0971
► Dans cette thèse, nous nous sommes intéressés aux problèmes d'ordonnancement et d'optimisation dans des contextes probabilistes. Les contributions de cette thèse se déclinent en deux…
(more)
▼ Dans cette thèse, nous nous sommes intéressés aux problèmes d'ordonnancement et d'optimisation dans des contextes probabilistes. Les contributions de cette thèse se déclinent en deux parties. La première partie est dédiée à l’optimisation de différents mécanismes de tolérance aux pannes pour les machines de très large échelle qui sont sujettes à une probabilité de pannes. La seconde partie est consacrée à l’optimisation du coût d’exécution des arbres d’opérateurs booléens sur des flux de données.Dans la première partie, nous nous sommes intéressés aux problèmes de résilience pour les machines de future génération dites « exascales » (plateformes pouvant effectuer 1018 opérations par secondes).Dans le premier chapitre, nous présentons l’état de l’art des mécanismes les plus utilisés dans la tolérance aux pannes et des résultats généraux liés à la résilience.Dans le second chapitre, nous étudions un modèle d’évaluation des protocoles de sauvegarde de points de reprise (checkpoints) et de redémarrage. Le modèle proposé est suffisamment générique pour contenir les situations extrêmes: d’un côté le checkpoint coordonné, et de l’autre toute une famille de stratégies non-Coordonnées. Nous avons proposé une analyse détaillée de plusieurs scénarios, incluant certaines des plateformes de calcul existantes les plus puissantes, ainsi que des anticipations sur les futures plateformes exascales.Dans les troisième, quatrième et cinquième chapitres, nous étudions l'utilisation conjointe de différents mécanismes de tolérance aux pannes (réplication, prédiction de pannes et détection d'erreurs silencieuses) avec le mécanisme traditionnel de checkpoints et de redémarrage. Nous avons évalué plusieurs modèles au moyen de simulations. Nos résultats montrent que ces modèles sont bénéfiques pour un ensemble de modèles d'applications dans le cadre des futures plateformes exascales.Dans la seconde partie de la thèse, nous étudions le problème de la minimisation du coût de récupération des données par des applications lors du traitement d’une requête exprimée sous forme d'arbres d'opérateurs booléens appliqués à des prédicats sur des flux de données de senseurs. Le problème est de déterminer l'ordre dans lequel les prédicats doivent être évalués afin de minimiser l'espérance du coût du traitement de la requête. Dans le sixième chapitre, nous présentons l'état de l'art de la seconde partie et dans le septième chapitre, nous étudions le problème pour les requêtes exprimées sous forme normale disjonctive. Nous considérons le cas plus général où chaque flux peut apparaître dans plusieurs prédicats et nous étudions deux modèles, le modèle où chaque prédicat peut accéder à un seul flux et le modèle où chaque prédicat peut accéder à plusieurs flux.
In this thesis, we are interested in scheduling and optimization problems in probabilistic contexts. The contributions of this thesis come in two parts. The first part is dedicated to the optimization of different fault-Tolerance mechanisms for very large scale machines that are subject to a probability…
Advisors/Committee Members: Vivien, Frédéric (thesis director).
Subjects/Keywords: Tolérance aux pannes; Exascale; Optimisation; Ordonnancement; Sauvegarde de points de reprise (checkpoints) et de redémarrage; Réplication; Prédiction de fautes; Erreurs silencieuses; Traitement de requêtes; Opérateurs booléens; Énergie; Algorithme glouton; Partage de données; Algorithmique probabiliste; Fault tolerance; Exascale; Optimization; Scheduling; Checkpoint/restart; Replication; Fault prediction; Silent errors; Query processing; Boolean operators; Energy; Greedy algorithm; Data sharing
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Bentria, D. (2014). Combining checkpointing and other resilience mechanisms for exascale systems : L'utilisation conjointe de mécanismes de sauvegarde de points de reprise (checkpoints) et d'autres mécanismes de résilience pour les systèmes exascales. (Doctoral Dissertation). Lyon, École normale supérieure. Retrieved from http://www.theses.fr/2014ENSL0971
Chicago Manual of Style (16th Edition):
Bentria, Dounia. “Combining checkpointing and other resilience mechanisms for exascale systems : L'utilisation conjointe de mécanismes de sauvegarde de points de reprise (checkpoints) et d'autres mécanismes de résilience pour les systèmes exascales.” 2014. Doctoral Dissertation, Lyon, École normale supérieure. Accessed April 14, 2021.
http://www.theses.fr/2014ENSL0971.
MLA Handbook (7th Edition):
Bentria, Dounia. “Combining checkpointing and other resilience mechanisms for exascale systems : L'utilisation conjointe de mécanismes de sauvegarde de points de reprise (checkpoints) et d'autres mécanismes de résilience pour les systèmes exascales.” 2014. Web. 14 Apr 2021.
Vancouver:
Bentria D. Combining checkpointing and other resilience mechanisms for exascale systems : L'utilisation conjointe de mécanismes de sauvegarde de points de reprise (checkpoints) et d'autres mécanismes de résilience pour les systèmes exascales. [Internet] [Doctoral dissertation]. Lyon, École normale supérieure; 2014. [cited 2021 Apr 14].
Available from: http://www.theses.fr/2014ENSL0971.
Council of Science Editors:
Bentria D. Combining checkpointing and other resilience mechanisms for exascale systems : L'utilisation conjointe de mécanismes de sauvegarde de points de reprise (checkpoints) et d'autres mécanismes de résilience pour les systèmes exascales. [Doctoral Dissertation]. Lyon, École normale supérieure; 2014. Available from: http://www.theses.fr/2014ENSL0971
15.
Calhoun, Jon Cameron.
From detection to optimization: impact of soft errors on high-performance computing applications.
Degree: PhD, Computer Science, 2017, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/98379
► As high-performance computing (HPC) continues to progress, constraints on HPC system design forces the handling of errors to higher levels in the software stack. Of…
(more)
▼ As high-performance computing (HPC) continues to progress, constraints on HPC system design forces the handling of errors to higher levels in the software stack. Of the types of errors facing HPC, soft errors that silently corrupt system or application state are among the most severe. The behavior of HPC applications in the presence of soft errors is critical to gain insight for effective utilization of HPC systems. The need to understand this behavior can be used in developing algorithm-based error detection guided by application characteristics from fault injection and error propagation studies. Furthermore, the realization that applications are tolerant to small errors allows optimizations such as lossy compression on high-cost data transfers. Lossy compression adds small user controllable amounts of error when compressing data, to reduce data size before expensive data transfers saving time. This dissertation investigates and improves the resiliency of HPC applications to soft errors, and explores lossy compression as a new form of optimization for expensive, time-consuming data transfers.
Advisors/Committee Members: Snir, Marc (advisor), Olson, Luke N (Committee Chair), Gropp, William (committee member), Cappello, Franck (committee member).
Subjects/Keywords: High-performance computing; Fault tolerance; Silent data corruption; Soft errors; Error detection; Error recovery; Fault injection; Error propagation; Lossy compression; Checkpoint-restart
…checkpoint-restart.
Checkpoint-restart
HPC checkpoint-restart relies on a short detection latency… …x5D;. System-level checkpoint-restart [58, 87, 45] offers the ability to recover… …globally coordinated checkpoint-restart limits application performance due
to coordination and I… …asynchronous checkpoint-restart schemes have been developed [83, 94, 42], but are… …logged communication.
Although application-based checkpoint-restart and asynchronous checkpoint…
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Calhoun, J. C. (2017). From detection to optimization: impact of soft errors on high-performance computing applications. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/98379
Chicago Manual of Style (16th Edition):
Calhoun, Jon Cameron. “From detection to optimization: impact of soft errors on high-performance computing applications.” 2017. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 14, 2021.
http://hdl.handle.net/2142/98379.
MLA Handbook (7th Edition):
Calhoun, Jon Cameron. “From detection to optimization: impact of soft errors on high-performance computing applications.” 2017. Web. 14 Apr 2021.
Vancouver:
Calhoun JC. From detection to optimization: impact of soft errors on high-performance computing applications. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2017. [cited 2021 Apr 14].
Available from: http://hdl.handle.net/2142/98379.
Council of Science Editors:
Calhoun JC. From detection to optimization: impact of soft errors on high-performance computing applications. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2017. Available from: http://hdl.handle.net/2142/98379
.