You searched for +publisher:"University of Michigan" +contributor:("Davidson, Edward S.")
.
Showing records 1 – 20 of
20 total matches.
No search limiters apply to these results.

University of Michigan
1.
Boyd, Eric Logan.
Performance evaluation and improvement of parallel applications on high performance architectures.
Degree: PhD, Computer Science and Engineering, 1995, University of Michigan
URL: http://hdl.handle.net/2027.42/104748
► An effective methodology of performance evaluation and improvement enables application developers to quickly and efficiently tune their applications to achieve good performance. One efficient approach…
(more)
▼ An effective methodology of performance evaluation and improvement enables application developers to quickly and efficiently tune their applications to achieve good performance. One efficient approach to improving performance for scientific applications is to bound the best achievable performance that a machine could deliver on a particular application code and then try to approach this bound in delivered performance. A useful abstraction is to approximate components of the application, architecture, compiler as independent contributors to the total application runtime. A hierarchy of such bounds equations can be constructed by successively taking into account actual rather than idealized additional components of machine runtime (and hence performance) and creating a new bounds equation for each such increasingly detailed abstraction. The MACS12*B bounds hierarchy that we develop models most of the common factors contributing to run time including limits posed by peak performance, essential operations in the high level code, nonessential instructions inserted by the compiler and scheduler, misses in the first level cache, misses in the second level cache which generate communication events, congestion effects leading to long miss latencies, and load imbalance effects. Given this hierarchy, it is possible to focus application performance improvement efforts on the causes of the largest gaps between successive performance bounds in the hierarchy. We develop a methodology of performance improvement in which we analyze performance, target performance bottleneck(
s), select a standard heuristic applicable to the targeted bottleneck, restructure or change the high-level application, and iterate until acceptable performance has been achieved. A case study in which two of the NAS parallel benchmarks are tuned for the Kendall Square Research KSR parallel computer illustrates the power of this approach. These applications are analyzed with the MACS12*B bounds hierarchy and their performance bottlenecks are identified. The original applications are progressively restructured and/or changed using the iterative methodology. The performance of the resulting codes is found to compare favorably to the KSR-supplied hand-tuned versions.
Advisors/Committee Members: Davidson, Edward S. (advisor).
Subjects/Keywords: Engineering, Electronics and Electrical; Computer Science
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Boyd, E. L. (1995). Performance evaluation and improvement of parallel applications on high performance architectures. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/104748
Chicago Manual of Style (16th Edition):
Boyd, Eric Logan. “Performance evaluation and improvement of parallel applications on high performance architectures.” 1995. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/104748.
MLA Handbook (7th Edition):
Boyd, Eric Logan. “Performance evaluation and improvement of parallel applications on high performance architectures.” 1995. Web. 22 Jan 2021.
Vancouver:
Boyd EL. Performance evaluation and improvement of parallel applications on high performance architectures. [Internet] [Doctoral dissertation]. University of Michigan; 1995. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/104748.
Council of Science Editors:
Boyd EL. Performance evaluation and improvement of parallel applications on high performance architectures. [Doctoral Dissertation]. University of Michigan; 1995. Available from: http://hdl.handle.net/2027.42/104748

University of Michigan
2.
Abandah, Gheith Ali.
Reducing communication cost in scalable shared memory systems.
Degree: PhD, Computer science, 1998, University of Michigan
URL: http://hdl.handle.net/2027.42/130888
► Distributed shared-memory systems provide scalable performance and a convenient model for parallel programming. However, their non-uniform memory latency often makes it difficult to develop efficient…
(more)
▼ Distributed shared-memory systems provide scalable performance and a convenient model for parallel programming. However, their non-uniform memory latency often makes it difficult to develop efficient parallel applications. Future systems should reduce communication cost to achieve better programmability and performance. We have developed a methodology, and implemented a suite of tools, to guide the search for improved codes and systems. As the result of one such search, we recommend a remote data caching technique that significantly reduces communication cost. We analyze applications by instrumenting their assembly-code sources. During execution, an instrumented application pipes a detailed trace to configuration independent (CIAT) and configuration dependent (CDAT) analysis tools. CIAT characterizes inherent application characteristics that do not change from one configuration to another, including working sets, concurrency, sharing behavior, and communication patterns, variation over time, slack, and locality. CDAT simulates the trace on a particular hardware model, and is easily retargeted to new systems. CIAT is faster than detailed simulation; however, CDAT directly provides more information about a specific configuration. The combination of the two tools constitutes a comprehensive and efficient methodology. We calibrate existing systems using carefully designed microbenchmarks that characterize local and remote memory, producer-consumer communication involving two or more processors, and contention when multiple processors utilize memory and interconnect. This dissertation describes these tools and illustrates their use by characterizing a wide range of applications and assessing the effects of architectural and technological advances on the performance of HP/Convex Exemplar systems, evaluates strengths and weaknesses of current system approaches, and recommends solutions. CDAT analysis of three CC-NUMA system approaches shows that current systems reduce communication cost by minimizing either remote latency or remote communication frequency. We describe four architecturally varied systems that are technologically similar to the low remote latency SGI Origin 2000, but incorporate additional techniques for reducing the number of remote communications. Using CCAT, a CDAT-like simulator that models communication contention, we evaluate the worthiness of these techniques. Superior performance is reached when processors supply cached clean data, or when a remote data cache is introduced that participates in the local bus protocol.
Advisors/Committee Members: Davidson, Edward S. (advisor).
Subjects/Keywords: Communication; Cost; Microbenchmarking; Reducing; Scalable; Shared Memory; Systems
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Abandah, G. A. (1998). Reducing communication cost in scalable shared memory systems. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/130888
Chicago Manual of Style (16th Edition):
Abandah, Gheith Ali. “Reducing communication cost in scalable shared memory systems.” 1998. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/130888.
MLA Handbook (7th Edition):
Abandah, Gheith Ali. “Reducing communication cost in scalable shared memory systems.” 1998. Web. 22 Jan 2021.
Vancouver:
Abandah GA. Reducing communication cost in scalable shared memory systems. [Internet] [Doctoral dissertation]. University of Michigan; 1998. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/130888.
Council of Science Editors:
Abandah GA. Reducing communication cost in scalable shared memory systems. [Doctoral Dissertation]. University of Michigan; 1998. Available from: http://hdl.handle.net/2027.42/130888

University of Michigan
3.
Meleis, Waleed M.
Optimal instruction scheduling and register allocation for multiple-issue processors.
Degree: PhD, Computer Science and Engineering, 1996, University of Michigan
URL: http://hdl.handle.net/2027.42/105122
► As processors make use of wider instruction issue and deeper pipelines, the number of instructions in flight and consequently the number of simultaneously live values…
(more)
▼ As processors make use of wider instruction issue and deeper pipelines, the number of instructions in flight and consequently the number of simultaneously live values per cycle increases. Fine grain scheduling and register allocation algorithms need to be closely coupled to make the best use of the available parallelism. In my research, I have developed a group of optimal scheduling and register allocation algorithms for statically scheduled processors that can issue more than one operation per cycle. This class of processors includes vector supercomputers and VLIW computers such as the Cydra-5 and Multiflow Trace VLIW computers and the Kendall Square Research (KSR) machine. My contributions include: An integer linear programming formulation that computes the shortest schedule for a general multiple-issue processor. This formulation inserts spills where needed, puts no restrictions on the program dependence graph, and admits a wide range of processors with specifiable issue widths, instruction latencies, and register set size. A quadratic time algorithm that optimally schedules a binary dependence tree on a dual-issue processor under the classical restriction that the operations all have unit latency. By showing that at least one optimal solution is in contiguous form and that all k-spill contiguous form schedules have the same cost, I am able to eliminate all but one k-spill candidate schedule from consideration. A dynamic programming algorithm to find an optimal register allocation that minimizes spill cost for a given, dual-issue, instruction schedule. Value exclusions and implicit and explicit pruning rules are used to substantially reduce the size of the search space.
Advisors/Committee Members: Davidson, Edward S. (advisor).
Subjects/Keywords: Computer Science
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Meleis, W. M. (1996). Optimal instruction scheduling and register allocation for multiple-issue processors. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/105122
Chicago Manual of Style (16th Edition):
Meleis, Waleed M. “Optimal instruction scheduling and register allocation for multiple-issue processors.” 1996. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/105122.
MLA Handbook (7th Edition):
Meleis, Waleed M. “Optimal instruction scheduling and register allocation for multiple-issue processors.” 1996. Web. 22 Jan 2021.
Vancouver:
Meleis WM. Optimal instruction scheduling and register allocation for multiple-issue processors. [Internet] [Doctoral dissertation]. University of Michigan; 1996. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/105122.
Council of Science Editors:
Meleis WM. Optimal instruction scheduling and register allocation for multiple-issue processors. [Doctoral Dissertation]. University of Michigan; 1996. Available from: http://hdl.handle.net/2027.42/105122

University of Michigan
4.
Annavaram, Murali Mohan Kumar.
Prefetch mechanisms that acquire and exploit application specific knowledge.
Degree: PhD, Electrical engineering, 2001, University of Michigan
URL: http://hdl.handle.net/2027.42/124668
► The large number of cache misses of current applications coupled with the increasing cache miss latencies in current processor designs cause significant performance degradation, even…
(more)
▼ The large number of cache misses of current applications coupled with the increasing cache miss latencies in current processor designs cause significant performance degradation, even in aggressive out-of-order processors. This dissertation explores two novel prefetching techniques to reduce I-cache and D-cache misses by acquiring and exploiting application specific knowledge. The first part of this dissertation focuses on reducing I-cache misses of database systems using Call Graph Prefetching (CGP). CGP can be implemented either in software or in hardware. Both implementations are based on the insight that the sequence of function calls in a DBMS is highly predictable. CGP leverages this predictability by analyzing the call graph of a DBMS to prefetch a function that is likely to be called next. We evaluate the performance of CGP on sets of Wisconsin and TPC-H queries running on today'
s typical out-of-order superscalar processor models and show that CGP reduces the I-cache miss stall time by nearly 50%. The second part of this dissertation addresses data prefetching. Initially we developed Ancestor Driven Prefetching (ADP). ADP builds ancestor graphs for frequently executed load/store instructions; it exploits the correlation between values defined by an ancestor of a group of memory instructions and the subsequent addresses they access to issue prefetches for the group. ADP suffered both accuracy and timeliness problems that limited its success. But the insights gained from ADP led us to precompute prefetch addresses instead of predicting them. Dependence Graph Precomputation (DGP) uses a novel approach to data prefetching. Once an instruction is fetched from the I-cache into the Instruction Fetch Queue (IFQ), its dependences are determined and stored as pointers with the instruction in the IFQ. When a load/store instruction that is deemed likely to cause a cache miss enters the IFQ, a Dependence Graph Generator follows the dependence pointers still within the IFQ to generate the dependence graph of those instructions yet to be executed that will determine the address of the load/store instruction. A separate precomputation engine executes these graphs to generate the load/store addresses early enough for accurate prefetching. Our results show DGP reduces the D-cache miss stall time by 47%. Thus the techniques presented in this dissertation, CGP and DGP, take us about half way from an already highly tuned baseline system toward performance with perfect I-cache and perfect D-cache, respectively.
Advisors/Committee Members: Davidson, Edward S. (advisor).
Subjects/Keywords: Acquire; Application-specific Knowledge; Call Graph Prefetching; Database Performance; Exploit; Mechanisms; Prefetch
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Annavaram, M. M. K. (2001). Prefetch mechanisms that acquire and exploit application specific knowledge. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/124668
Chicago Manual of Style (16th Edition):
Annavaram, Murali Mohan Kumar. “Prefetch mechanisms that acquire and exploit application specific knowledge.” 2001. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/124668.
MLA Handbook (7th Edition):
Annavaram, Murali Mohan Kumar. “Prefetch mechanisms that acquire and exploit application specific knowledge.” 2001. Web. 22 Jan 2021.
Vancouver:
Annavaram MMK. Prefetch mechanisms that acquire and exploit application specific knowledge. [Internet] [Doctoral dissertation]. University of Michigan; 2001. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/124668.
Council of Science Editors:
Annavaram MMK. Prefetch mechanisms that acquire and exploit application specific knowledge. [Doctoral Dissertation]. University of Michigan; 2001. Available from: http://hdl.handle.net/2027.42/124668

University of Michigan
5.
Shih, Tien-Pao.
Goal-directed performance tuning for scientific applications.
Degree: PhD, Computer Science and Engineering, 1996, University of Michigan
URL: http://hdl.handle.net/2027.42/105149
► Performance tuning, as carried out by compiler designers and application programmers to close the performance gap between the achievable peak and delivered performance, becomes increasingly…
(more)
▼ Performance tuning, as carried out by compiler designers and application programmers to close the performance gap between the achievable peak and delivered performance, becomes increasingly important and challenging as the microprocessor speeds and system sizes increase. However, although performance tuning on scientific codes usually deals with relatively small program regions, it is not generally known how to establish a reasonable performance objective and how to efficiently achieve this objective. We suggest a goal-directed approach and develop such an approach for each of three major system performance components: central processor unit (CPU) computation, memory accessing, and communication. For the CPU, we suggest using a machine-application performance model that characterizes workloads on four key function units (memory, floating-point, issue, and a virtual "dependence unit") to produce an upper bound performance objective, and derive a mechanism to approach this objective. A case study shows an average 1.79x speedup achieved by using this approach for the Livermore Fortran Kernels 1-12 running on the IBM RS/6000. For memory, as compulsory and capacity misses are relatively easy to characterize, we derive a method for building application-specific cache behavior models that report the number of misses for all three types of conflict misses: self, cross, and ping-pong. The method uses averaging concepts to determine the expected number of cache misses instead of attempting to count them exactly in each instance, which provides a more rapid, yet realistic assessment of expected cache behavior. For each type of conflict miss, we propose a reduction method that uses one or a combination of three techniques based on modifying or exploiting data layout: array padding, initial address adjustment, and access resequencing. A case study using a blocked matrix multiply program as an example shows that the model is within 11% of the simulation results, and that each type of conflict miss can be effectively reduced or completely eliminated. For communication in shared memory parallel systems, we derive an array grouping mechanism and related loop transformations to reduce communication caused by the problematic case of nonconsecutive references to shared arrays and prove several theorems that determine when and where to apply this technique. The experimental results show a 15% reduction in communication, a 40% reduction in data subcache misses, and an 18% reduction in maximum user time for a finite element application on a 56 processor KSR1 parallel computer.
Advisors/Committee Members: Davidson, Edward S. (advisor).
Subjects/Keywords: Computer Science
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Shih, T. (1996). Goal-directed performance tuning for scientific applications. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/105149
Chicago Manual of Style (16th Edition):
Shih, Tien-Pao. “Goal-directed performance tuning for scientific applications.” 1996. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/105149.
MLA Handbook (7th Edition):
Shih, Tien-Pao. “Goal-directed performance tuning for scientific applications.” 1996. Web. 22 Jan 2021.
Vancouver:
Shih T. Goal-directed performance tuning for scientific applications. [Internet] [Doctoral dissertation]. University of Michigan; 1996. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/105149.
Council of Science Editors:
Shih T. Goal-directed performance tuning for scientific applications. [Doctoral Dissertation]. University of Michigan; 1996. Available from: http://hdl.handle.net/2027.42/105149

University of Michigan
6.
Chang, Chuan-Hua.
Performance optimization of pipeline circuits with latches and wave pipelining.
Degree: PhD, Computer Science and Engineering, 1996, University of Michigan
URL: http://hdl.handle.net/2027.42/104927
► The use of level-sensitive latches and wave-pipelining techniques are two aggressive methods to increase the performance of pipelined circuits. However, they do present difficult analysis…
(more)
▼ The use of level-sensitive latches and wave-pipelining techniques are two aggressive methods to increase the performance of pipelined circuits. However, they do present difficult analysis and design problems that heretofore have prevented active use and full exploration of these techniques. To tackle these barriers, this dissertation solves three related problems: finding the optimal single-phase clock schedule for latch-based closed pipelines, designing stopping and restarting mechanisms for wave-pipelined circuits, and achieving a desired clock rate by inserting latches to balance the delays of a wave-pipelined circuit. The optimal single-phase clocking problem for latch-based closed pipelines has been difficult to solve because the general latch timing model generates a nonconvex solution space. The best prior approach uses a linear program to solve an overconstrained problem. We present an efficient (cubic complexity) algorithm, Gpipe, that solves the general problem by exploiting the geometric characteristics of the full nonconvex solution space to determine the maximum single-phase clocking rate for a closed pipeline with a specified degree of wave-pipelining. The Long path and Short path timing constraints are examined to prove that a further clock speedup may be obtained by permanently enabling some latches and increasing the degree of wave pipelining. Sufficient conditions are also found to identify which latches can be removed in this fashion while guaranteeing no decrease and permitting a possible increase in the clock rate. Wave pipelining is often avoided in design due, in part, to the difficulty of stopping and restarting a wave-pipelined circuit without losing data. The ability to stop and start a circuit is essential for pipeline stalls and reduced rate testing; however, this problem had not previously been addressed. We have solved this problem by developing a redundant latch mechanism to store the wave-pipelined data. A formal characterization of stoppability and startability indicates where these redundant latches may be inserted within the pipeline and how to sequence the shut down and restart of the clock. Two improved stopping and restarting mechanisms, Pitcher and Catcher, have been developed which place a FIFO queue alongside the beginning or end of each wave-pipelined stage. Chains of delay elements are typically used to increase performance by balancing the min/max path difference of wave-pipelined stages. However, better performance and cost economy can be achieved by inserting latches instead, resulting in an efficient blend of physical stages and wave pipelining to achieve the desired clock rate. A mathematical programming model and efficient heuristics are developed to insert latches with single phase, two phase, or random phase clocks. Experimental results clearly show the performance and cost advantages of using latches.
Advisors/Committee Members: Davidson, Edward S. (advisor).
Subjects/Keywords: Engineering, Electronics and Electrical; Computer Science
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Chang, C. (1996). Performance optimization of pipeline circuits with latches and wave pipelining. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/104927
Chicago Manual of Style (16th Edition):
Chang, Chuan-Hua. “Performance optimization of pipeline circuits with latches and wave pipelining.” 1996. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/104927.
MLA Handbook (7th Edition):
Chang, Chuan-Hua. “Performance optimization of pipeline circuits with latches and wave pipelining.” 1996. Web. 22 Jan 2021.
Vancouver:
Chang C. Performance optimization of pipeline circuits with latches and wave pipelining. [Internet] [Doctoral dissertation]. University of Michigan; 1996. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/104927.
Council of Science Editors:
Chang C. Performance optimization of pipeline circuits with latches and wave pipelining. [Doctoral Dissertation]. University of Michigan; 1996. Available from: http://hdl.handle.net/2027.42/104927

University of Michigan
7.
Hung, Shih-Hao.
Optimizing parallel applications.
Degree: PhD, Electrical engineering, 1998, University of Michigan
URL: http://hdl.handle.net/2027.42/131229
► While parallel computing offers an attractive perspective for the future, developing efficient parallel applications today is a labor-intensive process that requires an intimate knowledge of…
(more)
▼ While parallel computing offers an attractive perspective for the future, developing efficient parallel applications today is a labor-intensive process that requires an intimate knowledge of the machines, the applications, and many subtle machine-application interactions. Optimizing applications so that they can achieve their full potential on parallel machines is often beyond the programmer'
s or the compiler'
s ability; furthermore its complexity will not be reduced with the increasingly complex computer architectures of the foreseeable future. In this dissertation, we discuss how application performance can be optimized systematically. We show how insights regarding machine-application pairs and the weaknesses in their delivered performance can be derived by characterizing the machine, the application, and the machine-application interactions. We describe a general performance tuning scheme that can be used for selecting and applying a broad range of performance tuning actions to solve major performance problems in a structured sequence of steps, and discuss the interrelationship among and between performance problems and performance tuning actions. To guide programmers in performance tuning, we developed a goal-directed performance tuning methodology that employs hierarchical performance bounds to characterize the delivered performance quantitatively and explain where potential performance is lost. To reduce the complexity of performance tuning, we developed an innovative performance modeling scheme to quickly derive machine-application interactions from abstract representations of the machine and application of interest. Collectively, this dissertation unifies a range of research work done within the Parallel Performance Project at the
University of
Michigan over the past seven years and significantly improves the state-of-the-art in parallel application development environments.
Advisors/Committee Members: Davidson, Edward S. (advisor).
Subjects/Keywords: High-performance Computing; Optimizing; Parallel Applications; Performance Optimization
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Hung, S. (1998). Optimizing parallel applications. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/131229
Chicago Manual of Style (16th Edition):
Hung, Shih-Hao. “Optimizing parallel applications.” 1998. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/131229.
MLA Handbook (7th Edition):
Hung, Shih-Hao. “Optimizing parallel applications.” 1998. Web. 22 Jan 2021.
Vancouver:
Hung S. Optimizing parallel applications. [Internet] [Doctoral dissertation]. University of Michigan; 1998. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/131229.
Council of Science Editors:
Hung S. Optimizing parallel applications. [Doctoral Dissertation]. University of Michigan; 1998. Available from: http://hdl.handle.net/2027.42/131229

University of Michigan
8.
Vlaovic, Steven Alexander.
Modeling and analysis of x86-based front -end architectures.
Degree: PhD, Electrical engineering, 2002, University of Michigan
URL: http://hdl.handle.net/2027.42/130766
► This dissertation analyzes x86 processor models in order to better understand the impact that the x86 instruction set architecture (ISA) has on the front end…
(more)
▼ This dissertation analyzes x86 processor models in order to better understand the impact that the x86 instruction set architecture (ISA) has on the front end of high performance x86 processors. In order to design better processors, it is important to understand existing processors. Real-world parameters often dictate that a specific ISA must be supported, due to the amount of legacy software available for that ISA. This is the hallmark of modern platforms that support the x86 ISA. For more than 20 years, this architecture has been in existence and architects have had to design for both improved performance and backward compatibility. To aid in our evaluation of x86 processors, we have developed an x86 simulation infrastructure, called T&barbelow;race A&barbelow;nalysis for X&barbelow;86 I&barbelow;nterpretation, or TAXI, which provides a platform for modeling x86 processors. Not only does TAXI provide detailed performance results, it also has special facilities to do basic branch prediction, instruction cache, and Branch Target Buffer (BTB) studies without running the entire cycle accurate processor model. For our BTB studies, we developed a specialized branch filtering mechanism, DLL filtering, that reduced contention in the BTB by removing Dynamically Linked Library calls from the BTB. Using the TAXI performance model, we have been able to evaluate and improve the performance of two processor models, Model III and Model 4, patterned after the Pentium III and the Pentium 4, respectively. In our analysis of Model III, we decompose front end performance into seven orthogonal components. By improving the branch predictor and the instruction cache, we obtained 20% higher performance across our desktop application suite. We use TAXI to compare Model III with Model 4 across our application suite. Model 4 has a pipeline that is almost three times as long, yet requires only 30% more cycles per instruction (CPI) than Model III. By developing a new classification for Model 4 trace cache accesses, we discovered that nonhead misses caused the greatest performance loss. We have proposed nonhead miss speculation, a hardware technique, which achieves about 10% speedup over our eight desktop applications by removing most of the penalty associated with nonhead misses.
Advisors/Committee Members: Davidson, Edward S. (advisor).
Subjects/Keywords: Analysis; Front-end Architectures; Instruction Set; Processor Modeling; X86-based
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Vlaovic, S. A. (2002). Modeling and analysis of x86-based front -end architectures. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/130766
Chicago Manual of Style (16th Edition):
Vlaovic, Steven Alexander. “Modeling and analysis of x86-based front -end architectures.” 2002. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/130766.
MLA Handbook (7th Edition):
Vlaovic, Steven Alexander. “Modeling and analysis of x86-based front -end architectures.” 2002. Web. 22 Jan 2021.
Vancouver:
Vlaovic SA. Modeling and analysis of x86-based front -end architectures. [Internet] [Doctoral dissertation]. University of Michigan; 2002. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/130766.
Council of Science Editors:
Vlaovic SA. Modeling and analysis of x86-based front -end architectures. [Doctoral Dissertation]. University of Michigan; 2002. Available from: http://hdl.handle.net/2027.42/130766

University of Michigan
9.
Wellman, John-David.
Processor modeling and evaluation techniques for early design stage performance comparison.
Degree: PhD, Electrical engineering, 1996, University of Michigan
URL: http://hdl.handle.net/2027.42/130159
► This thesis develops two techniques and a design space search hierarchy that can be used to examine a large space of processor designs early in…
(more)
▼ This thesis develops two techniques and a design space search hierarchy that can be used to examine a large space of processor designs early in the design cycle, before an expensive investment has been made on hardware design, allowing designers to select an initial processor design that will satisfy more of the processor'
s design goals, reducing the number and cost of the design iterations. They also allow the designers to investigate the impact of trade-offs in the processor design, better preparing the project for later changes required in order for the processor to meet its area, power, or other design constraints. The first technique is the resource conflict methodology (RCM), a full execution trace driven simulation that allows specification of the processor organization, design parameters and even the instruction set architecture model, and performance is estimated as accurately as a traditional (cycle-by-cycle) simulator for the same processor model. The resource conflict methodology therefore provides the user the flexibility to examine a large number of different processor designs, but simulation time can be limiting because RCM is driven by a full execution trace. The second technique, called reduced trace analysis (RTA), addresses the simulation time problem by reducing the redundant calculation done while generating a performance estimate. RTA analyzes the trace to develop a weighted control-flow graph representation of the execution of the workload, where each node represents a sequentially-executed code block and each weighted, directed link represents a number of transfers of control between the linked blocks during execution. RTA uses a trace simulator to derive estimates for each code block and each interface between connected blocks, then weights and assembles these estimates to determine the full performance estimate. This method significantly reduces the simulation time for each processor model, and yet still produces estimates within a few percent of the RCM estimates. Thus, RTA allows the user to investigate a large number of designs much more quickly than full trace simulation methods.
Advisors/Committee Members: Davidson, Edward S. (advisor), Bose, Pradip (advisor).
Subjects/Keywords: Comparison; Design; Early; Evaluation; Modeling; Performance; Processor; Reduced Trace Analysis; Resource Conflict Methodology; Stage; Techniques
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Wellman, J. (1996). Processor modeling and evaluation techniques for early design stage performance comparison. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/130159
Chicago Manual of Style (16th Edition):
Wellman, John-David. “Processor modeling and evaluation techniques for early design stage performance comparison.” 1996. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/130159.
MLA Handbook (7th Edition):
Wellman, John-David. “Processor modeling and evaluation techniques for early design stage performance comparison.” 1996. Web. 22 Jan 2021.
Vancouver:
Wellman J. Processor modeling and evaluation techniques for early design stage performance comparison. [Internet] [Doctoral dissertation]. University of Michigan; 1996. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/130159.
Council of Science Editors:
Wellman J. Processor modeling and evaluation techniques for early design stage performance comparison. [Doctoral Dissertation]. University of Michigan; 1996. Available from: http://hdl.handle.net/2027.42/130159
10.
Ernst, Elizabeth Ann.
Optimal Combinational Multi-Level Logic Synthesis.
Degree: PhD, Computer Science & Engineering, 2009, University of Michigan
URL: http://hdl.handle.net/2027.42/62373
► Within the field of automated logic design, the optimal synthesis of combinational logic has remained one of the most basic design objectives. However, the computational…
(more)
▼ Within the field of automated logic design, the optimal synthesis of combinational logic has remained one of the most basic design objectives. However, the computational complexity of this optimization problem has limited the practical application of optimal synthesis for large circuits. Since much of the exact synthesis literature predates many advances in both computer hardware as well as reasoning and search techniques, it was our objective to revisit optimal synthesis. Through this investigation we hoped to complete optimal synthesis on more complex functions.
In this dissertation, we provide a general formulation of logic synthesis as an expanding search problem and describe BESS, an optimal multi-level branch-and-bound synthesis algorithm for combinational circuits. The formulation of synthesis as an expanding search problem provides insights into the difficulty of optimal synthesis. The generality of the formulation provides the flexibility in the options under which synthesis could be completed. Since BESS was created based on this formulation, it completes optimal synthesis under a variety of options while guaranteeing that an optimal network will be produced.
In this dissertation, we provide a comprehensive evaluation of BESS. First we describe an extensive study of the search strategies for BESS, including both empirical and theoretical arguments which explain why these strategies are able to provide a more efficient search. Then through an analysis of the algorithm, we provide proofs of both completeness and convergence as well as an analysis of the search space.
Empirically, we discuss the findings yielded by BESS. We give a database of optimal circuits. Optimal implementations of all 2-, 3-, and 4-input functions are given, including for the first time the optimal implementation of the 4-input xor function. We then extend this database of know optimal circuits to include 4,745 5-input functions. We also provide optimal networks and cost formula for n-input functions for a variety of common functions based on an analysis of optimal network structures. Finally, we provide networks for larger function that are with-in a known distance from optimal by modifying the bounding technique in BESS. Networks with as many as 17 inputs and 16 outputs are completed.
Advisors/Committee Members: Sakallah, Karem A. (committee member), Davidson, Edward S. (committee member), Hayes, John Patrick (committee member), Lafortune, Stephane (committee member), Papaefthymiou, Marios C. (committee member).
Subjects/Keywords: Optimal Logic Synthesis; Computer Science; Engineering
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ernst, E. A. (2009). Optimal Combinational Multi-Level Logic Synthesis. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/62373
Chicago Manual of Style (16th Edition):
Ernst, Elizabeth Ann. “Optimal Combinational Multi-Level Logic Synthesis.” 2009. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/62373.
MLA Handbook (7th Edition):
Ernst, Elizabeth Ann. “Optimal Combinational Multi-Level Logic Synthesis.” 2009. Web. 22 Jan 2021.
Vancouver:
Ernst EA. Optimal Combinational Multi-Level Logic Synthesis. [Internet] [Doctoral dissertation]. University of Michigan; 2009. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/62373.
Council of Science Editors:
Ernst EA. Optimal Combinational Multi-Level Logic Synthesis. [Doctoral Dissertation]. University of Michigan; 2009. Available from: http://hdl.handle.net/2027.42/62373

University of Michigan
11.
Srinivasan, Vijayalakshmi.
Hardware solutions to reduce effective memory access time.
Degree: PhD, Electrical engineering, 2001, University of Michigan
URL: http://hdl.handle.net/2027.42/124069
► In this dissertation, we provide hardware solutions to increase the efficiency of the cache hierarchy, thereby reducing the effective memory access time. Specifically, we focus…
(more)
▼ In this dissertation, we provide hardware solutions to increase the efficiency of the cache hierarchy, thereby reducing the effective memory access time. Specifically, we focus on two approaches to reduce the effective memory access time, namely prefetching and novel cache designs. In the first part of this dissertation we show that traditional metrics like coverage and accuracy may be inadequate for evaluating the effectiveness of a prefetch algorithm. Our main contribution is the development of a prefetch traffic and miss taxonomy (PTMT) that provides a complete classification of all the prefetches; in particular, the PTMT classification precisely quantifies the direct and indirect effect of each prefetch on traffic and misses. We show that while most instruction prefetch algorithms do achieve a substantial reduction in misses, they fail to issue the prefetches in a timely fashion. Our branch history guided hardware prefetch algorithm (BHGP) improves the timeliness of instruction prefetches. Our results show that BHGP on average eliminates 66% of the I-cache misses for some important commercial and Windows-NT applications and some applications from the CPU2000 suite that have high I-cache misses. In addition, BHGP improves IPC by 14 to 18% for the CPU2000 applications studied. In the second part of this dissertation, we explore novel cache designs to reduce the effective L1 cache access time in the light of current technology trends. We show that the straightforward approach of adding one more level of memory hierarchy, an L0 cache between the processor and the L1 cache, does not always reduce the effective cache access time because of the high miss rate of the L0 cache and the small difference in access latency between L0 and L1. We develop a split latency cache system (SpliCS), which is an enhanced version of the traditional (L0 + L1) system and uses 2 primary caches: a small fast cache (A) and a larger, slower cache (B). Our experiments show that, relative to a similarly configured L1 cache alone, SpliCS achieves an 8% or 18% reduction in CPI with a cache B latency of 3 or 5 cycles, respectively. Moreover, SpliCS achieves an average 15% improvement in CPI relative to a traditional (L0 + L1) hierarchy. Abstract shortened by UMI.)
Advisors/Committee Members: Davidson, Edward S. (advisor), Tyson, Gary S. (advisor).
Subjects/Keywords: Branch History Guided; Branch History-guided; Effective; Hardware Prefetch; Memory Access Time; Reduce; Solutions; Split Latency Cache
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Srinivasan, V. (2001). Hardware solutions to reduce effective memory access time. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/124069
Chicago Manual of Style (16th Edition):
Srinivasan, Vijayalakshmi. “Hardware solutions to reduce effective memory access time.” 2001. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/124069.
MLA Handbook (7th Edition):
Srinivasan, Vijayalakshmi. “Hardware solutions to reduce effective memory access time.” 2001. Web. 22 Jan 2021.
Vancouver:
Srinivasan V. Hardware solutions to reduce effective memory access time. [Internet] [Doctoral dissertation]. University of Michigan; 2001. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/124069.
Council of Science Editors:
Srinivasan V. Hardware solutions to reduce effective memory access time. [Doctoral Dissertation]. University of Michigan; 2001. Available from: http://hdl.handle.net/2027.42/124069

University of Michigan
12.
Mangione-Smith, William Henry.
Performance bounds and buffer space requirements for concurrent processors.
Degree: PhD, Electrical engineering, 1992, University of Michigan
URL: http://hdl.handle.net/2027.42/128895
► Scientific programs are typically characterized as floating-point intensive loop-dominated tasks with large amounts of exploitable parallelism. A wide range of concurrent processor architectures have been…
(more)
▼ Scientific programs are typically characterized as floating-point intensive loop-dominated tasks with large amounts of exploitable parallelism. A wide range of concurrent processor architectures have been proposed to capitalize on this parallelism, e.g. vector, multi-threaded, superscalar, and VLIW. Even though these architectures all attempt to exploit the same performance potential, the wide range of hardware structures has to date discouraged a unified performance evaluation technique. This thesis develops and uses application-specific upper bounds on performance to evaluate a wide range of concurrent machines. The method focuses on the bandwidth of critical machine units, abstracting away many details of the underlying architectures. Three machines are evaluated as examples: the RISC DECstation 3100, the superscalar IBM RS/6000, and the Decoupled Access-Execute Astronautics ZS-1. When compared to measured performance, performance bounds reveal the overall efficiency of execution and the effective utilization of the critical machine units. These bounds have proven useful for evaluating specific machines, comparing two or more machines and improving the performance of applications codes on these computers. Achieving high performance on machines with long latency operations demands increased buffer space, usually in the form of registers. The dependence on buffer space for high performance is highlighted by the three example machines mentioned above, each of which uses a fundamentally different structure for its buffer space. A method is presented for calculating minimal buffer space requirements for an application running at its optimum performance as machine latencies are varied. This method allows an architect to examine fundamental consequences of various buffer space and latency design choices. One such issue concerns the number of registers required in a vector machine, versus their length. The results presented here suggest that the historic trend toward longer vector registers as latencies have increased is the wrong approach and that greater resources should be spent on increasing the number of available registers even if their length is greatly reduced.
Advisors/Committee Members: Abraham, Santosh G. (advisor), Davidson, Edward S. (advisor).
Subjects/Keywords: Bounds; Buffer; Concurrent; Performance; Processors; Requirements; Space
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Mangione-Smith, W. H. (1992). Performance bounds and buffer space requirements for concurrent processors. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/128895
Chicago Manual of Style (16th Edition):
Mangione-Smith, William Henry. “Performance bounds and buffer space requirements for concurrent processors.” 1992. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/128895.
MLA Handbook (7th Edition):
Mangione-Smith, William Henry. “Performance bounds and buffer space requirements for concurrent processors.” 1992. Web. 22 Jan 2021.
Vancouver:
Mangione-Smith WH. Performance bounds and buffer space requirements for concurrent processors. [Internet] [Doctoral dissertation]. University of Michigan; 1992. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/128895.
Council of Science Editors:
Mangione-Smith WH. Performance bounds and buffer space requirements for concurrent processors. [Doctoral Dissertation]. University of Michigan; 1992. Available from: http://hdl.handle.net/2027.42/128895

University of Michigan
13.
Rivers, Jude A.
Performance aspects of high-bandwidth multi-lateral cache organizations.
Degree: PhD, Computer science, 1998, University of Michigan
URL: http://hdl.handle.net/2027.42/131086
► As the issue widths of processors continue to increase, efficient data supply will become ever more critical. Unfortunately, with processor speeds increasing faster than memory…
(more)
▼ As the issue widths of processors continue to increase, efficient data supply will become ever more critical. Unfortunately, with processor speeds increasing faster than memory speeds, supplying data efficiently will continue to be more and more difficult. Attempts to address this issue have included reducing the effective latency of memory accesses and increasing the available access bandwidth to the primary data cache. However, each of these two techniques is often proposed and evaluated in the absence of the other. This dissertation proposes and evaluates solutions for the latency and the bandwidth aspects of data supply, and a cache structure that incorporates both solutions. To solve the latency problem, we use the multi-lateral caching paradigm for active data placement and management in the L1 cache. The multi-lateral paradigm, which advocates the partitioning of the L1 cache into multiple subcaches, emerges from the fundamental belief that several classes of data elements have distinct reference or behavior characteristics that should be exploited differentially. However, some criterion must be found upon which to partition the reference stream into appropriate classes for selective caching. We demonstrate the value of temporality-based caching as an appropriate criterion with the introduction of the Non-Temporal Streaming (NTS) Cache, which performs better than larger single-structure caches. The NTS Cache utilizes data reuse information for intelligent data placement and active management of its 2-unit multi-lateral structure. To solve the bandwidth problem, we analyze the scalability of current multiporting approaches as data access parallelism increases. Currently, multibanking and multiporting through replication are the two popular and implementable approaches to providing multiple ports. However, neither approach appears to be scalable with increasing data access parallelism. Whereas the multibanking technique suffers performance degradation because of bank conflicts, the replication technique is die area limited and degraded by the need to broadcast stores for coherence. Multibanking is economical in terms of die area requirements and design complexity. Analysis of the SPEC95 memory reference streams reveals that the majority of all bank conflicts are due to nearby references that map into the same cache line of the same cache bank. Our solution to the bank conflict problem is the Locality-Based Interleaved Cache (LBIC), which is built on traditional multibanking, but exploits same line spatial locality to obtain performance similar to true multiporting at far less cost. Finally, this dissertation explores the effectiveness of reducing the average data access time via active management of a cache space in conjunction with high bandwidth techniques. Experimental results show that adding multi-lateral caching to an LBIC results in a cache structure capable of performing as well or better than traditionally managed single-structure LBIC caches of nearly twice the size.
Advisors/Committee Members: Davidson, Edward S. (advisor), Tyson, Gary S. (advisor).
Subjects/Keywords: Aspects; Bandwidth; Cache; High; Lateral; Management; Multi; Organizations; Performance
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Rivers, J. A. (1998). Performance aspects of high-bandwidth multi-lateral cache organizations. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/131086
Chicago Manual of Style (16th Edition):
Rivers, Jude A. “Performance aspects of high-bandwidth multi-lateral cache organizations.” 1998. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/131086.
MLA Handbook (7th Edition):
Rivers, Jude A. “Performance aspects of high-bandwidth multi-lateral cache organizations.” 1998. Web. 22 Jan 2021.
Vancouver:
Rivers JA. Performance aspects of high-bandwidth multi-lateral cache organizations. [Internet] [Doctoral dissertation]. University of Michigan; 1998. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/131086.
Council of Science Editors:
Rivers JA. Performance aspects of high-bandwidth multi-lateral cache organizations. [Doctoral Dissertation]. University of Michigan; 1998. Available from: http://hdl.handle.net/2027.42/131086

University of Michigan
14.
Tam, Edward S.
Improving cache performance via active management.
Degree: PhD, Electrical engineering, 1999, University of Michigan
URL: http://hdl.handle.net/2027.42/132017
► This dissertation analyzes a way to improve cache performance via active management of a target cache space. As microprocessor speeds continue to grow faster than…
(more)
▼ This dissertation analyzes a way to improve cache performance via active management of a target cache space. As microprocessor speeds continue to grow faster than memory subsystem speeds, minimizing the average data access time grows in importance. As current data caches are often poorly and inefficiently managed, a good management technique can improve the average data access time. Cache management involves two main processes: block allocation decisions and block replacement decisions. Active block allocation can be performed most efficiently in multilateral caches (several distinct data stores with disjoint contents placed in parallel within L1), where blocks exhibiting particular characteristics can be placed in the appropriate store. To aid in our evaluation of different active block management schemes, we have developed a multilateral cache simulator, mlcache, which provides a platform whereby different cache schemes can easily be specified, and produces evaluation statistics that can help explain their performance. Using mlcache, we have been able to evaluate the performance of proposed multilateral cache schemes and to derive new, better performing schemes. Our results show that multilateral schemes outperform traditional caches of similar size and often rival the performance of traditional caches nearly twice as large. However, the performance difference between previously-proposed implementable schemes and a multilateral configuration that uses a non-implementable near-optimal replacement policy is large. This disparity is due mainly to the simple prediction strategies presently used in the implementable schemes, along with their limited management of blocks while resident in the L1 cache structure. We introduce a new multilateral allocation scheme, Allocation By Conflict (ABC), which outperforms all previously proposed reuse-based multilateral configurations and performs comparably to multilateral schemes that have significantly more hardware requirements (particularly Victim, which requires a data path between its A and B caches). The ABC scheme incurs the lowest hardware cost of any of the proposed multilateral schemes, yet it performs the highest and is the most easily implementable. The ABC scheme requires the addition of only a single additional bit per block in cache A and a very simple logic circuit for making the allocation decisions. The ABC scheme'
s performance advantage also scales well as the sizes of the caches are increased and as the associativity of the A cache is increased.
Advisors/Committee Members: Davidson, Edward S. (advisor), Tyson, Gary S. (advisor).
Subjects/Keywords: Active Management; Allocation By Conflict; Cache; Improving; Performance; Via
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Tam, E. S. (1999). Improving cache performance via active management. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/132017
Chicago Manual of Style (16th Edition):
Tam, Edward S. “Improving cache performance via active management.” 1999. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/132017.
MLA Handbook (7th Edition):
Tam, Edward S. “Improving cache performance via active management.” 1999. Web. 22 Jan 2021.
Vancouver:
Tam ES. Improving cache performance via active management. [Internet] [Doctoral dissertation]. University of Michigan; 1999. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/132017.
Council of Science Editors:
Tam ES. Improving cache performance via active management. [Doctoral Dissertation]. University of Michigan; 1999. Available from: http://hdl.handle.net/2027.42/132017

University of Michigan
15.
Chaar, Jarir Kamel.
A methodology for developing real-time control software for efficient and dependable manufacturing systems.
Degree: PhD, Computer Science and Engineering, 1990, University of Michigan
URL: http://hdl.handle.net/2027.42/103284
► Designing efficient and dependable manufacturing systems has always been a major goal of modern computer-integrated manufacturing. The dissertation proposes a methodology for developing the control…
(more)
▼ Designing efficient and dependable manufacturing systems has always been a major goal of modern computer-integrated manufacturing. The dissertation proposes a methodology for developing the control software of these systems, and develops a set of software tools that enhance the applicability of this methodology. These tools aid in planning the set of operations of a manufacturing job, generating a cyclic schedule for processing a batch of jobs, and monitoring the operations of the system while this batch is being processed. The syntax and semantics of a component-oriented rule-based language for specifying the formal models of manufacturing systems is presented. A model captures the state of a component of the system in a set of first-order logic predicates, and it captures the semantics of the operations performed by this component in a set of rules that determine the preconditions and postconditions of an operation. These models are used in planning the sequence of operations of each class of jobs to be manufactured by these systems. To achieve efficiency, the reservation table technique is used to create optimum cyclic job-shop schedules for processing a batch of identical jobs or a mix of jobs from several classes on these systems. A reservation table is derived from the plan of a job. This table is then used to determine the theoretical maximum job initiation rate and the set of all possible initiation strategies for the batch. In some cases, this theoretical maximum rate is achieved by increasing the flow time of the job. The above technique inherently allows multiple devices to be reserved concurrently, it can deal with transport time explicitly, and it achieves higher initiation rates by including cycles that involve multiple job initiations. To achieve dependability, a plan-oriented fault detection and correction strategy is proposed. This strategy can automatically handle any combination of faults that may occur when monitoring the operations of manufacturing systems. A fault-tree is consulted prior to executing the scheduled operations of a plan, and the faults that affect the execution of these operations are handled subsequently. Resuming the original cyclic schedule is attempted, whenever feasible.
Advisors/Committee Members: Volz, Richard A. (advisor), Davidson, Edward S. (advisor).
Subjects/Keywords: Engineering, Industrial; Computer Science
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Chaar, J. K. (1990). A methodology for developing real-time control software for efficient and dependable manufacturing systems. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/103284
Chicago Manual of Style (16th Edition):
Chaar, Jarir Kamel. “A methodology for developing real-time control software for efficient and dependable manufacturing systems.” 1990. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/103284.
MLA Handbook (7th Edition):
Chaar, Jarir Kamel. “A methodology for developing real-time control software for efficient and dependable manufacturing systems.” 1990. Web. 22 Jan 2021.
Vancouver:
Chaar JK. A methodology for developing real-time control software for efficient and dependable manufacturing systems. [Internet] [Doctoral dissertation]. University of Michigan; 1990. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/103284.
Council of Science Editors:
Chaar JK. A methodology for developing real-time control software for efficient and dependable manufacturing systems. [Doctoral Dissertation]. University of Michigan; 1990. Available from: http://hdl.handle.net/2027.42/103284

University of Michigan
16.
Eichenberger, Alexandre Edouard.
Modulo scheduling, machine representations, and register-sensitive algorithms.
Degree: PhD, Electrical engineering, 1996, University of Michigan
URL: http://hdl.handle.net/2027.42/129979
► High performance compilers increasingly rely on accurate modeling of the machine resources to efficiently exploit the instruction level parallelism of an application. In this dissertation,…
(more)
▼ High performance compilers increasingly rely on accurate modeling of the machine resources to efficiently exploit the instruction level parallelism of an application. In this dissertation, we first propose a reduced machine description that results in significantly faster detection of resource contentions while preserving the scheduling constraints present in the original machine description. This approach reduces a machine description in an automated, error-free, and efficient fashion. Moreover, it fully supports the elaborate scheduling techniques that are used by high-performance compilers, such as scheduling an operation earlier than others that are already scheduled, unscheduling operations due to resource conflicts, and efficient handling of periodic resource requirements found in software pipelined schedules. Reduced machine descriptions resulted in processing the queries to the resource model in 58.1% of the original time for a benchmark of 1327 loops scheduled by a state-of-the-art modulo scheduler for the Cydra 5 machine. Scheduling techniques such as Modulo Scheduling are also increasingly successful at efficiently exploiting the instruction level parallelism up to the resource limit of the machine, resulting in high performance but increased register requirements. In this dissertation, we propose an optimal register-sensitive algorithm for modulo scheduling that schedules loop operations to achieve the minimum register requirements for the smallest possible initiation interval between successive iterations. The proposed approach supports machines with finite resources and complex reservation tables, such as the Cydra 5. It also uses a well structured formulation that enables us to find schedules within a reasonable time for more loops (920 of the 1327 loops) and larger loops (up to 41 operations). We also propose an alternative approach, stage scheduling, which reduces the register requirements of a given modulo schedule by shifting its operations by multiples of II cycles. In addition to achieving a significant fraction of the possible decrease in register requirements (e.g. the average register requirements decrease by 22.2% for the optimal modulo scheduler: by 19.9% for the optimal stage scheduler, and by 19.8% for our best stage scheduling heuristic, compared to a register-insensitive modulo scheduler in a benchmark of 920 loops), the stage scheduling approach also preserves the steady-state performance of the initial schedules. By bounding the schedule length of its schedule, we may preserve the transient performance of the original as well. Thus, by coupling efficient stage schedule heuristics with a register-insensitive high-performance modulo scheduler, we may very quickly obtain high-performance schedules with low register requirements.
Advisors/Committee Members: Abraham, Santosh G. (advisor), Davidson, Edward S. (advisor).
Subjects/Keywords: Algorithms; Machine; Modulo; Register; Representations; Scheduling; Sensitive
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Eichenberger, A. E. (1996). Modulo scheduling, machine representations, and register-sensitive algorithms. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/129979
Chicago Manual of Style (16th Edition):
Eichenberger, Alexandre Edouard. “Modulo scheduling, machine representations, and register-sensitive algorithms.” 1996. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/129979.
MLA Handbook (7th Edition):
Eichenberger, Alexandre Edouard. “Modulo scheduling, machine representations, and register-sensitive algorithms.” 1996. Web. 22 Jan 2021.
Vancouver:
Eichenberger AE. Modulo scheduling, machine representations, and register-sensitive algorithms. [Internet] [Doctoral dissertation]. University of Michigan; 1996. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/129979.
Council of Science Editors:
Eichenberger AE. Modulo scheduling, machine representations, and register-sensitive algorithms. [Doctoral Dissertation]. University of Michigan; 1996. Available from: http://hdl.handle.net/2027.42/129979

University of Michigan
17.
Smelyanskiy, Mikhail.
Hardware/software mechanisms for increasing resource utilization on VLIW/EPIC processors.
Degree: C.S.E., Computer science, 2004, University of Michigan
URL: http://hdl.handle.net/2027.42/124185
► VLIW/EPIC (Very Large Instruction Word/Explicitly Parallel Instruction Computing) processors are increasingly used in signal processing, embedded and general-purpose applications. To achieve efficient instruction schedules in…
(more)
▼ VLIW/EPIC (Very Large Instruction Word/Explicitly Parallel Instruction Computing) processors are increasingly used in signal processing, embedded and general-purpose applications. To achieve efficient instruction schedules in order to meet the high performance demands of these applications, these processors rely on an optimizing compiler that uses aggressive optimizations, such as predication and software pipelining, to expose and exploit instruction level parallelism. To capitalize fully on the parallelism offered by these optimizations requires increasing critical processor resources, such as function units, register and memory ports, and architected registers, which is costly in terms of cycle time, power and area. To this end, this dissertation proposes three novel schemes for achieving higher processor performance by means of more efficient utilization of the existing processor resources in the context of predication and software pipelining. We developed deterministic predicate-aware scheduling (DPAS), which can combine operations with mutually-exclusive predicates to share the same resource in the same cycle. To support DPAS, the processor pipeline is adapted to read predicates early and discard the operations guarded under False predicates. Mutual exclusivity guarantees that runtime conflicts will never occur. The overall effect of DPAS is to use the limited existing resources more efficiently, thereby increasing the performance of the applications studied by an average 10% when resource constraints are a bottleneck. To increase resource utilization further, we developed a powerful generalization of DPAS, called probabilistic predicate-aware scheduling (PPAS), which can assign arbitrary predicated operations to share the same resource in the same cycle. Contrary to DPAS, PPAS can result in runtime conflicts, as it allows more than one predicate of a set of combined operations to be True in the same runtime cycle. Assignment is performed in a probabilistic manner using a combination of predicate profile information and predicate analysis aimed at maximizing the benefits of sharing in view of the expected degree of conflict. The processor pipeline is further modified to detect and recover from such conflicts. By allowing more flexibility in resource sharing than DPAS, PPAS achieved an average 19% performance gain for the resource-constrained instruction schedules. Finally, to effectively deal with the architected register pressure and code size problems in software-pipelined loops, we have developed a hardware/software mechanism called Register Queues. By decoupling an existing register space into a small set of architected registers and a large set of physical registers, register queues enable efficient software-pipeline schedules for high operation latencies with almost no increase in either architected registers or code size.
Advisors/Committee Members: Davidson, Edward S. (advisor), Mahlke, Scott A. (advisor).
Subjects/Keywords: Hardware/software; Increasing; Mechanisms; Processors; Resource Utilization; Vliw/epic
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Smelyanskiy, M. (2004). Hardware/software mechanisms for increasing resource utilization on VLIW/EPIC processors. (Thesis). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/124185
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Smelyanskiy, Mikhail. “Hardware/software mechanisms for increasing resource utilization on VLIW/EPIC processors.” 2004. Thesis, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/124185.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Smelyanskiy, Mikhail. “Hardware/software mechanisms for increasing resource utilization on VLIW/EPIC processors.” 2004. Web. 22 Jan 2021.
Vancouver:
Smelyanskiy M. Hardware/software mechanisms for increasing resource utilization on VLIW/EPIC processors. [Internet] [Thesis]. University of Michigan; 2004. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/124185.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Smelyanskiy M. Hardware/software mechanisms for increasing resource utilization on VLIW/EPIC processors. [Thesis]. University of Michigan; 2004. Available from: http://hdl.handle.net/2027.42/124185
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Michigan
18.
Chappell, Robert Sommer.
Simultaneous subordinate microthreading.
Degree: PhD, Electrical engineering, 2004, University of Michigan
URL: http://hdl.handle.net/2027.42/124027
► Tomorrow's ultra-wide microprocessors will be unable to supply enough work from single-threaded programs to take advantage of all available execution resources. Instruction processing bottlenecks, such…
(more)
▼ Tomorrow'
s ultra-wide microprocessors will be unable to supply enough work from single-threaded programs to take advantage of all available execution resources. Instruction processing bottlenecks, such as branch mispredictions and data cache misses, will combine to limit performance to a fraction of the potential. Simultaneous Subordinate Microthreading (SSMT) is a new paradigm that takes advantage of spare execution resources to improve single-threaded performance. Additional useful work is supplied in the form of microthreads, small routines of instructions that execute simultaneously with the running program, called the primary thread. Microthreads improve the performance of the primary thread by interacting with the microarchitecture to reduce instruction processing bottlenecks. This dissertation presents the SSMT paradigm and demonstrates two applications of SSMT that improve performance by correcting branch mispredictions. The first application, Microthread Hybrid Branch Prediction, uses microthreads as the specialized components of a hybrid branch predictor. The resulting microthread hybrid improves accuracy while bypassing the limitations of a hardware hybrid implementation. The second application of SSMT, Microthread Partial Branch Pre-computation, dynamically constructs microthreads to speculatively precompute branch outcomes along frequently mispredicted paths. These microthreads achieve nearly perfect prediction accuracy and also provide some prefetching benefit. Using either of these two applications of SSMT, performance of the primary thread can be substantially increased. This dissertation also describes the software and hardware changes necessary to support SSMT on a futuristic microprocessor. A set of microthread constraints is specified to simplify the requirements. Using these microthread constraints, an implementation can be achieved that limits additional complexity and has virtually no impact on the primary pipeline. A sensitivity analysis demonstrates that reasonable design points are possible for support of SSMT. Finally, this dissertation examines two inherent problems with the successful use of SSMT: microthread overhead and microthread latency. A number of techniques for limiting overhead and tolerating latency are discussed, including the use of special-purpose microinstructions and dynamic feedback. When applied to Microthread Hybrid Branch Prediction and Microthread Partial Branch Precomputation, these techniques for reducing overhead and latency can sometimes make the difference between performance loss and performance gain.
Advisors/Committee Members: Patt, Yale N. (advisor), Davidson, Edward S. (advisor).
Subjects/Keywords: Microthread; Multithreading; Simultaneous Subordinate Microthreading
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Chappell, R. S. (2004). Simultaneous subordinate microthreading. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/124027
Chicago Manual of Style (16th Edition):
Chappell, Robert Sommer. “Simultaneous subordinate microthreading.” 2004. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/124027.
MLA Handbook (7th Edition):
Chappell, Robert Sommer. “Simultaneous subordinate microthreading.” 2004. Web. 22 Jan 2021.
Vancouver:
Chappell RS. Simultaneous subordinate microthreading. [Internet] [Doctoral dissertation]. University of Michigan; 2004. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/124027.
Council of Science Editors:
Chappell RS. Simultaneous subordinate microthreading. [Doctoral Dissertation]. University of Michigan; 2004. Available from: http://hdl.handle.net/2027.42/124027

University of Michigan
19.
Racunas, Paul Brian.
Reducing load latency through memory instruction characterization.
Degree: PhD, Electrical engineering, 2003, University of Michigan
URL: http://hdl.handle.net/2027.42/123938
► Processor performance is directly impacted by the latency of the memory system. As processor core cycle times decrease, the disparity between the latency of an…
(more)
▼ Processor performance is directly impacted by the latency of the memory system. As processor core cycle times decrease, the disparity between the latency of an arithmetic instruction and the average latency of a load instruction will continue to increase. A wide-issue superscalar machine requires a memory system highly optimized for latency. This dissertation analyzes the patterns of data sharing between memory instructions and the address calculation chains leading up to each load instruction. The analysis of memory instruction data sharing patterns shows that the dynamic address stream can be broken into several independent streams. This observation is used to segment the first level of the memory hierarchy, including the memory disambiguation logic, into several independent partitions. A partitioned cache with eight partitions can be accessed in half the time of an equivalently-sized unpartitioned cache. An aggressive processor implementing a partitioned first-level cache outperformed the same processor implementing an equivalently-sized conventional cache by 4.5% on the SPECint00 benchmark suite. The analysis of address calculation chains demonstrates that, a relatively small number of unique functions are used in the calculation of memory data addresses within an application. A method of dynamically identifying these functions and reproducing them in hardware is developed. This technique allows the results of complex address calculations to be generated independently of the program instruction stream and without executing the instructions involved in the calculation. A processor utilizing this scheme outperformed a processor implementing conventional address prediction by 5.5% on the SPECint00 bench-mark suite.
Advisors/Committee Members: Patt, Yale N. (advisor), Davidson, Edward S. (advisor).
Subjects/Keywords: Characterization; Load Latency; Memory Instruction; Partitioning; Reducing; Superscalar
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Racunas, P. B. (2003). Reducing load latency through memory instruction characterization. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/123938
Chicago Manual of Style (16th Edition):
Racunas, Paul Brian. “Reducing load latency through memory instruction characterization.” 2003. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/123938.
MLA Handbook (7th Edition):
Racunas, Paul Brian. “Reducing load latency through memory instruction characterization.” 2003. Web. 22 Jan 2021.
Vancouver:
Racunas PB. Reducing load latency through memory instruction characterization. [Internet] [Doctoral dissertation]. University of Michigan; 2003. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/123938.
Council of Science Editors:
Racunas PB. Reducing load latency through memory instruction characterization. [Doctoral Dissertation]. University of Michigan; 2003. Available from: http://hdl.handle.net/2027.42/123938

University of Michigan
20.
Tomko, Karen Arnold.
Domain decomposition, irregular applications, and parallel computers.
Degree: PhD, Computer Science and Engineering, 1995, University of Michigan
URL: http://hdl.handle.net/2027.42/104873
► Many large-scale computational problems are based on irregular (unstructured) domains. Some examples are finite element methods in structural analysis, finite volume methods in fluid dynamics,…
(more)
▼ Many large-scale computational problems are based on irregular (unstructured) domains. Some examples are finite element methods in structural analysis, finite volume methods in fluid dynamics, and circuit simulation for VLSI design. Domain decomposition is a common technique for distributing the data and work of irregular scientific applications across a distributed memory parallel machine. To obtain efficiency, subdomains must be constructed such that the work is divided with a reasonable balance among the processors while the communication-causing subdomain boundary is kept small. Application and machine specific information can be used in conjunction with domain decomposition to achieve a level of performance not possible with traditional domain decomposition methods. Application profiling characterizes the performance of an application on a specific machine. We present a method that uses curve-fitting of application profile data to calculate vertex and edge weights for use with weighted graph decomposition algorithms. We demonstrate its potential on two routines from a production finite element application running on the IBM SP2. Our method combined with a multilevel spectral algorithm reduced load imbalance from 52% to less than 10% for one routine in our study. Many irregular applications have several phases, that must be load balanced individually to achieve high overall application performance. We propose finding one decomposition that can be used effectively for each phase of the application, and introduce a decomposition that can be used effectively for each phase of the application, and introduce a decomposition algorithm which load balances according to two vertex weight sets for use on two-phase applications. We show that this dual weight algorithm call be as successful at load balancing two individual routines together as the traditional single weight algorithm is at load balancing each routine independently. Domain decomposition algorithms take a simplistic view of multiprocessor communication. Higher performance can be achieved by considering the communication characteristics of the target multiprocessor in conjunction with decomposition techniques. We provide a methodology for tuning an application for a shared-address space multiprocessor by using intelligent layout of the application data to reduce coherence traffic and employing latency hiding mechanisms to overlap communication with useful work. These techniques have been applied to a finite element radar application running on the Kendall Square KSR1.
Advisors/Committee Members: Abraham, Santosh G. (advisor), Davidson, Edward S. (advisor).
Subjects/Keywords: Computer Science
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Tomko, K. A. (1995). Domain decomposition, irregular applications, and parallel computers. (Doctoral Dissertation). University of Michigan. Retrieved from http://hdl.handle.net/2027.42/104873
Chicago Manual of Style (16th Edition):
Tomko, Karen Arnold. “Domain decomposition, irregular applications, and parallel computers.” 1995. Doctoral Dissertation, University of Michigan. Accessed January 22, 2021.
http://hdl.handle.net/2027.42/104873.
MLA Handbook (7th Edition):
Tomko, Karen Arnold. “Domain decomposition, irregular applications, and parallel computers.” 1995. Web. 22 Jan 2021.
Vancouver:
Tomko KA. Domain decomposition, irregular applications, and parallel computers. [Internet] [Doctoral dissertation]. University of Michigan; 1995. [cited 2021 Jan 22].
Available from: http://hdl.handle.net/2027.42/104873.
Council of Science Editors:
Tomko KA. Domain decomposition, irregular applications, and parallel computers. [Doctoral Dissertation]. University of Michigan; 1995. Available from: http://hdl.handle.net/2027.42/104873
.