You searched for +publisher:"University of Texas – Austin" +contributor:("Burger, Douglas C.")
.
Showing records 1 – 16 of
16 total matches.
No search limiters apply to these results.
1.
Li, Dong, active 21st century.
Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors.
Degree: PhD, Computer Science, 2014, University of Texas – Austin
URL: http://hdl.handle.net/2152/25098
► Throughput processors such as GPUs continue to provide higher peak arithmetic capability. Designing a high throughput memory system to keep the computational units busy is…
(more)
▼ Throughput processors such as GPUs continue to provide higher peak arithmetic capability. Designing a high throughput memory system to keep the computational units busy is very challenging. Future throughput processors must continue to exploit data locality and utilize the on-chip and off-chip resources in the memory system more effectively to further improve the memory system throughput. This dissertation advocates orchestrating the thread scheduler with the cache management algorithms to alleviate GPU cache thrashing and pollution, avoid bandwidth saturation and maximize GPU memory system throughput. Based on this principle, this thesis work proposes three mechanisms to improve the cache efficiency and the memory throughput. This thesis work enhances the thread throttling mechanism with the Priority-based Cache Allocation mechanism (PCAL). By estimating the cache miss ratio with a variable number of cache-feeding threads and monitoring the usage of key memory system resources, PCAL determines the number of threads to share the cache and the minimum number of threads bypassing the cache that saturate memory system resources. This approach reduces the cache thrashing problem and effectively employs chip resources that would otherwise go unused by a pure thread throttling approach. We observe 67% improvement over the original as-is benchmarks and a 18% improvement over a better-tuned warp-throttling baseline. This work proposes the AgeLRU and Dynamic-AgeLRU mechanisms to address the inter-thread cache thrashing problem. AgeLRU prioritizes cache blocks based on the scheduling priority of their fetching warp at replacement. Dynamic-AgeLRU selects the AgeLRU algorithm and the LRU algorithm adaptively to avoid degrading the performance of non-thrashing applications. There are three variants of the AgeLRU algorithm: (1) replacement-only, (2) bypassing, and (3) bypassing with traffic optimization. Compared to the LRU algorithm, the above mentioned three variants of the AgeLRU algorithm enable increases in performance of 4%, 8% and 28% respectively across a set of cache-sensitive benchmarks. This thesis work develops the Reuse-Prediction-based cache Replacement scheme (RPR) for the GPU L1 data cache to address the intra-thread cache pollution problem. By combining the GPU thread scheduling priority together with the fetching Program Counter (PC) to generate a signature as the index of the prediction table, RPR identifies and prioritizes the near-reuse blocks and high-reuse blocks to maximize the cache efficiency. Compared to the AgeLRU algorithm, the experimental results show that the RPR algorithm results in a throughput improvement of 5% on average for regular applications, and a speedup of 3.2% on average across a set of cache-sensitive benchmarks. The techniques proposed in this dissertation are able to alleviate the cache thrashing, cache pollution and resource saturation problems effectively. We believe when these techniques are combined, they will synergistically further improve GPU cache efficiency and the…
Advisors/Committee Members: Fussell, Donald S., 1951- (advisor), Burger, Douglas C., Ph. D. (advisor).
Subjects/Keywords: Throughput processors; GPU; Architecture
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Li, Dong, a. 2. c. (2014). Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/25098
Chicago Manual of Style (16th Edition):
Li, Dong, active 21st century. “Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors.” 2014. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/25098.
MLA Handbook (7th Edition):
Li, Dong, active 21st century. “Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors.” 2014. Web. 26 Feb 2021.
Vancouver:
Li, Dong a2c. Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2014. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/25098.
Council of Science Editors:
Li, Dong a2c. Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors. [Doctoral Dissertation]. University of Texas – Austin; 2014. Available from: http://hdl.handle.net/2152/25098
2.
Nagarajan, Ramadass, 1977-.
Design and evaluation of a technology-scalable architecture for instruction-level parallelism.
Degree: PhD, Computer Sciences, 2007, University of Texas – Austin
URL: http://hdl.handle.net/2152/3534
► Future performance improvements must come from the exploitation of concurrency at all levels. Recent approaches that focus on thread-level and data-level concurrency are a natural…
(more)
▼ Future performance improvements must come from the exploitation of concurrency
at all levels. Recent approaches that focus on thread-level and data-level
concurrency are a natural fit for certain application domains, but it is unclear
whether they can be adapted efficiently to eliminate serial bottlenecks. Conventional
superscalar hardware that instead focuses on instruction-level parallelism (ILP) is
limited by power inefficiency, on-chip wire latency, and design complexity. Ultimately,
poor single-thread performance and Amdahl’s law will inhibit the overall
performance growth even on parallel workloads. To address this problem, we undertook
the challenge of designing a scalable, wide-issue, large-window processor
that mitigates complexity, reduces power overheads, and exploits ILP to improve
single-thread performance at future wire-delay dominated technologies.
This dissertation describes the design and evaluation of the TRIPS architecture
for exploiting ILP. The TRIPS architecture belongs to a new class of instruction
set architectures called Explicit Data Graph Execution (EDGE) architectures that
use large dataflow graphs of computation and explicit producer-consumer communication
to express concurrency to the hardware. We describe how these architectures
match the characteristics of future sub-45 nm CMOS technologies to mitigate
complexity and improve concurrency at reduced overheads. We describe the architectural
and microarchitectural principles of the TRIPS architecture, which exploits
ILP by issuing instructions widely, in dynamic dataflow fashion, from a large distributed
window of instructions.
We then describe our specific contributions to the development of the TRIPS
prototype chip, which was implemented in a 130 nm ASIC technology and consists
of more than 170 million transistors. In particular, we describe the implementation
of the distributed control protocols that offer various services for executing a single
program in the hardware. Finally, we describe a detailed evaluation of the TRIPS
architecture and identify the key determinants of its performance. In particular,
we describe the development of the infrastructure required for a detailed analysis,
including a validated performance model, a highly optimized suite of benchmarks,
and critical path models that identify various architectural and microarchitectural
bottlenecks at a fine level of granularity.
On a set of highly optimized benchmark kernels, the manufactured TRIPS
parts out-perform a conventional superscalar processor by a factor of 3× on average.
We find that the automatically compiled versions of the same kernels are yet to
reap the benefits of the high-ILP TRIPS core, but exceed the performance of the
superscalar processor in many cases. Our results indicate that the overhead of
various control protocols that manage the overall execution in the processor have
only a modest effect on performance. However, operand communication between
various components in the distributed microarchitecture…
Advisors/Committee Members: Burger, Douglas C., Ph. D. (advisor).
Subjects/Keywords: Computer architecture – Design; Computer architecture – Evaluation; High performance processors – Design and construction; High performance processors – Evaluation; Parallel processing (Electronic computers); Threads (Computer programs)
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Nagarajan, Ramadass, 1. (2007). Design and evaluation of a technology-scalable architecture for instruction-level parallelism. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/3534
Chicago Manual of Style (16th Edition):
Nagarajan, Ramadass, 1977-. “Design and evaluation of a technology-scalable architecture for instruction-level parallelism.” 2007. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/3534.
MLA Handbook (7th Edition):
Nagarajan, Ramadass, 1977-. “Design and evaluation of a technology-scalable architecture for instruction-level parallelism.” 2007. Web. 26 Feb 2021.
Vancouver:
Nagarajan, Ramadass 1. Design and evaluation of a technology-scalable architecture for instruction-level parallelism. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2007. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/3534.
Council of Science Editors:
Nagarajan, Ramadass 1. Design and evaluation of a technology-scalable architecture for instruction-level parallelism. [Doctoral Dissertation]. University of Texas – Austin; 2007. Available from: http://hdl.handle.net/2152/3534
3.
St Amant, Renee Marie.
Enabling high-performance, mixed-signal approximate computing.
Degree: PhD, Computer Science, 2014, University of Texas – Austin
URL: http://hdl.handle.net/2152/25025
► For decades, the semiconductor industry enjoyed exponential improvements in microprocessor power and performance with the device scaling of successive technology generations. Scaling limitations at sub-micron…
(more)
▼ For decades, the semiconductor industry enjoyed exponential improvements in microprocessor power and performance with the device scaling of successive technology generations. Scaling limitations at sub-micron technologies, however, have ceased to provide these historical performance improvements within a limited power budget. While device scaling provides a larger number of transistors per chip, for the same chip area, a growing percentage of the chip will have to be powered off at any given time due to power constraints. As such, the architecture community has focused on energy-efficient designs and is looking to specialized hardware to provide gains in performance. A focus on energy efficiency, along with increasingly less reliable transistors due to device scaling, has led to research in the area of approximate computing, where accuracy is traded for energy efficiency when precise computation is not required. There is a growing body of approximation-tolerant applications that, for example, compute on noisy or incomplete data, such as real-world sensor inputs, or make approximations to decrease the computation load in the analysis of cumbersome data sets. These approximation-tolerant applications span application domains, such as machine learning, image processing, robotics, and financial analysis, among others. Since the advent of the modern processor, computing models have largely presumed the attribute of accuracy. A willingness to relax accuracy requirements, however, with goal of gaining energy efficiency, warrants the re-investigation of the potential of analog computing. Analog hardware offers the opportunity for fast and low-power computation; however, it presents challenges in the form of accuracy. Where analog compute blocks have been applied to solve fixed-function problems, general-purpose computing has relied on digital hardware implementations that provide generality and programmability. The work presented in this thesis aims to answer the following questions: Can analog circuits be successfully integrated into general-purpose computing to provide performance and energy savings? And, what is required to address the historical analog challenges of inaccuracy, programmability, and a lack of generality to enable such an approach? This thesis work investigates a neural approach as a means to address the historical analog challenges of inaccuracy, programmability, and generality and to enable the use of analog circuits in general-purpose, high-performance computing. The first piece of this thesis work investigates the use of analog circuits at the microarchitecture level in the form of an analog neural branch predictor. The task of branch prediction can tolerate imprecision, as roll-back mechanisms correct for branch mispredictions, and application-level accuracy remains unaffected. We show that analog circuits enable the implementation of a highly-accurate, neural-prediction algorithm that is infeasible to implement in the digital domain. The second piece of this thesis work presents a neural…
Advisors/Committee Members: Lin, Yun Calvin (advisor), Burger, Douglas C., Ph. D. (advisor).
Subjects/Keywords: Approximate computing; Neural branch prediction; Neural accelerator; General purpose computing
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
St Amant, R. M. (2014). Enabling high-performance, mixed-signal approximate computing. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/25025
Chicago Manual of Style (16th Edition):
St Amant, Renee Marie. “Enabling high-performance, mixed-signal approximate computing.” 2014. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/25025.
MLA Handbook (7th Edition):
St Amant, Renee Marie. “Enabling high-performance, mixed-signal approximate computing.” 2014. Web. 26 Feb 2021.
Vancouver:
St Amant RM. Enabling high-performance, mixed-signal approximate computing. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2014. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/25025.
Council of Science Editors:
St Amant RM. Enabling high-performance, mixed-signal approximate computing. [Doctoral Dissertation]. University of Texas – Austin; 2014. Available from: http://hdl.handle.net/2152/25025
4.
Robatmili, Behnam.
Efficient execution of sequential applications on multicore systems.
Degree: PhD, Computer Science, 2011, University of Texas – Austin
URL: http://hdl.handle.net/2152/ETD-UT-2011-08-3987
► Conventional CMOS scaling has been the engine of the technology revolution in most application domains. This trend has changed as in each technology generation, transistor…
(more)
▼ Conventional CMOS scaling has been the engine of the technology revolution in most application domains. This trend has changed as in each technology generation, transistor densities continue to increase while due to the limits on threshold voltage scaling, per-transistor energy consumption decreases much more slowly than in the past. The power scaling issues will restrict the adaptability of designs to operate in different power and performance regimes. Consequently, future systems must employ more efficient architectures for optimizing every thread in the program across different power and performance regimes, rather than architectures that utilize more transistors. One solution is composable or dynamic multicore architectures that can span a wide range of energy/performance operating points by enabling multiple simple cores to compose to form a larger and more powerful core.
Explicit Data Graph Execution (EDGE) architectures represent a highly scalable class of composable processors that exploit predicated dataflow block execution and distributed microarchitectures. However, prior EDGE architectures suffer from several energy and performance bottlenecks including expensive intra-block operand communication due to fine-grain instruction distribution among cores,
the compiler-generated fanout trees built for high-fanout operand delivery, poor next-block prediction accuracy, and low speculation rates due to predicates and expensive refills after pipeline flushes. To design an energy-efficient and flexible dynamic multicore, this dissertation employs a systematic methodology that detects inefficiencies and then designs and evaluates solutions that
maximize power and performance efficiency across different power and performance regimes. Some innovations and optimization techniques include:
(a) Deep Block Mapping extracts more coarse-grained parallelism and reduces cross-core operand network traffic by mapping each block of instructions into the instruction queue of one core instead of distributing blocks across all composed cores as done in previous EDGE designs,
(b) Iterative Path Predictor (IPP) reduces branch and predication overheads by unifying multi-exit block target prediction and predicate path prediction while providing improved accuracy for each,
(
c) Register Bypassing reduces cross-core register communication delays by bypassing register values predicted to be critical directly from producing to consuming cores,
(d) Block Reissue reduces pipeline flush penalties by reissuing instructions in previously executed instances of blocks while they are still in the instruction queue, and
(e) Exposed Operand Broadcasts (EOBs) reduce wide-fanout instruction overheads by extending the ISA to employ architecturally exposed low-overhead broadcasts combined with dataflow for efficient operand delivery for both high- and low-fanout instructions.
These components form the basis for a third-generation EDGE microarchitecture called T3. T3 improves energy efficiency by about 2x and performance by 47% compared…
Advisors/Committee Members: McKinley, Kathryn S. (advisor), Burger, Douglas C., Ph. D. (advisor), Keckler, Stephen W. (committee member), Lin, Calvin (committee member), Reinhardt, Steve (committee member).
Subjects/Keywords: Microarchitecture; EDGE; Multicore; Single-thread performance; Dataflow; Block-atomic execution; Power efficiency; Composable cores
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Robatmili, B. (2011). Efficient execution of sequential applications on multicore systems. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/ETD-UT-2011-08-3987
Chicago Manual of Style (16th Edition):
Robatmili, Behnam. “Efficient execution of sequential applications on multicore systems.” 2011. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/ETD-UT-2011-08-3987.
MLA Handbook (7th Edition):
Robatmili, Behnam. “Efficient execution of sequential applications on multicore systems.” 2011. Web. 26 Feb 2021.
Vancouver:
Robatmili B. Efficient execution of sequential applications on multicore systems. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2011. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/ETD-UT-2011-08-3987.
Council of Science Editors:
Robatmili B. Efficient execution of sequential applications on multicore systems. [Doctoral Dissertation]. University of Texas – Austin; 2011. Available from: http://hdl.handle.net/2152/ETD-UT-2011-08-3987
5.
Grot, Boris.
Network-on-chip architectures for scalability and service guarantees.
Degree: PhD, Computer Sciences, 2011, University of Texas – Austin
URL: http://hdl.handle.net/2152/ETD-UT-2011-08-3960
► Rapidly increasing transistor densities have led to the emergence of richly-integrated substrates in the form of chip multiprocessors and systems-on-a-chip. These devices integrate a variety…
(more)
▼ Rapidly increasing transistor densities have led to the emergence of richly-integrated substrates in the form of chip multiprocessors and systems-on-a-chip. These devices integrate a variety of discrete resources, such as processing cores and cache memories, on a single die with the degree of integration growing in accordance with Moore's law. In this dissertation, we address challenges of scalability and quality-of-service (QOS) in network architectures of highly-integrated chips. The proposed techniques address the principal sources of inefficiency in networks-on-chip (NOCs) in the form of performance, area, and energy overheads. We also present a comprehensive network architecture capable of interconnecting over a thousand discrete resources with high efficiency and strong guarantees.
We first show that mesh networks, commonly employed in existing chips, fall significantly short of achieving their performance potential due to transient congestion effects that diminish network performance. Adaptive routing has the potential to improve performance through better load distribution. However, we find that existing approaches are myopic in that they only consider local congestion indicators and fail to take global network state into account. Our approach, called Regional Congestion Awareness (RCA), improves network visibility in adaptive routers via a light-weight mechanism for propagating and integrating congestion information. By leveraging both local and non-local congestion indicators, RCA improves network load balance and boosts throughput. Under a set of parallel workloads running on a 49-node substrate, RCA reduces on-chip network latency by 16%, on average, compared to a locally-adaptive router.
Next, we target NOC latency and energy efficiency through a novel point-to-multipoint topology. Ring and mesh networks, favored in existing on-chip interconnects, often require packets to go through a number of intermediate routers between source and destination nodes, resulting in significant latency and energy overheads. Topologies that improve connectivity, such as fat tree and flattened butterfly, eliminate much of the router overhead, but require non-minimal channel lengths or large channel count, reducing energy-efficiency and/or performance as a result. We propose a new topology, called Multidrop Express Channels (MECS), that augments minimally-routed express channels with multi-drop capability. The resulting richly-connected NOC enjoys a low hop count with favorable delay and energy characteristics, while improving wire utilization over prior proposals.
Applications such as virtualized servers-on-a-chip and real-time systems require chip-level quality-of-service (QOS) support to provide fairness, service differentiation, and guarantees. Existing network QOS approaches suffer from considerable performance and area overheads that limit their usefulness in a resource-limited on-die network. In this dissertation, we propose a new QOS scheme called Preemptive Virtual Clock (PVC). PVC uses a preemptive…
Advisors/Committee Members: Keckler, Stephen W. (advisor), Burger, Douglas C. (committee member), Mutlu, Onur (committee member), Witchel, Emmett (committee member), Zhang, Yin (committee member).
Subjects/Keywords: Network-on-chip; NOC; Interconnection network; Quality-of-service; QOS; Topology; Routing; Flow control
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Grot, B. (2011). Network-on-chip architectures for scalability and service guarantees. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/ETD-UT-2011-08-3960
Chicago Manual of Style (16th Edition):
Grot, Boris. “Network-on-chip architectures for scalability and service guarantees.” 2011. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/ETD-UT-2011-08-3960.
MLA Handbook (7th Edition):
Grot, Boris. “Network-on-chip architectures for scalability and service guarantees.” 2011. Web. 26 Feb 2021.
Vancouver:
Grot B. Network-on-chip architectures for scalability and service guarantees. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2011. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/ETD-UT-2011-08-3960.
Council of Science Editors:
Grot B. Network-on-chip architectures for scalability and service guarantees. [Doctoral Dissertation]. University of Texas – Austin; 2011. Available from: http://hdl.handle.net/2152/ETD-UT-2011-08-3960
6.
Smith, Aaron Lee, 1977-.
Explicit data graph compilation.
Degree: PhD, Computer Sciences, 2009, University of Texas – Austin
URL: http://hdl.handle.net/2152/ETD-UT-2009-12-626
► Technology trends such as growing wire delays, power consumption limits, and diminishing clock rate improvements, present conventional instruction set architectures such as RISC, CISC, and…
(more)
▼ Technology trends such as growing wire delays, power consumption limits, and diminishing clock rate improvements, present conventional instruction set architectures such as RISC, CISC, and VLIW with difficult challenges. To show continued performance growth, future microprocessors must exploit concurrency power efficiently. An important question for any future system is the division of responsibilities between programmer, compiler, and hardware to discover and exploit concurrency.
In this research we develop the first compiler for an Explicit Data Graph Execution (EDGE) architecture and show how to solve the new challenge of compiling to a block-based architecture. In EDGE architectures, the compiler is responsible for partitioning the program into a sequence of structured blocks that logically execute atomically. The EDGE ISA defines the structure of, and the restrictions on, these blocks. The TRIPS prototype processor is an EDGE architecture that employs four restrictions on blocks intended to strike a balance between software and hardware complexity. They are: (1) fixed block sizes (maximum of 128 instructions), (2) restricted number of loads and stores (no more than 32 may issue per block), (3) restricted register accesses (no more than eight reads and eight writes to each of four banks per block), and (4) constant number of block outputs (each block must always generate a constant number of register writes and stores, plus exactly one branch).
The challenges addressed in this thesis are twofold. First, we develop the algorithms and internal representations necessary to support the new structural constraints imposed by the block-based EDGE execution model. This first step provides correct execution and demonstrates the feasibility of EDGE compilers.
Next, we show how to optimize blocks using a dataflow predication model and provide results showing how the compiler is meeting this challenge on the SPEC2000 benchmarks. Using basic blocks as the baseline performance, we show that optimizations utilizing the dataflow predication model achieve up to 64% speedup on SPEC2000 with an average speedup of 31%.
Advisors/Committee Members: Burger, Douglas C., Ph. D. (advisor), John, Lizy K. (committee member), Keckler, Stephen W. (committee member), Lin, Calvin (committee member), McKinley, Kathryn S. (committee member).
Subjects/Keywords: EDGE; Computer architecture; Compilers
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Smith, Aaron Lee, 1. (2009). Explicit data graph compilation. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/ETD-UT-2009-12-626
Chicago Manual of Style (16th Edition):
Smith, Aaron Lee, 1977-. “Explicit data graph compilation.” 2009. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/ETD-UT-2009-12-626.
MLA Handbook (7th Edition):
Smith, Aaron Lee, 1977-. “Explicit data graph compilation.” 2009. Web. 26 Feb 2021.
Vancouver:
Smith, Aaron Lee 1. Explicit data graph compilation. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2009. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/ETD-UT-2009-12-626.
Council of Science Editors:
Smith, Aaron Lee 1. Explicit data graph compilation. [Doctoral Dissertation]. University of Texas – Austin; 2009. Available from: http://hdl.handle.net/2152/ETD-UT-2009-12-626
7.
Gebhart, Mark Alan.
Energy-efficient mechanisms for managing on-chip storage in throughput processors.
Degree: PhD, Computer Science, 2012, University of Texas – Austin
URL: http://hdl.handle.net/2152/ETD-UT-2012-05-5141
► Modern computer systems are power or energy limited. While the number of transistors per chip continues to increase, classic Dennard voltage scaling has come to…
(more)
▼ Modern computer systems are power or energy limited. While the number of transistors per chip continues to increase, classic Dennard voltage scaling has come to an end. Therefore, architects must improve a design's energy efficiency to continue to increase performance at historical rates, while staying within a system's power limit. Throughput processors, which use a large number of threads to tolerate
memory latency, have emerged as an energy-efficient platform for
achieving high performance on diverse workloads and are found in
systems ranging from cell phones to supercomputers. This work focuses
on graphics processing units (GPUs), which contain thousands of
threads per chip.
In this dissertation, I redesign the on-chip storage system of a
modern GPU to improve energy efficiency. Modern GPUs contain very large register files that consume between 15%-20% of the
processor's dynamic energy. Most values written into the register
file are only read a single time, often within a few instructions of
being produced. To optimize for these patterns, we explore various
designs for register file hierarchies. We study both a
hardware-managed register file cache and a software-managed operand register file. We evaluate the energy tradeoffs in varying the number of levels and the capacity of each level in the hierarchy. Our most efficient design reduces register file energy by 54%.
Beyond the register file, GPUs also contain on-chip scratchpad
memories and caches. Traditional systems have a fixed partitioning
between these three structures. Applications have diverse
requirements and often a single resource is most critical to
performance. We propose to unify the register file, primary data
cache, and scratchpad memory into a single structure that is
dynamically partitioned on a per-kernel basis to match the
application's needs.
The techniques proposed in this dissertation improve the utilization of on-chip memory, a scarce resource for systems with a large number of hardware threads. Making more efficient use of on-chip memory both improves performance and reduces energy. Future efficient systems will be achieved by the combination of several such techniques which
improve energy efficiency.
Advisors/Committee Members: Keckler, Stephen W. (advisor), Burger, Douglas C. (committee member), Erez, Mattan (committee member), Fussell, Donald S. (committee member), Lin, Calvin (committee member), McKinley, Kathryn S. (committee member).
Subjects/Keywords: Energy efficiency; Multi-threading; Register file organization; Throughput computing
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Gebhart, M. A. (2012). Energy-efficient mechanisms for managing on-chip storage in throughput processors. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/ETD-UT-2012-05-5141
Chicago Manual of Style (16th Edition):
Gebhart, Mark Alan. “Energy-efficient mechanisms for managing on-chip storage in throughput processors.” 2012. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/ETD-UT-2012-05-5141.
MLA Handbook (7th Edition):
Gebhart, Mark Alan. “Energy-efficient mechanisms for managing on-chip storage in throughput processors.” 2012. Web. 26 Feb 2021.
Vancouver:
Gebhart MA. Energy-efficient mechanisms for managing on-chip storage in throughput processors. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2012. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/ETD-UT-2012-05-5141.
Council of Science Editors:
Gebhart MA. Energy-efficient mechanisms for managing on-chip storage in throughput processors. [Doctoral Dissertation]. University of Texas – Austin; 2012. Available from: http://hdl.handle.net/2152/ETD-UT-2012-05-5141

University of Texas – Austin
8.
Kim, Changkyu.
A technology-scalable composable architecture.
Degree: PhD, Computer Sciences, 2007, University of Texas – Austin
URL: http://hdl.handle.net/2152/3279
► Clock rate scaling can no longer sustain computer system performance scaling due to power and thermal constraints and diminishing performance returns of deep pipelining. Future…
(more)
▼ Clock rate scaling can no longer sustain computer system performance scaling due
to power and thermal constraints and diminishing performance returns of deep pipelining.
Future performance improvements must therefore come from mining concurrency from applications.
However, increasing global on-chip wire delays will limit the amount of state
available in a single cycle, thereby hampering the ability to mine concurrency with conventional
approaches.
To address these technology challenges, the processor industry has migrated to chip
multiprocessors (CMPs). The disadvantage of conventional CMP architectures, however,
is their relative inflexibility to meet the wide range of application demands and operating
targets that now exist. The granularity (e.g., issue width), the number of processors in a chip
and memory hierarchies are fixed at design time based on the target workload mix, which
result in suboptimal operation as the workload mix and operating targets change over time.
In this dissertation, we explore the concept of composability to address both the
increasing wire delay problem and the inflexibility of conventional CMP architectures. The
basic concept of composability is the ability to dynamically adapt to diverse applications
and operating targets, both in terms of granularity and functionality, by aggregating finegrained
processing units or memory units.
First, we propose a composable on-chip memory substrate, called Non-Uniform
Access Cache Architecture (NUCA) to address increasing on-chip wire delay for future
large caches. The NUCA substrate breaks large on-chip memories into many fine-grained
memory banks that are independently accessible, with a switched network embedded in
the cache. Lines can be mapped into this array of memory banks with fixed mappings or
dynamic mappings, where cache lines can move around within the cache to further reduce
the average cache hit latency.
Second, we evaluate a range of strategies to build a composable processor. Composable
processors provide flexibility of adapting the granularity of processors to various
application demands and operating targets, and thus choose the hardware configurations
best suited to any given point. A composable processor consists of a large number of lowpower,
fine-grained processor cores that can be aggregated dynamically to form more powerful
logical processors. We present architectural innovations to support composability in a
power- and area-efficient manner.
Advisors/Committee Members: Burger, Douglas C., Ph. D. (advisor).
Subjects/Keywords: Computer architecture; Computer storage devices; Memory management (Computer science); Multiprocessors
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Kim, C. (2007). A technology-scalable composable architecture. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/3279
Chicago Manual of Style (16th Edition):
Kim, Changkyu. “A technology-scalable composable architecture.” 2007. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/3279.
MLA Handbook (7th Edition):
Kim, Changkyu. “A technology-scalable composable architecture.” 2007. Web. 26 Feb 2021.
Vancouver:
Kim C. A technology-scalable composable architecture. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2007. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/3279.
Council of Science Editors:
Kim C. A technology-scalable composable architecture. [Doctoral Dissertation]. University of Texas – Austin; 2007. Available from: http://hdl.handle.net/2152/3279

University of Texas – Austin
9.
Ranganathan, Nitya.
Control flow speculation for distributed architectures.
Degree: PhD, Computer Sciences, 2009, University of Texas – Austin
URL: http://hdl.handle.net/2152/6586
► As transistor counts, power dissipation, and wire delays increase, the microprocessor industry is transitioning from chips containing large monolithic processors to multi-core architectures. The granularity…
(more)
▼ As transistor counts, power dissipation, and wire delays increase, the microprocessor
industry is transitioning from chips containing large monolithic processors to multi-core
architectures. The granularity of cores determines the mechanisms for branch prediction,
instruction fetch and map, data supply, instruction execution, and completion. Accurate
control flow prediction is essential for high performance processors with large instruction
windows and high-bandwidth execution. This dissertation considers cores with very large
granularity, such as TRIPS, as well as cores with extremely small granularity, such as TFlex,
and explores control flow speculation issues in such processors. Both TRIPS and TFlex are distributed block-based architectures and require control speculation mechanisms that can
work in a distributed environment while supporting efficient block-level prediction, misprediction
detection, and recovery.
This dissertation aims at providing efficient control flow prediction techniques for
distributed block-based processors. First, we discuss simple exit predictors inspired by
branch predictors and describe the design of the TRIPS prototype block predictor. Area and
timing trade-offs in the predictor implementation are presented. We report the predictor
misprediction rates from the prototype chip for the SPEC benchmark suite. Next, we look
at the performance bottlenecks in the prototype predictor and present a detailed analysis
of exit and target predictors using basic prediction components inspired from branch predictors.
This study helps in understanding what types of predictors are effective for exit
and target prediction. Using the results of our prediction analysis, we propose novel hardware
techniques to improve the accuracy of block prediction. To understand whether exit
prediction is inherently more difficult than branch prediction, we measure the correlation
among branches in basic blocks and hyperblocks and examine the loss in correlation due to
hyperblock construction. Finally, we propose block predictors for TFlex, a fully distributed
architecture that uses composable lightweight processors. We describe various possible designs
for distributed block predictors and a classification scheme for such predictors. We
present results for predictors from each of the design points for distributed prediction.
Advisors/Committee Members: Burger, Douglas C., Ph. D. (advisor).
Subjects/Keywords: Distributed architectures; Control flow prediction
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ranganathan, N. (2009). Control flow speculation for distributed architectures. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/6586
Chicago Manual of Style (16th Edition):
Ranganathan, Nitya. “Control flow speculation for distributed architectures.” 2009. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/6586.
MLA Handbook (7th Edition):
Ranganathan, Nitya. “Control flow speculation for distributed architectures.” 2009. Web. 26 Feb 2021.
Vancouver:
Ranganathan N. Control flow speculation for distributed architectures. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2009. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/6586.
Council of Science Editors:
Ranganathan N. Control flow speculation for distributed architectures. [Doctoral Dissertation]. University of Texas – Austin; 2009. Available from: http://hdl.handle.net/2152/6586

University of Texas – Austin
10.
Sethumadhavan, Lakshminarasimhan, 1978-.
Scalable hardware memory disambiguation.
Degree: PhD, Computer Sciences, 2007, University of Texas – Austin
URL: http://hdl.handle.net/2152/3682
► This dissertation deals with one of the long-standing problems in Computer Architecture – the problem of memory disambiguation. Microprocessors typically reorder memory instructions during execution…
(more)
▼ This dissertation deals with one of the long-standing problems in Computer Architecture
– the problem of memory disambiguation. Microprocessors typically reorder
memory instructions during execution to improve concurrency. Such microprocessors
use hardware memory structures for memory disambiguation, known as LoadStore
Queues (LSQs), to ensure that memory instruction dependences are satisfied
even when the memory instructions execute out-of-order. A typical LSQ implementation
(circa 2006) holds all in-flight memory instructions in a physically centralized
LSQ and performs a fully associative search on all buffered instructions to ensure
that memory dependences are satisfied. These LSQ implementations do not scale
because they use large, fully associative structures, which are known to be slow and
power hungry. The increasing trend towards distributed microarchitectures further
exacerbates these problems. As on-chip wire delays increase and high-performance
processors become necessarily distributed, centralized structures such as the LSQ
can limit scalability.
This dissertation describes techniques to create scalable LSQs in both centralized
and distributed microarchitectures. The problems and solutions described
in this thesis are motivated and validated by real system designs. The dissertation
starts with a description of the partitioned primary memory system of the TRIPS
processor, of which the LSQ is an important component, and then through a series
of optimizations describes how the power, area, and centralization problems
of the LSQ can be solved with minor performance losses (if at all) even for large
number of in flight memory instructions. The four solutions described in this dissertation
— partitioning, filtering, late binding and efficient overflow management —
enable power-, area-efficient, distributed and scalable LSQs, which in turn enable
aggressive large-window processors capable of simultaneously executing thousands
of instructions.
To mitigate the power problem, we replaced the power-hungry, fully associative
search with a power-efficient hash table lookup using a simple address-based
Bloom filter. Bloom filters are probabilistic data structures used for testing set
membership and can be used to quickly check if an instruction with the same data
address is likely to be found in the LSQ without performing the associative search.
Bloom filters typically eliminate more than 80% of the associative searches and they
are highly effective because in most programs, it is uncommon for loads and stores
to have the same data address and be in execution simultaneously.
To rectify the area problem, we observe the fact that only a small fraction
of all memory instructions are dependent, that only such dependent instructions
need to be buffered in the LSQ, and that these instructions need to be in the LSQ
only for certain parts of the pipelined execution. We propose two mechanisms to
exploit these observations. The first mechanism, area filtering, is a hardware mechanism
that…
Advisors/Committee Members: Burger, Douglas C., Ph. D. (advisor).
Subjects/Keywords: Memory management (Computer science); Computer storage devices; Microprocessors – Design and construction; Computer architecture
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Sethumadhavan, Lakshminarasimhan, 1. (2007). Scalable hardware memory disambiguation. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/3682
Chicago Manual of Style (16th Edition):
Sethumadhavan, Lakshminarasimhan, 1978-. “Scalable hardware memory disambiguation.” 2007. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/3682.
MLA Handbook (7th Edition):
Sethumadhavan, Lakshminarasimhan, 1978-. “Scalable hardware memory disambiguation.” 2007. Web. 26 Feb 2021.
Vancouver:
Sethumadhavan, Lakshminarasimhan 1. Scalable hardware memory disambiguation. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2007. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/3682.
Council of Science Editors:
Sethumadhavan, Lakshminarasimhan 1. Scalable hardware memory disambiguation. [Doctoral Dissertation]. University of Texas – Austin; 2007. Available from: http://hdl.handle.net/2152/3682

University of Texas – Austin
11.
Desikan, Rajagopalan.
Distributed selective re-execution for EDGE architectures.
Degree: PhD, Electrical and Computer Engineering, 2005, University of Texas – Austin
URL: http://hdl.handle.net/2152/2414
Subjects/Keywords: Computer architecture; High performance computing
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Desikan, R. (2005). Distributed selective re-execution for EDGE architectures. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/2414
Chicago Manual of Style (16th Edition):
Desikan, Rajagopalan. “Distributed selective re-execution for EDGE architectures.” 2005. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/2414.
MLA Handbook (7th Edition):
Desikan, Rajagopalan. “Distributed selective re-execution for EDGE architectures.” 2005. Web. 26 Feb 2021.
Vancouver:
Desikan R. Distributed selective re-execution for EDGE architectures. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2005. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/2414.
Council of Science Editors:
Desikan R. Distributed selective re-execution for EDGE architectures. [Doctoral Dissertation]. University of Texas – Austin; 2005. Available from: http://hdl.handle.net/2152/2414

University of Texas – Austin
12.
Liu, Haiming.
Hardware techniques to improve cache efficiency.
Degree: PhD, Computer Sciences, 2009, University of Texas – Austin
URL: http://hdl.handle.net/2152/6566
► Modern microprocessors devote a large portion of their chip area to caches in order to bridge the speed and bandwidth gap between the core and…
(more)
▼ Modern microprocessors devote a large portion of their chip area to caches in order
to bridge the speed and bandwidth gap between the core and main memory. One
known problem with caches is that they are usually used with low efficiency; only a
small fraction of the cache stores data that will be used before getting evicted. As
the focus of microprocessor design shifts towards achieving higher performance-perwatt,
cache efficiency is becoming increasingly important. This dissertation proposes
techniques to improve both data cache efficiency in general and instruction cache
efficiency for Explicit Data Graph Execution (EDGE) architectures.
To improve the efficiency of data caches and L2 caches, dead blocks (blocks
that will not be referenced again before their eviction from the cache) should be
identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed, based on the individual reference history of the block. Such
schemes result in lower prediction accuracy and coverage. We delay the prediction
to achieve better prediction accuracy and coverage. For the L1 cache, we propose
a new class of dead-block prediction schemes that predict dead blocks based on
cache bursts. A cache burst begins when a block moves into the MRU position
and ends when it moves out of the MRU position. Cache burst history is more
predictable than individual reference history and results in better dead-block prediction
accuracy and coverage. Experiment results show that predicting the death
of a block at the end of a burst gives the best tradeoff between timeliness and prediction
accuracy/coverage. We also propose mechanisms to improve counting-based
dead-block predictors, which work best at the L2 cache. These mechanisms handle
reference-count variations, which cause problems for existing counting-based deadblock
predictors. The new schemes can identify the majority of the dead blocks with
approximately 90% or higher accuracy. For a 64KB, two-way L1 D-cache, 96% of
the dead blocks can be identified with a 96% accuracy, half way into a block’s dead
time. For a 64KB, four-way L1 cache, the prediction accuracy and coverage are 92%
and 91% respectively. At any moment, the average fraction of the dead blocks that
has been correctly detected for a two-way or four-way L1 cache is approximately
49% or 67% respectively. For a 1MB, 16-way set-associative L2 cache, 66% of the
dead blocks can be identified with a 89% accuracy, 1/16th way into a block’s dead
time. At any moment, 63% of the dead blocks in such an L2 cache, on average,
has been correctly identified by the dead-block predictor. The ability to accurately
identify the majority of the dead blocks in the cache long before their eviction can
lead to not only higher cache efficiency, but also reduced power consumption or
higher reliability.
In this dissertation, we use the dead-block information to improve cache
efficiency and performance by three techniques: replacement optimization, cache bypassing, and prefetching into…
Advisors/Committee Members: Burger, Douglas C., Ph. D. (advisor).
Subjects/Keywords: Data cache efficiency; Instruction cache efficiency; Explicit Data Graph Execution; Dead-block prediction
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Liu, H. (2009). Hardware techniques to improve cache efficiency. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/6566
Chicago Manual of Style (16th Edition):
Liu, Haiming. “Hardware techniques to improve cache efficiency.” 2009. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/6566.
MLA Handbook (7th Edition):
Liu, Haiming. “Hardware techniques to improve cache efficiency.” 2009. Web. 26 Feb 2021.
Vancouver:
Liu H. Hardware techniques to improve cache efficiency. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2009. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/6566.
Council of Science Editors:
Liu H. Hardware techniques to improve cache efficiency. [Doctoral Dissertation]. University of Texas – Austin; 2009. Available from: http://hdl.handle.net/2152/6566

University of Texas – Austin
13.
Huh, Jaehyuk.
Hardware techniques to reduce communication costs in multiprocessors.
Degree: PhD, Computer Sciences, 2006, University of Texas – Austin
URL: http://hdl.handle.net/2152/2533
► This dissertation explores techniques for reducing the costs of inter-processor communication in shared memory multiprocessors (MP). We seek to improve MP performance by enhancing three…
(more)
▼ This dissertation explores techniques for reducing the costs of inter-processor
communication in shared memory multiprocessors (MP). We seek to improve MP
performance by enhancing three aspects of multiprocessor cache designs: miss reduction,
low communication latency, and high coherence bandwidth. In this dissertation,
we propose three techniques to enhance the three factors: shared non-uniform
cache architecture, coherence decoupling, and subspace snooping.
As a miss reduction technique, we investigate shared cache designs for future
Chip-Multiprocessors (CMPs). Cache sharing can reduce cache misses by eliminating
unnecessary data duplication and by reallocating the cache capacity dynamically.
We propose a reconfigurable shared non-uniform cache architecture and evaluate
the trade-offs of cache sharing with varied sharing degrees. Although shared caches
can improve caching efficiency, the most significant disadvantage of shared caches
is the increase of cache hit latencies. To mitigate the effect of the long latencies,
we evaluate two latency management techniques, dynamic block migration and L1
prefetching.
However, improving the caching efficiency does not reduce the cache misses
induced by MP communication. For such communication misses, the latencies of
cache coherence should be either reduced or hidden and the coherence bandwidth
should scale with the number of processors. To mitigate long communication latencies,
coherence decoupling uses speculation for communication data. Coherence
decoupling allows processors to run speculatively at communication misses with predicted
values. Our prediction mechanism, called Speculative Cache Lookup (SCL)
protocol, uses stale values in the local caches. We show that the SCL read component
can hide false sharing and silent store misses effectively. We also investigate
the SCL update component to hide the latencies of truly shared misses by updating
invalid blocks speculatively.
To improve the coherence bandwidth, we propose subspace snooping, which
improves the snooping bandwidth with future large-scale shared-memory machines.
Even with huge optical bus bandwidth, traditional snooping protocols may not scale
to hundreds of processors, since all processors should respond to every bus access.
Subspace snooping allows only a subset of processors to be snooped for a bus access,
thus increasing the effective snoop tag bandwidth. We evaluate subspace snooping
with a large broadcasting bandwidth provided by optical interconnects.
Advisors/Committee Members: Burger, Douglas C., Ph. D. (advisor).
Subjects/Keywords: Multiprocessors – Design and construction; Cache memory
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Huh, J. (2006). Hardware techniques to reduce communication costs in multiprocessors. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/2533
Chicago Manual of Style (16th Edition):
Huh, Jaehyuk. “Hardware techniques to reduce communication costs in multiprocessors.” 2006. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/2533.
MLA Handbook (7th Edition):
Huh, Jaehyuk. “Hardware techniques to reduce communication costs in multiprocessors.” 2006. Web. 26 Feb 2021.
Vancouver:
Huh J. Hardware techniques to reduce communication costs in multiprocessors. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2006. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/2533.
Council of Science Editors:
Huh J. Hardware techniques to reduce communication costs in multiprocessors. [Doctoral Dissertation]. University of Texas – Austin; 2006. Available from: http://hdl.handle.net/2152/2533
14.
Govindan, Madhu Sarava.
E³ : energy-efficient EDGE architectures.
Degree: PhD, Computer Sciences, 2010, University of Texas – Austin
URL: http://hdl.handle.net/2152/ETD-UT-2010-08-1934
► Increasing power dissipation is one of the most serious challenges facing designers in the microprocessor industry. Power dissipation, increasing wire delays, and increasing design complexity…
(more)
▼ Increasing power dissipation is one of the most serious challenges facing designers in the microprocessor industry. Power dissipation, increasing wire delays, and increasing design complexity have forced industry to embrace multi-core architectures or chip multiprocessors (CMPs). While CMPs mitigate wire delays and design complexity, they do not directly address single-threaded performance. Additionally, programs must be parallelized, either manually or automatically, to fully exploit the performance of CMPs. Researchers have recently proposed an architecture called Explicit Data Graph Execution (EDGE) as an alternative to conventional CMPs. EDGE architectures are designed to be technology-scalable and to provide good single-threaded performance as well as exploit other types of parallelism including data-level and thread-level parallelism. In this dissertation, we examine the energy efficiency of a specific EDGE architecture called TRIPS Instruction Set Architecture (ISA) and two microarchitectures called TRIPS and TFlex that implement the TRIPS ISA. TRIPS microarchitecture is a first-generation design that proves the feasibility of the TRIPS ISA and distributed tiled microarchitectures. The second-generation TFlex microarchitecture addresses key inefficiencies of the TRIPS microarchitecture by matching the resource needs of applications to a composable hardware substrate. First, we perform a thorough power analysis of the TRIPS microarchitecture. We describe how we develop architectural power models for TRIPS. We then improve power-modeling accuracy using hardware power measurements on the TRIPS prototype combined with detailed Register Transfer Level (RTL) power models from the TRIPS design. Using these refined architectural power models and normalized power modeling methodologies, we perform a detailed performance and power comparison of the TRIPS microarchitecture with two different processors: 1) a low-end processor designed for power efficiency (ARM/XScale) and 2) a high-end superscalar processor designed for high performance (a variant of Power4). This detailed power analysis provides key insights into the advantages and disadvantages of the TRIPS ISA and microarchitecture compared to processors on either end of the performance-power spectrum. Our results indicate that the TRIPS microarchitecture achieves 11.7 times better energy efficiency compared to ARM, and approximately 12% better energy efficiency than Power4, in terms of the Energy-Delay-Squared (ED²) metric. Second, we evaluate the energy efficiency of the TFlex microarchitecture in comparison to TRIPS, ARM, and Power4. TFlex belongs to a class of microarchitectures called Composable Lightweight Processors (CLPs). CLPs are distributed microarchitectures designed with simple cores and are highly configurable at runtime to adapt to resource needs of applications. We develop power models for the TFlex microarchitecture based on the validated TRIPS power models. Our quantitative results indicate that by better matching execution resources to the needs of…
Advisors/Committee Members: Keckler, Stephen W. (advisor), Burger, Douglas C. (committee member), McKinley, Kathryn S. (committee member), Chiou, Derek (committee member), Hunt, Jr., Warren A. (committee member), Brooks, David (committee member).
Subjects/Keywords: Energy efficiency; EDGE architectures; Power efficiency; Composability; DVFS; Power management; Dynamic voltage and frequency scaling; Explicit Data Graph Execution architectures
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Govindan, M. S. (2010). E³ : energy-efficient EDGE architectures. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/ETD-UT-2010-08-1934
Chicago Manual of Style (16th Edition):
Govindan, Madhu Sarava. “E³ : energy-efficient EDGE architectures.” 2010. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/ETD-UT-2010-08-1934.
MLA Handbook (7th Edition):
Govindan, Madhu Sarava. “E³ : energy-efficient EDGE architectures.” 2010. Web. 26 Feb 2021.
Vancouver:
Govindan MS. E³ : energy-efficient EDGE architectures. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2010. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/ETD-UT-2010-08-1934.
Council of Science Editors:
Govindan MS. E³ : energy-efficient EDGE architectures. [Doctoral Dissertation]. University of Texas – Austin; 2010. Available from: http://hdl.handle.net/2152/ETD-UT-2010-08-1934

University of Texas – Austin
15.
Murukkathampoondi, Hrishikesh Sathyavasu.
Design of wide-issue high-frequency processors in wire delay dominated technologies.
Degree: PhD, Electrical and Computer Engineering, 2004, University of Texas – Austin
URL: http://hdl.handle.net/2152/1279
Subjects/Keywords: Microprocessors – Design and construction; Computer architecture
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Murukkathampoondi, H. S. (2004). Design of wide-issue high-frequency processors in wire delay dominated technologies. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/1279
Chicago Manual of Style (16th Edition):
Murukkathampoondi, Hrishikesh Sathyavasu. “Design of wide-issue high-frequency processors in wire delay dominated technologies.” 2004. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/1279.
MLA Handbook (7th Edition):
Murukkathampoondi, Hrishikesh Sathyavasu. “Design of wide-issue high-frequency processors in wire delay dominated technologies.” 2004. Web. 26 Feb 2021.
Vancouver:
Murukkathampoondi HS. Design of wide-issue high-frequency processors in wire delay dominated technologies. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2004. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/1279.
Council of Science Editors:
Murukkathampoondi HS. Design of wide-issue high-frequency processors in wire delay dominated technologies. [Doctoral Dissertation]. University of Texas – Austin; 2004. Available from: http://hdl.handle.net/2152/1279

University of Texas – Austin
16.
Maher, Bertrand Allen.
Atomic block formation for explicit data graph execution architectures.
Degree: PhD, Computer Sciences, 2010, University of Texas – Austin
URL: http://hdl.handle.net/2152/ETD-UT-2010-08-1904
► Limits on power consumption, complexity, and on-chip latency have focused computer architects on power-efficient designs that exploit parallelism. One approach divides programs into atomic blocks…
(more)
▼ Limits on power consumption, complexity, and on-chip latency have
focused computer architects on power-efficient designs that exploit
parallelism. One approach divides programs into atomic blocks of
operations that execute semi-independently, which efficiently creates
a large window of potentially concurrent operations. This
dissertation studies the intertwined roles of the compiler,
architecture, and microarchitecture in achieving efficiency and high
performance with a block-atomic architecture.
For such an architecture to achieve high performance the compiler must
form blocks effectively. The compiler must create large blocks of
instructions to amortize the per-block overhead, but control flow and
content restrictions limit the compiler's options. Block formation
should consider factors such of frequency of execution, block size
such as selecting control-flow paths that are frequently executed, and
exploiting locality of computations to reduce communication overheads.
This dissertation determines what characteristics of programs
influence block formation and proposes techniques to generate
effective blocks. The first contribution is a method for solving
phase-ordering problems inherent to block formation, mitigating the
tension between block-enlarging optimizations – if-conversion, tail
duplication, loop unrolling, and loop peeling – as well as scalar
optimizations. Given these optimizations, analysis shows that the
remaining obstacles to creating larger blocks are inherent in the
control flow structure of applications, and furthermore that any fixed
block size entails a sizable amount of wasted space. To eliminate
this overhead, this dissertation proposes an architectural
implementation of variable-size blocks that allow the compiler to
dramatically improve block efficiency.
We use these mechanisms to develop policies for block formation that
achieve high performance on a range of applications and processor
configurations. We find that the best policies differ significantly
depending on the number of participating cores. Using machine
learning, we discover generalized policies for particular hardware
configurations and find that the best policy varies significantly
between applications and based on the number of parallel resources
available in the microarchitecture. These results show that effective
and efficient block-atomic execution is possible when the compiler and
microarchitecture are designed cooperatively.
Advisors/Committee Members: McKinley, Kathryn S. (advisor), Burger, Douglas C., Ph. D. (advisor), Keckler, Stephen W. (committee member), Mahlke, Scott A. (committee member), Pingali, Keshav (committee member).
Subjects/Keywords: Computer architecture; Compilers; Block formation
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Maher, B. A. (2010). Atomic block formation for explicit data graph execution architectures. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://hdl.handle.net/2152/ETD-UT-2010-08-1904
Chicago Manual of Style (16th Edition):
Maher, Bertrand Allen. “Atomic block formation for explicit data graph execution architectures.” 2010. Doctoral Dissertation, University of Texas – Austin. Accessed February 26, 2021.
http://hdl.handle.net/2152/ETD-UT-2010-08-1904.
MLA Handbook (7th Edition):
Maher, Bertrand Allen. “Atomic block formation for explicit data graph execution architectures.” 2010. Web. 26 Feb 2021.
Vancouver:
Maher BA. Atomic block formation for explicit data graph execution architectures. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2010. [cited 2021 Feb 26].
Available from: http://hdl.handle.net/2152/ETD-UT-2010-08-1904.
Council of Science Editors:
Maher BA. Atomic block formation for explicit data graph execution architectures. [Doctoral Dissertation]. University of Texas – Austin; 2010. Available from: http://hdl.handle.net/2152/ETD-UT-2010-08-1904
.