You searched for +publisher:"University of Illinois – Urbana-Champaign" +contributor:("Hwu, Wen-Mei")
.
Showing records 1 – 30 of
61 total matches.
◁ [1] [2] [3] ▶

University of Illinois – Urbana-Champaign
1.
Ross, Gregory.
High performance histogramming on massively parallel processors.
Degree: MS, 1200, 2014, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/50625
► Histogramming is a technique by which input datasets are mined to extract features and patterns. Histograms have wide range of uses in computer vision, machine…
(more)
▼ Histogramming is a technique by which input datasets are mined to extract features and patterns. Histograms have wide range of uses in computer vision, machine learning, database processing, quality control for manufacturing, and many applications benefit from advance knowledge about the distribution of data.
Computing a histogram is, essentially, the antithesis of parallel processing. Without the use of slow atomic operations or serial execution when contributing data to a histogram bin in an input-driven method, there would likely be inaccuracies in the resulting output. An output-driven method would eliminate the need for atomic operations but would amplify read bandwidth requirements, reduce overall throughput, and result in a zero or negative gain in performance.
We introduce a method to pack multiple bins into a memory word with the goal of better utilizing GPU resources. This method improves GPU occupancy relative to earlier histogram kernel implementations, increases the number of working threads to better hide the latency of atomic operations and collisions while maintaining reasonable throughput. This technique will be demonstrated to improve performance of histogram functions of various sizes with a variety of inputs, including a study on a particular application. While the results are heavily driven by dependancies on input data patterns, the conclusions gathered in this thesis will outline that the packed atomics histogramming kernel can and usually does outperform other implementations in all but a select number of exceptions.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
Mei%20W.%22%29&pagesize-30">
Hwu,
Wen-
Mei W. (advisor).
Subjects/Keywords: General Purpose Computing on Graphics Processing Units (GPGPU); Compute Unified Device Architecture (CUDA); Histogram; Image Processing; Parallelism
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ross, G. (2014). High performance histogramming on massively parallel processors. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/50625
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Ross, Gregory. “High performance histogramming on massively parallel processors.” 2014. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/50625.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Ross, Gregory. “High performance histogramming on massively parallel processors.” 2014. Web. 13 Apr 2021.
Vancouver:
Ross G. High performance histogramming on massively parallel processors. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2014. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/50625.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Ross G. High performance histogramming on massively parallel processors. [Thesis]. University of Illinois – Urbana-Champaign; 2014. Available from: http://hdl.handle.net/2142/50625
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign
2.
Obeid, Nady M.
Compact binning for parallel processing of limited-range functions.
Degree: MS, 1200, 2011, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/18310
► Limited-range functions are domain-level optimizations to a class of applications where all input elements contribute to all output elements, based on the distance between two…
(more)
▼ Limited-range functions are domain-level optimizations to a class of applications where all input elements contribute to all output elements, based on the distance between two given elements. When the contribution of an input element to the output is inversely proportional to the distance, a limited range can be applied, which approximates the contribution to zero beyond a certain cutoff distance. Introducing a limited-range function to the application reduces the computation complexity from O(N2) to O(N).
Processing multiple input elements in a limited-range function in parallel can lead to data races without the use of expensive synchronization. That is why a preferred approach is an output-driven one, where multiple output elements are processed in parallel, all sharing the input data set for reads. Typically the input data set is unstructured, which without the use of binning, would result in every output element in the output-driven approach reading all of the input elements to determine which ones fall within its cutoff. Binning is a preconditioning step that sorts the input elements into predetermined bins that are easily accessible by the output, thus allowing the output to only access the bins relevant to its computation.
Traditionally, bins were created with uniform size and capacity to enable easy access to them; however, making the bins regular can have severe side-effects on memory requirements to maintain these bins. We propose a technique to allow the bins to vary in capacity in order to reduce the memory overhead, at the cost of added accessing overhead. In this work, we will compare regular binning and our approach, compact binning. We will demonstrate that compact bins can in fact improve the execution performance of limited-range functions executed in parallel.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
Mei%20W.%22%29&pagesize-30">
Hwu,
Wen-
Mei W. (advisor).
Subjects/Keywords: irregular binning; compact binning; graphics
processing units (GPUs); graphics processors; parallel processing; limited-range functions; gridding; cutoff distance
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Obeid, N. M. (2011). Compact binning for parallel processing of limited-range functions. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/18310
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Obeid, Nady M. “Compact binning for parallel processing of limited-range functions.” 2011. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/18310.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Obeid, Nady M. “Compact binning for parallel processing of limited-range functions.” 2011. Web. 13 Apr 2021.
Vancouver:
Obeid NM. Compact binning for parallel processing of limited-range functions. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2011. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/18310.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Obeid NM. Compact binning for parallel processing of limited-range functions. [Thesis]. University of Illinois – Urbana-Champaign; 2011. Available from: http://hdl.handle.net/2142/18310
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign
3.
Chang, Li-Wen.
Scalable parallel tridiagonal algorithms with diagonal pivoting and their optimization for many-core architectures.
Degree: MS, 1200, 2014, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/50588
► Tridiagonal solvers are important building blocks for a wide range of scientific applications that are commonly performance-sensitive. Recently, many-core architectures, such as GPUs, have become…
(more)
▼ Tridiagonal solvers are important building blocks for a wide range of scientific applications that are commonly performance-sensitive. Recently, many-core architectures, such as GPUs, have become ubiquitous targets for these applications. Therefore, a high-performance general-purpose GPU tridiagonal solver becomes critical. However, no existing GPU tridiagonal solver provides comparable quality of solutions to most common, general-purpose CPU tridiagonal solvers, like Matlab or Intel MKL, due to no pivoting. Meanwhile, conventional pivoting algorithms are sequential and not applicable to GPUs.
In this thesis, we propose three scalable tridiagonal algorithms with diagonal pivoting for better quality of solutions than the state-of-the-art GPU tridiagonal solvers. A SPIKE-Diagonal Pivoting algorithm efficiently partitions the workloads of a tridiagonal solver and provides pivoting in each partition. A Parallel Diagonal Pivoting algorithm transforms the conventional diagonal pivoting algorithm into a parallelizable form which can be solved by high-performance parallel linear recurrence solvers. An Adaptive R-Cyclic Reduction algorithm introduces pivoting into the conventional R-Cyclic Reduction family, which commonly suffers limited quality of solutions due to no applicable pivoting. Our proposed algorithms can provide comparable quality of solutions to CPU tridiagonal solvers, like Matlab or Intel MKL, without compromising the high throughput GPUs provide.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
Mei%20W.%22%29&pagesize-30">
Hwu,
Wen-
Mei W. (advisor).
Subjects/Keywords: Tridiagonal Solver; SPIKE algorithm; Linear Recurrence; Cyclic Reduction; Diagonal Pivoting; Graphics Processing Unit (GPU) Computing; General Purpose computation on Graphics Processing Units (GPGPU); Many-core
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Chang, L. (2014). Scalable parallel tridiagonal algorithms with diagonal pivoting and their optimization for many-core architectures. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/50588
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Chang, Li-Wen. “Scalable parallel tridiagonal algorithms with diagonal pivoting and their optimization for many-core architectures.” 2014. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/50588.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Chang, Li-Wen. “Scalable parallel tridiagonal algorithms with diagonal pivoting and their optimization for many-core architectures.” 2014. Web. 13 Apr 2021.
Vancouver:
Chang L. Scalable parallel tridiagonal algorithms with diagonal pivoting and their optimization for many-core architectures. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2014. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/50588.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Chang L. Scalable parallel tridiagonal algorithms with diagonal pivoting and their optimization for many-core architectures. [Thesis]. University of Illinois – Urbana-Champaign; 2014. Available from: http://hdl.handle.net/2142/50588
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign
4.
Lv, Jie.
Parallel merge for many-core architectures.
Degree: MS, Electrical & Computer Engr, 2016, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/90824
► This thesis proposes a novel GPU implementation for merging two sorted arrays. We consider the problem of merging two arrays A and B into a…
(more)
▼ This thesis proposes a novel GPU implementation for merging two sorted arrays.
We consider the problem of merging two arrays A and B into a single array C. Each element in the arrays has a key. An ordering relation denoted by is defined on the keys. Array A and array B have m and n elements, respectively, where m and n do not have to be equal. Both array A and array B are sorted based on the ordering relation. The task is to produce the output array C of size m + n. Array C consists of all the input elements from array A and array B, and is sorted by the ordering relation.
We applied several GPU-specific optimizations to a parallel merge algorithm. The optimizations include coordinating the memory access pattern, making full use of the shared memory and reducing the thread divergence. Our implementation achieves up to 10x and 40x speedup on Titan-Z and GTX 980 GPU respectively compared to thrust merge implementation.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
Mei%20W%22%29&pagesize-30">
Hwu,
Wen-
Mei W (advisor).
Subjects/Keywords: Graphics processing unit (GPU); Parallel Merge
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Lv, J. (2016). Parallel merge for many-core architectures. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/90824
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Lv, Jie. “Parallel merge for many-core architectures.” 2016. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/90824.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Lv, Jie. “Parallel merge for many-core architectures.” 2016. Web. 13 Apr 2021.
Vancouver:
Lv J. Parallel merge for many-core architectures. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2016. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/90824.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Lv J. Parallel merge for many-core architectures. [Thesis]. University of Illinois – Urbana-Champaign; 2016. Available from: http://hdl.handle.net/2142/90824
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign
5.
Huang, Sitao.
Hardware acceleration of the pair HMM algorithm for DNA variant calling.
Degree: MS, Electrical & Computer Engr, 2017, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/97496
► With the advent of several accurate and sophisticated statistical algorithms and pipelines for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing…
(more)
▼ With the advent of several accurate and sophisticated statistical algorithms and pipelines for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing data into biologically meaningful information for further clinical analysis and processing. However, given the large volume of the data involved, even modestly complex algorithms would require a prohibitively long time to complete. Hence it is urgent to explore non-conventional implementation platforms to accelerate genomics research.
In this thesis, we present a Field-Programmable Gate Array (FPGA) accelerated implementation of the Pair Hidden Markov Model (Pair HMM) forward algorithm, the performance bottleneck in the HaplotypeCaller, a critical function in the popular Genome Analysis Toolkit (GATK) variant calling tool. We introduce the PE ring structure which, thanks to the fine-grained parallelism allowed by the FPGA, can be built into various configurations striking a trade-off between Instruction-Level Parallelism (ILP) and data parallelism. We investigate the resource utilization and performance of different configurations. Our solution can achieve a speed-up of up to 487x compared to the C++ baseline implementation on CPU and 1.56x compared to the previous best hardware implementation.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20M%22%29&pagesize-30">Hwu, Wen-Mei M (advisor).
Subjects/Keywords: Hardware acceleration; Field-programmable gate array (FPGA); Forward algorithm; Pair hidden Markov model (HMM); Computational genomics; Processing element (PE) ring
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Huang, S. (2017). Hardware acceleration of the pair HMM algorithm for DNA variant calling. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/97496
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Huang, Sitao. “Hardware acceleration of the pair HMM algorithm for DNA variant calling.” 2017. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/97496.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Huang, Sitao. “Hardware acceleration of the pair HMM algorithm for DNA variant calling.” 2017. Web. 13 Apr 2021.
Vancouver:
Huang S. Hardware acceleration of the pair HMM algorithm for DNA variant calling. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2017. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/97496.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Huang S. Hardware acceleration of the pair HMM algorithm for DNA variant calling. [Thesis]. University of Illinois – Urbana-Champaign; 2017. Available from: http://hdl.handle.net/2142/97496
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign
6.
AlMasri, Mohammad.
On implementing sparse matrix-vector multiplication on intel platform.
Degree: MS, Electrical & Computer Engr, 2018, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/101729
► Sparse matrix-vector multiplication, SpMV, can be a performance bottle-neck in iterative solvers and algebraic eigenvalue problems. In this thesis, we present our sparse matrix compressed…
(more)
▼ Sparse matrix-vector multiplication, SpMV, can be a performance bottle-neck in iterative solvers and algebraic eigenvalue problems. In this thesis, we present our sparse matrix compressed chunk storage format (CCF) and SpMV CCF kernel that realizes high performance on Intel Xeon multicore and Phi processors for unstructured matrices. CCF kernel exploits the properties of CCF to enhance load balancing and SIMD efficiency. Moreover, we present the CCF auto-tuner that selects the most effective parameters and the SpMV kernel to achieve the highest possible performance that CCF can attain on a target architecture. Using 151 unstructured matrices from 38 application areas, we compare the performance of the CCF kernel to that of MKL 2018u1 SpMV CSR, MKL 2018u2 Inspector executor SpMV CSR, and Compressed Vectorization-oriented sparse Row (CVR) SpMV. We execute the kernels on a dual 24-core Skylake Xeon Platinum 8160 and a 68-core KNL Xeon Phi 7250. Executing on the dual 24-core Skylake Xeon Platinum 8160, and compared to MKL SpMV CSR, our kernel achieves superior execution throughputs for 135 matrices (89%) with an average speed improvement of 2.3x and maximum speed improvement of 27.5x. Our kernel outperforms MKL Inspector-executor SpMV CSR for 109 matrices (73%) with an average speed improvement of 1.5x and maximum speed improvement of 3.0x. Moreover, SpMV CCF outperforms SpMV CVR for 81% of the matrices with an average speed improvement of 1.8x and maximum speed improvement of 4.2x. Executing on the 68-core KNL Xeon Phi 7250, CCF achieves high average and maximum speed improvements compared to the other three kernels but for slightly smaller percentages of matrices. Lastly, we show that auto-tuning CCF parameters improves the performance for more than 50 matrices compared to the default CCF on Skylake and KNL with an average speed improvement of 1.2x.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
Mei%20W.%22%29&pagesize-30">
Hwu,
Wen-
Mei W. (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Abu-Sufah%2C%20Walid%22%29&pagesize-30">Abu-Sufah, Walid (advisor).
Subjects/Keywords: SpMV; SIMD; CCF; CSR; I-e; MKL; OpenMP; Skylake; KNL
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
AlMasri, M. (2018). On implementing sparse matrix-vector multiplication on intel platform. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/101729
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
AlMasri, Mohammad. “On implementing sparse matrix-vector multiplication on intel platform.” 2018. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/101729.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
AlMasri, Mohammad. “On implementing sparse matrix-vector multiplication on intel platform.” 2018. Web. 13 Apr 2021.
Vancouver:
AlMasri M. On implementing sparse matrix-vector multiplication on intel platform. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2018. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/101729.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
AlMasri M. On implementing sparse matrix-vector multiplication on intel platform. [Thesis]. University of Illinois – Urbana-Champaign; 2018. Available from: http://hdl.handle.net/2142/101729
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign
7.
Li, Cheng.
Performance benchmarking, analysis, and optimization of deep learning inference.
Degree: PhD, Computer Science, 2020, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/108505
► The world sees a proliferation of deep learning (DL) models and their wide adoption in different application domains. This has made the performance benchmarking, understanding,…
(more)
▼ The world sees a proliferation of deep learning (DL) models and their wide adoption in different application domains. This has made the performance benchmarking, understanding, and optimization of DL inference an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible computing system to serve DL models with the desired latency, throughput, and energy requirements while maximizing resource utilization. However, DL faces the following challenges in performance engineering.
Benchmarking — While there have been significant efforts to develop benchmark suites that evaluate widely used DL models, developing, maintaining, and running benchmarks takes a non-trivial amount of effort, and DL benchmarking has been hampered in part due to the lack of representative and up-to-date benchmarking suites.
Performance Understanding — Understanding the performance of DL workloads is challenging as their characteristics depend on the interplay between the models, frameworks, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack. This largely limits the types of analysis that can be performed on model execution.
Optimization Advising — The current DL optimization process is manual and ad-hoc that requires a lot of effort and expertise. Existing tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow DL characterization/optimization cycles that cannot keep up with the fast pace at which new DL innovations are introduced.
Evaluation and Comparison — The current DL landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks a DL benchmarking platform to facilitate evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. Due to the lack of a benchmarking platform, the current practice of evaluating the benefits of proposed DL innovations is both arduous and error-prone — stifling the adoption of the innovations.
This thesis addresses the above challenges in DL performance engineering. First we introduce DLBricks, a composable benchmark generation design that reduces the effort of developing, maintaining, and running DL benchmarks. DLBricks decomposes DL models into a set of unique runnable networks and constructs the original model’s performance using the performance of the generated benchmarks. Then, we present XSP, an across-stack profiling design that correlates profiles from different sources to obtain a holistic and hierarchical view of DL model execution. XSP innovatively leverages distributed tracing and accurately capture the profiles at each level of the HW/SW stack in spite of profiling overhead. Next, we propose Benanza, a systematic DL benchmarking and analysis design that guides researchers to potential optimization…
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
mei%22%29&pagesize-30">
Hwu,
Wen-
mei (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-mei%22%29&pagesize-30">Hwu, Wen-mei (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Fletcher%2C%20Christopher%22%29&pagesize-30">Fletcher, Christopher (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Padua%2C%20David%22%29&pagesize-30">Padua, David (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Tan%2C%20Wei%22%29&pagesize-30">Tan, Wei (committee member).
Subjects/Keywords: deep learning; machine learning; performance analysis; benchmarking; optimization
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Li, C. (2020). Performance benchmarking, analysis, and optimization of deep learning inference. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/108505
Chicago Manual of Style (16th Edition):
Li, Cheng. “Performance benchmarking, analysis, and optimization of deep learning inference.” 2020. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/108505.
MLA Handbook (7th Edition):
Li, Cheng. “Performance benchmarking, analysis, and optimization of deep learning inference.” 2020. Web. 13 Apr 2021.
Vancouver:
Li C. Performance benchmarking, analysis, and optimization of deep learning inference. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2020. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/108505.
Council of Science Editors:
Li C. Performance benchmarking, analysis, and optimization of deep learning inference. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2020. Available from: http://hdl.handle.net/2142/108505

University of Illinois – Urbana-Champaign
8.
Kim, Hee-Seok.
Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures.
Degree: PhD, Electrical & Computer Engineering, 2015, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/88987
► The rising pressure to simultaneously improve performance and reduce power consumption is driving more heterogeneity into all aspects of computing devices. However, wide adoption of…
(more)
▼ The rising pressure to simultaneously improve performance and reduce power consumption is driving more heterogeneity into all aspects of computing devices.
However, wide adoption of specialized computing devices such as GPUs and Xeon Phis comes with a programming challenge. A carefully optimized program that is well matched to the target hardware can run many times faster and more energy efficiently than one that is not.
Ideally, programmers should write their code using a single programming model, and the compiler would transform the program to run optimally on the target architecture.
In practice, however, programmers have to expend great effort to translate performance enjoyed on one platform to another.
As such, single-source code-based portability has gained substantial momentum and OpenCL, a bulk-synchronous programming language, has become a popular choice, among others, to fulfill the need for portability.
The assumed computing model of these languages is inevitably loosely coupled with an underlying architecture, obligating a combined compiler and runtime to find an efficient execution mapping from the input program onto the architecture which best exploits the hardware for performance.
In this dissertation, I argue and demonstrate that obtaining high performance from executing OpenCL programs on CPU is feasible. In order to achieve the goal, I present compiler and runtime techniques to execute OpenCL programs on CPU architectures.
First, I propose a compiler technique in which the execution of fine-grained parallel threads, called work-items, is collectively analyzed to consider the impact of scheduling them with respect to data locality.
By analyzing the memory addresses accessed in a kernel, the technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance.
The approach achieves geomean speedups of 3.32x over AMD's and 1.71x over Intel's state-of-the-art implementations on Parboil and Rodinia benchmarks.
Second, I propose a runtime that allows a compiler to deposit differently optimized kernels to mitigate the stress on the compiler in deriving the most optimal code.
The runtime systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination.
It exploits the fact that OpenCL programs typically come with a large number of independent work-groups, a feature that amortizes the cost of profiling execution of a few work-items, while the overhead is further reduced by retaining the profiling execution result to constitute the final execution output.
The proposed runtime performs with an average overhead of 3% compared to an ideal/oracular runtime in execution time.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
mei%22%29&pagesize-30">
Hwu,
Wen-
mei (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-mei%22%29&pagesize-30">Hwu, Wen-mei (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Boppart%2C%20Stephen%22%29&pagesize-30">Boppart, Stephen (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Lumetta%2C%20Steven%22%29&pagesize-30">Lumetta, Steven (committee member).
Subjects/Keywords: OpenCL; compiler; Central Processing Unit (CPU); performance portability; data locality
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Kim, H. (2015). Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/88987
Chicago Manual of Style (16th Edition):
Kim, Hee-Seok. “Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures.” 2015. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/88987.
MLA Handbook (7th Edition):
Kim, Hee-Seok. “Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures.” 2015. Web. 13 Apr 2021.
Vancouver:
Kim H. Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2015. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/88987.
Council of Science Editors:
Kim H. Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2015. Available from: http://hdl.handle.net/2142/88987
9.
Pearson, Carl.
Heterogeneous system and application communication modeling.
Degree: MS, Electrical & Computer Engr, 2018, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/101501
► With the end of Dennard scaling, high-performance computing increasingly relies on heterogeneous systems with specialized hardware to improve application performance. This trend has driven up…
(more)
▼ With the end of Dennard scaling, high-performance computing increasingly relies on heterogeneous systems with specialized hardware to improve application performance. This trend has driven up the complexity of high-performance software development, as developers must manage multiple programming systems and develop system-tuned code to utilize specialized hardware. In addition, it has exacerbated existing challenges of data placement as the specialized hardware often has local memories to fuel its computational demands. In addition to using appropriate software resources to target application computation at the best hardware for the job, application developers now must manage data movement and placement within their application, which also must be specifically tuned to the target system. Instead of relying on the application developer to have specialized knowledge of system characteristics and specialized expertise in multiple programming systems, this work proposes a heterogeneous system communication library that automatically chooses data location and data movement for high-performance application development and execution on heterogeneous systems. This work presents the foundational components of that library: a systematic approach for characterization of system communication links and application communication demands.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
Mei%22%29&pagesize-30">
Hwu,
Wen-
Mei (advisor).
Subjects/Keywords: GPGPU; CUDA; NVLink; Benchmark; PCIe; NUMA
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Pearson, C. (2018). Heterogeneous system and application communication modeling. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/101501
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Pearson, Carl. “Heterogeneous system and application communication modeling.” 2018. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/101501.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Pearson, Carl. “Heterogeneous system and application communication modeling.” 2018. Web. 13 Apr 2021.
Vancouver:
Pearson C. Heterogeneous system and application communication modeling. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2018. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/101501.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Pearson C. Heterogeneous system and application communication modeling. [Thesis]. University of Illinois – Urbana-Champaign; 2018. Available from: http://hdl.handle.net/2142/101501
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
10.
Zhu, Xinrui.
MLModelScope website development with react.
Degree: MS, Electrical & Computer Engr, 2019, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/105716
► With the rapid growth of the MLModelScope project, there is an urgent need for a user interface to show users the available resources, functionalities, and…
(more)
▼ With the rapid growth of the MLModelScope project, there is an urgent need for a user interface to show users the available resources, functionalities, and facilitate their experiments. The project was designed to develop a website with React for MLModelScope which entailed providing an interactive user interface to easily demonstrate its functionalities. Also, other users can use our platform to run some experiments without setting up the whole system.
In this thesis, we will first give an introduction to MLModelScope and React and also the goal of this project. Then a discussion follows to describe the whole design and development process and the problems we faced in each stage. Next, we dive into the details of the technologies we used and how we dealt with some technical challenges. Finally, we presented our opinions about this project, thoughts for developing in React and plans for further development.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
Mei%22%29&pagesize-30">
Hwu,
Wen-
Mei (advisor).
Subjects/Keywords: React; User Interface; Machine Learning; MLModelScope
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Zhu, X. (2019). MLModelScope website development with react. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/105716
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Zhu, Xinrui. “MLModelScope website development with react.” 2019. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/105716.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Zhu, Xinrui. “MLModelScope website development with react.” 2019. Web. 13 Apr 2021.
Vancouver:
Zhu X. MLModelScope website development with react. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2019. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/105716.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Zhu X. MLModelScope website development with react. [Thesis]. University of Illinois – Urbana-Champaign; 2019. Available from: http://hdl.handle.net/2142/105716
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
11.
Srivastava, Abhishek.
Performance evaluation of deep learning on smartphones.
Degree: MS, Computer Science, 2019, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/106260
► Deep Learning powers a variety of applications from self driving cars and autonomous robotics to web search and voice assistants. It is fair to say…
(more)
▼ Deep Learning powers a variety of applications from self driving cars and autonomous robotics to web search and voice assistants. It is fair to say that it is omnipresent and here to stay. It is deployed in all sorts of devices ranging from consumer electronics to Internet of Things (IoT). Such a deployment is categorized as inference at the edge. This thesis focuses on Deep Learning on one such edge device - Mobile Phone. The thesis surveys the space of Deep Learning deployment on mobile devices, and identifies three key problems - (a) lack of common programming interface, (b) dearth of benchmarking systems and (c) shortage of in-depth performance evaluation. Then, it provides a solution to each one of them by (a) providing a common interface derived from MLModelScope, referred to as mobile Predictor (mPredictor), (b) providing a benchmarking application and (c) using aforementioned mPredictor and benchmarking application to perform a detailed evaluation. This work has been developed to assist a generic mobile developer in integrating Deep Learning service in his/her application.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
Mei%22%29&pagesize-30">
Hwu,
Wen-
Mei (advisor).
Subjects/Keywords: Deep Learning; Benchmarking; Performance Evaluation; Mobile Devices
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Srivastava, A. (2019). Performance evaluation of deep learning on smartphones. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/106260
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Srivastava, Abhishek. “Performance evaluation of deep learning on smartphones.” 2019. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/106260.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Srivastava, Abhishek. “Performance evaluation of deep learning on smartphones.” 2019. Web. 13 Apr 2021.
Vancouver:
Srivastava A. Performance evaluation of deep learning on smartphones. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2019. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/106260.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Srivastava A. Performance evaluation of deep learning on smartphones. [Thesis]. University of Illinois – Urbana-Champaign; 2019. Available from: http://hdl.handle.net/2142/106260
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign
12.
Wu, Xiao-Long.
Tiger: tiled iterative genome assembler and approximate multi-genome aligner.
Degree: PhD, 1200, 2013, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/45618
► Sequence assembly and alignments are two important stepping stones for comparative genomics. With the fast adoption of the next-generation sequencing (NGS) technologies and the coming…
(more)
▼ Sequence assembly and alignments are two important stepping stones for comparative genomics. With the fast adoption of the next-generation sequencing (NGS) technologies and the coming of the third-generation sequencing (TGS) technologies, genomics has provided us with an unprecedented opportunity to answer fundamental questions in biology and elucidate human diseases. However, most de novo assemblers require an enormous amount of computational resource, which is not readily available to most research groups and medical personnel. Moreover, there has been little progress on sequence assembly qualities, especially for genomes having high repetitions. As more affordable raw data and assembled genomes are accessible to the community, there is an emerging demand for genome searches among the big amount of divergent genomes in gene banks. The genomes can be in the form of raw reads, unfinished/low-quality assemblies, or completed genomes, on which traditional multi-sequence alignment tools may not be suitable to perform similarity searches. Yet there are few research studies aiming at meeting this demand.
We have developed a novel de novo assembly framework, called Tiger assembler, which adapts to available computing resources by iteratively decomposing the assembly problem into sub-problems. Our method can flexibly embed different assemblers for various types of target genomes. Using the sequence data from a human chromosome, our results show that Tiger can achieve much better NG50s, better genome coverage, and slightly higher errors, as compared to Velvet and SOAPdenovo, using a modest amount of memory that is available in commodity computers today. We also experimented with a real de novo assembly, i.e., the E. mexicana genome, and demonstrated the strength of our work. The N50s of our contigs and scaffolds by Tiger were 7 and 57 times longer than those by SOAPdenovo. On the other hand, the assembly done by ALLPATHS-LG had only one-third genome size.
We also developed a multi-genome sequence aligner, called Tiger aligner, able to perform fast similarity checks among multiple genomes with distant biological relationship and low quality raw data. Practical applications of our tool are demonstrated through experiments. The performance of Tiger aligner on traditional multi-sequence alignments is also compared against existing works, MUMmer and SOAPaligner. The results show the practicality and strengths of our tool.
Most state-of-the-art assemblers that can achieve relatively high assembly quality need an excessive amount of computing resource (in particular, memory) that is not readily available to most researchers. Tiger assembler provides the only known viable path to utilize NGS de novo assemblers that require more memory than that is present in available computers. Evaluation results demonstrate the feasibility of getting better quality results with low memory footprint and the scalability of using distributed commodity computers.
The quantity explosion of genomes makes existing multi-sequence aligners…
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
Mei%20W.%22%29&pagesize-30">
Hwu,
Wen-
Mei W. (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20W.%22%29&pagesize-30">Hwu, Wen-Mei W. (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Ma%2C%20Jian%22%29&pagesize-30">Ma, Jian (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Liang%2C%20Zhi-Pei%22%29&pagesize-30">Liang, Zhi-Pei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Robinson%2C%20Gene%20E.%22%29&pagesize-30">Robinson, Gene E. (committee member).
Subjects/Keywords: De novo genome assembly; next-generation sequencing; third-generation sequencing; iterative genome assembler; read partitioning; Multiple sequence alignment; multiple genome alignment
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Wu, X. (2013). Tiger: tiled iterative genome assembler and approximate multi-genome aligner. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/45618
Chicago Manual of Style (16th Edition):
Wu, Xiao-Long. “Tiger: tiled iterative genome assembler and approximate multi-genome aligner.” 2013. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/45618.
MLA Handbook (7th Edition):
Wu, Xiao-Long. “Tiger: tiled iterative genome assembler and approximate multi-genome aligner.” 2013. Web. 13 Apr 2021.
Vancouver:
Wu X. Tiger: tiled iterative genome assembler and approximate multi-genome aligner. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2013. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/45618.
Council of Science Editors:
Wu X. Tiger: tiled iterative genome assembler and approximate multi-genome aligner. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2013. Available from: http://hdl.handle.net/2142/45618

University of Illinois – Urbana-Champaign
13.
Chang, Li-Wen.
Toward performance portability for CPUS and GPUS through algorithmic compositions.
Degree: PhD, Electrical & Computer Engr, 2017, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/98331
► The diversity of microarchitecture designs in heterogeneous computing systems allows programs to achieve high performance and energy efficiency, but results in substantial software redevelopment cost…
(more)
▼ The diversity of microarchitecture designs in heterogeneous computing systems allows programs to achieve high performance and energy efficiency, but results in substantial software redevelopment cost for each type or generation of hardware. To mitigate this cost, a performance portable programming system is required.
This work presents my solution to the performance portability problem. I argue that a new language is required for replacing the current practices of programming systems to achieve practical performance portability. To support my argument, I first demonstrate the limited performance portability of the current practices by showing quantitative and qualitative evidences. I identify the main limiting issues of conventional programming languages. To overcome the issues, I propose a new modular, composition-based programming language that can effectively express an algorithmic design space with functional polymorphism, and a compiler that can effectively explore the design space and facilitate many high-level optimization techniques. This proposed approach achieves no less than 70% of the performance of highly optimized vendor libraries such as Intel MKL and NVIDIA CUBLAS/CUSPARSE on an Intel i7-3820 Sandy Bridge CPU, an NVIDIA C2050 Fermi GPU, and an NVIDIA K20c Kepler GPU.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
mei%20W.%22%29&pagesize-30">
Hwu,
Wen-
mei W. (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-mei%20W.%22%29&pagesize-30">Hwu, Wen-mei W. (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Kim%2C%20Nam%20Sung%22%29&pagesize-30">Kim, Nam Sung (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Lumetta%2C%20Steven%20S.%22%29&pagesize-30">Lumetta, Steven S. (committee member).
Subjects/Keywords: Performance portability; Algorithmic composition; Parallel programming; TANGRAM; Programming language; Compiler; Graphics processing units (GPUs); Central processing units (CPUs); Open Computing Language (OpenCL); Open Multi-Processing (OpenMP); Open Accelerators (OpenACC); C++ Accelerated Massive Parallelism (C++AMP)
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Chang, L. (2017). Toward performance portability for CPUS and GPUS through algorithmic compositions. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/98331
Chicago Manual of Style (16th Edition):
Chang, Li-Wen. “Toward performance portability for CPUS and GPUS through algorithmic compositions.” 2017. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/98331.
MLA Handbook (7th Edition):
Chang, Li-Wen. “Toward performance portability for CPUS and GPUS through algorithmic compositions.” 2017. Web. 13 Apr 2021.
Vancouver:
Chang L. Toward performance portability for CPUS and GPUS through algorithmic compositions. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2017. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/98331.
Council of Science Editors:
Chang L. Toward performance portability for CPUS and GPUS through algorithmic compositions. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2017. Available from: http://hdl.handle.net/2142/98331

University of Illinois – Urbana-Champaign
14.
Yu, Jiahui.
Towards efficient, on-demand and automated deep learning.
Degree: PhD, Electrical & Computer Engr, 2020, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/107845
► In the past decade, deep learning has achieved great breakthroughs on tasks of computer vision, speech, language, control and many others. The advanced and dedicated…
(more)
▼ In the past decade, deep learning has achieved great breakthroughs on tasks of computer vision, speech, language, control and many others. The advanced and dedicated computing chips, like Nvidia GPU and Google TPU, largely contributed and broadened this success. However, the requirement of large computing power impedes the deployment of deep learning methods in many real scenarios, where cost, time and energy efficiency are critical – for example, self-driving cars, AR/VR kits, internet-of-things devices and mobile phones. This thesis presents a series of in-depth research towards efficient, on-demand and automated deep learning.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Huang%2C%20Thomas%20S.%22%29&pagesize-30">Huang, Thomas S. (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Huang%2C%20Thomas%20S.%22%29&pagesize-30">Huang, Thomas S. (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Liang%2C%20Zhi-Pei%22%29&pagesize-30">Liang, Zhi-Pei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%22%29&pagesize-30">Hwu, Wen-Mei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Lin%2C%20Zhe%22%29&pagesize-30">Lin, Zhe (committee member).
Subjects/Keywords: efficient; on-demand; automated; deep learning; automl
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Yu, J. (2020). Towards efficient, on-demand and automated deep learning. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/107845
Chicago Manual of Style (16th Edition):
Yu, Jiahui. “Towards efficient, on-demand and automated deep learning.” 2020. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/107845.
MLA Handbook (7th Edition):
Yu, Jiahui. “Towards efficient, on-demand and automated deep learning.” 2020. Web. 13 Apr 2021.
Vancouver:
Yu J. Towards efficient, on-demand and automated deep learning. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2020. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/107845.
Council of Science Editors:
Yu J. Towards efficient, on-demand and automated deep learning. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2020. Available from: http://hdl.handle.net/2142/107845

University of Illinois – Urbana-Champaign
15.
Papakonstantinou, Alexandros.
High-level automation of custom hardware design for high-performance computing.
Degree: PhD, 1200, 2013, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/42137
► This dissertation focuses on efficient generation of custom processors from high-level language descriptions. Our work exploits compiler-based optimizations and transformations in tandem with high-level synthesis…
(more)
▼ This dissertation focuses on efficient generation of custom processors from high-level language descriptions. Our work exploits compiler-based optimizations and transformations in tandem with high-level synthesis (HLS) to build high-performance custom processors. The goal is to offer a common multiplatform high-abstraction programming interface for heterogeneous compute systems where the benefits of custom reconfigurable (or fixed) processors can be exploited by the application developers.
The research presented in this dissertation supports the following thesis: In an increasingly heterogeneous compute environment it is important to leverage the compute capabilities of each heterogeneous processor efficiently. In the case of FPGA and ASIC accelerators this can be achieved through HLS-based flows that (i) extract parallelism at coarser than basic block granularities, (ii) leverage common high-level parallel programming languages, and (iii) employ high-level source-to-source transformations to generate high-throughput custom processors.
First, we propose a novel HLS flow that extracts instruction level parallelism beyond the boundary of basic blocks from C code. Subsequently, we describe FCUDA, an HLS-based framework for mapping fine-grained and coarse-grained parallelism from parallel CUDA kernels onto spatial parallelism. FCUDA provides a common programming model for acceleration on heterogeneous devices (i.e. GPUs and FPGAs). Moreover, the FCUDA framework balances multilevel granularity parallelism synthesis using efficient techniques that leverage fast and accurate estimation models (i.e. do not rely on lengthy physical implementation tools). Finally, we describe an advanced source-to-source transformation framework for throughput-driven parallelism synthesis (TDPS), which appropriately restructures CUDA kernel code to maximize throughput on FPGA devices. We have integrated the TDPS framework into the FCUDA flow to enable automatic performance porting of CUDA kernels designed for the GPU architecture onto the FPGA architecture.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Cong%2C%20Jason%22%29&pagesize-30">Cong, Jason (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20W.%22%29&pagesize-30">Hwu, Wen-Mei W. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%20D.F.%22%29&pagesize-30">Wong, Martin D.F. (committee member).
Subjects/Keywords: High-level synthesis; Field-Programmable Gate Array (FPGA); CUDA; parallel programming; High Performance Computing (HPC)
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Papakonstantinou, A. (2013). High-level automation of custom hardware design for high-performance computing. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/42137
Chicago Manual of Style (16th Edition):
Papakonstantinou, Alexandros. “High-level automation of custom hardware design for high-performance computing.” 2013. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/42137.
MLA Handbook (7th Edition):
Papakonstantinou, Alexandros. “High-level automation of custom hardware design for high-performance computing.” 2013. Web. 13 Apr 2021.
Vancouver:
Papakonstantinou A. High-level automation of custom hardware design for high-performance computing. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2013. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/42137.
Council of Science Editors:
Papakonstantinou A. High-level automation of custom hardware design for high-performance computing. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2013. Available from: http://hdl.handle.net/2142/42137

University of Illinois – Urbana-Champaign
16.
Lin, Chen-Hsuan.
Design automation for circuit reliability and energy efficiency.
Degree: PhD, Electrical & Computer Engr, 2017, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/99237
► This dissertation presents approaches to improve circuit reliability and energy efficiency from different angles, such as verification, logic synthesis, and functional unit design. A variety…
(more)
▼ This dissertation presents approaches to improve circuit reliability and energy efficiency from different angles, such as verification, logic synthesis, and functional unit design. A variety of algorithmic methods and heuristics are used in our approaches such as SAT solving, data mining, logic restructuring, and applied mathematics. Furthermore, the scalability of our approaches was taken into account while we developed our solutions.
Experimental results show that our approaches offer the following advantages: 1) SAT-BAG can generate concise assertions that can always achieve 100% input space coverage. 2) C-Mine-DCT, compared to a recent publication, can achieve compatible performance with an additional 8% energy saving and 54x speedup for bigger benchmarks on average. 3) C-Mine-APR can achieve up to 13% more energy saving than C-Mine-DCT while confronting designs with more common cases. 4) CSL can achieve 6.5% NBTI delay reduction with merely 2.5% area overhead on average. 5) Our modulo functional units, compared to a previous approach, can achieve a 12.5% reduction in area and a 47.1% reduction in delay for a 32-bit mod-3 reducer. For modulo-15 and above, all of our modulo functional units have better area and delay than their previous counterparts.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%22%29&pagesize-30">Hwu, Wen-Mei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Rutenbar%2C%20Rob%20A.%22%29&pagesize-30">Rutenbar, Rob A. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%20D.%20F.%22%29&pagesize-30">Wong, Martin D. F. (committee member).
Subjects/Keywords: Electronic design automation; Reliability; Energy efficiency; Data mining; Satisfiability (SAT) solving; Logic restructuring; Assertion; Negative bias temperature instability (NBTI) effect; Modulo arithmetic; Shadow datapath
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Lin, C. (2017). Design automation for circuit reliability and energy efficiency. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/99237
Chicago Manual of Style (16th Edition):
Lin, Chen-Hsuan. “Design automation for circuit reliability and energy efficiency.” 2017. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/99237.
MLA Handbook (7th Edition):
Lin, Chen-Hsuan. “Design automation for circuit reliability and energy efficiency.” 2017. Web. 13 Apr 2021.
Vancouver:
Lin C. Design automation for circuit reliability and energy efficiency. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2017. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/99237.
Council of Science Editors:
Lin C. Design automation for circuit reliability and energy efficiency. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2017. Available from: http://hdl.handle.net/2142/99237

University of Illinois – Urbana-Champaign
17.
Hwang, Leslie K.
Thermal designs, models and optimization for three-dimensional integrated circuits.
Degree: PhD, Electrical & Computer Engr, 2018, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/102951
► Three-dimensional integrated circuits (3D ICs), a novel packaging technology, are heavily studied to enable improved performance with denser packaging and reduced interconnects. Despite numerous advantages,…
(more)
▼ Three-dimensional integrated circuits (3D ICs), a novel packaging technology, are heavily studied to enable improved performance with denser packaging and reduced interconnects. Despite numerous advantages, thermal management is the biggest bottleneck to expanding the applications of this device stacking technology. In addition to implementing the thermal-aware designs of existing methodologies, it is necessary to implement new features to dissipate heat efficiently.
This work presents two main aspects of thermal designs: on-chip level and package level. First, we propose a novel thermal-aware physical design on chip between devices. We aim to mitigate localized hotspots to ensure the functionality by adding thermal fin geometry to existing thermal through- silicon via (TTSV). We analyze design requirements of thermal fin for single TTSV as well as TTSV cluster designs with the goal of maximizing heat dissipation while minimizing the interference with routing and area consumption. An analytical model of the three-dimensional system and thermal resistance circuit is built for accurate and runtime-efficient thermal analysis.
In terms of high-performance computing systems in 3D ICs, thermal bottle- necks are much more challenging with merely on-chip design solutions. Inter- tier liquid cooling microchannel layers have been introduced into 3D ICs as an integrated cooling mechanism to tackle the thermal degradation. Many existing research works optimize microchannel designs based on runtime-intensive numerical methods or inaccurate thermo-fluid models. Hence, we propose an accurate but compact closed-form model of tapered microchannel to capture the relationship between the channel geometry and heat transfer performance. To improve the accuracy, our correlations are based on the developing flow model and derived from numerical simulation data on a sub- set of multiple channel parameters. Our model achieves 57% less error in Nusselt number and 45 % less error in pressure drop for channels with inlet width 100-400 μm compared to a commonly used approximate model on fully developed flow.
Next, we present the correlations for diverging channels as well as complete correlations that extend to any linearly tapering channel models, that include diverging shape, uniformly rectangular shape and converging shape. The complete models provide the flexibility to analyze and optimize any arbitrary geometry based on the piecewise linear channel wall assumption.
Finally, we demonstrate the optimized channel designs using the derived correlations. Tapered channel models provided the flexibility to incorporate any arbitrary shapes and explore the advanced geometries during the optimization. The microchannel is divided into small segments in axial direction from inlet to outlet and piecewise optimized. The simulated annealing method is applied in our optimization, and channel width at one randomly chosen segment interface is altered to evaluate the design at each iteration. The objective is to minimize the overall thermal…
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%22%29&pagesize-30">Wong, Martin (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%22%29&pagesize-30">Wong, Martin (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%22%29&pagesize-30">Hwu, Wen-Mei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Miljkovic%2C%20Nenad%22%29&pagesize-30">Miljkovic, Nenad (committee member).
Subjects/Keywords: 3D IC; thermal design; TTSV; thermal fin; microchannel; optimization
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Hwang, L. K. (2018). Thermal designs, models and optimization for three-dimensional integrated circuits. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/102951
Chicago Manual of Style (16th Edition):
Hwang, Leslie K. “Thermal designs, models and optimization for three-dimensional integrated circuits.” 2018. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/102951.
MLA Handbook (7th Edition):
Hwang, Leslie K. “Thermal designs, models and optimization for three-dimensional integrated circuits.” 2018. Web. 13 Apr 2021.
Vancouver:
Hwang LK. Thermal designs, models and optimization for three-dimensional integrated circuits. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2018. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/102951.
Council of Science Editors:
Hwang LK. Thermal designs, models and optimization for three-dimensional integrated circuits. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2018. Available from: http://hdl.handle.net/2142/102951

University of Illinois – Urbana-Champaign
18.
Lin, Chun-Xun.
Advances in parallel programming for electronic design automation.
Degree: PhD, Electrical & Computer Engr, 2020, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/108425
► The continued miniaturization of the technology node increases not only the chip capacity but also the circuit design complexity. How does one efficiently design a…
(more)
▼ The continued miniaturization of the technology node increases not only the chip capacity but also the circuit design complexity. How does one efficiently design a chip with millions or billions transistors? This has become a challenging problem in the integrated circuit (IC) design industry, especially for the developers of electronic design automation (EDA) tools. To boost the performance of EDA tools, one promising direction is via parallel computing. In this dissertation, we explore different parallel computing approaches, from CPU to GPU to distributed computing, for EDA applications.
Nowadays multi-core processors are prevalent from mobile devices to laptops to desktop, and it is natural for software developers to utilize the available cores to maximize the performance of their applications. Therefore, in this dissertation we first focus on multi-threaded programming. We begin by reviewing a C++ parallel programming library called Cpp-Taskflow. Cpp-Taskflow is designed to facilitate programming parallel applications, and has been successfully applied to an EDA timing analysis tool. We will demonstrate Cpp-Taskflow’s programming model and interface, software architecture and execution flow. Then, we improve Cpp-Taskflow in several aspects. First, we enhance Cpp-Taskflow’s usability through restructuring the software architecture. Second, we introduce task graph composition to support composability and modularity, which makes it easier for users to construct large and complex parallel patterns. Third, we add a new task type in Cpp-Taskflow to let users control the graph execution flow. This feature empowers the graph model with the ability to describe complex control flow. Aside from the above enhancements, we have designed a new scheduler to adaptively manage the threads based on available parallelism. The new scheduler uses a simple and effective strategy which can not only prevent resource from being underutilized, but also mitigate resource over-subscription. We have evaluated the new scheduler on both micro-benchmarks and a very-large-scale integration (VLSI) application, and the results show that the new scheduler can achieve good performance and is very energy-efficient.
Next we study the applicability of heterogeneous computing, specifically the graphics processing unit (GPU), to EDA. We demonstrate how to use GPU to accelerate VLSI placement, and we show that GPU can bring substantial performance gain to VLSI placement. Finally, as the design size keeps increasing, a more scalable solution will be distributed computing. We introduce a distributed power grid analysis framework built on top of DtCraft. This framework allows users to flexibly partition the design and automatically deploy the computations across several machines. In addition, we propose a job scheduler that can efficiently utilize cluster resource to improve the framework’s performance.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%20D%20F%22%29&pagesize-30">Wong, Martin D F (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%20D%20F%22%29&pagesize-30">Wong, Martin D F (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%22%29&pagesize-30">Hwu, Wen-Mei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Xiong%2C%20Jinjun%22%29&pagesize-30">Xiong, Jinjun (committee member).
Subjects/Keywords: Electronic design automation; Parallel programming
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Lin, C. (2020). Advances in parallel programming for electronic design automation. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/108425
Chicago Manual of Style (16th Edition):
Lin, Chun-Xun. “Advances in parallel programming for electronic design automation.” 2020. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/108425.
MLA Handbook (7th Edition):
Lin, Chun-Xun. “Advances in parallel programming for electronic design automation.” 2020. Web. 13 Apr 2021.
Vancouver:
Lin C. Advances in parallel programming for electronic design automation. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2020. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/108425.
Council of Science Editors:
Lin C. Advances in parallel programming for electronic design automation. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2020. Available from: http://hdl.handle.net/2142/108425

University of Illinois – Urbana-Champaign
19.
Alian, Mohammad.
A cross-stack, network-centric architectural design for next-generation datacenters.
Degree: PhD, Electrical & Computer Engr, 2020, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/108449
► This thesis proposes a full-stack, cross-layer datacenter architecture based on in-network computing and near-memory processing paradigms. The proposed datacenter architecture is built atop two principles:…
(more)
▼ This thesis proposes a full-stack, cross-layer datacenter architecture based on in-network computing and near-memory processing paradigms. The proposed datacenter architecture is built atop two principles: (1) utilizing commodity, off-the-shelf hardware (i.e., processor, DRAM, and network devices) with minimal changes to their architecture, and (2) providing a standard interface to the programmers for using the novel hardware. More specifically, the proposed datacenter architecture enables a smart network adapter to collectively compress/decompress data exchange between distributed DNN training nodes and assist the operating system in performing aggressive processor power management. It also deploys specialized memory modules in the servers, capable of performing general-purpose computation and network connectivity.
This thesis unlocks the potentials of hardware and operating system co-design in architecting application-transparent, near-data processing hardware for improving datacenter's performance, energy efficiency, and scalability. We evaluate the proposed datacenter architecture using a combination of full-system simulation, FPGA prototyping, and real-system experiments.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Kim%2C%20Nam%20Sung%22%29&pagesize-30">Kim, Nam Sung (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Kim%2C%20Nam%20Sung%22%29&pagesize-30">Kim, Nam Sung (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-mei%22%29&pagesize-30">Hwu, Wen-mei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Torrellas%2C%20Josep%22%29&pagesize-30">Torrellas, Josep (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Kumar%2C%20Rakesh%22%29&pagesize-30">Kumar, Rakesh (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Snir%2C%20Marc%22%29&pagesize-30">Snir, Marc (committee member).
Subjects/Keywords: datacenter architecture; near-data processing; near-memory processing; in-network computing; distributed simulation; datacenter network architecture; scale-out processing
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Alian, M. (2020). A cross-stack, network-centric architectural design for next-generation datacenters. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/108449
Chicago Manual of Style (16th Edition):
Alian, Mohammad. “A cross-stack, network-centric architectural design for next-generation datacenters.” 2020. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/108449.
MLA Handbook (7th Edition):
Alian, Mohammad. “A cross-stack, network-centric architectural design for next-generation datacenters.” 2020. Web. 13 Apr 2021.
Vancouver:
Alian M. A cross-stack, network-centric architectural design for next-generation datacenters. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2020. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/108449.
Council of Science Editors:
Alian M. A cross-stack, network-centric architectural design for next-generation datacenters. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2020. Available from: http://hdl.handle.net/2142/108449

University of Illinois – Urbana-Champaign
20.
Karpuzcu, Rahmet.
Novel many-core architectures for energy-efficiency.
Degree: PhD, 1200, 2012, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/34560
► Ideal CMOS device scaling relies on scaling voltages down with lithographic dimensions at every technology generation. This gives rise to faster circuits due to higher…
(more)
▼ Ideal CMOS device scaling relies on scaling voltages down with lithographic
dimensions at every technology generation. This gives rise to faster circuits
due to higher frequency and smaller silicon area for the same functionality.
The dynamic power density - equivalently, dynamic power, if the chip area is
fixed - stays constant. Static power density, on the other hand, increases. In
early generations, however, since the share of static power was practically
negligible, dynamic power density staying constant translated to total power
density staying constant.
This picture has changed recently. To keep the growth of the static power under
control, the decrease in the threshold voltage has practically stopped. This,
in turn, has prevented the supply voltage from scaling. The end effect is an
increasing power density over generations, giving rise to the power wall:
Processor chips can include more cores and accelerators than can be active at
any given time - and the situation is getting worse. This effect, utilization
wall or dark silicon, as induced by the power wall, presents a fundamental
challenge that is transforming the many-core architecture landscape.
This dissertation attempts to address the key implication of the power wall
problem, dark silicon, in two novel and promising ways: By (1) trading off the
processor service life for power and performance - the BubbleWrap many-core, and
(2) exploring near-threshold voltage operation from an architectural perspective
- the Polyomino many-core.
The BubbleWrap many-core assumes as many cores on chip as CMOS transistor
density scaling trends suggest, and exploits the resulting implicit redundancy
- as not all of the cores can be powered on simultaneously - to extract
maximum performance by trading off power and service life on a per-core basis.
To achieve this, BubbleWrap continuously tunes the supply voltage within the
course of each core's service life, leveraging any aging-induced guard-band
instantaneously left, rendering one of the following regimes of operation:
Minimize power at the same performance level and processor service life; attain
the highest performance for the same service life while respecting the given
power budget; or attain even higher performance for a shorter service life while
respecting the given power budget. Effectively, BubbleWrap runs each core at a
closer-to-optimal operating point by always aggressively using up all the
aging-induced guard-band that the designers have included - preventing any waste
of it.
Another way to dim dark silicon is reducing the supply voltage to a value only
slightly higher than the threshold voltage. This regime is called near-threshold
voltage (NTV) computing (NTC), as opposed to conventional super-threshold
voltage (STV) computing (STC). A major drawback of NTC is the higher
susceptibility to parametric variations, namely the deviation of device
parameters from their nominal values. To address parametric variations in
present and future NTV designs,…
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Torrellas%2C%20Josep%22%29&pagesize-30">Torrellas, Josep (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Torrellas%2C%20Josep%22%29&pagesize-30">Torrellas, Josep (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20W.%22%29&pagesize-30">Hwu, Wen-Mei W. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Patel%2C%20Sanjay%20J.%22%29&pagesize-30">Patel, Sanjay J. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Shanbhag%2C%20Naresh%20R.%22%29&pagesize-30">Shanbhag, Naresh R. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Kim%2C%20Nam%20Sung%22%29&pagesize-30">Kim, Nam Sung (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Wilkerson%2C%20Chris%22%29&pagesize-30">Wilkerson, Chris (committee member).
Subjects/Keywords: Power constraints; Dark silicon; Near-threshold voltage; Many-core architectures; Process variations; Static random-access memory (SRAM) fault models; Wear-Out
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Karpuzcu, R. (2012). Novel many-core architectures for energy-efficiency. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/34560
Chicago Manual of Style (16th Edition):
Karpuzcu, Rahmet. “Novel many-core architectures for energy-efficiency.” 2012. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/34560.
MLA Handbook (7th Edition):
Karpuzcu, Rahmet. “Novel many-core architectures for energy-efficiency.” 2012. Web. 13 Apr 2021.
Vancouver:
Karpuzcu R. Novel many-core architectures for energy-efficiency. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2012. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/34560.
Council of Science Editors:
Karpuzcu R. Novel many-core architectures for energy-efficiency. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2012. Available from: http://hdl.handle.net/2142/34560

University of Illinois – Urbana-Champaign
21.
Crago, Neal.
Energy-efficient latency tolerance for 1000-core data parallel processors with decoupled strands.
Degree: PhD, 1200, 2012, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/34589
► This dissertation presents a novel decoupled latency tolerance technique for 1000-core data parallel processors. The approach focuses on developing instruction latency tolerance to improve performance…
(more)
▼ This dissertation presents a novel decoupled latency tolerance technique for 1000-core data parallel processors. The approach focuses on developing instruction latency tolerance to improve performance for a single thread. The main idea behind the approach is to leverage the compiler to split the original thread into separate memory-accessing and memory-consuming instruction streams. The goal is to provide latency tolerance similar to high-performance techniques such as out-of-order execution while leveraging low hardware complexity similar to an in-order execution core.
The research in this dissertation supports the following thesis: Pipeline stalls due to long exposed instruction latency are the main performance limiter for cached 1000-core data parallel processors. Leveraging natural decoupling of memory-access and memory-consumption, a serial thread of execution can be partitioned into strands providing energy-efficient latency tolerance.
This dissertation motivates the need for latency tolerance in 1000-core data parallel processors and presents decoupled core architectures as an alternative to currently used techniques. This dissertation discusses the limitations of prior decoupled architectures, and proposes techniques to improve both latency tolerance and energy-efficiency. Finally, the success of the proposed decoupled architecture is demonstrated against other approaches by performing an exhaustive design space exploration of energy, area, and performance using high-fidelity performance and physical design models.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Patel%2C%20Sanjay%20J.%22%29&pagesize-30">Patel, Sanjay J. (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Patel%2C%20Sanjay%20J.%22%29&pagesize-30">Patel, Sanjay J. (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20W.%22%29&pagesize-30">Hwu, Wen-Mei W. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Lumetta%2C%20Steven%20S.%22%29&pagesize-30">Lumetta, Steven S. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (committee member).
Subjects/Keywords: Parallel Processing; Data-parallel; Graphics processing unit (GPU); General-purpose computing on graphics processing units (GPGPU); manycore; latency tolerance; decoupled architecture; compiler technique; energy-efficiency; power-efficiency; high-performance; low power; low energy
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Crago, N. (2012). Energy-efficient latency tolerance for 1000-core data parallel processors with decoupled strands. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/34589
Chicago Manual of Style (16th Edition):
Crago, Neal. “Energy-efficient latency tolerance for 1000-core data parallel processors with decoupled strands.” 2012. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/34589.
MLA Handbook (7th Edition):
Crago, Neal. “Energy-efficient latency tolerance for 1000-core data parallel processors with decoupled strands.” 2012. Web. 13 Apr 2021.
Vancouver:
Crago N. Energy-efficient latency tolerance for 1000-core data parallel processors with decoupled strands. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2012. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/34589.
Council of Science Editors:
Crago N. Energy-efficient latency tolerance for 1000-core data parallel processors with decoupled strands. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2012. Available from: http://hdl.handle.net/2142/34589

University of Illinois – Urbana-Champaign
22.
Johnson, Daniel.
Multithreaded architectures for manycore throughput processors.
Degree: PhD, 1200, 2013, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/44751
► This dissertation describes work on the architecture of throughput-oriented accelerator processors. First, we examine the limitations of current accelerator processors and identify an opportunity to…
(more)
▼ This dissertation describes work on the architecture of throughput-oriented accelerator processors.
First, we examine the limitations of current accelerator processors and identify an opportunity to enable high throughput while also providing a more general-purpose programming model.To address this opportunity,we present Rigel, a single-chip accelerator architecture with 1024 independent processing cores targeted at a broad class of data- and task-parallel computation. Enabled by the feasibility of large die sizes combined with increasing transistor densities, we show that such an aggressive design can be implemented in today's process technology within acceptable area and power limits. We discuss our motivation for such a design and evaluate the performance scalability as well as power and area requirements. We also describe the Rigel memory system, including the Task Centric Memory Model software coherence protocol, the Cohesion hybrid memory model, and lazy atomic operations.
We describe the Rigel toolflow, a set of tools we have developed for evaluating manycore accelerator architectures. The Rigel toolset includes an architectural simulator, LLVM-based compiler, parallel benchmarks, RTL models, and associated infrastructure scripts and toolflows. We have prepared an open-source release of portions of the resulting toolset for the use of the broader research community. Such a release will enable others to perform further work in the area of accelerator design.
We present multi-level scheduling, a technique developed for throughput-oriented graphics processing units (GPUs) designed to reduce complexity and energy consumption. Modern GPUs employ a large number of hardware threads to hide both long and short latencies. Supporting tens of thousands of hardware threads requires a complex scheduler and a large register file which is expensive to access in terms of energy and latency. With multi-level scheduling, we divide threads into a smaller set of active threads to hide short latencies and larger set of pending threads for hiding long latencies to main memory. By reducing the concurrently active number of threads, we enable more efficient scheduler and register file structures.
Finally, we describe opportunities for employing similar hierarchical multithreading techniques to MIMD accelerator designs such as Rigel. We extend the original Rigel architecture with a new multithreaded microarchitecture. We propose a novel, flexible multithreading paradigm that allows the architect a flexible way to scale the number of threads to match the requirements of targeted workloads. We show that this new multithreaded architecture can be implemented efficiently while providing more flexibility to the architect.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Patel%2C%20Sanjay%20J.%22%29&pagesize-30">Patel, Sanjay J. (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Patel%2C%20Sanjay%20J.%22%29&pagesize-30">Patel, Sanjay J. (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Lumetta%2C%20Steven%20S.%22%29&pagesize-30">Lumetta, Steven S. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20W.%22%29&pagesize-30">Hwu, Wen-Mei W. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Shanbhag%2C%20Naresh%20R.%22%29&pagesize-30">Shanbhag, Naresh R. (committee member).
Subjects/Keywords: Processors; Multiprocessors; Computer Architecture; Parallel processing; Multithreading; High Performance Computing (HPC); visual computing; Accelerator processor; coprocessor; cache coherence; parallel computing; computer systems; graphics processors
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Johnson, D. (2013). Multithreaded architectures for manycore throughput processors. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/44751
Chicago Manual of Style (16th Edition):
Johnson, Daniel. “Multithreaded architectures for manycore throughput processors.” 2013. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/44751.
MLA Handbook (7th Edition):
Johnson, Daniel. “Multithreaded architectures for manycore throughput processors.” 2013. Web. 13 Apr 2021.
Vancouver:
Johnson D. Multithreaded architectures for manycore throughput processors. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2013. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/44751.
Council of Science Editors:
Johnson D. Multithreaded architectures for manycore throughput processors. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2013. Available from: http://hdl.handle.net/2142/44751

University of Illinois – Urbana-Champaign
23.
Wu, Pei-Ci.
New methods for electronic design automation problems.
Degree: PhD, Electrical & Computer Engr, 2015, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/78447
► As the semiconductor technology marches towards the 14nm node and beyond, EDA (electronic design automation) has rapidly increased in importance with ever more complicated modern…
(more)
▼ As the semiconductor technology marches towards the 14nm node and beyond,
EDA (electronic design automation) has rapidly increased in importance
with ever more complicated modern integration circuit (IC) designs.
This presents many new issues for EDA including design, manufacturing, and
packaging. Challenging EDA problems in these three domains are studied in
this dissertation.
Timing closure, which aims to satisfy the timing constraints, is always a
key problem in the physical design flow. The challenges of timing closure for
IC designs keep increasing as the technology advances. During the timing
optimization process, buffers can be used to speed up the circuit or serve
as delay elements. In this dissertation, we study the hold-violation removal
problem for a circuit-level design. Considering the challenges of industrial
designs, discrete buffer sizes, accurate timing models/analysis and complex
timing constraints make the problem difficult and time-consuming to solve.
In this dissertation, a linear programming-based methodology is presented.
In the experiment, our approach is tested on industrial designs, and is incorporated
into to the state-of-the-art industrial optimization flow.
While buffers can help fix hold-time violations, they also increase the difficulty
of routability and the utilization of a design. And the larger area of
cells contributes larger leakage power, while power is an increasing challenge
as the technology advances. Therefore, in Chapter 3, we study the buffer
insertion problem that is to find which buffers to be inserted in order to
meet the timing constraints, meanwhile minimizing the total area of inserted
buffers. Several approaches are presented. We test the proposed approaches
on the industrial designs, and the machine learning based approach shows
better results in terms of quality and runtime.
Aerial image simulation is a fundamental problem in the regular lithographyrelated
process. Since it requires a huge amount of mathematical computation, an efficient yet accurate implementation becomes a necessity. In the
literature, GPU or FPGA has successfully demonstrated its potential with
detailed tuning for accelerating aerial image simulation. However, the advantages
of GPU or FPGA to CPU are not solid enough, given that the careful
tuning for the CPU-based method is missing in the previous works, while the
recent CPU architectures have significant modifications towards high performance
computing capabilities. In this dissertation, we present and discuss
several algorithms for the aerial image simulation on multi-core SIMD CPU.
Our experimental results show that the performance on the multi-core SIMD
CPU is promising, and careful CPU tuning is necessary in order to exploit
its computing capabilities.
Since the constantly evolving technology continues to push the complexity
of package and printed circuit board (PCB) design to a higher level, nowadays
a modern package can contain thousands of pins. On the other hand, the size
of a package is still…
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%20D.F.%22%29&pagesize-30">Wong, Martin D.F. (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%20D.F.%22%29&pagesize-30">Wong, Martin D.F. (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20W.%22%29&pagesize-30">Hwu, Wen-Mei W. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Rutenbar%2C%20Robin%20A.%22%29&pagesize-30">Rutenbar, Robin A. (committee member).
Subjects/Keywords: Electronic Design Automation (EDA); Timing Closure; Buffer Insertion; Aerial Image Simulation; Escape Routing; Bus Planner
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Wu, P. (2015). New methods for electronic design automation problems. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/78447
Chicago Manual of Style (16th Edition):
Wu, Pei-Ci. “New methods for electronic design automation problems.” 2015. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/78447.
MLA Handbook (7th Edition):
Wu, Pei-Ci. “New methods for electronic design automation problems.” 2015. Web. 13 Apr 2021.
Vancouver:
Wu P. New methods for electronic design automation problems. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2015. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/78447.
Council of Science Editors:
Wu P. New methods for electronic design automation problems. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2015. Available from: http://hdl.handle.net/2142/78447

University of Illinois – Urbana-Champaign
24.
Wang, Haichuan.
Compiler and runtime techniques for optimizing dynamic scripting languages.
Degree: PhD, Computer Science, 2015, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/78638
► This thesis studies the compilation and runtime techniques to improve the performance of dynamic scripting languages using R programming language as a test case. The…
(more)
▼ This thesis studies the compilation and runtime techniques to improve the performance of dynamic scripting languages using R programming language as a test case.
The R programming language is a convenient system for statistical computing. In this era of big data, R is becoming increasingly popular as a powerful data analytics tool. But the performance of R limits its usage in a broader context. The thesis introduces a classification of R programming styles into Looping over data(Type I), Vector programming(Type II), and Glue codes(Type III), and identified the most serious overhead of R is mostly manifested in Type I R codes. It proposes techniques to improve the performance R. First, it uses interpreter level specialization to do object allocation removal and path length reduction, and evaluates its effectiveness for GNU R VM. The approach uses profiling to translate R byte-code into a specialized byte-code to improve running speed, and uses data representation specialization to reduce the memory allocation and usage. Secondly, it proposes a lightweight approach that reduces the interpretation overhead of R through vectorization of the widely used Apply class of operations in R. The approach combines data transformation and function vectorization to transform the looping-over-data execution into a code with mostly vector operations, which can significantly speedup the execution of Apply operations in R without any native code generation and still using only a single-thread of execution. Thirdly, the Apply vectorization technique is integrated into SparkR, a widely used distributed R computing system, and has successfully improved its performance. Furthermore, an R benchmark suite has been developed. It includes a collection of different types of R applications, and a flexible benchmarking environment for conducting performance research for R. All these techniques could be applied to other dynamic scripting languages.
The techniques proposed in the thesis use a pure interpretation approach (the system based on the techniques does not generate native code) to improve the performance of R. The strategy has the advantage of maintaining the portability and compatibility of the VM, simplify the implementation. It is also a very interesting problem to see the potential of an interpreter.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Padua%2C%20David%20A.%22%29&pagesize-30">Padua, David A. (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Padua%2C%20David%20A.%22%29&pagesize-30">Padua, David A. (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Adve%2C%20Vikram%20S.%22%29&pagesize-30">Adve, Vikram S. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20W.%22%29&pagesize-30">Hwu, Wen-Mei W. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Wu%2C%20Peng%22%29&pagesize-30">Wu, Peng (committee member).
Subjects/Keywords: R Programming Language; Dynamic Scripting Language; Compiler; Performance; Specialization; Vectorization
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Wang, H. (2015). Compiler and runtime techniques for optimizing dynamic scripting languages. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/78638
Chicago Manual of Style (16th Edition):
Wang, Haichuan. “Compiler and runtime techniques for optimizing dynamic scripting languages.” 2015. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/78638.
MLA Handbook (7th Edition):
Wang, Haichuan. “Compiler and runtime techniques for optimizing dynamic scripting languages.” 2015. Web. 13 Apr 2021.
Vancouver:
Wang H. Compiler and runtime techniques for optimizing dynamic scripting languages. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2015. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/78638.
Council of Science Editors:
Wang H. Compiler and runtime techniques for optimizing dynamic scripting languages. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2015. Available from: http://hdl.handle.net/2142/78638

University of Illinois – Urbana-Champaign
25.
Ahn, Daniel.
Software and architecture support for the bulk multicore.
Degree: PhD, 0112, 2012, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/32076
► Research on transactional memory began as a tool to improve the experience of programmers working on parallel code. Just as transactions in databases, it was…
(more)
▼ Research on transactional memory began as a tool to improve the experience of programmers working on parallel code. Just as transactions in databases, it was the job of the runtime to detect any conflicts between parallel transactions and rollback the ones that needed to be re-executed, leaving the programmers blissfully unaware of the communication and synchronization that needs to happen. The programmer only needed to cover the sections of code that might generate conflicts in transactions, or atomic regions.
More recently, new uses for transactional execution were proposed where, not only were user specified sections of code executed transactionally but the entire program was executed using transactions. In this environment, the hardware is in charge of generating the transactions, also called chunks, unbeknownst to the user. This simple idea led to many improvements in programmability such as providing a sequentially consistent(SC) memory model, aiding atomicity violation detection, enabling deterministic reply, and even enabling deterministic execution. However, the implications of this chunking hardware on the compiler layer has not been studied before, which is the subject of this Thesis.
The Thesis makes three contributions. First, it describes the modifications to the compiler necessary to enable the benefits in programmability, specifically SC memory model, to percolate up to the programmer language level from the hardware level. The already present hardware support for chunked execution is exposed to the compiler and used extensively for this purpose. Surprisingly, the ability to speculate using chunks leads to speedups over traditional compilers in many cases. Second, it describes how to expose hardware signatures, present in chunking hardware for the purposes of conflict detection and memory disambiguation, to the compiler to enable further novel optimizations. An example is given where hardware signatures are used to summarize the side-effects of functions to perform function memoization at a large scale. Third, it describes how to use atomic regions and conflict detection hardware to improve alias analysis for general compiler optimizations using speculation. Loop Invariant Code Motion, a widely used traditional compiler pass, is run as an example client pass to test the potential of the new alias analysis.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Torrellas%2C%20Josep%22%29&pagesize-30">Torrellas, Josep (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Torrellas%2C%20Josep%22%29&pagesize-30">Torrellas, Josep (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Cascaval%2C%20Calin%22%29&pagesize-30">Cascaval, Calin (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Midkiff%2C%20Samuel%22%29&pagesize-30">Midkiff, Samuel (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20W.%22%29&pagesize-30">Hwu, Wen-Mei W. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Adve%2C%20Vikram%20S.%22%29&pagesize-30">Adve, Vikram S. (committee member),
Champaign%22%20%2Bcontributor%3A%28%22King%2C%20Samuel%20T.%22%29&pagesize-30">King, Samuel T. (committee member).
Subjects/Keywords: Computer Architecture; Compiler; Transactional Memory; Transactional Execution; Speculative Optimization; Bloomfilter; Signature; Memory Model; Sequential Consistency; Function Memoization; Alias Analysis
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ahn, D. (2012). Software and architecture support for the bulk multicore. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/32076
Chicago Manual of Style (16th Edition):
Ahn, Daniel. “Software and architecture support for the bulk multicore.” 2012. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/32076.
MLA Handbook (7th Edition):
Ahn, Daniel. “Software and architecture support for the bulk multicore.” 2012. Web. 13 Apr 2021.
Vancouver:
Ahn D. Software and architecture support for the bulk multicore. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2012. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/32076.
Council of Science Editors:
Ahn D. Software and architecture support for the bulk multicore. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2012. Available from: http://hdl.handle.net/2142/32076

University of Illinois – Urbana-Champaign
26.
Huang, Tsung-Wei.
Distributed timing analysis.
Degree: PhD, Electrical & Computer Engr, 2017, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/99302
► As design complexities continue to grow larger, the need to efficiently analyze circuit timing with billions of transistors across multiple modes and corners is quickly…
(more)
▼ As design complexities continue to grow larger, the need to efficiently analyze circuit timing with billions of transistors across multiple modes and corners is quickly becoming the major bottleneck to the overall chip design closure process. To alleviate the long runtimes, recent trends are driving the need of distributed timing analysis (DTA) in electronic design automation (EDA) tools. However, DTA has received little research attention so far and remains a critical problem. In this thesis, we introduce several methods to approach DTA problems. We present a near-optimal algorithm to speed up the path-based timing analysis in Chapter 1. Path-based timing analysis is a key step in the overall timing flow to reduce unwanted pessimism, for example, common path pessimism removal (CPPR). In Chapter 2, we introduce a MapReduce-based distributed Path-based timing analysis framework that can scale up to hundreds of machines. In Chapter 3, we introduce our standalone timer, OpenTimer, an open-source high-performance timing analysis tool for very large scale integration (VLSI) systems. OpenTimer efficiently supports (1) both block-based and path-based timing propagations, (2) CPPR, and (3) incremental timing. OpenTimer works on industry formats (e.g., .v, .spef, .lib, .sdc) and is designed to be parallel and portable. To further facilitate integration between timing and timing-driven optimizations, OpenTimer provides user-friendly application programming interface (API) for inactive analysis. Experimental results on industry benchmarks re- leased from TAU 2015 timing analysis contest have demonstrated remarkable results achieved by OpenTimer, especially in its order-of-magnitude speedup over existing timers.
In Chapter 4 we present a DTA framework built on top of our standalone timer OpenTimer. We investigated into existing cluster computing frameworks from big data community and demonstrated DTA is a difficult fit here in terms of computation patterns and performance concern. Our specialized DTA framework supports (1) general design partitions (logical, physical, hierarchical, etc.) stored in a distributed file system, (2) non-blocking IO with event-driven programming for effective communication and computation overlap, and (3) an efficient messaging interface between application and network layers. The effectiveness and scalability of our framework has been evaluated on large hierarchical industry designs over a cluster with hundreds of machines.
In Chapter 5, we present our system DtCraft, a distributed execution engine for compute-intensive applications. Motivated by our DTA framework, DtCraft introduces a high-level programming model that lets users without detailed experience of distributed computing utilize the cluster resources. The major goal is to simplify the coding efforts on building distributed applications based on our system. In contrast to existing data-parallel cluster computing frameworks, DtCraft targets on high-performance or compute- intensive applications including simulations, modeling, and most…
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%20D.%20F.%22%29&pagesize-30">Wong, Martin D. F. (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Wong%2C%20Martin%20D.%20F.%22%29&pagesize-30">Wong, Martin D. F. (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Chen%2C%20Deming%22%29&pagesize-30">Chen, Deming (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%22%29&pagesize-30">Hwu, Wen-Mei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Rutenbar%2C%20Rob%20A.%22%29&pagesize-30">Rutenbar, Rob A. (committee member).
Subjects/Keywords: Distributed systems; Timing analysis
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Huang, T. (2017). Distributed timing analysis. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/99302
Chicago Manual of Style (16th Edition):
Huang, Tsung-Wei. “Distributed timing analysis.” 2017. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/99302.
MLA Handbook (7th Edition):
Huang, Tsung-Wei. “Distributed timing analysis.” 2017. Web. 13 Apr 2021.
Vancouver:
Huang T. Distributed timing analysis. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2017. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/99302.
Council of Science Editors:
Huang T. Distributed timing analysis. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2017. Available from: http://hdl.handle.net/2142/99302

University of Illinois – Urbana-Champaign
27.
Estrada, Zachary.
Dynamic reliability and security monitoring: a virtual machine approach.
Degree: PhD, Electrical & Computer Engr, 2016, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/95458
► While one always works to prevent attacks and failures, they are inevitable and situational awareness is key to taking appropriate action. Monitoring plays an integral…
(more)
▼ While one always works to prevent attacks and failures, they are inevitable and situational awareness is key to taking appropriate action. Monitoring plays an integral role in ensuring reliability and security of computing systems. Infrastructure as a Service (IaaS) clouds significantly lower the barrier for obtaining scalable computing resources and allow users to focus on what is important to them. Can a similar service be offered to provide on-demand reliability and security monitoring?
Cloud computing systems are typically built using virtual machines (VMs). VM monitoring takes advantage of this and uses the hypervisor that runs VMs for robust reliability and security monitoring. The hypervisor provides an environment that is isolated from failures and attacks inside customers’ VMs. Furthermore, as a low-level manager of computing resources, the hypervisor has full access to the infrastructure running above it. Hypervisor-based VM monitoring leverages that information to observe the VMs for failures and attacks. However, existing VM monitoring techniques fall short of “as-a-service” expectations because they require a priori VM modifications and require human interaction to obtain necessary information about the underlying guest system. The research presented in this dissertation closes those gaps by providing a flexible VM monitoring framework and automated analysis to support that framework.
We have developed and tested a dynamic VM monitoring framework called Hypervisor Probes (hprobes). The hprobe framework allows us to monitor the execution of both the guest OS and applications from the hypervisor. To supplement this monitoring framework, we use dynamic analysis techniques to investigate the relationship between hardware events visible to the hyper-visor and OS constructs common across OS versions. We use the results of this analysis to parametrize the hprobe-based monitors without requiring any user input. Combining the dynamic VM monitoring framework and analysis frameworks allows us to provide on-demand hypervisor based monitors for cloud VMs.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Iyer%2C%20Ravishankar%20K%22%29&pagesize-30">Iyer, Ravishankar K (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Kalbarczyk%2C%20Zbigniew%20T%22%29&pagesize-30">Kalbarczyk, Zbigniew T (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Iyer%2C%20Ravishankar%20K%22%29&pagesize-30">Iyer, Ravishankar K (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Bailey%2C%20Michael%20D%22%29&pagesize-30">Bailey, Michael D (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Campbell%2C%20Roy%20H%22%29&pagesize-30">Campbell, Roy H (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%20W%22%29&pagesize-30">Hwu, Wen-Mei W (committee member).
Subjects/Keywords: virtualization; emulation; security; reliability; hprobes; cloud; hardware-asssisted virtualization; dynamic analysis
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Estrada, Z. (2016). Dynamic reliability and security monitoring: a virtual machine approach. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/95458
Chicago Manual of Style (16th Edition):
Estrada, Zachary. “Dynamic reliability and security monitoring: a virtual machine approach.” 2016. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/95458.
MLA Handbook (7th Edition):
Estrada, Zachary. “Dynamic reliability and security monitoring: a virtual machine approach.” 2016. Web. 13 Apr 2021.
Vancouver:
Estrada Z. Dynamic reliability and security monitoring: a virtual machine approach. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2016. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/95458.
Council of Science Editors:
Estrada Z. Dynamic reliability and security monitoring: a virtual machine approach. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2016. Available from: http://hdl.handle.net/2142/95458

University of Illinois – Urbana-Champaign
28.
Jambunathan, Revathi.
CHAOS: A multi-GPU PIC-DSMC solver for modeling gas and plasma flows.
Degree: PhD, Aerospace Engineering, 2019, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/104740
► Numerical modeling of gas and plasma-surface interactions is critical to understanding the complex kinetic processes that dominate the extreme environments of planetary entry and in-space…
(more)
▼ Numerical modeling of gas and plasma-surface interactions is critical to understanding the complex kinetic processes that dominate the extreme environments of planetary entry and in-space propulsion. However, simulations of these systems that evolve over multiple length- and time-scales is computationally expensive. Until recently, approximations were used to keep computational costs tenable, which in turn, increased the uncertainty in predictions and offered limited insights into the micro-scale flow properties and electron kinetics that dominate the macroscale processes. The need to perform high-fidelity physics-based gas and plasma simulations has led to the development of a three-dimensional, multi-GPU, Particle-in-cell (PIC)-direct simulation Monte Carlo (DSMC) solver called Cuda-based Hybrid Approach for Octree Simulations (CHAOS) that is presented in this work. This computational tool has been applied to candidate PICA-like TPS materials that consist of an irregular porous network of fibers to allow high-temperature boundary layer gases as well as pyrolysis by-products to penetrate in and flow out of the material. Quantifying bulk transport properties of these materials is essential for accurate prediction of the macroscopic ablation rate. The second application that CHAOS is being used with is the modeling of ion thruster plumes that consist of fast beam ions and slow neutrals that undergo charge-exchange (CEX) reactions to produce slow ions and fast neutrals. These slow CEX ions are strongly influenced by the electric field induced between the ion plume and the thruster surface, resulting in a backflow of ions towards the critical solar panel and thruster surfaces. Three backflow quantities, namely, ion flux, incidence angle, and incidence energy affect the macroscopic sputtering rate of the solar panel surfaces over extended operational times and are predicted from the PIC-DSMC simulations.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Levin%2C%20Deborah%20A%22%29&pagesize-30">Levin, Deborah A (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Levin%2C%20Deborah%20A%22%29&pagesize-30">Levin, Deborah A (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%22%29&pagesize-30">Hwu, Wen-Mei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Chew%2C%20Huck%20Beng%22%29&pagesize-30">Chew, Huck Beng (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Stephani%2C%20Kelly%20A.%22%29&pagesize-30">Stephani, Kelly A. (committee member).
Subjects/Keywords: PIC; DSMC; Forest of Octree; GPU; CUDA; MPI; Morton encoding; linearization; permeability; porous media; plasma plume; neutralization; electron kinetics; charge-exchange collisions; ion backflow
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Jambunathan, R. (2019). CHAOS: A multi-GPU PIC-DSMC solver for modeling gas and plasma flows. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/104740
Chicago Manual of Style (16th Edition):
Jambunathan, Revathi. “CHAOS: A multi-GPU PIC-DSMC solver for modeling gas and plasma flows.” 2019. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/104740.
MLA Handbook (7th Edition):
Jambunathan, Revathi. “CHAOS: A multi-GPU PIC-DSMC solver for modeling gas and plasma flows.” 2019. Web. 13 Apr 2021.
Vancouver:
Jambunathan R. CHAOS: A multi-GPU PIC-DSMC solver for modeling gas and plasma flows. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2019. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/104740.
Council of Science Editors:
Jambunathan R. CHAOS: A multi-GPU PIC-DSMC solver for modeling gas and plasma flows. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2019. Available from: http://hdl.handle.net/2142/104740

University of Illinois – Urbana-Champaign
29.
Gopi Reddy, Bhargava Reddy.
Energy efficient core designs for upcoming process technologies.
Degree: PhD, Computer Science, 2019, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/104832
► Energy efficiency has been a first order constraint in the design of micro processors for the last decade. As Moore's law sunsets, new technologies are…
(more)
▼ Energy efficiency has been a first order constraint in the design of micro processors for the last decade. As Moore's law sunsets, new technologies are being actively explored to extend the march in increasing the computational power and efficiency. It is essential for computer architects to understand the opportunities and challenges in utilizing the upcoming process technology trends in order to design the most efficient processors. In this work, we consider three process technology trends and propose core designs that are best suited for each of the technologies. The process technologies are expected to be viable over a span of timelines.
We first consider the most popular method currently available to improve the energy efficiency, i.e. by lowering the operating voltage. We make key observations regarding the limiting factors in scaling down the operating voltage for general purpose high performance processors. Later, we propose our novel core design, ScalCore, one that can work in high performance mode at nominal Vdd, and in a very energy-efficient mode at low Vdd. The resulting core design can operate at much lower voltages providing higher parallel performance while consuming lower energy.
While lowering Vdd improves the energy efficiency, CMOS devices are fundamentally limited in their low voltage operation. Therefore, we next consider an upcoming device technology – Tunneling Field-Effect Transistors (TFETs), that is expected to supplement CMOS device technology in the near future. TFETs can attain much higher energy efficiency than CMOS at low voltages. However, their performance saturates at high voltages and, therefore, cannot entirely replace CMOS when high performance is needed. Ideally, we desire a core that is as energy-efficient as TFET and provides as much performance as CMOS. To reach this goal, we characterize the TFET device behavior for core design and judiciously integrate TFET units, CMOS units in a single core. The resulting core, called HetCore, can provide very high energy efficiency while limiting the slowdown when compared to a CMOS core.
Finally, we analyze Monolithic 3D (M3D) integration technology that is widely considered to be the only way to integrate more transistors on a chip. We present the first analysis of the architectural implications of using M3D for core design and show how to partition the core across different layers. We also address one of the key challenges in realizing the technology, namely, the top layer performance degradation. We propose a critical path based partitioning for logic stages and asymmetric bit/port partitioning for storage stages. The result is a core that performs nearly as well as a core without any top layer slowdown. When compared to a 2D baseline design, an M3D core not only provides much higher performance, it also reduces the energy consumption at the same time.
In summary, this thesis addresses one of the fundamental challenges in computer architecture – overcoming the fact that CMOS is not scaling anymore. As we increase the computing…
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Torrellas%2C%20Josep%22%29&pagesize-30">Torrellas, Josep (advisor),
Champaign%22%20%2Bcontributor%3A%28%22Torrellas%2C%20Josep%22%29&pagesize-30">Torrellas, Josep (Committee Chair),
Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-Mei%22%29&pagesize-30">Hwu, Wen-Mei (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Kim%2C%20Nam%20Sung%22%29&pagesize-30">Kim, Nam Sung (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Fletcher%2C%20Christopher%22%29&pagesize-30">Fletcher, Christopher (committee member),
Champaign%22%20%2Bcontributor%3A%28%22Mishra%2C%20Asit%22%29&pagesize-30">Mishra, Asit (committee member).
Subjects/Keywords: Energy efficient architecture; TFET; Monolithic 3D; ScalCore; HetCore; Microarchitecture; Processor; CPU; Low Voltage
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Gopi Reddy, B. R. (2019). Energy efficient core designs for upcoming process technologies. (Doctoral Dissertation). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/104832
Chicago Manual of Style (16th Edition):
Gopi Reddy, Bhargava Reddy. “Energy efficient core designs for upcoming process technologies.” 2019. Doctoral Dissertation, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/104832.
MLA Handbook (7th Edition):
Gopi Reddy, Bhargava Reddy. “Energy efficient core designs for upcoming process technologies.” 2019. Web. 13 Apr 2021.
Vancouver:
Gopi Reddy BR. Energy efficient core designs for upcoming process technologies. [Internet] [Doctoral dissertation]. University of Illinois – Urbana-Champaign; 2019. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/104832.
Council of Science Editors:
Gopi Reddy BR. Energy efficient core designs for upcoming process technologies. [Doctoral Dissertation]. University of Illinois – Urbana-Champaign; 2019. Available from: http://hdl.handle.net/2142/104832
30.
Agarwal, Ayush.
Memory access patterns and page promotion in hybrid memory systems.
Degree: MS, Electrical & Computer Engr, 2020, University of Illinois – Urbana-Champaign
URL: http://hdl.handle.net/2142/108002
► Hybrid heterogeneous memory systems are becoming increasingly popular as traditional memory systems are hitting performance and energy walls in processing data-intensive applications, which are becoming…
(more)
▼ Hybrid heterogeneous memory systems are becoming increasingly popular as traditional memory systems are hitting performance and energy walls in processing data-intensive applications, which are becoming the norm with the resurgence of machine learning, big data, graph analytics, and database management systems, especially in modern datacenters. In addition to the massive data that these applications process, they exhibit varying and non-deterministic memory access patterns making I/O latency a prime criterion in the design considerations that go into building modern computing systems to support them.
A traditional memory system moves data by swapping pages between the faster DRAM and the slower SSD. While applications with sequential accesses have better traffic between the DRAM and the SSD, applications with random page accesses, such as large graphs, often produce high traffic and exhibit little or no reuse of pages swapped into the DRAM.
This thesis proposes a technique to identify memory access patterns, and a scalable and distributed technique to determine when pages should be promoted from the slower memory system to the faster memory system, thereby reducing I/O traffic. The proposed page promotion design shows up to 6.74x reduction in page traffic and 1.21x increase in the total hit rate of a data-intensive application with uniformly distributed random memory accesses.
Advisors/Committee Members: Champaign%22%20%2Bcontributor%3A%28%22Hwu%2C%20Wen-
mei%20W%22%29&pagesize-30">
Hwu,
Wen-
mei W (advisor).
Subjects/Keywords: memory system; data-intensive; prefetching; hybrid memory system; page promotion; page migration
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Agarwal, A. (2020). Memory access patterns and page promotion in hybrid memory systems. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/108002
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Chicago Manual of Style (16th Edition):
Agarwal, Ayush. “Memory access patterns and page promotion in hybrid memory systems.” 2020. Thesis, University of Illinois – Urbana-Champaign. Accessed April 13, 2021.
http://hdl.handle.net/2142/108002.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
MLA Handbook (7th Edition):
Agarwal, Ayush. “Memory access patterns and page promotion in hybrid memory systems.” 2020. Web. 13 Apr 2021.
Vancouver:
Agarwal A. Memory access patterns and page promotion in hybrid memory systems. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2020. [cited 2021 Apr 13].
Available from: http://hdl.handle.net/2142/108002.
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
Council of Science Editors:
Agarwal A. Memory access patterns and page promotion in hybrid memory systems. [Thesis]. University of Illinois – Urbana-Champaign; 2020. Available from: http://hdl.handle.net/2142/108002
Note: this citation may be lacking information needed for this citation format:
Not specified: Masters Thesis or Doctoral Dissertation
◁ [1] [2] [3] ▶
.