Sponsored by the
IEEE Computer Society
Building on the positive feedback enjoyed by the First Workshop on Computer Architecture Evaluation using Commercial Workloads and Second Workshop on Computer Architecture Evaluation using Commercial Workloads, this third workshop will again bring together researchers and practitioners in computer architecture and commercial workloads from industry and academia. In the course of one day, we will discuss work-in-progress that utilizes commercial workloads for the evaluation of computer architectures. By discussing this ongoing research, the workshop will expose participants to the characteristics of commercial workload behavior and provide an understanding of how commercial workloads exercise computer systems. There will be discussions on the difficulties associated with using commercial workloads to drive new computer architecture designs and what can be done to overcome them.
The Final Program for the workshop is listed below, with an abstract for each talk. There will be plenty of time for audience participation. A panel with a round-table discussion will be held after the technical presentations. There will be no proceedings for the workshop since we encourage the presentation of work-in-progress and research in early stages. Copies of the foils used by the speakers will be distributed to the attendees.
Dynamic Branch prediction schemes have led to processor designs where the instruction supply and execution are clearly decoupled, being easy nowadays to hear about front-end and back-end engines. A front-end or fetch engine is characterized by the speed at which it can deliver instructions and the 'quality' of the instructions that it delivers. The speed is directly related to the effective bandwidth and latency of the fetch unit. On the other hand, the 'quality' of the delivered instructions depends on the accuracy of the branch predictor. During years, researchers have been working in order to keep instruction fetch latencies low and obtain higher branch prediction accuracies. However, the fetch bandwidth has become important when the execution width of processors has grown. This work focuses on the evaluation of the performance effects that several variations of the fetch engine parameters produce on the execution of Decision Support Systems (DSS) workloads. Our study uses Postgres 6.3 running TPC-D queries 3 and 17 over a 100MB Btree indexed database workload. We have used a user level simulator of an out of order superscalar processor derived from the simplescalar v3.0 tool set. First of all, we look at the performance that can be expected from a perfect fetch. With this, we see that a 4 wide issue processor with a perfect fetch engine of the same width, can achive an IPC of 2.66 which shows a performance increase of 59% over a basic configuration with 32K byte cache and a 4096 two bit counter bimodal predictor.
With the evidence that fetch may help to improve the performance of DBMSs execution on current superscalars, we analyze the impact of the i-cache and the branch prediction mechanism. On one hand, an undersized i-cache may drop the performance of our applications significantly. For instance, a perfect i-cache with perfect branch prediction has an IPC of 2.39 while a 16K byte i-cache in the same situation has an IPC of 1.45. On the other hand, we show that the effect of the branch prediction mechanism is bounded by the i-cache size. In particular, we show that the effect of increasing the size and quality of the branch predictor has a larger effect on larger i-caches. For instance, for a 16K byte i-cache, the percentual difference between the worst (1K two bit counter Gshare) and a perfect branch predictor is 8.5% while, the same branch predictors show a difference of 22.6% on a perfect i-cache.
Finally, we have evaluated the benefits that can be expected from a better fetch bandwidth. We have tested a hipothetical fetch mechanism that always returns a fixed number of, possibly wrong path, instructions per cycle. For example, on a 4 issue processor, a 4 instruction fetch engine with the previous characteristics, improves a conventional mechanism that stops at the first branch found in a 5.54%. Also, on the same 4 issue processor, fetching and decoding 8 fixed instructions per cycle improves the the fetching of 4 fixed instructions in a 2.62%, reaching a total IPC of 2.31. As a conclusion, our study reveals that a well dimensioned fetch engine is of great importance for DBMS performance. In particular, an i-cache able to capture the working set of the application is essential. Also, we have found that the size of the i-cache can clearly bound the improvements of branch predictors.
References
[Ail99] Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, David A. Wood 'DBMSs on a Modern Processor: Where Does Time Go?', Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, Pp 266-277, Morgan Kaufmann 1999.
[May94] A. M. Maynard, C. M. Donnelly and B. R. Olszewski, 'Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads', Proceedings of the 6th Intl. Conference on Architectural Support for Programming Languages and Operating Systems, Pp 145-156, 1994.
[Nav98] Carlos Navarro, 'Caracterizacion de la ejecucion de consultas sobre Postgres95', Projecte final de carrera, Facultat d'Informatica de Barcelona, UPC, June, 1998.
To address these trends, we describe a storage system design that uses "intelligent" disks (IDISKs). An IDISK is a hard disk containing an embedded general-purpose processor, tens to hundreds of megabytes of memory, and gigabit per second network links. We analyze the potential performance benefits of an IDISK architecture using analytic models of DSS operations, such as selection and join, which are based on measurements of full-scale DSS behavior.
We find that IDISK outperforms cluster-based and SMP-based systems by up to an order of magnitude, even though these alternate systems possess faster processors and higher aggregate memory capacity. This high performance is attributable to the increased data and computational parallelism possible in an IDISK architecture, and to IDISK's ability to compensate for smaller memory capacity by using multi-pass algorithms that trade off disk I/O bandwidth for memory capacity.
References
[1] G. Papadopoulos. "Moore's Law ain't good enough," Keynote speech at Hot Chips X, August 1998.
[2] R. Winter and K. Auerbach. "Giants walk the earth: the 1997 VLDB survey," Database Programming and Design, volume 10, number 9, pages S2 - S9+, September 1997.
[3] R. Winter and K. Auerbach. "The big time: the 1998 VLDB survey," Database Programming and Design, volume 11, number 8, August 1998.
Our purpose is to provide a better understanding of performance improvements in multimedia applications due to these ISA extensions. We have used Altivec, introduced by Apple and Motorola in their newest microprocessor: the PowerPC G4. Altivec works with 128 bits vectors and allows both integer and single precision vector processing.
The eight micro-kernels we use as benchmarks have been extracted from commonly used multimedia applications. We will compare the original optimized C kernel with an Altivec optimized one. These eight micro-kernels are FIR/IIR filter, min/max, vertice transformation and projection, normals transformation and normalization, backface culling, phong lighting, DAXPY and generic matrix-matrix multiplication.
Various audio filtering and speech compression are based on FIR/IIR filter, the Viterbi algorithm used in speech recognition uses the minmax kernel. Mesa, a 3-D graphics library with an API which is very similar to that of OpenGL uses 3-D geometry transformations and lighting to render 3D polygonal scenes. Vertice transformation, projection, perspective correction, normals transformation and renormalization, backface culling and phong lighting are the critical parts of geometry transformations. Some functions like DAXPY, the sum of two floating point arrays and generic matrix multiplication have been studied too because of their intensive use in numerical algorithms. All these kernels have been hand tuned using well known technics like loop unrolling and software pipelining.
For our performance evaluation, we used tools provided by Apple and Motorola. The Altivec kernels are not fully written in assembly. A set of C functions that work on variables instead of registers and that maps on Altivec instructions allows easiest programming, and fine tuning. We generate execution trace with pitsTT6 tool by linking the kernel with pitsTT6 library and calling the functions to start tracing and to end it. The G4 simulator gives a lot of statistics on the execution of the tt6 trace. All the results provided come from the simulation of a G4 processor executing the previously extracted TT6 execution trace.
The speedup obtained with Altivec for these benchmarks range from two (FIR/IIR) to ten (64x64 generic matrix multiplication). The final speedup for the 3D geometry pipeline is 4 when using two dynamics light sources ie two pass into the phong lighting step which is the most accelerated part of the 3D kernel. The use of streaming Altivec prefetch improves the global perfomances by a factor of 1.1 times to 2 times.
The memory behaviour of our micro-kernels is characterized by streaming data access. Datas are quite never reused and the first level cache size has no impact on our micro-kernels performances when over 4Kb. Most of the benchmarks become memory bound when using Altivec, but the use of prefetch lessens this tendance. We also study the impact of increasing memory bandwidth on the final performance of these benchmarks. The performance improvement using a doubled memory bandwidth at a constant latency is for most of the benchmarks around 20%. For the 3D benchmarks we also evaluated the importance of data organisation on the performance improvement due to the use of vector processing. We found the same results as Intel on its SSE streaming media extensions. On all the benchmarks we explain the reason for the performance improvement, and for over 4x improvement using 4 element wide SIMD processing.
To summarize the floating point benchmarks we have studied are improved by the use of Altivec streaming SIMD extensions from 2 times to 10 times. The impact of prefetch ranges from 1.1 times to 2 times performance enhancements. Improving main memory bandwidth without lowering latency has a noticeable impact on final performance due to streaming software directed prefetch.
As these systems grow in size and complexity, the demand placed on their internal interconnect structure becomes a bottleneck. Currently bus-based structures are used in many of these systems. Bus-based topologies are well-suited to applications that produce moderate internode communication, but fail to scale as the number of nodes increases. Since I/O systems need to transfer large amounts of data, more scalable interconnect structures are needed to scale the performance of these systems.
In this paper, we evaluate a number of different internal interconnect topologies for high performance ICDAs. We utilize actual commercial workloads taken from live customer environments. We use these workloads to drive our analysis, and perform event-driven simulation to assess the scalability of different switch technologies and topologies. The target workloads include decision support, datamining and online transaction processing.
This talk presents a much more detailed comparison of the TPC-B and TPC-C workloads. First, we have applied our scaling methodology (used for TPC-B) to TPC-C, and show that TPC-C can also be successfully scaled down for experimentation. Second, we present results based on monitoring the performance of the two workloads on both 21164-based (in-order) and 21264-based (out-of-order) Alpha multiprocessors. We also provide detailed simulation results, based on full-system simulations (including kernel activity) in our SimOS-Alpha environment. Our simulation results are also presented for both in-order and out-of-order processor models, along with varying memory system parameters.
The comprehensive comparison presented in this talk allows researchers in this area to better understand the differences between TPC-B and TPC-C. Furthermore, these findings make it easier to compare results from studies that use one or the other benchmark, and in some cases allow for an approximate extrapolation of results for one benchmark based on actual results for the other benchmark.
Our presentation summarizes the I/O characteristics of two commercial server workloads, electronic mail and TPC-D-based DSS. Our study is based on the examination of disk traces from full-scale servers: our electronic mail server supports thousands of users, and our DSS system uses a 300 GB scale factor TPC-D database. We begin by describing high-level I/O characteristics, such as request size distribution, request rate distribution, relative frequency of reads and writes, and spatial locality of requests. We continue by discussing how these characteristics vary over the logical storage volumes and physical storage devices in each system, and how they vary over the duration of the traces. For the DSS workload, we relate the logical and physical storage characteristics back to the table, index and log access patterns presented by the queries. To aid in the iterative process of characterizing application behavior, we have developed a general, extensible framework for analyzing I/O traces. We describe this framework, and how it was used to complete our analyses.
Our experience indicates that it is insufficient to summarize the I/O characteristics of these applications with single figures of merit, such as overall average request size. Instead, we must look at different components of the storage system separately, and examine distributions of interesting quantities to gain an accurate understanding of application behavior. This more complete picture allows us to design I/O subsystems to more closely match the needs of these workloads.
We characterize our implementation of TPC-W by running it under a full-system simulation environment in a single-tier configuration. Work is in progress to bring up our TPC-W implementation under two different instruction sets (PowerPC and SPARC) running under their respective full-system simulators (SimOS-PPC and SimICS). We discuss the challenges we faced in bringing up a complex, networked, multitier workload implemented in a new language (Java) on a full system simulator. We also present preliminary data characterizing the instruction stream, branch predictability, cache behavior, and multiprocessor data sharing patterns of this new workload.
Once mature, we plan to make our implementation of TPC-W available in the public domain to encourage widespread use of realistic, modern application benchmarks in computer architecture research. Updates will be posted to http://www.ece.wisc.edu/~mikko/tpcw.html.
Using this methodology, we run the Solaris operating system for Ultrasparc II processors, completely unmodified, including firmware. This enables us to run any application program, including large commercial and scientific applications, with confidence that the instruction stream generated will be very close to that ultimately generated on the real hardware. We will discuss our approach to generating workloads for use in simulation, including database applications.
Our most recent design effort is the SPARC64 V processor, an eight issue trace cache based machine, which will ship in 2001 at a frequency of 1GHz. This is a very aggressive and complex microarchitecture, containing both control and data speculation, making simulation a demanding task. We will discuss our methodology for dealing with speculation, including compromises made to enable us to run complex operating system code reliably. We will also discuss our functional correctness and performance verification methodology used to ensure that performance predictions will be achieved by real hardware.
First, branch prediction was approached using simple static branch predictors, obtaining moderate prediction accuracy. More complex static prediction heuristics and the use of profile feedback information increased the accuracy of these static predictors, reaching 80-90% accuracy.
Next, the transistor count increase in modern processors allowed the implementation of dynamic branch predictors. Bimodal branch predictors keep information on the recent behavior of a branch, and predict that it will keep doing what it has done in the recent past. Two-level branch predictors keep extra information on the branch behavior and the outcomes of the previously executed branches, reaching 90-95% prediction accuracy.
Finally, hybrid predictors combine a bimodal and a two-level predictor, exploiting the synergy between the two predictors, reaching higher accuracy at a lower implementation cost.
We have shown before that most branches in a database kernel tend to behave in a fixed way, either always taken or always not taken. This is a characteristic that can be easily exploited by a static branch predictor using profile feedback information.
In this work, we focus on the combination of a dynamic branch predictor and a static (profile based) predictor for a database workload. We show how this mixed software hardware approach can be implemented with any dynamic predictor, even with combined branch predictors, with negligible hardware cost. The combination of a static and a dynamic predictor will use the dynamic predictor only for those branches which exhibit a variable behavior, that is, for those branches that can not be accurately predicted using the profile information.
We examine the static-dynamic predictor combination with two different targets: achieving higher prediction accuracy, and minimizing the predictor cost without reducing the current prediction accuracy. We show which combinations work well, and which combinations dont, based on the synergy of the dynamic predictors used and the static predictor.