THIRD WORKSHOP ON COMPUTER ARCHITECTURE EVALUATION USING COMMERCIAL WORKLOADS

Toulouse, France

Sunday, January 9th, 2000

Immediately precedes the Sixth International Symposium on High Performance Computer Architecture (HPCA-6)

IEEE Computer Society Sponsored by the IEEE Computer Society



Organized by:

Russell Clapp, Informix Software

rmc@informix.com

Ashwini Nanda, IBM TJ Watson Research Center

ashwini@watson.ibm.com

Josep Torrellas, University of Illinois at Urbana-Champaign

torrella@cs.uiuc.edu

Building on the positive feedback enjoyed by the First Workshop on Computer Architecture Evaluation using Commercial Workloads and Second Workshop on Computer Architecture Evaluation using Commercial Workloads, this third workshop will again bring together researchers and practitioners in computer architecture and commercial workloads from industry and academia. In the course of one day, we will discuss work-in-progress that utilizes commercial workloads for the evaluation of computer architectures. By discussing this ongoing research, the workshop will expose participants to the characteristics of commercial workload behavior and provide an understanding of how commercial workloads exercise computer systems. There will be discussions on the difficulties associated with using commercial workloads to drive new computer architecture designs and what can be done to overcome them.

The Final Program for the workshop is listed below, with an abstract for each talk. There will be plenty of time for audience participation. A panel with a round-table discussion will be held after the technical presentations. There will be no proceedings for the workshop since we encourage the presentation of work-in-progress and research in early stages. Copies of the foils used by the speakers will be distributed to the attendees.

Final Program

8:00 am - 8:20 am

Registration

8:20 am - 8:30 am

Introductory Comments

8:30 am - 10:00 am

Session 1: Architectural Issues

Fetch Engines and Databases

Carlos Navarro, Alex Ramirez, Josep-L. Larriba-Pey, and Mateo Valero
Departament d'Arquitectura de Computadors
Universitat Politecnica de Catalunya
Several recent studies of database management systems (DBMSs) performance on current processors have shown that such workloads have a different behavior than other codes widely used for computer architecture evaluation [May94]. DBMSs are highly structured codes, that perform many procedure calls and have plenty of control statements to handle all types of error situations [Nav98]. This causes a great level of control flow activity in the code and bigger instruction working sets than in other common integer codes. Recent work has shown that those instruction working sets can be a problem for the instruction cache sizes used in current generation microprocessors [Ail99]. With all this evidence, it seems interesting to study how DBMSs exercise the different parts of the fetch engine of current processors, and how different improvements on the fetch engine can help their execution.

Dynamic Branch prediction schemes have led to processor designs where the instruction supply and execution are clearly decoupled, being easy nowadays to hear about front-end and back-end engines. A front-end or fetch engine is characterized by the speed at which it can deliver instructions and the 'quality' of the instructions that it delivers. The speed is directly related to the effective bandwidth and latency of the fetch unit. On the other hand, the 'quality' of the delivered instructions depends on the accuracy of the branch predictor. During years, researchers have been working in order to keep instruction fetch latencies low and obtain higher branch prediction accuracies. However, the fetch bandwidth has become important when the execution width of processors has grown. This work focuses on the evaluation of the performance effects that several variations of the fetch engine parameters produce on the execution of Decision Support Systems (DSS) workloads. Our study uses Postgres 6.3 running TPC-D queries 3 and 17 over a 100MB Btree indexed database workload. We have used a user level simulator of an out of order superscalar processor derived from the simplescalar v3.0 tool set. First of all, we look at the performance that can be expected from a perfect fetch. With this, we see that a 4 wide issue processor with a perfect fetch engine of the same width, can achive an IPC of 2.66 which shows a performance increase of 59% over a basic configuration with 32K byte cache and a 4096 two bit counter bimodal predictor.

With the evidence that fetch may help to improve the performance of DBMSs execution on current superscalars, we analyze the impact of the i-cache and the branch prediction mechanism. On one hand, an undersized i-cache may drop the performance of our applications significantly. For instance, a perfect i-cache with perfect branch prediction has an IPC of 2.39 while a 16K byte i-cache in the same situation has an IPC of 1.45. On the other hand, we show that the effect of the branch prediction mechanism is bounded by the i-cache size. In particular, we show that the effect of increasing the size and quality of the branch predictor has a larger effect on larger i-caches. For instance, for a 16K byte i-cache, the percentual difference between the worst (1K two bit counter Gshare) and a perfect branch predictor is 8.5% while, the same branch predictors show a difference of 22.6% on a perfect i-cache.

Finally, we have evaluated the benefits that can be expected from a better fetch bandwidth. We have tested a hipothetical fetch mechanism that always returns a fixed number of, possibly wrong path, instructions per cycle. For example, on a 4 issue processor, a 4 instruction fetch engine with the previous characteristics, improves a conventional mechanism that stops at the first branch found in a 5.54%. Also, on the same 4 issue processor, fetching and decoding 8 fixed instructions per cycle improves the the fetching of 4 fixed instructions in a 2.62%, reaching a total IPC of 2.31. As a conclusion, our study reveals that a well dimensioned fetch engine is of great importance for DBMS performance. In particular, an i-cache able to capture the working set of the application is essential. Also, we have found that the size of the i-cache can clearly bound the improvements of branch predictors.

References

[Ail99] Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, David A. Wood 'DBMSs on a Modern Processor: Where Does Time Go?', Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, Pp 266-277, Morgan Kaufmann 1999.

[May94] A. M. Maynard, C. M. Donnelly and B. R. Olszewski, 'Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads', Proceedings of the 6th Intl. Conference on Architectural Support for Programming Languages and Operating Systems, Pp 145-156, 1994.

[Nav98] Carlos Navarro, 'Caracterizacion de la ejecucion de consultas sobre Postgres95', Projecte final de carrera, Facultat d'Informatica de Barcelona, UPC, June, 1998.

Exploiting Disk Intelligence for Decision Support Databases

Kimberly Keeton and David A. Patterson
University of California at Berkeley
The key challenge posed by decision support (DSS) database workloads is the explosive growth of their I/O capacity and computational requirements. Dr. Greg Papadopoulos, the Chief Technical Officer (CTO) of Sun Microsystems, estimates that the I/O capacity and associated processing demands for decision support double every 6 to 12 months [1], a rate supported by the large commercial systems summarized in the Winter Very Large Database (VLDB) surveys [2] [3]. This evidence implies that DSS growth is faster than the growth rate of disk capacity (which doubles every 18 months) and the growth rate of processor performance (which also doubles every 18 months). These phenomenal growth trends require a more scalable I/O system design for these data-intensive services.

To address these trends, we describe a storage system design that uses "intelligent" disks (IDISKs). An IDISK is a hard disk containing an embedded general-purpose processor, tens to hundreds of megabytes of memory, and gigabit per second network links. We analyze the potential performance benefits of an IDISK architecture using analytic models of DSS operations, such as selection and join, which are based on measurements of full-scale DSS behavior.

We find that IDISK outperforms cluster-based and SMP-based systems by up to an order of magnitude, even though these alternate systems possess faster processors and higher aggregate memory capacity. This high performance is attributable to the increased data and computational parallelism possible in an IDISK architecture, and to IDISK's ability to compensate for smaller memory capacity by using multi-pass algorithms that trade off disk I/O bandwidth for memory capacity.

References

[1] G. Papadopoulos. "Moore's Law ain't good enough," Keynote speech at Hot Chips X, August 1998.

[2] R. Winter and K. Auerbach. "Giants walk the earth: the 1997 VLDB survey," Database Programming and Design, volume 10, number 9, pages S2 - S9+, September 1997.

[3] R. Winter and K. Auerbach. "The big time: the 1998 VLDB survey," Database Programming and Design, volume 11, number 8, August 1998.

A Performance Evaluation of Multimedia Kernels Using ALtivec Streaming SIMD Extensions

Sebot Julien
Universite Paris XI, 91405 Orsay Cedex, France
This paper aims to provide an understanding of multimedia applications that use floating point computations performance on recent general-purpose microprocessors using streaming SIMD ISA extensions. We use eight benchmarks to study the impact of these extensions on general application performance and identify the possible bottlenecks introduced. The introduction of floating point SIMD instructions in general purpose microprocessors is quite recent, and there isn't currently a lot of studies on the subject. On the other hand, integer SIMD workloads have already been well studied.

Our purpose is to provide a better understanding of performance improvements in multimedia applications due to these ISA extensions. We have used Altivec, introduced by Apple and Motorola in their newest microprocessor: the PowerPC G4. Altivec works with 128 bits vectors and allows both integer and single precision vector processing.

The eight micro-kernels we use as benchmarks have been extracted from commonly used multimedia applications. We will compare the original optimized C kernel with an Altivec optimized one. These eight micro-kernels are FIR/IIR filter, min/max, vertice transformation and projection, normals transformation and normalization, backface culling, phong lighting, DAXPY and generic matrix-matrix multiplication.

Various audio filtering and speech compression are based on FIR/IIR filter, the Viterbi algorithm used in speech recognition uses the minmax kernel. Mesa, a 3-D graphics library with an API which is very similar to that of OpenGL uses 3-D geometry transformations and lighting to render 3D polygonal scenes. Vertice transformation, projection, perspective correction, normals transformation and renormalization, backface culling and phong lighting are the critical parts of geometry transformations. Some functions like DAXPY, the sum of two floating point arrays and generic matrix multiplication have been studied too because of their intensive use in numerical algorithms. All these kernels have been hand tuned using well known technics like loop unrolling and software pipelining.

For our performance evaluation, we used tools provided by Apple and Motorola. The Altivec kernels are not fully written in assembly. A set of C functions that work on variables instead of registers and that maps on Altivec instructions allows easiest programming, and fine tuning. We generate execution trace with pitsTT6 tool by linking the kernel with pitsTT6 library and calling the functions to start tracing and to end it. The G4 simulator gives a lot of statistics on the execution of the tt6 trace. All the results provided come from the simulation of a G4 processor executing the previously extracted TT6 execution trace.

The speedup obtained with Altivec for these benchmarks range from two (FIR/IIR) to ten (64x64 generic matrix multiplication). The final speedup for the 3D geometry pipeline is 4 when using two dynamics light sources ie two pass into the phong lighting step which is the most accelerated part of the 3D kernel. The use of streaming Altivec prefetch improves the global perfomances by a factor of 1.1 times to 2 times.

The memory behaviour of our micro-kernels is characterized by streaming data access. Datas are quite never reused and the first level cache size has no impact on our micro-kernels performances when over 4Kb. Most of the benchmarks become memory bound when using Altivec, but the use of prefetch lessens this tendance. We also study the impact of increasing memory bandwidth on the final performance of these benchmarks. The performance improvement using a doubled memory bandwidth at a constant latency is for most of the benchmarks around 20%. For the 3D benchmarks we also evaluated the importance of data organisation on the performance improvement due to the use of vector processing. We found the same results as Intel on its SSE streaming media extensions. On all the benchmarks we explain the reason for the performance improvement, and for over 4x improvement using 4 element wide SIMD processing.

To summarize the floating point benchmarks we have studied are improved by the use of Altivec streaming SIMD extensions from 2 times to 10 times. The impact of prefetch ranges from 1.1 times to 2 times performance enhancements. Improving main memory bandwidth without lowering latency has a noticeable impact on final performance due to streaming software directed prefetch.

10:00 am - 10:30 am

Coffee Break

10:30 am - 12:00 pm

Session 2: Architecture and I/O

Hot Entry: A Fine-Granularity Disk Buffering Management Scheme

Qiang Cao, Josep Torrellas, and H. V. Jagadish
Department of Computer Sience, University of Illinois at Urbana-Champaign
and Department of EECS, University of Michigan
Disk I/O is recognized as the performance bottleneck for many database applications. Correspondingly, most database algorithms pay careful attention to what pages on disk are buffered in memory. In particular, buffer management has been a topic of considerable study. However, most work in this area assumes that a disk page is the unit of buffering. With the rising size of disk pages, more and more data today fits on a single disk page. The question we seek to address in this paper is whether one can benefit from buffering at the granularity of tuples rather than pages. The intuition is that if only one (or a few) tuples on a disk page are hot, then it is a waste to buffer the entire page rather than just the specific tuples. To this end, we devise a fine-granularity Hot Entry buffer management scheme that can be used in tandem with a standard page-based buffering scheme of choice. We are going to present two basic fine-granularity schemes -- Hot Index-Entry buffering and Hot Data-Entry buffering. Through experimental evaluation, we demonstrate the benefit of fine-granularity buffering. In particular, for the TPC-C benchmark, we are able to demonstrate that the number of disk accesses is reduced by up to a factor of 2.

Experiences with IBM Server Performance

Michael Ignatowski
Server Division, IBM
The field of computer architecture grew up in a time when processors were extremely expensive and the focus of most optimization work. Even though important aspects of the hardware have changed significantly since then, sometimes attitudes and priorities are slow to reflect the current state of affairs. This talk will explore some of the past and current work within the industry on computer architecture for commercial servers, illustrated with some examples of issues now being struggled with. Today's work is strongly influenced by the need to better understand the changes in future workloads, new ways people will be using servers, and the need to live within development and product cost constraints. The talk will conclude with speculation, ranging from practical to whimsical, about future architecture priorities for commercial servers.

Scalable Cached Disk Arrays

Manpreet Singh, David R. Kaeli, and William Zahavi
Department of Electrical and Computer Engineering, Northeastern University
and EMC Corporation
The importance of high performance storage systems has grown rapidly as industry increasingly depends upon data to manage daily business. Integrated Cached Disk Arrays (ICDAs) are used to house the large databases developed to run today's businesses. Many ICDA systems have been designed to use standardized host interfaces (e.g., SCSI, UltraSCSI, ESCON and Fibre Channel). These systems include terabytes of storage, large caches, and provide fault-tolerance using redundant interconnects, non-volatile RAM, and RAIDs (Redundant Array of Inexpensive Disks). To maximize I/O throughput, write buffering, and prefetching algorithms are employed. Currently, a number of manufacturers provide systems which allow multiple hosts to connect to a shared pool of storage, supporting a variety of host adapters.

As these systems grow in size and complexity, the demand placed on their internal interconnect structure becomes a bottleneck. Currently bus-based structures are used in many of these systems. Bus-based topologies are well-suited to applications that produce moderate internode communication, but fail to scale as the number of nodes increases. Since I/O systems need to transfer large amounts of data, more scalable interconnect structures are needed to scale the performance of these systems.

In this paper, we evaluate a number of different internal interconnect topologies for high performance ICDAs. We utilize actual commercial workloads taken from live customer environments. We use these workloads to drive our analysis, and perform event-driven simulation to assess the scalability of different switch technologies and topologies. The target workloads include decision support, datamining and online transaction processing.

12:00 pm - 1:30 pm

Lunch

1:30 pm - 3:00pm

Session 3: Workload Characterization

A Detailed Comparison of TPC-B Versus TPC-C

Robert Stets, Luiz Andre Barroso, Kourosh Gharachorloo, and Ben Verghese
Western Research Laboratory
Compaq Computer Corporation
There are are a number of studies that have used the old TPC-B instead of the current TPC-C benchmark as a representative workload for online transaction processing (OLTP). The primary reason for this is that TPC-B has much simpler setup requirements, and therefore lends itself better to monitoring and simulation experiments. In our previous work, we have shown how TPC-B can be scaled down to make it amenable to such experimentation. Furthermore, our previous performance monitoring experiments with TPC-B and TPC-C have shown that the two benchmarks have similar processor and memory system behavior, with TPC-B exhibiting somewhat worse memory system behavior than TPC-C. We have also pointed out that this latter difference may possibly make TPC-B more representative of actual customer database applications, which are widely aknowledged to exhibit poorer performance than TPC-C.

This talk presents a much more detailed comparison of the TPC-B and TPC-C workloads. First, we have applied our scaling methodology (used for TPC-B) to TPC-C, and show that TPC-C can also be successfully scaled down for experimentation. Second, we present results based on monitoring the performance of the two workloads on both 21164-based (in-order) and 21264-based (out-of-order) Alpha multiprocessors. We also provide detailed simulation results, based on full-system simulations (including kernel activity) in our SimOS-Alpha environment. Our simulation results are also presented for both in-order and out-of-order processor models, along with varying memory system parameters.

The comprehensive comparison presented in this talk allows researchers in this area to better understand the differences between TPC-B and TPC-C. Furthermore, these findings make it easier to compare results from studies that use one or the other benchmark, and in some cases allow for an approximate extrapolation of results for one benchmark based on actual results for the other benchmark.

I/O Characterization of Commercial Workloads

Kimberly Keeton, Alistair Veitch, Doug Obal, and John Wilkes
Hewlett-Packard Laboratories
In the last five to ten years, an increasing number of studies have explored the processor and memory system behavior of commercial applications, such as online transaction processing (OLTP) and decision support (DSS) databases, web servers, and multimedia applications. The I/O characteristics of these workloads, however, have been largely ignored. Given the I/O-intensive nature of many commercial server applications, this lack of attention is unfortunate.

Our presentation summarizes the I/O characteristics of two commercial server workloads, electronic mail and TPC-D-based DSS. Our study is based on the examination of disk traces from full-scale servers: our electronic mail server supports thousands of users, and our DSS system uses a 300 GB scale factor TPC-D database. We begin by describing high-level I/O characteristics, such as request size distribution, request rate distribution, relative frequency of reads and writes, and spatial locality of requests. We continue by discussing how these characteristics vary over the logical storage volumes and physical storage devices in each system, and how they vary over the duration of the traces. For the DSS workload, we relate the logical and physical storage characteristics back to the table, index and log access patterns presented by the queries. To aid in the iterative process of characterizing application behavior, we have developed a general, extensible framework for analyzing I/O traces. We describe this framework, and how it was used to complete our analyses.

Our experience indicates that it is insufficient to summarize the I/O characteristics of these applications with single figures of merit, such as overall average request size. Instead, we must look at different components of the storage system separately, and examine distributions of interesting quantities to gain an accurate understanding of application behavior. This more complete picture allows us to design I/O subsystems to more closely match the needs of these workloads.

Characterizing a Java Implementation of TPC-W

Todd Bezenek, Trey Cain, Ross Dickson, Timothy Heil, Milo Martin, Collin McCurdy, Ravi Rajwar, Eric Weglarz, Craig Zilles, and Mikko Lipasti
Department of Electrical and Computer Engineering and Department of Computer Sciences
University of Wisconsin - Madison
The Transaction Processing Performance Council (TPC) is in the process of developing a new benchmark, TPC-W, that attempts to model the application behavior of online web merchant sites. TPC-W consists of a web serving component that models online browsing and purchasing, as well as an on-line transaction processing (OLTP) component that models order processing and tracking. We implement this benchmark based on the preliminary specification that TPC has made available for public review. The remote browser emulation, web merchant application logic, and web server of our version of TPC-W are implemented completely in Java, using the Jigsaw Java Web Server, the servlet interface for dynamic html and forms processing, and relational database connectivity through the standard JDBC (Java Database Connectivity) interface. The OLTP component is implemented using the IBM DB2 relational database management system.

We characterize our implementation of TPC-W by running it under a full-system simulation environment in a single-tier configuration. Work is in progress to bring up our TPC-W implementation under two different instruction sets (PowerPC and SPARC) running under their respective full-system simulators (SimOS-PPC and SimICS). We discuss the challenges we faced in bringing up a complex, networked, multitier workload implemented in a new language (Java) on a full system simulator. We also present preliminary data characterizing the instruction stream, branch predictability, cache behavior, and multiprocessor data sharing patterns of this new workload.

Once mature, we plan to make our implementation of TPC-W available in the public domain to encourage widespread use of realistic, modern application benchmarks in computer architecture research. Updates will be posted to http://www.ece.wisc.edu/~mikko/tpcw.html.

3:00 pm - 3:30 pm

Coffee Break

3:30 pm - 5:00 pm

Session 4: Miscellaneous

A Simulation Environment for Microprocessor Development

Chien Chen, Leslie A. Barnes and John R. Slice
HAL Computer Systems
In this talk, we discuss the simulation methodology used for performance analysis during microprocessor development at HAL computer systems. The system development effort is aimed at large scale commercial and scientific applications, from 8-128 processors. Our aim is to develop a simulation methodology capable of generating realistic workloads so as to enable microarchitectural tradeoff decisions to be accurately made at the microprocessor level.

Using this methodology, we run the Solaris operating system for Ultrasparc II processors, completely unmodified, including firmware. This enables us to run any application program, including large commercial and scientific applications, with confidence that the instruction stream generated will be very close to that ultimately generated on the real hardware. We will discuss our approach to generating workloads for use in simulation, including database applications.

Our most recent design effort is the SPARC64 V processor, an eight issue trace cache based machine, which will ship in 2001 at a frequency of 1GHz. This is a very aggressive and complex microarchitecture, containing both control and data speculation, making simulation a demanding task. We will discuss our methodology for dealing with speculation, including compromises made to enable us to run complex operating system code reliably. We will also discuss our functional correctness and performance verification methodology used to ensure that performance predictions will be achieved by real hardware.

Combining Static and Dynamic Branch Prediction

Alex Ramirez and Josep-L. Larriba-Pey
Departament d'Arquitectura de Computadors
Universitat Politecnica de Catalunya
Application performance in current superscalar processors is highly dependent on the branch prediction accuracy. In this work, we study branch prediction in commercial workloads.

First, branch prediction was approached using simple static branch predictors, obtaining moderate prediction accuracy. More complex static prediction heuristics and the use of profile feedback information increased the accuracy of these static predictors, reaching 80-90% accuracy.

Next, the transistor count increase in modern processors allowed the implementation of dynamic branch predictors. Bimodal branch predictors keep information on the recent behavior of a branch, and predict that it will keep doing what it has done in the recent past. Two-level branch predictors keep extra information on the branch behavior and the outcomes of the previously executed branches, reaching 90-95% prediction accuracy.

Finally, hybrid predictors combine a bimodal and a two-level predictor, exploiting the synergy between the two predictors, reaching higher accuracy at a lower implementation cost.

We have shown before that most branches in a database kernel tend to behave in a fixed way, either always taken or always not taken. This is a characteristic that can be easily exploited by a static branch predictor using profile feedback information.

In this work, we focus on the combination of a dynamic branch predictor and a static (profile based) predictor for a database workload. We show how this mixed software hardware approach can be implemented with any dynamic predictor, even with combined branch predictors, with negligible hardware cost. The combination of a static and a dynamic predictor will use the dynamic predictor only for those branches which exhibit a variable behavior, that is, for those branches that can not be accurately predicted using the profile information.

We examine the static-dynamic predictor combination with two different targets: achieving higher prediction accuracy, and minimizing the predictor cost without reducing the current prediction accuracy. We show which combinations work well, and which combinations dont, based on the synergy of the dynamic predictors used and the static predictor.

Evaluating Commercial System Architectures : Tools and Issues

Ashwini Nanda
IBM Research
Computer systems as well as the workloads running on them are both getting more complex. Memory hierarchies are getting deeper and the size of higher level caches are doubling almost every year. All these factors pose difficulties in accurately evaluating architecture alternatives for future systems. No single tool is suitable for evaluating every component of a system. This talk will cover some of our recent experience with tools for evaluating future servers at IBM.

5:00 pm - 6:00 pm

Panel and Round Table Discussion

Topic:What Tools Do We Use to Evaluate Future Memory Systems?

Organizer: Ashwini Nanda
IBM Research