Fourth Workshop on Computer Architecture Evaluation using Commercial Workloads

Monterrey, Mexico

Sunday, January 21st, 2001

Immediately precedes the Seventh International Symposium on High Performance Computer Architecture

IEEE Computer SocietySponsored by the IEEE Computer Society



Organized by:

Russell Clapp, IBM

rclapp@us.ibm.com

Kimberley Keeton, Hewlett-Packard Laboratories

kkeeton@hpl.hp.com

Ashwini Nanda, IBM TJ Watson Research Center

ashwini@watson.ibm.com

Josep Torrellas, University of Illinois at Urbana-Champaign

torrellas@cs.uiuc.edu

Building on the positive feedback enjoyed by the First, Second, and Third Workshops on Computer Architecture Evaluation using Commercial Workloads, this fourth workshop will again bring together researchers and practitioners in computer architecture and commercial workloads from industry and academia. In the course of one day, we will discuss work-in-progress that utilizes commercial workloads for the evaluation of computer architectures. By discussing this ongoing research, the workshop will expose participants to the characteristics of commercial workload behavior and provide an understanding of how commercial workloads exercise computer systems. There will be discussions on the difficulties associated with using commercial workloads to drive new computer architecture designs and what can be done to overcome them.

The Final Program for the workshop is listed below, with an abstract for each talk. There will be plenty of time for audience participation. A panel with a round-table discussion will be held after the technical presentations. There will be no proceedings for the workshop since we encourage the presentation of work-in-progress and research in early stages. Copies of the foils used by the speakers will be distributed to the attendees on CD-ROM only.   If hardcopy is desired, attendees are encouraged to visit this site 1 week before the workshop to download an electronic copy of the foils for printing.

 

Final Program

8:00 am - 8:15 am

Registration

8:20 am - 8:30 am

Introductory Comments

8:30 am – 10:00 am

Session 1: Characterizing I/O Behavior

Characterizing Data-Intensive Workloads on Modern Disk Arrays

Guillermo Alvarez, Kimberly Keeton, Erik Riedel, Mustafa Uysal

Hewlett-Packard Laboratories

Characterization of I/O for TPC-C and TPC-H Workloads

Don DeSota

IBM

Iterative Development of an I/O Workload Characterization

Zachary Kurmas, Ralph Becker-Szendy, and Kimberly Keeton

HP-Labs Storage Systems Program

10:00 am - 10:30 am

Coffee Break

10:30 am – 11:30 am

Session 2: Characterization and Modeling of Web-Driven Workloads

An Internet Traffic Generator for Server Architecture Evaluation

Krishna Kant, Vijay Tewari and Ravi Iyer

Intel Corporation

  Performance Impact of Multithreaded Java Server Applications

Yue Luo and Lizy K. John

Department of Electrical and Computer Engineering

The University of Texas at Austin

11:30 am - 1:00 pm

Lunch

1:00 pm – 2:30 pm

Session 3: Processor Architecture Evaluation and Simulation

Walking Four Machines By The Shore

Anastassia Ailamaki, David J. DeWitt, and Mark D. Hill

University of Wisconsin-Madison

Evaluation of TPC-H benchmark on Athlon based systems

Mike Clark, Ajaya Durg, Kevin Lienenbrugger, and Lizy John

Electrical and Computer Engineering Department

The University of Texas at Austin

Statistical Simulation of Superscalar Architectures using Commercial Workloads

Lieven Eeckhout and Koen De Bosschere

Department of Electronics and Information Systems (ELIS)

Ghent University

2:30 pm - 3:00 pm

Coffee Break

3:00 pm - 4:00 pm

Session 4: Processor/Memory Interconnect Performance for Database Workloads

Impact of Database Scaling on Realistic DSS Workload Characteristics on SMP Systems

Ramendra K. Sahoo, Krishnan Sugavanam, Ashwini K. Nanda

IBM T.J. Watson Research Center

STiNG Revisited:  Performance of Commercial Database Benchmarks on a CC-NUMA Computer System

Russell M. Clapp

IBM

4:00 pm - 4:15 pm

Short Break

4:15 pm – 5:15 pm

Invited Talk

Analyzing Business Computing Architecture Availability using Commercial Database Benchmark Workloads

Stephen de Glanville

IBM

5:15 pm

Participant Feedback

Closing Remarks

Full Text of Abstracts:

Get Adobe Acrobat Reader

Session 1: Characterizing I/O Behavior

Characterizing Data-Intensive Workloads on Modern Disk Arrays

Guillermo Alvarez, Kimberly Keeton, Erik Riedel, Mustafa Uysal

Hewlett-Packard Laboratories

 

Most large storage systems today use disk arrays to meet capacity, reliability, availability, and performance requirements.  Modern disk arrays have grown in sophistication over the years to keep up with growing application demands.  One of the biggest, recent architectural changes has been the addition of very large array caches:  modern high-end disk arrays can contain as much as 64 GB of cache memory [HP]. The presence of big caches, and the policies that disk arrays implement in order to take advantage of them, interact with data-intensive workloads in poorly-understood ways.  There has not been much study about how array caches impact application performance and what characteristics of applications can take best advantage of these caches.

 

We discuss two cache-related array behaviors ("sweet spots"):  prefetch effectiveness and cache effectiveness.  We show how to quantify, for given workloads, the extent to which these behaviors improve or degrade application performance.  This is done by evaluating a set of candidate workload characteristics such as the degree to which the request streams perform sequential accesses (intra-stream and inter-streams), the footprint of the workload, and the temporal burstiness of the request streams.

 

Our evaluation approach starts by running a set of synthetic workloads on array configurations with different values of tunable parameters, in order to identify and isolate the array's sweet spots.  We then use the insights gained from the synthetic workloads to propose novel workload metrics that best characterize the workload's performance on modern disk arrays.  Finally, we validate our choices by examining the behavior of a set of real workloads:  a 300-GB TPC-D decision support database benchmark, a mail server (Open Mail), and a file system for an engineering development environment.

 

Our findings indicate that sweet spots can impact access times by as much as two orders of magnitude.  We show that a workload metric that takes into account short forward jumps and the interference from competing (possibly sequential) streams accurately predicts the effectiveness of array-level prefetching for the TPC-D and Open-Mail workloads.

 

References:

 

[HP] The Hewlett-Packard XP 512 Disk Array.

Characterization of I/O for TPC-C and TPC-H Workloads

Don DeSota

IBM NUMA-Q

15450 SW Koll Parkway

Beaverton, Oregon 97006-6063

desotad@us.ibm.com

 

The TPC-C and TPC-H benchmarks have become industry standards for determining system-level performance for On-line Transaction Processing (OLTP) and Decision Support System (DSS) usage models on new systems.  This talk will cover I/O performance data measured on IBM NUMA-Q systems for each of these benchmarks. It will include data for properly sized and tuned TPC-C and TPC-H workloads.  The presentation will include a description of the systems the benchmarks are run on, how the measurements were made, and the measured data.  The data will include both consumed bandwidth and I/Os per second.

 

The  TPC-C I/O data is for multiple size systems between 4 and 16 500MHz Xeon processors in 1 to 4 quads. The systems include 4GB memory per quad and a fibre channel disk I/O subsystem.

 

The TPC-H system is a 64 700MHz Xeon II processors, 16 quad system with 2GB of memory per quad and a fibre channel disk subsystem.  The TPC-H is a 300GB scale factor.

 

As expected, our results show a higher I/Os per second requirement for TPC-C but a higher bandwidth requirement for TPC-H.  This is due to the random nature of the I/Os for TPC-C and their associated small block size.  TPC-H does more sequential scanning of data with a large block size.  Despite the random nature of I/Os for TPC-C, we see good scalability of I/O on the multi-quad NUMA system due to the presence of “multipath” I/O and a fibre channel switch, which largely prevents I/O DMA transfers from traversing the longer latency second-level interconnect.  Also, while TPC-C shows a steady I/O requirement related to system throughput, the TPC-H benchmark shows a highly variable requirement as throughput fluctuates during a specific query as well as query by query and during the throughput test as well.  The results in an instance during the test where all practical available I/O bandwidth is consumed by the benchmark.  For the system described above, that is an amount in excess of 3GB/s.

Iterative Development of an I/O Workload Characterization

Zachary Kurmas, Ralph Becker-Szendy, and Kimberly Keeton

kurmasz@cc.gatech.edu, ralphbsz@hpl.hp.com,  kkeeton@hpl.hp.com

HP-Labs Storage Systems Program

Introduction

 

Interesting workloads are a key input to experimental I/O system studies.  Unfortunately, although a trace of I/O activity completely describes a workload, a trace usually has a very low ratio of information to bits.  Our current research is to develop a compact representation of a I/O workload that contains all the ``important'' information, but does not suffer the disadvantages of a full trace.

 

Such a characterization of an I/O workload has many potential uses. For example, scientists can use it to quickly understand the high-level characteristics of a workload, to quickly compare multiple workloads, or to synthetically generate workloads.  This knowledge may also be useful for developing a better method of specifying benchmarks (e.g. TPC) (which may help eliminate the problem of applications tuned to a specific workload); and developing a model that can be used to quickly configure large storage systems.

 

The essence of the problem is not simply characterizing the workload, but ensuring that the characterization contains all of the ``important'' information.  For example, we could characterize a workload by simply giving the mean size of the I/Os; however, such a characterization would not be very useful to most people.  From the standpoint of an end user, the most important characteristic of an I/O workload is how quickly it executes on a storage system.  Thus, the focus of our research is learning how to characterize a workload with enough detail that we can synthetically generate a workload with the same performance.

 

Methodology

 

In order to develop such a workload characterization, we must learn what attributes affect performance.  As a starting point, we consider four basic ``attributes'' for each I/O request in the workload: size, interarrival time (time since last I/O), location on disk, and type (read or write).  A characterization of a basic attribute describes the distribution of values for that attribute (i.e. the probability an attribute will take a certain value.)

 

Developing our characterization is an iterative process.  We first choose a set of attributes (e.g., Request Size, Request Rate, and Location) and modify our trace analysis tool, Rubicon, to characterize those attributes.  Next, we modify our synthetic workload generation tool, Pylon, to generate a workload with those characteristics.  We then compare the performance of our synthetically generated workload to the performance of the observed workload. We compare performance by comparing the cumulative distribution functions (CDFs) of the I/O latencies for each workload; we use the root-mean-squared metric to quantify the similarity between the CDFs.  Finally, we determine how to further improve our characterization, either by adding attributes, or by adding further detail to the current characterization.

 

At first it may appear that we can do no better that make an educated guess as to how to begin each iteration.  Fortunately, Pylon is designed so that we can isolate the generation of each attribute. Pylon can be configured to generate a sequence of values for an attribute that is identical to the sequence in the observed workload, or chosen randomly from the summarizing distribution.  For example, we could configure Pylon so that the sequence of request sizes, interarrival times, and read/write type is identical to that of the observed workload, and only the location of each I/O is generated randomly.  This gives use some insight into how the accuracy of our synthesis of location values affects the accuracy of our synthetic workload.  (For example, in this case, if the latencies don't change much regardless of the method by which we generate location, we conclude that location does not have a large affect on latency.)  By holding different component values ``constant'' (i.e. identical to the original workload), we gain insight into which attributes have the largest effect on latency, and therefore, which attributes we should focus on during the next iteration.

 

Results

 

Our first iteration was to characterize each of the four basic attributes with simply a histogram.  Pylon then generated I/Os by drawing a value for each attribute randomly from a distribution specified by those histograms.  While this was a good start, the performance of the synthetic workload was not close to that of the

observed workload (our root-mean-squared analysis showed a difference of almost 10 percent).  The problem with this first iteration is that it assumes that I/Os can be generated by independent random values. In other words, such a generation method assumes there is no correlation either between or within attributes.  We know this is not the case.  For example, in the Open Mail workload (the only workload we've analyzed thus far --- we will analyze other workloads after we have stronger results), 30 percent of the writes were made to the same location on the disk (the index of the mail system), whereas the reads tended to occur at the end of the disk (where the data was stored). Thus, there is a clear correlation between type and location.

 

Our second iteration was to characterize the reads separately from the writes.  Thus, there were two histograms of location, one for reads, and one for writes.  This was successful in some respects, but less so in others.  It corrected the differences caused by having 30 percent of the reads issued to the same location; but it did not address the fundamental problem of treating all components as independent random variables.

 

In order to further improve our characterization, we must expand our definition of ``attribute'' so that it also includes the correlations within and between our four basic attributes.  For example, workloads tend to have ``runs'' of I/Os in which each I/O in the run has a location adjacent to the previous I/O.  Thus, the distribution of distances between successive I/Os (we call this ``jump distance'') will be much different than the distribution created by a simple characterization of location as an independent random variable.  Thus, our characterization must now include this new attribute, jump distance, which describes the correlation between the locations of successive I/Os.

 

Our current task is to discover how to best characterize jump distance.  Our first attempt showed that the problem was more complex that we first believed; however we have developed a method that we believe will work very well.  We hope to have this method implemented and tested by Spring.

 

Workload characterization has many potential uses.  We are very encouraged by our early results and expect our methodology to allow us to steadly improve the quality of our characterizations.

Session 2: Characterization and Modeling of Web-Driven Workloads

An Internet Traffic Generator for Server Architecture Evaluation

Krishna Kant, Vijay Tewari and Ravi Iyer

Intel Corporation

Beaverton, Oregon, USA

{krishna.kant | vijay.tewari | ravishankar.iyer}@intel.com

 

The work reported here was motivated by the need to understand the implications of web traffic characteristics on the architectural aspects of a web/e-commerce server.  This requires generation of realistic Internet traffic in a lab environment so that it possible to do detailed measurements on server systems. This includes tracing of processor and I/O busses in order to evaluate the impact of varying traffic characteristics on the resources inside the web server. Towards this goal, we have developed a Generator for E-commerce and Internet Server Traffic (GEIST).

 

Generation of synthetic traffic with complex characteristics is often computationally intensive, making it difficult to ensure that a request is actually generated very close to the intended time. This aspect depends on the number of active processes/threads in operation, scheduling algorithms used by the O/S, and the amount of queuing for the CPU and other resources. We address this issue in GEIST by splitting the generation into two steps, referred to as Aggregate Trace Generation and Traffic Generation.

 

The first step (aggregate trace generation) handles all the complexities of computing the actual time and parameters of the requests. There are several

important aspects of aggregate traffic that a traffic generator needs to emulate accurately. These include (a) temporal characteristics of the arrival process, (b) transactional composition of the traffic, and (c) nature of and dependencies between objects retrieved by successive requests. We will touch upon how each of these aspects is dealt with in GEIST. This step can be executed off-line and the output placed in a ``trace file". The second step (traffic generation) has to read the trace and actually issue the requests. The main advantage of a two-step process is that this traffic generator part doesn't care how the trace was generated; the trace could have been derived from HTTP logs from a live site. The difficulties in this step is that the scheduling granularity of the threads responsible for request generation and other O/S activities over which the programmer has no control could introduce a significant skew (or slippage) between the intended request time and the actual request time. We have carried out a detailed set of experiments to assess the the accuracy and performance of GEIST in terms of slippage. We will present the findings from this slippage study during the talk.

 

Comparing with related work, we believe that GEIST is different and generally more powerful than other traffic generators that we are aware of. Most other generators (e.g., Microsoft's Web Application Stress Tool [1], SPECweb96, SPECweb99 [2]) are user emulation based and do not support complex temporal behavior.  One exception is SURGE [3] which does support self-similar process by making each ``user equivalent'' an on-off process.  In contrast, GEIST supports asymptotically self-similar, multifractal, and nonstationary arrival process.  GEIST also supports detailed transactional characterization of the traffic. We also believe that GEIST's two-part design makes it much more scalable and extensible than other generators.

 

We have already used GEIST in several performance studies such as overload control for web servers, architectural experiments for proxy servers and secure protocols like SSL. We will present the salient results (pertaining to GEIST) from some of these studies. We will also cover the possible future uses and extensions to GEIST.

 

References

 

[1]   ``Microsoft Web Application Stress Tool'', msdn.microsoft.com/library/periodic/period00/stresstool.htm

[2]   ``An explanation of the SPECweb96/SPECweb99 benchmark'', www.specbench.org/osg/web96  & www.specbench.org/osg/web99.

[3]   P. Barford, and M. Crovella, ``Generating Representative Web Workloads for Network and Server Performance Evaluation'', Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 151-160, July 1998.

 

Performance Impact of Multithreaded Java Server Applications

Yue Luo and Lizy K. John

Laboratory for Computer Architecture

Department of Electrical and Computer Engineering

The University of Texas at Austin

Austin, TX 78712

 

In many commercial server applications, the server needs to support many concurrent client connections. In Unix, it is common to serve several clients with one server thread by using traditional select() or poll() functions or POSIX asynchronous I/O.  The server has a relatively small number of threads while it is able to server a large number of clients.  Such APIs, however, are not available in Java platforms. So the current common programming paradigm in Java is to devote one thread to one client connection. Thus, performance under the presence of a large number of threads is crucial for a commercial Java server to support multiple clients simultaneously.  In this research, we investigate the performance impact of multiple threads and multiple connections in Java applications.  We study VolanoMark, an Internet chat server environment on two of most popular platforms: Pentium III +Windows and Sparc + Solaris. VolanoMark is characterized by long-lasting connections and high thread counts.  It is based on a commercial pure Java chat server, VolanoChat, which is used by customers in 33 countries.

 

    Performance counters are used to study the Pentium III system. We run VolanoMark with the latest Sun JDK 1.1.3 for Win32.  Sun's Hotspot server JVM

is also studied and compared with classic JVM.  All metrics are collected separately for OS mode and user mode.   Shade, a trace tool from Sun, is used on Sparc platform.  VolanoMark is run with Sun JDK1.2.2 for Solaris, which includes green thread mode JVM.  Both native thread mode and green thread mode are studied and compared.   On both platform we synchronize our measurements with the client connections.  We only study the period when there are client activities and ignore the startup and shut down process of the server.

 

    Our preliminary results show that with the increase of the number of threads and connections, there is increased OS activity in the application.  On Pentium III platform only 34% time is spent in user code at 800 connections, compared with 45% at 20 connections.  We also find that multithreaded program is actually CPU and I-cache friendly.  We have observed lower I-cache miss rates due to the sharing of code between threads.  We also observe better branch predictions.  These result in fewer I-stalls and a lower CPI.  The resource stalls increase with increase in number of threads and hence higher performance can be expected from increasing the number of functional units in the processor. The biggest factor affecting the performance of multithreaded applications on Windows NT so far is the overhead caused by synchronization and context switch.  We believe most gain can be desired from eliminating this overhead.  The study is still going on and we expect results from Sparc platform to be available before the final version is submitted.

Session 3: Processor Architecture Evaluation and Simulation

Walking Four Machines By The Shore

Anastassia Ailamaki, David J. DeWitt, and Mark D. Hill

University of Wisconsin-Madison

1210 West Dayton Street

Madison, WI 53706

{natassa,dewitt,markhill}@cs.wisc.edu

 

Recent studies have shown that the hardware behavior of database workloads is suboptimal when compared to scientific workloads, and have

identified the processor and memory subsystem as the true performance bottlenecks, when running decision-support workloads on various commercial DBMSs. Conceptually, all of today's processors follow the same sequence of logical operations when executing a program. Nevertheless, there are internal implementation details that critically affect the processor's performance, and vary both within and across compute vendor products. To accurately identify the impact of variation in processor and memory subsystem design on DBMS performance, we need to identify the impact of the microarchitectural parameters on the performance of database management systems.

 

This study compares the behavior of a prototype database system built on top of the Shore storage manager across three different processor design philosophies: the Sun UltraSparc (using processors UltraSparc-II and UltraSparc-IIi), the Intel P6 (using an Intel PII Xeon), and a Compaq/DEC Alpha (using a 21164A). The processors exhibit high variations in the processor and memory subsystem design. The prototype system choice is pertinent because the system's hardware behavior was found similar to commercial database systems when executing decision-support workloads. In order to evaluate the different design decisions and trade-offs in the execution engine and memory subsystems of the above processors, we ran several range selections and decision-support queries on a memory-resident TPC-H dataset.  The insights gained are indications that, provided that there are no serious hardware implementation concerns, decision-support workloads would exploit the following designs towards higher performance:

 

1.      A processor design that employs (a) out-of-order execution to more aggressively overlap stalls, (b) a high-accuracy branch prediction mechanism, and (c) the opportunity to execute more than one load/store instruction per cycle, and

 

2.      A memory hierarchy with (a) non-inclusive (at least for instructions) caches (b) a large (> 2MB) second-level cache, and (c) a large cache block size (64-128 bytes) without sub-blocking, to exploit spatial locality.

Evaluation of TPC-H benchmark on Athlon based systems

Mike Clark, Ajaya Durg, Kevin Lienenbrugger and Lizy John

Electrical and Computer Engineering Department

The University of Texas at Austin

Austin, TX 78712

 

This paper analyzes the operation of five queries from the TPC-H benchmark using two different AMD Athlon based machines. The

queries were picked (1, 3, 6, 8, and 19) to provide a representative sample of the TPC-H benchmark. The experiment had three objectives. One objective was to analyze the effect of the different L2 cache architectures of the two different Athlon processors. The classic Athlon processor contained an external, 1/3 the processor speed, 512K, 2 way set associative, inclusive cache and the newer Athlon processor had an internal, full speed, 256K, 16 way set associative, victim cache. Both machines were allocated with 256M of memory and were running Win2K server and a DB2 database.  Another objective of the research was to analyze the resources and micro-architecture of the Athlon when running large database queries. The last objective was to analyze the queries themselves to better understand their behavior.  All analysis was done through hardware based performance monitor counters.  Preliminary results for the instruction cache did show that the shrinking of the L2 cache did not hurt its effectiveness due to the increase in associativity.  This seems like a very good design decision since the miss rate did not increase, and the latency has improved 3 times.  Another interesting result was that two of the queries(3,8) showed two distinct phases and so performance results for those two queries were reported in two sections.  It was also observed that the processor was halted 60% of the time that the query is running. This is attributed to the disk latencies.  Of the remaining 40% of the time, the processor was stalled half of the time. The largest contributor to this stall component was the Load-Store unit becoming full.  However, slightly more surprising was that the second largest contributor was the 72-entry reorder buffer. This implies that the machine was able to consistently find a lot of work behind stalls but was still not deep enough to hide the full latency of the stall.  Assuming these stalls are due to memory accesses, good software prefetching or even hardware prefetching may be able to alleviate this problem.  The L1 caches and TLBs showed extremely good performance, as did the branch predictor. Some of the queries showed a higher macro-op to instruction factor than would be expected. This implies that the query may be using more of the complex instructions of the x86 instruction set. Unfortunately profiling the code was beyond the scope of this paper. In conclusion, the Athlon micro-architecture seems very capable of performing well on large data base queries.

Statistical Simulation of Superscalar Architectures using Commercial Workloads

Lieven Eeckhout and Koen De Bosschere

Department of Electronics and Information Systems (ELIS)

Ghent University

Sint-Pietersnieuwstraat 41

B-9000 Ghent, Belgium

{leeckhou,kdb}@elis.rug.ac.be

 

Trace- and execution-driven simulators are very important in the design of new microarchitectures. They model cycle-by-cycle processor state transitions which makes them highly accurate in simulating a processor prototype. However, there are serious practical shortcomings with these techniques for efficiently culling a huge design space in an early design stage. First of all, due to the increasing number of instructions that need to be simulated in order to measure the performance of contemporary real-life applications on a particular processor, simulation time is getting prohibitive. A second disadvantage that specifically holds for trace-driven simulations, is that huge traces need to be stored which is impractical.

 

In practice, researchers have proposed several solutions to shorten simulation time: taking a contiguous part of the trace, analytical modeling and trace sampling. Only recently, statistical simulation [1,2,3] was proposed to speed up the simulation process. In statistical simulation, a statistical profile or a set of statistical program characteristics is extracted from a program execution, e.g., the instruction mix, the distribution of the dependency distance between instructions, etc. This statistical profile is then used to generate a synthetic trace that is subsequently fed into a trace-driven simulator, which will compute the attainable performance for the microarchitecture modeled. Thanks to the statistical nature of the technique, performance characteristics quickly converge to a steady state solution. So, statistical simulation is a useful technique for culling huge design spaces in limited time.

 

In previous work [1,2,3], researchers only used SPEC benchmarks (www.spec.org) to evaluate the statistical simulation methodology. In this paper, we use a more broad spectrum of traces from commercial and scientific workloads:

·        the 8 integer control-intensive SPECint95 benchmarks;

·        5 scientific SPECfp95 benchmarks (hydro2d, su2cor, swim, tomcatv, wave5);

·        the 8 IBS traces [4], which contain significant amounts of operating system activity;

·        4 multimedia applications from the MediaBench suite [5], namely g721_encode, gs, gsm_encode, mpeg2_encode;

·        4 X graphics benchmarks from the SimpleScalar distribution (www.simplescalar.org), namely DooM, POVRay, Xanim and Quake;

·        2 TPC-D traces generated by tracing Postgres 6.3 running the TPC-D queries 2 and 17 over a 100MB Btree indexed database [6].

All traces include approximately 200 million instructions. The out-of-order architectures that we considered have an issue width of 8 and

16 and an instruction window of 64 and 128 entries, respectively. We used a hybrid branch predictor and two cache configurations, a `small' (8KB DM L1 I-cache, 8KB DM L1 D-cache and 64KB 2WSA L2 cache) and a `large' cache configuration (32KB DM L1 I-cache, 64KB 2WSA L1 D-cache and 256KB 4WSA L2 cache). To evaluate the performance prediction accuracy of the statistical simulation technique, we calculate the relative error between the IPC of the real trace and the IPC of the synthetic trace generated using the statistical profile of the real trace.

 

We conclude that statistical simulation obtains smaller performance prediction errors for applications with a higher static instruction

count. More specifically, the IPC prediction error is no larger than 25% for the IBS traces, the TPC-D traces, two X graphics traces (DooM and Quake) and three SPECint traces (gcc, go and vortex). The other traces have a smaller instruction footprint and have significantly larger performance prediction errors (up to 40%). So, since commercial workloads are known to have a larger instruction footprint, we can conclude that statistical simulation is a fast simulation technique that is useful for commercial workloads.

 

[1]   R. Carl and J. E. Smith. Modeling superscalar processors via statistical sim ulation. In Workshop on Performance Analysis and Its Impact on Design (PAID), held in conjunction with the 25th Annual International Symposium on Computer Architecture (ISCA-25), June 1998.

[2]   L. Eeckhout, K. De Bosschere and H. Neefs. Performance analysis through synthetic trace generation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2000), pp. 1-6, April 2000.

[3]   M. Oskin, F. T. Chong, M. Farrens. HLS: Combining statistical and symbolic simulation to guide microprocessor design. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA-27), pp. 71-82, June 2000.

[4]   R. Uhlig, D. Nagle, T. Mudge, S. Sechrest and J. Emer. Instruction fetching: coping with code bloat. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA-22), pp. 345-356, June 1995.

[5]   C. Lee, M. Potkonjak and W. H. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture (MICRO-30), pp. 330-335, December 1997.

[6]  C. Navarro, A. Ramirez, J.-L. Larriba-Pey and Mateo Valero. Fetch engines and databases. In Proceedings of the Third Workshop on Computer Architecture Evaluation using Commercial Workloads, held in conjunction with the 6th International Symposium on High-Performance Computer Architecture (HPCA-6), January 2000.

 

Session 4: Processor/Memory Interconnect Performance for Database Workloads

Impact of Database Scaling on Realistic DSS Workload Characteristics on SMP Systems

Ramendra K. Sahoo, Krishnan Sugavanam, Ashwini K. Nanda

IBM T.J. Watson Research Center

Yorktown Heights, NY 10598

rsahoo@us.ibm.com

 

Study of large scale database system characteristics on shared-memory multiprocessors is considered  to be a difficult task. Hence traditionally, research has been focussed on simulation of scaled down workloads. It is  difficult to come up with a standard scaling plan for large databases  representing realistic workloads.  The primary scope of the study is to provide  a preliminary  analysis of database scaling and its effect on overall memory behavior for decision support  workloads. All the analyses and results are based on TPC-H (Decision Support) benchmark queries for a database size ranging from 1GB to 100GB. The study uses the MemorIES board and on-chip counters to analyze query by query behavior for realistic TPC-H benchmarks.  A quantitative evaluation of the difference between a scaled down database and a realistic workload is presented. This study covers various system parameters like miss ratio, miss rate etc. and the effect of database size on these parameters. Our initial observation is that,the results from a scaled down workload might qualitatively characterize database/memory system behavior, but not quantitatively, which is essential for future server design.

STiNG Revisited:  Performance of Commercial Database Benchmarks on a CC-NUMA Computer System

Russell M. Clapp

IBM NUMA-Q

15450 SW Koll Parkway

Beaverton, Oregon 97006-6063

rclapp@us.ibm.com

 

In 1996, the STiNG architecture was introduced at the International Symposium for Computer Architecture.  Since that time, it has been implemented with several generations of Pentium Pro processors, memory and I/O control chipsets, and ASICs that comprise the second-level SCI-based interconnect that is CC-NUMA.  In this talk, the speaker describes the performance of commercial database benchmarks as viewed from the event counters embedded in the hardware at various places throughout the a recent implementation of the STiNG architecture.   After a review of the STiNG system architecture, performance data of representative benchmarks for OLTP and Decision Support workloads will be presented.  This data includes the rate of events for the workloads (e.g. cache misses, cache-to-cache transfers, invalidations, etc.), latency to service these events, bandwidth consumptions at various datapaths in the system, and other architecture specific measurements.  Where appropriate, comparisons will be made between this data and the original projections that were made in 1996.  What we conclude is that, despite some deficiencies in the original simulation environment and some unanticipated differences in software behavior on real systems, there is still good overall agreement between the original simulated results and the recently measured results.

Analyzing Business Computing Architecture Availability using Commercial Database Benchmark Workloads

Stephen de Glanville

IBM NUMA-Q

15450 SW Koll Parkway

Beaverton, Oregon 97006-6063

deglanvs@us.ibm.com

The availability of applications to end-users, sometimes described as business-availability, is often used by both hardware and database vendors to impress potential customer's with the robustness of their load-balanced, clustered, or parallel server products. Hardware vendors often brag about 'five-nines' availability (99.999%) without qualifying whether this figure is a simple calculation based on the compounded MTBFs (Minimum Time Between Failures) of all system components, or whether this is TOTAL system availability to ALL end-users. The latter availability figure is what most customers would like solution vendors to specify and often craft lengthy service and performance level agreements with their suppliers to ensure they can get compensation for business-loss caused by system failure.

This presentation describes techniques available through the use of commercial database benchmarks that can be used to quantify the impact on a customer's business of partial or total system failure. This technique allows analysis of what actually happens to end user transaction characteristics when a system suffers partial or total failure, whereas, MTBF figures can only attempt to predict how often a failure might occur. An example of harness technology used in a commercially available ERP (Enterprise Resource Planning) benchmark is given to explain how parameters like TPS (Transactions Per Second), and transaction timings, can be captured as a percentage of the figure from an optimal benchmark run, to ascertain the precise impact of any simulated failure on end-user communities. Hooks are built into the benchmark harnesses to simulate, application server failure, client-server network-segment failure, database instance failure and database server failure.

The model is simple and permits extrapolation techniques to be used to calculate the cost of long-term partial or total system outage. Data extracted using these techniques provides valuable information to permit the predictable design of highly available business computing architectures, and can be used to predict how a given architecture will perform should a failure occur in a production system.