SECOND WORKSHOP ON COMPUTER ARCHITECTURE EVALUATION USING COMMERCIAL WORKLOADS

Orlando, Florida

Sunday, January 10th, 1999

Immediately precedes the Fifth International Symposium on High Performance Computer Architecture (HPCA-5)

Sponsored by the IEEE Computer Society

Organized by:

Russell Clapp, Informix Software

rmc@informix.com

Ashwini Nanda, IBM TJ Watson Research Center

ashwini@watson.ibm.com

Josep Torrellas, University of Illinois at Urbana-Champaign

torrella@cs.uiuc.edu

Building on the positive feedback enjoyed by the First Workshop on Computer Architecture Evaluation using Commercial Workloads, this second workshop will bring together again researchers and practitioners in computer architecture and commercial workloads from industry and academia. In the course of one day, we will discuss work-in-progress that utilizes commercial workloads for the evaluation of computer architectures. By discussing this ongoing research, the workshop will expose participants to the characteristics of commercial workload behavior and provide an understanding of how commercial workloads exercise computer systems. There will be discussions on the difficulties associated with using commercial workloads to drive new computer architecture designs and what can be done to overcome them.

The Final Program for the workshop is listed below, followed by an abstract for each talk. A 20-minute time limit will be strictly enforced for each talk, and there will be plenty of time for audience participation. A round table session discussion will be held after the technical presentations. There will be no proceedings for the workshop since we encourage the presentation of work-in-progress and research in early stages. Copies of the foils used by the speakers will be distributed to the attendees.

Final Program

8:00 am - 8:15 am

Registration

8:15 am - 8:20 am

Introductory Comments

8:20 am - 10:00 am

Session 1: Invited Talks

Behavior of Benchmarks on Commercial Server Systems - "Tales from the Trenches"

Tony Petrossian and Russell Clapp
Informix Software

Effectiveness of Remote Caches in NUMA Systems

Don DeSota and Ruth Forester
Sequent Computer Systems

Thread-based Cache Analysis of a Modified TPC-C Workload

D N Jayasimha and Akhilesh Kumar
Intel Corporation

10:00 am - 10:20 am

Coffee Break

10:20 am - 12:00 pm

Session 2: Workload Characterization and Analysis

A Trace-driven Analysis of Sharing Behavior in TPC-C

Ravi Iyer+, G. Janakiraman*, Raj Kumar* and Laxmi Bhuyan+
* Hewlett-Packard Laboratories
+ Texas A&M University

Sensitivity Analysis of System Performance Using Synthetic Workloads

Ann Harrold+, David Kaeli* and Zarka Cvetanovic+
+Compaq Computer
*Northeastern University

Detailed Performance Characterization of TPC-D on a 4-CPU Pentium Pro NT Server

Qiang Cao+, Josep Torrellas+, Josep Larriba+, Youjip Won* and Bob Knighten*
+University of Illinois
*Intel Corporation

Instruction-level Performance Characterization of Computational Physics and Multimedia Applications using Performance Counters

Yong Luo, Kirk W. Cameron and Olaf M. Lubeck
Los Alamos National Laboratory

12:00 pm - 1:30 pm

Lunch

1:30 pm - 2:20 pm

Session 3: New Techniques for Increasing Performance

Page Prefetching for Database Workloads

Rabin Sugumar+, Srikanth Srinivasan* and Partha Tirumalai+
+Sun Microsystems
*Duke University

Optimizing Instruction Fetch for Decision Support Workloads

Alex Ramirez, Josep Ll. Larriba-Pey, Carlos Navarro, Xavi Serrano, Josep Torrellas and Mateo Valero
Universitat Politecnica de Catalunya

2:20 pm - 2:30 pm

Short Break

2:30 pm - 3:45 pm

Session 4: Analyzing Web-based and Multimedia Workloads

Web Workload Characterization at a Microarchitectural Level

Ramesh Radhakrishnan and Lizy John
University of Texas, Austin

Performance Impact of Secondary Cache Size on Web Servers

Roy Saharoy
Intel Corporation

A new Framework for Industry Standard Storage Subsystem Benchmarking: The Storage Performance Council

Charles Gimarc
The Storage Performance Council

3:45 pm - 4:00 pm

Coffee Break

4:00 pm - 5:30 pm

Round Table Discussion

Cooperation between industry, academia and government to make commercial applications widely available

Organizer: Laxmi Bhuyan
CISE, National Science Foundation

Abstracts

Session 1: Invited Talks

Behavior of Benchmarks on Commercial Server Systems - "Tales from the Trenches"

Tony Petrossian and Russell Clapp
Informix Software
A fair amount of work has been done in recent years to characterize the behavior of commercial workloads using TPC benchmarks. However, there are still many issues that arise when executing a TPC benchmark that suggest that the benefits of this research has yet to be applied to commercially available systems. Furthermore, there are other examples of deficient system behavior which can be traced to system software and tools, of which computer architects should be aware. Viewing the system and its problems as a whole should enable hardware architects to identify new solutions to make processors, caches, I/O devices, and interconnects more "software friendly."

One example where the hardware can better interact with software is in the area of processor caches. While it is well-known that higher levels of associativity in processor caches are of great benefit to the performance of commercial applications, there are still systems being shipped today with direct-mapped caches. In addition to performance loss however, such a cache organization also contributes to a second-order effect, specifically the inability to get consistent performance results between benchmark runs. This is caused by an inconsistent page allocation scheme which does not attempt to prevent address conflicts in the processor cache. While research papers have discussed the merits of such schemes, they are often not provided by commercial systems vendors. Other examples of such "non-holistic" computer systems design will be described in our talk, including effects of compiler optimizations that can not be applied and the penalty for not supporting memory "pinning."

In this talk we discuss several problems that can be characterized using TPC benchmarks and how these problems impact commercial workloads. The discussion will cover real-world problems with specific hardware features as well as deficiencies with system software and tools. We will also identify some key aspects of benchmark behavior and discuss what implications they have on the requirements for new commercial systems architectures. This discussion will include the I/O subsystem as well, which has typically received less attention than the primary processor-to-memory interconnects.

Effectiveness of Remote Caches in NUMA Systems

Don DeSota and Ruth Forester
Sequent Computer Systems
Large scale NUMA systems require a high bandwidth interconnect to be able to handle the traffic for non-local references. The bandwidth and latency of the interconnect can be the limiting factor in the performance of the system. The complexity and technology required to design an interconnect so it does not limit performance can be the critical path in getting the system to market. A cache on each node for remote data can help reduce the bandwidth required on the interconnect and help a system with a longer remote latency to perform at the level of a system with a lower latency. This may be used to reduce the complexity in the interconnect thereby allowing the system to be designed faster or with less resources.

Our talk will cover performance data that has been gathered from Sequent PentiumPro systems and Xeon NUMA systems running commercial databases and workloads. Data has been collected on both large systems running a NUMA optimized OS and smaller systems running a non-optimized OS. The data will include measurements of two different databases and include data from TPC-C, TPC-D and customer workloads. We will give a brief overview of the system architecture and a table showing the system size, the OS, the database used, and the benchmark. The OS and databases will not be named explicitly but we will indicate where they are different.

We will show the hit and reference rates for the various workloads. The data will include data measured on actual systems including remote reference rates, L2 hit rates, and remote cache hit. The reduction in off quad traffic through the utilization of the remote cache will be demonstrated with remote cache hit rates and bus bandwidths. This will be equated to the reduction in interconnect bandwidth as a result of using the remote cache.

We will also present data generated from our performance model that compares the performance with and without a remote cache for some of these benchmarks. The model will be run for 10:1 and 2:1 remote-to-local latency ratios. We will show the remote-to-local ratio with no remote cache needed to get equivalent performance to a system with a 10:1 ratio and a remote cache. The performance for a ratio of 2:1 will be shown, with and without remote cache, to demonstrate that even with a very good ratio a remote cache can still improve performance. Lastly our data will show what ratio is required with a remote cache to get equivalent performance to the system with the 2:1 ratio and no remote cache.

We will finish with our conclusions on the effectiveness of remote cache.

Thread-based Cache Analysis of a Modified TPC-C Workload

D N Jayasimha and Akhilesh Kumar
Intel Corporation
TPC-C is a multi-threaded, scalable, on-line transaction processing database benchmark. The number of warehouses and clients are two key benchmark parameters. Our study is based on the collection of x86 instruction traces and their analysis using a multi-level cache simulator. A distinguishing feature of the study is that the cache analysis is based on the interaction, both constructive and destructive, among the application threads (each client spawns a thread). A trace is first annotated to identify all the unique threads and each instance of a thread. The simulator has the ability to separate the hit and miss characteristics of an instance of a thread, different instances of a thread, and instances of different threads. Such an analysis offers insights into key platform architectural questions and helps answer the following questions, among others:
  1. How significant is inter- and intra-thread interference due to private data and does it increase with clients?
  2. How significant is the sharing of data and instructions among threads?
  3. How do the misses per instruction change with performance for a fixed cache size?
  4. How does cache hot spot behavior change with associativity?

Session 2: Workload Characterization and Analysis

A Trace-driven Analysis of Sharing Behavior in TPC-C

Ravi Iyer+, G. Janakiraman*, Raj Kumar* and Laxmi Bhuyan+
* Hewlett-Packard Laboratories
+ Texas A&M University
Commercial workloads such as OLTP, DSS and Web-applications comprise the largest share of the high-end server market. To succeed in this market, high-end multiprocessor servers must be designed to support the highest level of performance on commercial workloads. Such aggressive designs are possible only with a thorough understanding of the behavior of these commercial workloads. We have examined the memory sharing behavior of an OLTP workload (the TPC-C benchmark) in detail to gain insight into the effectiveness of system protocols and to identify potential improvements to the memory system design. We briefly describe our analysis methodology and present salient results from the analysis.

Our analysis abstracts the lifetime of a memory block as a sequence of intervals, where each interval is a run of read accesses to the block bounded by write accesses to the block. Intervals have several attributes including the identity of the processor writing the block at the start of the interval, the identity of processors that participate in the interval, the identity of the processor writing the block at the end of the interval, the number of read accesses and the duration of the interval. A block is characterized by its interval behavior: the number of intervals and the interval attributes.

We examine the interval behavior of blocks in TPC-C at the interface between the processor caches and main memory using a trace of TPC-C running on Oracle captured on a 4-way multiprocessor system. We analyze interval counts and characteristics and producer-consumer access distributions of blocks (in addition to overall characteristics such as load/store frequencies, miss frequencies and miss types). We describe how the interval behavior varies across different blocks as well as how interval attributes vary across the intervals of a single memory block. Our analysis shows that a small fraction of the blocks have a large number of intervals and account for a majority of the misses. Also, interval attributes are consistent across the intervals of a single memory block. We explore the nature of producer-consumer interactions in the interval behavior of these blocks and identify the dominant memory sharing patterns in the workload. These workload characteristics can suggest potential improvements to the memory system for future high-end multiprocessor servers.

Sensitivity Analysis of System Performance Using Synthetic Workloads

Ann Harrold+, David Kaeli* and Zarka Cvetanovic+
+Compaq Computer
*Northeastern University
Commercial workloads are inherently difficult to setup and evaluate. Much of this stems from the underlying database which must be installed, configured, and tuned in order to evaluate commercial workload benchmarks (e.g., TPCC and TPCD). In order study performance utilizing these workloads, one must first become an expert in database tuning (as well as in the analysis of the underlying operating system). Since access to source code for these database systems is seldom available, it is not possible to obtain a thorough understanding of the workload and it's associated behavior characteristics.

In addition to their complexity, benchmarks such as these do not allow the user to see how a system behaves as the workload changes. Typical benchmarks only provide us with one point on the cost/performance curve, and can not show us what happens when certain workload parameters changed, such as: the amount of data sharing, working set size, number and size of I/O access, etc.

We have recently begun to utilize the AIM benchmark as a alternative for studying system performance, and perhaps more importantly, as a tool which can provide ``workload knobs'' in order to investigate how varying the workload affects system performance.

Our approach to date has been to characterize TPC workloads on a single SMP system using various monitoring tools (e.g., DCPI and IPROBE) which capture CPU, system bus, and I/O activity. We use processor performance counters to collect statistics on stall and issuing times, cache misses, replay traps, TLB misses, and instruction frequencies. We also use bus monitoring counters to collect statistics on bus utilization, type of bus commands, and the amount of data sharing. We then try to achieve similar characteristics by tuning the AIM workload, both at the ``workload level'', and at the source code level. So far our approach has been successful, and we are able to synthetically mimic the behavior of TPCC on our SMP system.

The next step in our work (which will be presented at the workshop) will be to address synthetic workload scaling in order to match actual workload characteristics while scaling TPC benchmarks. This includes some definition of a ``transaction'' using AIM. Our eventual goal is to be able to provide knobs within AIM to allow for fine-grain system performance analysis across a variety of system configurations.

This talk will present our methodology to synthesize commercial workloads. We will also provide quantitative results on how close we have been able to mimic the behavior of a number of benchmark workloads.

Detailed Performance Characterization of TPC-D on a 4-CPU Pentium Pro NT Server Running Microsoft SQL

Qiang Cao+, Josep Torrellas+, Josep Larriba+, Youjip Won* and Bob Knighten*
+University of Illinois
*Intel Corporation
Hardware counters in the processor and software instrumentation in the database kernel are two powerful tools to gain insight into the performance of database workloads. In this paper, we use them both to characterize the architectural behavior of TPC-D running on a 4-processor Pentium Pro NT Server with Microsoft SQL. In an extensive measurement-based effort, we examine, for each individual query, the CPI, L1 and L2 cache behavior, branch prediction effectiveness, effectiveness of the speculative execution and I/O activity.

Our results show many interesting issues. The amount of I/O performed in a query, together with the size of each transfer, is still a significant factor in the execution time. Also, the L2 data misses are an important factor that determines the CPI of the queries. Furthermore, there is a correlation between the frequency of branch instructions and CPI, suggesting that branches contribute significantly to the CPI.

Overall, we find that, while individual queries exhibit different characteristics, TPC-D queries as a whole tend to have lower L1 and L2 miss rates and branch misprediction rates than TPC-C. Finally, we study the scalability of TPC-D as we move from a 1 Gbyte to a 10 Gbyte database, as well as the effects of different hardware parameters and various data layouts.

Instruction-level Performance Characterization of Computational Physics and Multimedia Applications using Performance Counters

Yong Luo, Kirk W. Cameron and Olaf M. Lubeck
Los Alamos National Laboratory
Workload characterization has been proven an essential tool to architecture design and performance evaluation in both scientific and commercial computing areas. Comparing scientific and multimedia application characteristics at the instruction-level is an admitted challenge. While some work has been accomplished in this area, no methods of characterization have attempted to model the inner workings of queues within the microprocessor in an attempt to address the processor utilization efficiency of a particular application. Traditional workload characterization techniques include FLOPS rate, cache miss ratios, CPI (cycles per instruction or IPC, instructions per cycle) etc. With the complexity of sophisticated, modern superscalar microprocessors, these traditional characterization techniques are not powerful enough to pinpoint the performance bottleneck of an application on a specific microprocessor. They are also incapable of immediately predicting the potential performance benefit of any architectural or functional improvement in a new processor design. New queue-oriented instruction-level characterizations promise to alleviate the necessity of relying on tedious simulations to provide details of architectural influence on certain large-scale scientific and commercial computing applications. This talk applies a new workload characterization technique, which has been previously published by Los Alamos National Laboratory (LANL), to the instruction-level comparison of scientific and multimedia applications using hardware performance counters.

A variety of instruction counts are collected from some LANL computational physics application benchmarks and multimedia commercial applications. These instruction counts can be used to form a set of abstract characteristic parameters directly related to a processor's architectural features. Based on microprocessor architectural constraints and these calculated abstract parameters, the architectural performance bottleneck for a specific application can be estimated. In particular, the analyzed results can provide some insight to the problem that only a small percentage of processor peak performance can be achieved even for many cache-friendly codes. Meanwhile, the bottleneck estimation can provide suggestions about viable architectural/functional improvement for certain workloads. Herein lies the difference between previous comparisons of commercial vs. scientific workloads and the work we have done: this method serves to provide evidence of architectural differences among applications across platforms using a new, unique, straightforward analysis technique based on a new method of workload characterization.

This talk describes a three-tiered analysis utilizing the aforementioned characterization technique on four LANL computational physics applications (SWEEP3D, HEAT, HYDRO, and HYDRO-T) and several commercial multimedia applications (RealPlayer, MPEG decoder, and some commonly used DSP kernels). All instruction-level characteristic data for these applications are collected on a MIPS R10000-based SGI Octane workstation using built-in performance counters. A set of abstract parameters is derived for each application to construct the application characteristics related to some key architectural features, such as pipeline queues, functional units, branch prediction, and memory hierarchy. The comparison provided in this talk attempts to draw conclusions for a particular architecture describing the performance differences attributable to architectural features utilized by scientific and commercial applications. This touches upon the greatest advantage of this new characterization technique: a better understanding of processor utilization efficiency and performance bottleneck for each application by pinpointing architectural constraints actually causing the CPU to stall. Another contribution of this technique is refined prediction of application-specific performance benefit from future architectural enhancements. Overall, the comparison between scientific and multimedia applications based on this new instruction-level characterization provides a picture of microprocessor performance utilization and efficiency across diverse applications for a particular architecture.

Session 3: New Techniques for Increasing Performance

Page Prefetching for Database Workloads

Rabin Sugumar+, Srikanth Srinivasan* and Partha Tirumalai+
+Sun Microsystems
*Duke University
Introduction

It is well known that cache misses contribute to a significant fraction of CPI on database workloads [1]. However, not much published work exists on the underlying constructs that are responsible for misses. This is in contrast to scientific/numerical benchmarks which also miss a lot in cache but where the underlying code constructs that cause the misses are well understood. In this work we analyzed a trace of an OLTP run to identify constructs that contribute to a large fraction of data cache misses. We find that a majority of data cache misses arise from accesses to database pages which store data and index records. Further we find that accesses to any individual page are clustered in time and that intent to access a page is known in advance of actual access. Based on our findings we propose a 'page-buffer' coupled with a special 'page prefetch' instruction to reduce cache miss penalty on database workloads.

Methodology

Our study was done on a single CPU, user only trace collected using Shade [2] from a commercial database engine running the TPC-C benchmark. The trace consists of 1 billion instructions. The trace was collected on a server only TPC-C setup with 125 warehouses. The physical memory on the system was 1 GB of which about 800 MB was allocated for the shared address space. The trace does not have kernel instructions and since it is a single CPU trace it cannot be used to study coherence misses. However, since we are focusing on constructs that miss in single CPU user only code, they will be misses in multi CPU user+kernel code as well.

Page Accesses

Database engines store data records in physical memory and disk in the form of pages [3]. Each page holds several data or index records. The page size varies but is typically in the range of 2KB to 8KB (in the run we traced a page size of 2KB is used). The amount of data that databases maintain in pages is huge (in gigabytes). So accesses to pages tend to cause compulsory or capacity misses. As cache sizes increase misses occurring on other smaller structures gradually disappear leaving misses on page accesses as a dominant fraction of total misses. The table below presents data illustrating this fact.
 

External cache Size
Percentage of external cache misses due to database page accesses
2MB 1-way
23.17%
4MB 1-way
29.64%
8MB 1-way
36.10%
16MB 1-way
46.73%
16MB 2-way
58.12%
We also find that accesses to individual pages are clustered in time, but repeat accesses to the same page are spread far apart. That is page accesses have good spatial locality but low temporal locality. Furthermore, we find that there are only few locations in the application where page accesses start.

Page Prefetch

To reduce the impact of cache misses on page accesses we propose a small fully associative buffer on the CPU chip adjacent to the Level 1 data cache -- a page-buffer. The page-buffer has space to store 'n' complete database pages where typically 'n' is a small number (probably <=3D 4). We also propose a new flavor of prefetch instruction to load pages into this page-buffer typically well before first use. All load instructions check the page-buffer along with the first level cache. If the requested data is found in the page-buffer it is returned from the page-buffer to the pipeline.

The compiler identifies locations in the code where a series of accesses to a particular database page will start either through profile feedback or through 'pragmas' inserted in the source code by developers. In the trace we analyzed only two such locations exist. The compiler then inserts the new page prefetch instruction at these locations to bring in the page about to be accessed into the page-buffer. During execution, when the page prefetch instruction is executed the entire page is fetched into the page-buffer. Subsequent loads that access page data find the data in the page-buffer and data is returned to the pipeline at a significantly lower latency.

The page-buffer is somewhat similar to stream buffers proposed by Jouppi [4]. But stream buffers rely on automatic detection of strided access patterns, and do not work well for the irregular access patterns we see in database workloads. The page-buffer is also similar to the P-cache on UltraSPARC-III [5] and the Assist Cache on HP-PA7200 [6]. The main difference is the fetch granularity. In the page-buffer 2KB or larger blocks are fetched; in UltraSPARC-III and HP-PA7200 64 and 32 byte blocks (respectively) are fetched. But the structures on US-III or PA7200 may be adapted to work like the page-buffer, perhaps by issuing multiple requests.

Simulation Results

Prefetching pages into the page-buffer as outlined above has the potential to significantly reduce cache miss penalty. However since entire pages are fetched in, but only few lines are actually used in each page, it could significantly raise bandwidth demands. It would be best if only the needed lines could be brought in with prefetch. However, since we know the location of the needed lines only to page level granularity, the choice boils down to whether it is better to bring in the full page in advance OR to suffer the miss penalties that currently result due to the needed data not being in cache. To evaluate the effectiveness of our page prefetching scheme in improving application performance, we have implemented an infrastructure to insert prefetches in the database trace and to simulate a page-buffer on a UltraSPARC simulator. Early results look reasonably promising; we are in the process of consolidating our results and simulating a wide range of design options. We hope to have all the results in time for the workshop.

References

[1] L.A.Barroso, K.Gharachorloo and E.D.Bugnion. Memory System Characterization of Commercial Workloads. In Proceedings of the 25th Intl. Symp. on Computer Architecture.

[2] http://jhared.east/share/perftools/shade/shade.html

[3] R. Ramakrishnan. Database Management Systems. McGraw-Hill 1997.

[4] N.P.Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. of the 17th Intl. Symp. on Computer Architecture.

[5] Gary Lauterbach. UltraSPARC-III -- Scalable high clock-rate SPARC processor. Presentation at Microprocessor Forum, 1997.

[6] K.K.Chan, C.C.Hay, J.R.Keller, G.P.Kurpanek, F.X. Schumacher and J.Zheng. Design of the HP PA7200. Hewlett-Packard Journal, Feb '96.

Optimizing Instruction Fetch for Decision Support Workloads

Alex Ramirez, Josep Ll. Larriba-Pey, Carlos Navarro, Xavi Serrano, Josep Torrellas and Mateo Valero
Universitat Politecnica de Catalunya
Future wide-issue superscalars are expected to demand a high instruction bandwidth to satisfy their execution requirements. This will put pressure on the fetch unit and has raised concerns that instruction fetch bandwidth may be a major limiting factor to the performance of aggressive processors. Consequently, it is crucial to develop techniques to increase the number of useful instructions per cycle provided to the processor.

This number of useful instructions per cycle provided by the fetch unit is broadly determined by three factors: the branch prediction accuracy, the cache hit rate and the number of instructions provided by the fetch unit for each access. Clearly, many things can go wrong. Branch mispredictions cause the fetch engine to provide wrong-path instructions to the processor. Instruction cache misses stall the fetch engine, interrupting the supply of instructions to the processor. Finally, the execution of non-contiguous basic blocks prevents the fetch unit from providing a full width of instructions.

Much work has been done in the past to address these problems. Branch effects have been addressed with techniques to improve the branch prediction accuracy and to predict multiple branches per cycle. Instruction cache misses have been addressed with software and hardware techniques. Software solutions include code reordering based on procedure placement or basic block mapping, either procedure oriented [PH90] or using a global scope [Hwu89,To95]. Hardware solutions include set associative caches, hardware prefetching, victim caches and other techniques. Finally, the number of instructions provided by the fetch unit each cycle can also be improved with software or hardware techniques. Software solutions include trace scheduling, and superblock scheduling. Hardware solutions include branch address caches, collapsing buffers and trace caches [Rot96].

While all these techniques have vastly improved the performance of superscalar I-fetch units, they have been largely focused and evaluated on engineering workloads. Unfortunately, there is growing evidence that popular commercial workloads provide a more challenging environment to aggressive instruction fetching.

Indeed, recent studies of database performance on current processors have given useful insight. These studies show that commercial workloads do not behave like other scientific and engineering codes [May94]. They execute fewer loops and have many procedure calls. This leads to large instruction footprints. The analysis, however, is not detailed enough to understand how to optimize them for improved I-fetch engine performance.

The work in this paper focuses on this issue. We proceed in three steps. First, we study the internal structure of a Relational Database Management System and the execution model implemented. Then, we characterize the locality patterns of database kernel code and find frequently executed paths. The database kernel used is PostgreSQL [Sto91]. This characterization shows that only 12% of the code is ever referenced. Furthermore, our data shows that there is significant locality, as 0.7% of the static basic blocks accumulate 90% of the references, and that the execution patterns are quite deterministic, with 80% of the static basic blocks behaving in a fixed way.

Second, we use this information to propose an algorithm to reorder the layout of the basic blocks in the database kernel for improved I-fetch. By mapping the basic blocks in execution order, we increase the potential bandwidth provided by a sequential fetch unit increasing the number of instructions executed between taken branches. Also, we will map the most frequently referenced sequences in a reserved area of the cache space, and other popular sequences next to other equally popular ones, saving many conflict misses.

Finally, we evaluate our scheme via simulations. Our results show a miss reduction of 60-98% for realistic instruction cache sizes and a doubling of the number of instructions executed between taken branches to over 22. As a consequence, a 16 instruction wide sequential fetch unit using a perfect branch predictor increases the fetch bandwidth from 5.6 to 10.6 instructions per cycle when using our proposed code layout.

The software scheme that we propose combines well with hardware schemes like a Trace Cache. The fetch bandwidth for a 256 entry Trace Cache improves from 8.6 to 12.1 when combined with our software approach. This suggests that commercial workloads may be amenable to the aggressive instruction fetch of future superscalars.

[Hwu89] Wen-mei Hwu and Pohua P. Chang, Achieving High Instruction Cache Performance with an Optimizing Compiler, Proceedings of the 16th Annual Intl. Symposium on Computer Architecture, pp 242-251, June 1989.

[May94] A.M.Maynard, C.M.Donnelly and B.R.Olszewski, Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads. Proceedings of the Sixth Intl. Conference on Architectural Support for Programming Languajes and Operating Systems, pp 145-156, October 1994.

[PH90] Karl Pettis and Robert C. Hansen, Profile Guided Code Positioning, Proc. ACM SIGPLAN'99 Conf. on Programming Language Design and Implementation, pp 16-27, June 1990.

[Rot96] E. Rottenberg, S. Benett and J. E. Smith, Trace Cache: a Low Latency Aprroach to High Bandwith Instruction Fetching. Proceedings of the 29th Anual ACM/IEEE International Symposium on Microarchitecture, pp 24-34, December 1996.

[Sto91] M. Stonebreaker and G. Kemnitz, The POSTGRES Next Generation Database Management System. Communications of the ACM, October 1991.

[To90] Josep Torrellas, Chun Xia and Russell Daigle, Optimizing Instruction Cache Performance for Operating System Intensive Workloads, Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pp 360-369, January 1995.

Session 4: Analyzing Web-based and Multimedia Workloads

Web Workload Characterization at a Microarchitectural Level

Ramesh Radhakrishnan and Lizy John
University of Texas, Austin
Our goal is to study the behavior of modern Web servers and server application programs to understand how they interact with the underlying hardware and operating system (OS) environments under a realistic workload. We monitor the performance of the system using hardware monitor counters, for different scenarios and evaluate the performance and identify the bottlenecks in the system based on the analysis of data collected. Since popular benchmarks for Web servers assume that all requests are for static Web pages, they do not match current real-world workloads. Our workloads include static requests which form a major percentage of the web traffic, as well as dynamic requests in the form of CGI (Common Gateway Interface) scripts and Servlets which are gaining popularity and hence form a significant part of today's Web traffic.

We include not only the Web server itself in our studies, but also the server applications that run under its control as CGI scripts or Java Servlets. An UltraSPARC-II is used to run the server program in the UNIX environment. Dynamic requests are observed to have a higher CPI in comparison to the static requests. The major factor that we could attribute to this was external (L2) cache misses. The other factors that affected the performance for the different requests, like pipeline stalls and cache misses are measured.

Performance Impact of Secondary Cache Size on Web Servers

Roy Saharoy
Intel Corporation
In this paper the impact of secondary (L2) cache size on Pentium II ® Xeon (tm) processor based web server is studied. Most web server workloads (benchmarks) are focused on throughput of the web server, in the sense that the primary metric is the number of web (HTTP) transactions per second. However, latency for a given HTTP workload also plays an important role for many web benchmarks such as, Webstone (from Silicon Graphics) or SPECweb96 (from the SPEC organization) - the general observation is that if latency for a given load is decreased, it results in higher value of maximum sustainable bandwidth. In this paper, the effect of the secondary cache size is studied primarily from a latency perspective.

On a latency focused study of web servers, the primary metric is the mean time to complete a single end-to-end transaction between a web client and the web server. The results show that as the secondary cache size is increased, the latency of a single web transaction decreases and such latency improvement happens at a faster pace with higher memory requirement in the server for a given workload and along with latency, the maximum sustainable throughput also increases. On the server side, the latency of a web "get" transaction includes: 1. Receiving a client request, 2. Getting the requested data ("html" file) from memory and 3. Sending the data out to the network adapter using TCP/IP protocol. The network I/O performance is directly affected by the up front latency involved in the initial processing of client request, which also involves establishing a new connection, and getting the required data from memory. Given sufficient network I/O bandwidth on the web server, the latency of a web transaction thus mostly (excluding the "think" time required by clients) depends on the resulting CPI (Cycles per Instruction) and the L2 MPI (L2 Misses per Instruction) on the server for a given workload.

The SPECweb96 workload is used in this study to analyze the L2 access characteristics in the web server - the web server running the Microsoft Windows NT 5.0 Beta operating system with Microsoft Internet Information Server 5.0 for providing web services to clients. The web server behavior using the SPECWeb96 workload reveals that both CPI and MPI decrease with increased L2 size. While this observation agrees with our expectation, this paper also looks at the cache and processor-to-memory bus accesses in further granularity in terms of access and transaction patterns and attempts to understand the effects of MPI on CPI at various levels of load produced by the workload.

It should be noted that the results presented in this paper do not any way represent actual SPECweb96 benchmark performance for the hardware and software combination used in this study. Despite the fact that a synthetic benchmark is used for this study, the results of the analysis should be applicable for real-life static HTTP workloads on web server platforms of similar architecture.

A new Framework for Industry Standard Storage Subsystem Benchmarking: The Storage Performance Council

Charles Gimarc
The Storage Performance Council
The Storage Performance Council (SPC) is a non-profit corporation with the objective to define, standardize, and promote storage subsystem benchmarks as well as to disseminate objective, verifiable performance data to the computer industry and its customers.

The primary goal of the SPC is to serve as a catalyst for performance improvement in storage subsystems. It works to foster the open exchange of ideas and information, and to ensure fair and vigorous competition between vendors as a means of improving the products and services available to the public.

In support of its goals the SPC is developing benchmarks focused on storage subsystems. These subsystems include storage targets, as well as the adapters, controllers, and networks that connect storage devices to the computer system. Currently the SPC has 17 member companies.

The talk will focus on four topics: First, to introduce the SPC to people within the industry who may not be aware of our work. Second, to outline why a storage benchmark is important to our industry. Third, to describe where the SPC is in our work to develop a true storage subsystem benchmark. And fourth, to solicit industry comments & participation in our work.