Sponsored by the IEEE Computer
Society
Organized by:
Building on the positive feedback enjoyed by the First, Second, and Third Workshops on Computer Architecture Evaluation using Commercial Workloads, this fourth workshop will again bring together researchers and practitioners in computer architecture and commercial workloads from industry and academia. In the course of one day, we will discuss work-in-progress that utilizes commercial workloads for the evaluation of computer architectures. By discussing this ongoing research, the workshop will expose participants to the characteristics of commercial workload behavior and provide an understanding of how commercial workloads exercise computer systems. There will be discussions on the difficulties associated with using commercial workloads to drive new computer architecture designs and what can be done to overcome them.
The Final Program for the workshop is listed below, with an abstract for each talk. There will be plenty of time for audience participation. A panel with a round-table discussion will be held after the technical presentations. There will be no proceedings for the workshop since we encourage the presentation of work-in-progress and research in early stages. Copies of the foils used by the speakers will be distributed to the attendees on CD-ROM only. If hardcopy is desired, attendees are encouraged to visit this site 1 week before the workshop to download an electronic copy of the foils for printing.
Session 1: Characterizing I/O Behavior
Don DeSota
HP-Labs Storage Systems Program
10:00
am - 10:30 am
Coffee Break
10:30
am – 11:30 am
An Internet Traffic Generator for Server Architecture
Evaluation
Krishna Kant, Vijay Tewari and Ravi Iyer
Intel
Corporation
Performance Impact of Multithreaded Java Server
Applications
Yue Luo and Lizy K. John
Department of Electrical and Computer Engineering
The
University of Texas at Austin
11:30 am
- 1:00 pm
Lunch
1:00 pm
– 2:30 pm
Session 3: Processor Architecture
Evaluation and Simulation
Walking
Four Machines By The Shore
Anastassia Ailamaki, David J. DeWitt, and Mark D. Hill
University
of Wisconsin-Madison
Evaluation
of TPC-H benchmark on Athlon based systems
Mike Clark, Ajaya Durg, Kevin Lienenbrugger, and Lizy John
Electrical and Computer Engineering Department
The
University of Texas at Austin
Statistical
Simulation of Superscalar Architectures using Commercial Workloads
Lieven Eeckhout and Koen De Bosschere
Department of Electronics and Information Systems (ELIS)
Ghent
University
2:30 pm
- 3:00 pm
Coffee Break
3:00 pm
- 4:00 pm
Session 4: Processor/Memory
Interconnect Performance for Database Workloads
Impact
of Database Scaling on Realistic DSS Workload Characteristics on SMP Systems
Ramendra K. Sahoo, Krishnan Sugavanam, Ashwini K. Nanda
IBM
T.J. Watson Research Center
STiNG
Revisited: Performance of
Commercial Database Benchmarks on a CC-NUMA Computer System
Russell M. Clapp
IBM
4:00 pm
- 4:15 pm
Short Break
4:15 pm
– 5:15 pm
Invited Talk
Analyzing Business Computing
Architecture Availability using Commercial Database Benchmark Workloads
Stephen de Glanville
IBM
5:15 pm
Participant Feedback
Closing Remarks
Full
Text of Abstracts:
Session 1: Characterizing I/O Behavior
Characterizing Data-Intensive Workloads on Modern Disk
Arrays
Guillermo
Alvarez, Kimberly Keeton, Erik Riedel, Mustafa Uysal
Hewlett-Packard
Laboratories
Most large storage systems today
use disk arrays to meet capacity, reliability, availability, and performance
requirements. Modern disk arrays
have grown in sophistication over the years to keep up with growing application
demands. One of the biggest,
recent architectural changes has been the addition of very large array
caches: modern high-end disk
arrays can contain as much as 64 GB of cache memory [HP]. The presence of big
caches, and the policies that disk arrays implement in order to take advantage
of them, interact with data-intensive workloads in poorly-understood ways. There has not been much study about how
array caches impact application performance and what characteristics of
applications can take best advantage of these caches.
We discuss two cache-related array
behaviors ("sweet spots"):
prefetch effectiveness and cache effectiveness. We show how to quantify, for given
workloads, the extent to which these behaviors improve or degrade application
performance. This is done by
evaluating a set of candidate workload characteristics such as the degree to
which the request streams perform sequential accesses (intra-stream and
inter-streams), the footprint of the workload, and the temporal burstiness of
the request streams.
Our evaluation approach starts by
running a set of synthetic workloads on array configurations with different
values of tunable parameters, in order to identify and isolate the array's
sweet spots. We then use the
insights gained from the synthetic workloads to propose novel workload metrics
that best characterize the workload's performance on modern disk arrays. Finally, we validate our choices by
examining the behavior of a set of real workloads: a 300-GB TPC-D decision support database benchmark, a mail
server (Open Mail), and a file system for an engineering development
environment.
Our findings indicate that sweet
spots can impact access times by as much as two orders of magnitude. We show that a workload metric that
takes into account short forward jumps and the interference from competing
(possibly sequential) streams accurately predicts the effectiveness of
array-level prefetching for the TPC-D and Open-Mail workloads.
References:
[HP] The Hewlett-Packard XP 512
Disk Array.
Characterization of I/O for TPC-C and TPC-H Workloads
Don DeSota
IBM NUMA-Q
15450 SW
Koll Parkway
Beaverton,
Oregon 97006-6063
The TPC-C and TPC-H
benchmarks have become industry standards for determining system-level
performance for On-line Transaction Processing (OLTP) and Decision Support
System (DSS) usage models on new systems.
This talk will cover I/O performance data measured on IBM NUMA-Q systems
for each of these benchmarks. It will include data for properly sized and tuned
TPC-C and TPC-H workloads. The
presentation will include a description of the systems the benchmarks are run
on, how the measurements were made, and the measured data. The data will include both consumed
bandwidth and I/Os per second.
The TPC-C I/O data is for multiple size
systems between 4 and 16 500MHz Xeon processors in 1 to 4 quads. The systems
include 4GB memory per quad and a fibre channel disk I/O subsystem.
The TPC-H system is a 64
700MHz Xeon II processors, 16 quad system with 2GB of memory per quad and a
fibre channel disk subsystem. The
TPC-H is a 300GB scale factor.
As
expected, our results show a higher I/Os per second requirement for TPC-C but a
higher bandwidth requirement for TPC-H.
This is due to the random nature of the I/Os for TPC-C and their
associated small block size. TPC-H
does more sequential scanning of data with a large block size. Despite the random nature of I/Os for
TPC-C, we see good scalability of I/O on the multi-quad NUMA system due to the
presence of “multipath” I/O and a fibre channel switch, which largely prevents
I/O DMA transfers from traversing the longer latency second-level
interconnect. Also, while TPC-C
shows a steady I/O requirement related to system throughput, the TPC-H
benchmark shows a highly variable requirement as throughput fluctuates during a
specific query as well as query by query and during the throughput test as
well. The results in an instance
during the test where all practical available I/O bandwidth is consumed by the
benchmark. For the system
described above, that is an amount in excess of 3GB/s.
Iterative Development of an I/O Workload Characterization
Zachary
Kurmas, Ralph Becker-Szendy, and Kimberly Keeton
kurmasz@cc.gatech.edu, ralphbsz@hpl.hp.com, kkeeton@hpl.hp.com
HP-Labs
Storage Systems Program
Introduction
Interesting workloads are a key
input to experimental I/O system studies.
Unfortunately, although a trace of I/O activity completely describes a
workload, a trace usually has a very low ratio of information to bits. Our current research is to develop a
compact representation of a I/O workload that contains all the ``important''
information, but does not suffer the disadvantages of a full trace.
Such a characterization of an I/O
workload has many potential uses. For example, scientists can use it to quickly
understand the high-level characteristics of a workload, to quickly compare multiple
workloads, or to synthetically generate workloads. This knowledge may also be useful for developing a better
method of specifying benchmarks (e.g. TPC) (which may help eliminate the
problem of applications tuned to a specific workload); and developing a model
that can be used to quickly configure large storage systems.
The essence of the problem is not
simply characterizing the workload, but ensuring that the characterization
contains all of the ``important'' information. For example, we could characterize a workload by simply
giving the mean size of the I/Os; however, such a characterization would not be
very useful to most people. From
the standpoint of an end user, the most important characteristic of an I/O
workload is how quickly it executes on a storage system. Thus, the focus of our research is
learning how to characterize a workload with enough detail that we can
synthetically generate a workload with the same performance.
Methodology
In order to develop such a workload
characterization, we must learn what attributes affect performance. As a starting point, we consider four
basic ``attributes'' for each I/O request in the workload: size, interarrival
time (time since last I/O), location on disk, and type (read or write). A characterization of a basic attribute
describes the distribution of values for that attribute (i.e. the probability
an attribute will take a certain value.)
Developing our characterization is
an iterative process. We first
choose a set of attributes (e.g., Request Size, Request Rate, and Location) and
modify our trace analysis tool, Rubicon, to characterize those attributes. Next, we modify our synthetic workload
generation tool, Pylon, to generate a workload with those characteristics. We then compare the performance of our
synthetically generated workload to the performance of the observed workload.
We compare performance by comparing the cumulative distribution functions
(CDFs) of the I/O latencies for each workload; we use the root-mean-squared
metric to quantify the similarity between the CDFs. Finally, we determine how to further improve our
characterization, either by adding attributes, or by adding further detail to
the current characterization.
At first it may appear that we can
do no better that make an educated guess as to how to begin each
iteration. Fortunately, Pylon is
designed so that we can isolate the generation of each attribute. Pylon can be
configured to generate a sequence of values for an attribute that is identical
to the sequence in the observed workload, or chosen randomly from the
summarizing distribution. For
example, we could configure Pylon so that the sequence of request sizes,
interarrival times, and read/write type is identical to that of the observed
workload, and only the location of each I/O is generated randomly. This gives use some insight into how
the accuracy of our synthesis of location values affects the accuracy of our
synthetic workload. (For example,
in this case, if the latencies don't change much regardless of the method by
which we generate location, we conclude that location does not have a large
affect on latency.) By holding
different component values ``constant'' (i.e. identical to the original
workload), we gain insight into which attributes have the largest effect on
latency, and therefore, which attributes we should focus on during the next
iteration.
Results
Our first iteration was to
characterize each of the four basic attributes with simply a histogram. Pylon then generated I/Os by drawing a
value for each attribute randomly from a distribution specified by those
histograms. While this was a good
start, the performance of the synthetic workload was not close to that of the
observed workload (our
root-mean-squared analysis showed a difference of almost 10 percent). The problem with this first iteration
is that it assumes that I/Os can be generated by independent random values. In
other words, such a generation method assumes there is no correlation either
between or within attributes. We
know this is not the case. For
example, in the Open Mail workload (the only workload we've analyzed thus far
--- we will analyze other workloads after we have stronger results), 30 percent
of the writes were made to the same location on the disk (the index of the mail
system), whereas the reads tended to occur at the end of the disk (where the
data was stored). Thus, there is a clear correlation between type and location.
Our second iteration was to
characterize the reads separately from the writes. Thus, there were two histograms of location, one for reads,
and one for writes. This was
successful in some respects, but less so in others. It corrected the differences caused by having 30 percent of
the reads issued to the same location; but it did not address the fundamental
problem of treating all components as independent random variables.
In order to further improve our
characterization, we must expand our definition of ``attribute'' so that it
also includes the correlations within and between our four basic attributes. For example, workloads tend to have
``runs'' of I/Os in which each I/O in the run has a location adjacent to the
previous I/O. Thus, the
distribution of distances between successive I/Os (we call this ``jump
distance'') will be much different than the distribution created by a simple
characterization of location as an independent random variable. Thus, our characterization must now
include this new attribute, jump distance, which describes the correlation between
the locations of successive I/Os.
Our current task is to discover how
to best characterize jump distance.
Our first attempt showed that the problem was more complex that we first
believed; however we have developed a method that we believe will work very
well. We hope to have this method
implemented and tested by Spring.
Workload characterization has many
potential uses. We are very
encouraged by our early results and expect our methodology to allow us to
steadly improve the quality of our characterizations.
Session 2: Characterization and
Modeling of Web-Driven Workloads
An Internet Traffic Generator for Server Architecture
Evaluation
Krishna
Kant, Vijay Tewari and Ravi Iyer
Intel
Corporation
Beaverton,
Oregon, USA
{krishna.kant
| vijay.tewari | ravishankar.iyer}@intel.com
The work reported here was
motivated by the need to understand the implications of web traffic
characteristics on the architectural aspects of a web/e-commerce server. This requires generation of realistic
Internet traffic in a lab environment so that it possible to do detailed
measurements on server systems. This includes tracing of processor and I/O
busses in order to evaluate the impact of varying traffic characteristics on
the resources inside the web server. Towards this goal, we have developed a
Generator for E-commerce and Internet Server Traffic (GEIST).
Generation of synthetic traffic with
complex characteristics is often computationally intensive, making it difficult
to ensure that a request is actually generated very close to the intended time.
This aspect depends on the number of active processes/threads in operation,
scheduling algorithms used by the O/S, and the amount of queuing for the CPU
and other resources. We address this issue in GEIST by splitting the generation
into two steps, referred to as Aggregate Trace Generation and Traffic
Generation.
The first step (aggregate trace generation)
handles all the complexities of computing the actual time and parameters of the
requests. There are several
important aspects of aggregate
traffic that a traffic generator needs to emulate accurately. These include (a)
temporal characteristics of the arrival process, (b) transactional composition
of the traffic, and (c) nature of and dependencies between objects retrieved by
successive requests. We will touch upon how each of these aspects is dealt with
in GEIST. This step can be executed off-line and the output placed in a ``trace
file". The second step (traffic generation) has to read the trace and
actually issue the requests. The main advantage of a two-step process is that
this traffic generator part doesn't care how the trace was generated; the trace
could have been derived from HTTP logs from a live site. The difficulties in
this step is that the scheduling granularity of the threads responsible for
request generation and other O/S activities over which the programmer has no
control could introduce a significant skew (or slippage) between the intended
request time and the actual request time. We have carried out a detailed set of
experiments to assess the the accuracy and performance of GEIST in terms of
slippage. We will present the findings from this slippage study during the
talk.
Comparing with related work, we
believe that GEIST is different and generally more powerful than other traffic
generators that we are aware of. Most other generators (e.g., Microsoft's Web
Application Stress Tool [1], SPECweb96, SPECweb99 [2]) are user emulation based
and do not support complex temporal behavior. One exception is SURGE [3] which does support self-similar
process by making each ``user equivalent'' an on-off process. In contrast, GEIST supports asymptotically
self-similar, multifractal, and nonstationary arrival process. GEIST also supports detailed
transactional characterization of the traffic. We also believe that GEIST's
two-part design makes it much more scalable and extensible than other generators.
We have already used GEIST in
several performance studies such as overload control for web servers,
architectural experiments for proxy servers and secure protocols like SSL. We
will present the salient results (pertaining to GEIST) from some of these
studies. We will also cover the possible future uses and extensions to GEIST.
References
[1]
``Microsoft Web Application Stress Tool'',
msdn.microsoft.com/library/periodic/period00/stresstool.htm
[2]
``An explanation of the SPECweb96/SPECweb99 benchmark'', www.specbench.org/osg/web96 & www.specbench.org/osg/web99.
[3]
P. Barford, and M. Crovella, ``Generating Representative Web
Workloads for Network and Server Performance Evaluation'', Proceedings of the
ACM SIGMETRICS International Conference on Measurement and Modeling of Computer
Systems, pp. 151-160, July 1998.
Performance Impact of Multithreaded Java Server
Applications
Yue Luo and
Lizy K. John
Laboratory
for Computer Architecture
Department of
Electrical and Computer Engineering
The
University of Texas at Austin
Austin, TX
78712
In many commercial server
applications, the server needs to support many concurrent client connections.
In Unix, it is common to serve several clients with one server thread by using
traditional select() or poll() functions or POSIX asynchronous I/O. The server has a relatively small
number of threads while it is able to server a large number of clients. Such APIs, however, are not available
in Java platforms. So the current common programming paradigm in Java is to
devote one thread to one client connection. Thus, performance under the
presence of a large number of threads is crucial for a commercial Java server
to support multiple clients simultaneously. In this research, we investigate the performance impact of
multiple threads and multiple connections in Java applications. We study VolanoMark, an Internet chat
server environment on two of most popular platforms: Pentium III +Windows and
Sparc + Solaris. VolanoMark is characterized by long-lasting connections and
high thread counts. It is based on
a commercial pure Java chat server, VolanoChat, which is used by customers in
33 countries.
Performance counters are used to study the
Pentium III system. We run VolanoMark with the latest Sun JDK 1.1.3 for
Win32. Sun's Hotspot server JVM
is also studied and compared with
classic JVM. All metrics are
collected separately for OS mode and user mode. Shade, a trace tool from Sun, is used on Sparc platform. VolanoMark is run with Sun JDK1.2.2 for
Solaris, which includes green thread mode JVM. Both native thread mode and green thread mode are studied
and compared. On both
platform we synchronize our measurements with the client connections. We only study the period when there are
client activities and ignore the startup and shut down process of the server.
Our preliminary results
show that with the increase of the number of threads and connections, there is
increased OS activity in the application.
On Pentium III platform only 34% time is spent in user code at 800
connections, compared with 45% at 20 connections. We also find that multithreaded program is actually CPU and
I-cache friendly. We have observed
lower I-cache miss rates due to the sharing of code between threads. We also observe better branch
predictions. These result in fewer
I-stalls and a lower CPI. The
resource stalls increase with increase in number of threads and hence higher
performance can be expected from increasing the number of functional units in
the processor. The biggest factor affecting the performance of multithreaded
applications on Windows NT so far is the overhead caused by synchronization and
context switch. We believe most
gain can be desired from eliminating this overhead. The study is still going on and we expect results from Sparc
platform to be available before the final version is submitted.
Session 3: Processor Architecture
Evaluation and Simulation
Walking
Four Machines By The Shore
Anastassia Ailamaki, David
J. DeWitt, and Mark D. Hill
University of
Wisconsin-Madison
1210 West Dayton Street
Madison, WI 53706
{natassa,dewitt,markhill}@cs.wisc.edu
Recent studies have shown that the hardware behavior
of database workloads is suboptimal when compared to scientific workloads, and
have
identified the processor and memory subsystem as the
true performance bottlenecks, when running decision-support workloads on
various commercial DBMSs. Conceptually, all of today's processors follow the
same sequence of logical operations when executing a program. Nevertheless,
there are internal implementation details that critically affect the
processor's performance, and vary both within and across compute vendor
products. To accurately identify the impact of variation in processor and
memory subsystem design on DBMS performance, we need to identify the impact of
the microarchitectural parameters on the performance of database management
systems.
This study compares the behavior of a prototype
database system built on top of the Shore storage manager across three
different processor design philosophies: the Sun UltraSparc (using processors
UltraSparc-II and UltraSparc-IIi), the Intel P6 (using an Intel PII Xeon), and
a Compaq/DEC Alpha (using a 21164A). The processors exhibit high variations in
the processor and memory subsystem design. The prototype system choice is
pertinent because the system's hardware behavior was found similar to
commercial database systems when executing decision-support workloads. In order
to evaluate the different design decisions and trade-offs in the execution
engine and memory subsystems of the above processors, we ran several range
selections and decision-support queries on a memory-resident TPC-H
dataset. The insights gained are
indications that, provided that there are no serious hardware implementation
concerns, decision-support workloads would exploit the following designs
towards higher performance:
1.
A
processor design that employs (a) out-of-order execution to more aggressively
overlap stalls, (b) a high-accuracy branch prediction mechanism, and (c) the
opportunity to execute more than one load/store instruction per cycle, and
2.
A
memory hierarchy with (a) non-inclusive (at least for instructions) caches (b)
a large (> 2MB) second-level cache, and (c) a large cache block size (64-128
bytes) without sub-blocking, to exploit spatial locality.
Evaluation
of TPC-H benchmark on Athlon based systems
Mike Clark, Ajaya Durg,
Kevin Lienenbrugger and Lizy John
Electrical and Computer
Engineering Department
The University of Texas at
Austin
Austin, TX 78712
This paper analyzes the operation of five queries
from the TPC-H benchmark using two different AMD Athlon based machines. The
queries were picked (1, 3, 6, 8, and 19) to provide
a representative sample of the TPC-H benchmark. The experiment had three
objectives. One objective was to analyze the effect of the different L2 cache
architectures of the two different Athlon processors. The classic Athlon
processor contained an external, 1/3 the processor speed, 512K, 2 way set
associative, inclusive cache and the newer Athlon processor had an internal,
full speed, 256K, 16 way set associative, victim cache. Both machines were
allocated with 256M of memory and were running Win2K server and a DB2
database. Another objective of the
research was to analyze the resources and micro-architecture of the Athlon when
running large database queries. The last objective was to analyze the queries
themselves to better understand their behavior. All analysis was done through hardware based performance
monitor counters. Preliminary
results for the instruction cache did show that the shrinking of the L2 cache
did not hurt its effectiveness due to the increase in associativity. This seems like a very good design
decision since the miss rate did not increase, and the latency has improved 3
times. Another interesting result
was that two of the queries(3,8) showed two distinct phases and so performance
results for those two queries were reported in two sections. It was also observed that the processor
was halted 60% of the time that the query is running. This is attributed to the
disk latencies. Of the remaining
40% of the time, the processor was stalled half of the time. The largest
contributor to this stall component was the Load-Store unit becoming full. However, slightly more surprising was
that the second largest contributor was the 72-entry reorder buffer. This
implies that the machine was able to consistently find a lot of work behind
stalls but was still not deep enough to hide the full latency of the
stall. Assuming these stalls are
due to memory accesses, good software prefetching or even hardware prefetching
may be able to alleviate this problem.
The L1 caches and TLBs showed extremely good performance, as did the
branch predictor. Some of the queries showed a higher macro-op to instruction
factor than would be expected. This implies that the query may be using more of
the complex instructions of the x86 instruction set. Unfortunately profiling
the code was beyond the scope of this paper. In conclusion, the Athlon
micro-architecture seems very capable of performing well on large data base
queries.
Statistical
Simulation of Superscalar Architectures using Commercial Workloads
Lieven Eeckhout and Koen De
Bosschere
Department of Electronics
and Information Systems (ELIS)
Ghent University
Sint-Pietersnieuwstraat 41
B-9000 Ghent, Belgium
Trace- and execution-driven simulators are very
important in the design of new microarchitectures. They model cycle-by-cycle
processor state transitions which makes them highly accurate in simulating a
processor prototype. However, there are serious practical shortcomings with
these techniques for efficiently culling a huge design space in an early design
stage. First of all, due to the increasing number of instructions that need to
be simulated in order to measure the performance of contemporary real-life
applications on a particular processor, simulation time is getting prohibitive.
A second disadvantage that specifically holds for trace-driven simulations, is
that huge traces need to be stored which is impractical.
In practice, researchers have proposed several
solutions to shorten simulation time: taking a contiguous part of the trace,
analytical modeling and trace sampling. Only recently, statistical simulation
[1,2,3] was proposed to speed up the simulation process. In statistical
simulation, a statistical profile or a set of statistical program
characteristics is extracted from a program execution, e.g., the instruction mix,
the distribution of the dependency distance between instructions, etc. This
statistical profile is then used to generate a synthetic trace that is
subsequently fed into a trace-driven simulator, which will compute the
attainable performance for the microarchitecture modeled. Thanks to the
statistical nature of the technique, performance characteristics quickly
converge to a steady state solution. So, statistical simulation is a useful
technique for culling huge design spaces in limited time.
In previous work [1,2,3], researchers only used SPEC
benchmarks (www.spec.org) to evaluate the statistical simulation methodology.
In this paper, we use a more broad spectrum of traces from commercial and
scientific workloads:
·
the
8 integer control-intensive SPECint95 benchmarks;
·
5
scientific SPECfp95 benchmarks (hydro2d, su2cor, swim, tomcatv, wave5);
·
the
8 IBS traces [4], which contain significant amounts of operating system
activity;
·
4
multimedia applications from the MediaBench suite [5], namely g721_encode, gs, gsm_encode,
mpeg2_encode;
·
4
X graphics benchmarks from the SimpleScalar distribution
(www.simplescalar.org), namely DooM, POVRay, Xanim and Quake;
·
2
TPC-D traces generated by tracing Postgres 6.3 running the TPC-D queries 2 and
17 over a 100MB Btree indexed database [6].
All traces include approximately 200 million
instructions. The out-of-order architectures that we considered have an issue
width of 8 and
16 and an instruction window of 64 and 128 entries,
respectively. We used a hybrid branch predictor and two cache configurations, a
`small' (8KB DM L1 I-cache, 8KB DM L1 D-cache and 64KB 2WSA L2 cache) and a
`large' cache configuration (32KB DM L1 I-cache, 64KB 2WSA L1 D-cache and 256KB
4WSA L2 cache). To evaluate the performance prediction accuracy of the
statistical simulation technique, we calculate the relative error between the
IPC of the real trace and the IPC of the synthetic trace generated using the
statistical profile of the real trace.
We conclude that statistical simulation obtains
smaller performance prediction errors for applications with a higher static
instruction
count. More specifically, the IPC prediction error
is no larger than 25% for the IBS traces, the TPC-D traces, two X graphics
traces (DooM and Quake) and three SPECint traces (gcc, go and vortex). The
other traces have a smaller instruction footprint and have significantly larger
performance prediction errors (up to 40%). So, since commercial workloads are
known to have a larger instruction footprint, we can conclude that statistical
simulation is a fast simulation technique that is useful for commercial
workloads.
[1]
R.
Carl and J. E. Smith. Modeling superscalar processors via statistical sim
ulation. In Workshop on Performance Analysis and Its Impact on Design (PAID),
held in conjunction with the 25th Annual International Symposium on Computer
Architecture (ISCA-25), June 1998.
[2]
L.
Eeckhout, K. De Bosschere and H. Neefs. Performance analysis through synthetic
trace generation. In Proceedings of the IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS-2000), pp. 1-6, April 2000.
[3]
M.
Oskin, F. T. Chong, M. Farrens. HLS: Combining statistical and symbolic
simulation to guide microprocessor design. In Proceedings of the 27th Annual
International Symposium on Computer Architecture (ISCA-27), pp. 71-82, June
2000.
[4]
R.
Uhlig, D. Nagle, T. Mudge, S. Sechrest and J. Emer. Instruction fetching:
coping with code bloat. In Proceedings of the 22nd Annual International
Symposium on Computer Architecture (ISCA-22), pp. 345-356, June 1995.
[5]
C.
Lee, M. Potkonjak and W. H. Mangione-Smith. MediaBench: A tool for evaluating
and synthesizing multimedia and communications systems. In Proceedings of the
30th Annual International Symposium on Microarchitecture (MICRO-30), pp.
330-335, December 1997.
[6] C. Navarro, A. Ramirez, J.-L. Larriba-Pey and Mateo Valero. Fetch engines and databases. In Proceedings of the Third Workshop on Computer Architecture Evaluation using Commercial Workloads, held in conjunction with the 6th International Symposium on High-Performance Computer Architecture (HPCA-6), January 2000.
Session 4: Processor/Memory
Interconnect Performance for Database Workloads
Impact
of Database Scaling on Realistic DSS Workload Characteristics on SMP Systems
Ramendra K. Sahoo, Krishnan
Sugavanam, Ashwini K. Nanda
IBM T.J. Watson Research
Center
Yorktown Heights, NY 10598
Study of large scale database system characteristics
on shared-memory multiprocessors is considered to be a difficult task. Hence traditionally, research has
been focussed on simulation of scaled down workloads. It is difficult to come up with a standard
scaling plan for large databases
representing realistic workloads.
The primary scope of the study is to provide a preliminary
analysis of database scaling and its effect on overall memory behavior
for decision support workloads.
All the analyses and results are based on TPC-H (Decision Support) benchmark
queries for a database size ranging from 1GB to 100GB. The study uses the
MemorIES board and on-chip counters to analyze query by query behavior for
realistic TPC-H benchmarks. A
quantitative evaluation of the difference between a scaled down database and a
realistic workload is presented. This study covers various system parameters
like miss ratio, miss rate etc. and the effect of database size on these
parameters. Our initial observation is that,the results from a scaled down
workload might qualitatively characterize database/memory system behavior, but
not quantitatively, which is essential for future server design.
STiNG
Revisited: Performance of
Commercial Database Benchmarks on a CC-NUMA Computer System
Russell M. Clapp
IBM NUMA-Q
15450 SW Koll Parkway
Beaverton, Oregon 97006-6063
In 1996, the STiNG
architecture was introduced at the International Symposium for Computer
Architecture. Since that time, it
has been implemented with several generations of Pentium Pro processors, memory
and I/O control chipsets, and ASICs that comprise the second-level SCI-based
interconnect that is CC-NUMA. In
this talk, the speaker describes the performance of commercial database benchmarks
as viewed from the event counters embedded in the hardware at various places
throughout the a recent implementation of the STiNG architecture. After a review of the STiNG
system architecture, performance data of representative benchmarks for OLTP and
Decision Support workloads will be presented. This data includes the rate of events for the workloads
(e.g. cache misses, cache-to-cache transfers, invalidations, etc.), latency to
service these events, bandwidth consumptions at various datapaths in the
system, and other architecture specific measurements. Where appropriate, comparisons will be made between this
data and the original projections that were made in 1996. What we conclude is that, despite some
deficiencies in the original simulation environment and some unanticipated
differences in software behavior on real systems, there is still good overall
agreement between the original simulated results and the recently measured
results.
Analyzing Business Computing
Architecture Availability using Commercial Database Benchmark Workloads
Stephen de Glanville
IBM NUMA-Q
15450 SW Koll Parkway
Beaverton, Oregon 97006-6063
The availability of
applications to end-users, sometimes described as business-availability, is
often used by both hardware and database vendors to impress potential
customer's with the robustness of their load-balanced, clustered, or parallel
server products. Hardware vendors often brag about 'five-nines' availability
(99.999%) without qualifying whether this figure is a simple calculation based
on the compounded MTBFs (Minimum Time Between Failures) of all system
components, or whether this is TOTAL system availability to ALL end-users. The
latter availability figure is what most customers would like solution vendors
to specify and often craft lengthy service and performance level agreements
with their suppliers to ensure they can get compensation for business-loss
caused by system failure.
This presentation
describes techniques available through the use of commercial database
benchmarks that can be used to quantify the impact on a customer's business of
partial or total system failure. This technique allows analysis of what
actually happens to end user transaction characteristics when a system suffers
partial or total failure, whereas, MTBF figures can only attempt to predict how
often a failure might occur. An example of harness technology used in a commercially
available ERP (Enterprise Resource Planning) benchmark is given to explain how
parameters like TPS (Transactions Per Second), and transaction timings, can be
captured as a percentage of the figure from an optimal benchmark run, to
ascertain the precise impact of any simulated failure on end-user communities.
Hooks are built into the benchmark harnesses to simulate, application server
failure, client-server network-segment failure, database instance failure and
database server failure.
The model is simple and
permits extrapolation techniques to be used to calculate the cost of long-term
partial or total system outage. Data extracted using these techniques provides
valuable information to permit the predictable design of highly available
business computing architectures, and can be used to predict how a given
architecture will perform should a failure occur in a production system.