The workshop will run from 8:00 am to 6:00 pm, and is organized in 20-minute talks, with plenty of time for audience participation. There will be no proceedings for the workshop since we encourage the presentation of work-in-progress and research in early stages. Copies of the foils used by the speakers will be distributed to the attendees. To register for the workshop and for hotel information, please check http://www.ee.iastate.edu/~prasant/hpca. For more information, please contact any of the workshop co-chairs:
System Design Considerations for a Commercial Application Environment
Luiz Andre Barroso and Kourosh Gharachorloo
Western Research Laboratory
Digital Equipment Corporation
Exploiting Caches Under Database Workloads
Pedro Trancoso and Josep Torrellas
University of Illinois at Urbana Champaign
Optimizing UNIX for OLTP on CC-NUMA
Darrell Suggs
Data General Corporation
Tracing and Characterization of NT-based System Workloads
Jason Casmira, David Kaeli - Northeastern University
David Hunter - DEC Software Partners Engineering Group
Analysis of Commercial and Technical Workloads on AlphaServer
Platforms
Zarka Cvetanovic
Digital Equipment Corporation
Characterizing TPC-D on a MIPS R10K Architecture
Qiang Cao, Pedro Trancoso, Josep Lluis Larriba-Pey and Josep Torrellas
University of Illinois at Urbana Champaign
Performance Analysis of Shadow Directory Prefetching for TPC-C
Dan Friendly - University of Michigan
Mark Charney - IBM Research
Evaluating Branch Prediction Methods for an S390 Processor using
Traces from Commercial Application Workloads
Rolf B. Hilgendorf, IBM Entwicklung GmbH, Boeblingen, Germany
Gerald J. Heim, Wilhelm Schichard-Institut fuer Informatik, Universitaet
Tuebingen, Tuebingen, Germany
Multiprocessor Architecture Evaluation using Commercial Applications
Ashwini K. Nanda
IBM TJ Watson Research Center
Computer System Evaluations with Commercial Workloads based on SimICS
Jim Nilsson, Fredrik Dahlgren, Magnus Karlsson, Peter Magnusson* and
Per Stenstrom
Department of Computer Engineering, Chalmers University of Technology
*Swedish Institute of Computer Science
VPC and IADYN - Project Overviews
Rich Uhlig and Todd Austin
Intel Corporation
"Do academics require access to DBMS source code in order to do effective research in the area of computer architecture for commercial workloads?" Participants - TBD.
ABSTRACTS:
Importance of Proper Configuration in Architectural Evaluations
using Database Workloads
Kimberly Keeton and David A. Patterson
University of California - Berkeley
Databases are complex systems, with many (e.g., O(10)) configuration knobs to turn in order to achieve reasonable performance. The Transaction Processing Council (TPC) benchmarks are more difficult to run than other commonly used benchmarks like SPEC, because of the complexity of the database server application. TPC benchmarks, such as TPC-C [1], provide a well-defined workload with quantitative criteria for scaling datasets. The inclusion of large datasets and their associated disk IO component also complicates the benchmarking effort. Finally, the networking component of the client-server benchmarks also adds complexity. Because of resource constraints or lack of understanding of the underlying software, researchers often improperly configure their benchmarking systems. They may underpopulate the disk subsystem, use too little memory, scale down the data set so that it fits entirely into memory, etc. Our full presentation will survey the validity of system configurations described in the literature.
All of these departures from a well-balanced system have the potential to affect the system under test in adverse ways, creating anomalies that may be unwittingly and unintentionally observed as being correct. The danger is that computer and operating systems designers might include optimizations that make no sense for properly configured systems, or might miss opportunities for improvements by not testing the machine under real load.
In this presentation, we will discuss some of the hardware and software factors that can affect the validity ofa configuration. We will discuss several ways that systems can be poorly configured, what potential performance anomalies may result from these bad configurations, and how to detect the situation. Finally, we will provide some rules of thumb for configuring well-balanced systems for using database workloads as benchmarks.
1.0 Factors that Impact the Validity of Configuration
A number of hardware and software factors may impact the validity of
a configuration for measuring architectural performance for database workloads.
We assume that the number of processors will be chosen to match the size
of memory and the I/O subsystem. These additional factors may be grouped
into four categories: I/O system hardware factors, memory-related hardware
factors, database software system configuration parameters, and benchmark
configuration parameters, and are discussed in Table 1. These factors may
impact several key architectural and operating system characteristics,
such as the breakdown of user/system time, number of disk I/Os, and CPI.
Category | Example Factors |
I/O system hardware | Number and speed of disks, number and speed of I/O controllers, speed of disk I/O (e.g., SCSI) bus, speed of processor I/O (e.g., PCI) bus, amount of caching provided by I/O controllers, data layout to avoid hot spots |
Memory-related hardware | Amount of physical memory impacts size of buffer cache, and memory bandwidth delivered by system |
Database software parameters | Database buffer pool size, buffer management strategy parameters (e.g., high and low water marks for background writes of dirty pages) |
Benchmark configuration
parameters |
Database size (e.g., TPC-C's number of warehouses), number of simulated clients |
2.0 Detecting a Poorly Balanced Configuration
Fortunately, the TPC-C benchmark provides a quantitative measure for verifying that a system is reasonably configured. The benchmark specification requires that the ratio of transaction throughput, transactions per minute C (tpmC), to database scale factor, the number of warehouses, resides in the range from 9.0 to 12.7. tpmC:warehouses ratios outside this range indicate that the system isn't properly balanced, the data set isn't properly scaled, or both.
In Table 2, we present a comparison of selected architectural and operating
system behavior for systems with tpmC:warehouses ratios within the acceptable
9.0 to 12.7 range, below 9.0, and above 12.7. These measurements were performed
using a commercial database running on a four-processor Pentium Pro-based
SMP running Windows NT 4.0, with 512 KB L2 processor caches, 2 GB of main
memory, and 90 database data disks.
Architectural Behavior | Properly Configured | Low Memory | Low Warehouses |
tpmC:warehouse ratio | 10.1 | 8.0 | 27.8 |
Relative transaction throughput | 100% | 79% | 115% |
User/system breakdown | 78% / 21% | 70% / 30% | 83% / 16% |
Disk reads/sec | 1550 | 2460 | 1070 |
Disk writes/sec | 1280 | 1235 | 1100 |
Database cycles per instruction (CPI) | 2.91 | 3.25 | 2.67 |
Database computation CPI | 0.96 | 0.96 | 0.95 |
Database resource stalls CPI | 0.57 | 0.74 | 0.49 |
Database instruction-related stalls CPI | 1.36 | 1.61 | 1.24 |
The "low memory" configuration yields a tpmC:warehouses ratio lower than the acceptable range. There is a loss in throughput due to the additional disk I/O required to compensate for the insufficient memory buffer pool. The increased I/O request rate results in an increased percentage of time spent in the operating system. In addition, we observe an increased database CPI, due to an increase in the resource and instruction-related stalls. Low tpmC:warehouses ratios may also occur when disk I/O takes too long, for example, when there is contention for the I/O bus, the I/O controller, or disks in the system. Alternately, this behavior could be exhibited by an extremely small dataset where there is artificial contention in data access, or when the offered transaction request load is too small because the number of client request generators is too low.
The "low warehouses" configuration yields a higher than appropriate tpmC:warehouses ratio. The throughput for this configuration is higher than it should be, because there is a reduced I/O rate due to the artificially high degree of locality in the workload. In this case, the data set isn't big enough to fully exercise the I/O system, as prescribed by the benchmark specification. This reduction in I/O request rate results in a higher percentage of time spent at user level. In addition, because more of the dataset is resident in memory, we see a lower database CPI, due to a decrease in resource and instruction-related stalls. Higher than acceptable tpmC:warehouses ratios may also occur for entirely memory-resident datasets.
From these simple examples, it is clear that the configuration of the system can impact several key architectural and operating system characteristics, such as the breakdown of user and system time, the disk I/O rates, and the cycles per instruction. To ensure that computer and operating system designers optimize for the correct behavior, performance analysts need to measure properly configured systems. In the full talk, we will conclude by articulating several rules of thumb for finding a reasonable configuration.
3.0 References
[1] Jim Gray. The Benchmark Handbook for Database and Transaction Processing
Systems. Morgan Kaufmann
Publishers, Inc., 2nd edition, 1993.
System Design Considerations for a Commercial Application Environment
Luiz Andre Barroso and Kourosh Gharachorloo
Western Research Laboratory
Digital Equipment Corporation
We have been studying numerous commercial workloads during the past
two years, with a special focus on their memory system behavior. This talk
describes our extensive experience with three important classes of workloads:
online transaction processing (OLTP), decision support systems (DSS), and
web index search. We use a popular commercial database engine for running
our OLTP and DSS workloads, and the AltaVista search engine for our web
index search workload.
Given the large scale of commercial workloads, scaling the size of
these workloads is critical for enabling a broad range of monitoring and
simulation studies. However, such scaling requires a deep knowledge of
the workloads and is rather sensitive to the set of architectural issues
under study. We briefly describe our experience in scaling these workloads
to allow for manageable architectural studies without compromising representative
memory system behavior.
Our memory system characterization effort is based on a large number of architectural experiments on Alpha multiprocessors, augmented with full system simulations (using our Alpha port of SimOS) to determine the impact of architectural trends. We have also studied the impact of processor architecture on the memory system behavior of these workloads using trace-driven simulations. Specifically, we discuss the effectiveness of dynamic scheduling and simultaneous multithreading for hiding memory latencies in such applications.
Overall, our studies have allowed us to identify and evaluate critical system design parameters that impact the performance of commercial workloads.
Exploiting Caches Under Database Workloads
Pedro Trancoso and Josep Torrellas
University of Illinois at Urbana Champaign
Since databases have traditionally been bottlenecked by I/O, the cost
functions used to select what operations and algorithms to use are dominated
by the number of I/O operations. As DRAM becomes denser and cheaper, however,
machines come equipped with more memory, and more databases are able to
keep a large fraction of their data permanently in memory. Furthermore,
memory systems are typically organized in cache hierarchies. Exploiting
these cache hierarchies well is crucial to high performance. In this talk,
we discuss a range of software and hardware supports for databases to run
efficiently in machines with caches. We design a model of cache misses
for database algorithms, and use it to build new cost functions for a cache-oriented
query optimizer. We also examine enhancing the cache use of database operations
with software prefetching, blocking algorithms, data layout restructuring,
and cache bypassing. The combination of all these optimizations has a very
high impact.
Our product goals challenged us with the need to identify software scaling issues for commercial OLTP workloads in a UNIX environment for a CC-NUMA system under design. The lack of prototype hardware that even remotely approximated the new design required that all analysis be done in a modeling and simulation environment. Our goal was to using tracing and simulation to provide sufficient lead time to make the required software modifications. As part of the process, we evolved a technique for obtaining architecture independent address traces for OLTP (TPC-C) workloads. We then used these traces to drive a detailed hardware simulation model. Through this technique we identified software scaling issues, advised reasonable changes, and predicted the impact of the changes. As the product became real, we were able to verify our predictive techniques against actual behavior. The results indicated that our analysis techniques are quite sound.
In the presentation, we will discuss machine architecture specifics, general software scaling issues, trace techniques, and comparative results from simulation and actual execution.
Tracing and Characterization of NT-based System Workloads
Jason Casmira, David Kaeli - Northeastern University
David Hunter - DEC Software Partners Engineering Group
Commercial applications are beginning to rely more on the services
and APIs provided by the hosting operating system. This is particularly
true in web browsers and database oriented workloads. Current trends
in computer architecture have been driven by the characteristics of benchmark
programs (e.g., SPEC, SPLASH, Bytemark). We argue that results using
these simple benchmarks may be misleading, especially since most of these
benchmarks make little use of operating system services.
In an effort to address this deficiency, we have developed the PatchWrx
Dynamic Execution Tracing Tool Suite. PatchWrx operates on a DEC
Alpha NT platform and has the ability to capture the entire execution of
commercial applications, including the NTOS kernel, DLLs and system services.
Trace capture is performed with a slowdown factor of 2x-4x. This low overhead
is only possible by using the PALcode interface provided on the Alpha architecture.
With this ability to capture complete application behavior on a system,
it is possible to more accurately project performance of commercial applications
on future architectures.
We have been able to capture the workload characteristics of several
applications run under Microsoft NT Workstation, hosted on a DEC AXP Alpha
21064 workstation. We have also compared these results to traces
of the bytemark benchmark captured in the same environment. To date we
have studied the characteristics of the BYTE magazine benchmark and have
compared these characteristics to Microsoft Internet Explorer (IE) and
Microsoft CD Player (MCD). The amount of operating system interaction
in the BYTE benchmark is less than 1% of the total execution, while for
the two commercial applications we encounter overheads of 23% for IE and
78% for CD. We have also observed as much as a 53.6% decrease in basic
block size when considering an application running with operating system
behavior included versus the application viewed as independent of any operating
system activity. This will have a dramatic effect on a branch prediction
design.
Not only does the basic block size differ, but there is a marked difference
in the total number of load and store instructions executed on the system.
We found the total number of loads and stores to increase by as much as
72% when considering the operating overhead present in the application,
versus the execution of the application code alone. This suggests that
we may need to rethink some of our assumptions about memory reference characteristics
and memory hierachy design.
Using the PatchWrx Dynamic Execution Tool Suite, we can not only address
questions related to the performance of commercial workloads, but we can
also identify the inherent differences between application and operating
system execution. We are currently gathering traces on Oracle databases
running TPC-D workloads. We will be reporting on the characteristics
of these workloads and their interaction with the NT kernel in our presentation.
Analysis of Commercial and Technical Workloads on AlphaServer
Platforms
Zarka Cvetanovic
Digital Equipment Corporation
This talk will identify major hardware components that have crucial
impact on the performance of commercial workloads on AlphaServer platforms.
We will contrast the characteristics and requirements of technical and
commercial workloads. The technical workloads used include the set of SPEC95,
SPEC95-parallel, Linpack, and NAS Parallel (decomposed). The commercial
workloads include: TPC-C, SPECweb96, and Laddis. The AlphaServer platforms
evaluated include: mid-range server AlphaServer 4100 (Rawhide), and high-end
server AlphaServer 8400 (TurboLaser).
We will contrast performance and SMP scaling of parallel/multistream
technical and commercial workloads. We will analyze single, dual, triple,
and quad issuing time on AlphaServer platforms. We will compare issuing
and stall time and identify major components that caused CPU stalls. We
will analyze cache misses thru several levels of cache/memory hierarchy.
Included will be data and analysis of traps. TB misses, branch mispredicts
and their effect on performance of technical/commercial workloads. The
analysis of bus traces with Read, Victim, and Shared traffic will be presented.
We will include analysis of the performance effects of Memory Barriers,
Locks, and Shared-Dirty data. A breakdown of the total stall time to different
stall components will be presented. The evaluation of performance benefits
from larger caches will be included. We will present guidelines for optimizing
future hardware architectures and implementations that will allow for high
performance in both technical and commercial applications.
We will conclude the presentation with the discussion of the effect
of I/O subsystem efficiency on the commercial performance.
Characterizing TPC-D on a MIPS R10K Architecture
Qiang Cao, Pedro Trancoso, and Josep Torrellas
University of Illinois at Urbana Champaign
In this work, we use the MIPS R10K counters to measure statistics on
the instruction execution and cache performance of the TPC-D benchmark
queries running on an SGI Origin 200 machine. The database is memory resident.
We classify the different queries from the TPC-D benchmark according to
their absolute number of instruction and data misses and relative weight
of instruction to data misses. This classification allows us to identify
queries with similar behavior and therefore select the queries that are
representative of the whole benchmark. For those representative queries
we study the behavior of each of their operations. In addition we also
study the impact of the access methods to the data, namely the use of index
or sequential scan. Finally, we study the scalability of the benchmark
by running different data set sizes.
The results show that the time lost to cache misses accounts for a
significant portion of the query execution time. In addition, instruction
misses have a high weight. Finally, additional index structures do
not seem to help in improving the scan performance for some of the queries.
Characterization, Tracing, and Optimization of Commercial I/O
Workloads
H. Huang, M. Teshome, J. Casmira and D.R. Kaeli
Northeastern University Computer Architecture Research Laboratory
Boston, MA
Brian Garrett and William Zahavi
EMC Corporation
Hopkington, MA
RAID systems have become the industry standard for providing efficient,
fault-tolerant mass storage [1,2]. Many of these RAID systems provide
large (multi-gigabyte), integrated cache storage. To provide scalable I/O
storage access, improved storage system caching strategies must be provided.
This research attempts to improve the efficiency of the disk array subsystem
by characterizing the reference patterns generated by commercial workloads.
Then new cache management algorithms are explored to improve the performance
of these commercial systems. This work is in conjunction with researchers
at EMC Corporation, the leading manufacturer of cached disk array subsystems.
Our research is targeted at storage intensive commercial applications
such as On-Line Transaction Processing (OLTP). This paper presents
our work on the capture, characterization, sythesis and use of these workloads,
detailing new prefetching and organizational issues related to providing
scalable storage performance. We also present a new hardware mechanism
which limits the effects of errant prefetching called a Prefetch Buffer
Filter (PBF) [3]. For OLTP workloads the PBF can increase the effective
disk cache hit ratio by as much as 52%, while reducing the amount of disk
traffic by 49%.
References:
[1] R.H. Katz, G.A. Gibson and D.A. Patterson,
``Disk System Architectures for High Performance Computing,''
Proc. of the IEEE, Vol. 77, pp. 1942-1958, Dec. 1989.
[2] P.M. Chen and E.K. Lee,
``Striping in a RAID Level 5 Disk Array,''
in Proc. of Sigmetrics '95, Vol. 23, No. 1, pp. 136-145, May
1995.
[3] J. Casmira and D.R. Kaeli, ``Modeling Cache Pollution,''
to appear in the Journal of Modeling and Simulation,
Vol. 19, No. 2, May 1998.
Trace-driven Performance Exploration of a PowerPC 601 OLTPWorkload
on Wide Superscalar Processors
J.H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan
IBM T.J. Watson Research Center
We describe an investigation of the potential performance of PowerPC-based
wide superscalar processors on a standard on-line transaction processing
(OLTP) benchmark, using a PowerPC 601 instruction and data reference trace
containing 170 million instructions. In this study, we have explored instruction-level
parallelism as a function of the policy for issuing instructions, the processor
width, the size of the cache memory, and the branch predictor. We summarize
the characteristics of our exploration approach, describe the features
of the processor model used in the exploration, the configurations explored,
and the results from the experimentation. Results include the average cycles
per instruction (CPI) obtained in each configuration, details on degradation
factors that limit performance, and sensitivity to some selected microarchitecture
features.
The simulation results validate common wisdom regarding the degrading effects of the memory subsystem on workloads such as the one considered, behavior encountered regardless of the width of the processor; the results also give insight into the utilization of the various resources in the processor. For the trace and processor organizations considered, the simulation data show that increasing the processor issue width to eight operations per cycle is advantageous whereas wider organizations provide diminishing performance improvement. For example, doubling the features of an out-of-order processor whose issue width is four by doubling the number of units and the various widths, while preserving the size of caches and prediction accuracy, produces about 20% overall performance improvement. Doubling also the size of the caches in the same configuration produces an additional 10% improvement. In addition, the simulation results show that wide-issue out-of-order superscalar processors are potentially capable of continuing delivering performance, albeit at the cost of complexity, thereby helping lay the grounds for the evaluation of the corresponding trade-offs in the design of such systems.
Performance Analysis of Shadow Directory Prefetching for TPC-C
Dan Friendly - University of Michigan
Mark Charney - IBM Research
It is a widely accepted attribute of commercial workloads that they
place greater demands on the memory subsystem than the standard benchmark
suites. In this work we take a close look at the impact servicing memory
requests has upon performance. To mitigate the effects of the larger memory
footprint of commercial code we suggest an aggressive prefetching technique
for both instruction and data caches.
Shadow directory prefetching is a recently proposed hardware prefetching
mechanism that attempts to maintain an accurate history of the pattern
of level 1 cache misses. By mapping an L2 cache request to a previously
requested line, shadow directory prefetching can often anticipate subsequent
requests of the data. When a line is requested the L2 responds with both
the demand miss and provides a number of prefetches by following the mappings
in the shadow directory. In doing so it able to reduce the average
latency of fetch requests to the memory subsystem.
In this work we evaluate the effectiveness of shadow directory prefetching
on a Power-PC 604 trace of the Oracle TPC-C code. The prefetching scheme
is assessed through a number of metrics -- CPI, coverage, accuracy, latency,
bus traffic and bus utilization. Variations of the shadow directory prefetching
algorithm are examined including the use of confirmation and altering the
mapping scheme.
Our preliminary results show that the utilization of the shared L2
to L1s reply bus places a significant limitation on the effectiveness of
the prefetching algorithm. As shadow directory prefetching is initiated
at the L2 cache it becomes imperative that the L2 maintain inclusion information
so that the prefetch engine can inhibit the sending of redundant prefetches.
Furthermore, we have found that inhibiting the bus transfers of prefetches
when demand requests are being processed increases the effectiveness of
the prefetching. Using both techniques, shadow directory prefetching is
shown to improve performance by close to 10%. This represents a significant
portion of the performance lost to the effects of having a finite memory
subsystem.
Evaluating Branch Prediction Methods for an S390 Processor using
Traces from Commercial Application Workloads
Rolf B. Hilgendorf, IBM Entwicklung GmbH, Boeblingen, Germany
Gerald J. Heim, Wilhelm Schichard-Institut fuer Informatik, Universitaet
Tuebingen, Tuebingen, Germany
For modern superscalar processors, branch prediction is a must. There
has been significant progress in this field during recent years, but it
is not so clear which if any of the currently advocated schemes is superior.
The quality of the branch prediction algorithms has been measured by using the SPECmark test suite. Different programs from this suite show different behavior with respect to branch prediction. So some often the average of the prediction rate for each member of the suit is taken while in other publications only selected programs are used to stress certain aspects.
For the IBM System 390 environment exists a set of traces representing different areas of commercial workloads. They include operating-system interactions, too. We used four of these traces to evaluate a big variety of branch prediction algorithms in order to give assistance for the design trade-offs to be made. The traces used differ significantly from traces extracted form the SPECmark test suit mostly in respect to the
Multiprocessor Architecture Evaluation using Commercial Applications
Ashwini K. Nanda
IBM TJ Watson Research Center
In this talk I will discuss (1) the design of a new simulation environment called COMPASS, and (2) how COMPASS is used to evaluate the memory behavior of three important commercial applications, namely, TPCC (transaction processing), TPCD (decision support) and SPECWeb (web server) on multiprocessor systems.
Shared memory multiprocessors have become de facto choice as server platforms for commercial applications such as transaction processing, decision support and web servers. Therefore, it has become desirable to study the performance impact of these commercial applications on the shared memory server platforms in order to make the right design decisions. However, most of the contemporary simulators used to evaluate shared memory multiprocessors both in industry and academia are not suitable for running commercial parallel programs. As a result most architecture studies are confined to scientific applications such as the Stanford SPLASH2 benchmarks.
Scientific applications on shared memory machines usually spend very little time in the operating systems. Therefore, not simulating any OS activity does not result in any significant loss of accuracy for these applications. However, many commercial applications heavily depend on operating system services, and some of the applications spend a significant portion of their CPU time in the operating systems. This is because commercial applications tend to use sophisticated inter-process communication mechanisms and I/O functions that operating systems provide. For example, our profiling data show that on both Unix and Windows NT systems, Web Servers spend 70-85% of their CPU times on OS kernels. On-Line-Transaction-Processing (OLTP) applications such as TPCC and decision support applications such as TPCD spend about 20% of their time in the operating systems. These applications, moreover, generate a significant amount of I/O activity. Therefore, in order to study commercial application behavior with reasonable accuracy, one has to simulate the OS functions where these applications spend a significant amount of their execution time.
COMPASS (COMmercial PArallel Shared Memory Simulator) was developed keeping these requirements in mind. COMPASS uses the basic instrumentation and execution driven simulation mechanism in a PowerPC version of the Augmint simulator. We carefully designed mechanisms to simulate only important OS functions that affect the performance of our target applications, yet to support virtually all of the OS functions that are potentially used by these applications. One can use the COMPASS environment to study the behavior of future shared memory multiprocessor systems and optimize them for commercial applications, as well as to optimize these applications for future server platforms.
Computer System Evaluations with Commercial Workloads based on SimICS
Jim Nilsson, Fredrik Dahlgren, Magnus Karlsson, Peter Magnusson* and
Per Stenstrom
Department of Computer Engineering, Chalmers University of Technology
*Swedish Institute of Computer Science
INTRODUCTION
We recognize an increasing interest in identifying, and finding remedies
for, performance bottlenecks for large commercial applications in future
computer systems. This requires an experimental platform that is able to
support the exploration of hardware/software interaction that includes
system software as well as the I/O system. Furthermore, such a platform
should be able to take new architectural designs into consideration as
well as support the adaptation of software to take full advantage of the
target system. The SimICS/Sun4m simulation environment is such a platform.
The simulator platform is able to boot and run unmodified operating systems,
such as Linux 2.0.30 and Solaris 2.6. The environment provides the means for
examining the underlying principles of contemporary and future multiprocessor
computer architectures, and their interaction with operating systems and
applications.
SIMULATOR PLATFORM
The platform consists of three components: the SimICS instruction-set
simulator, the Sun4m kernel architecture simulator, and a set of
memory system simulators.
SimICS makes use of threaded code and/or native code generation, and
runs approximately 25 times slower than the host platform when running
SPECint95 programs and gathering only a minimum of statistics,
and approximately 80 times slower when gathering detailed information,
including TLB misses, data cache and instruction cache misses, and
executed instructions, all at the granularity of individual culprit instructions.
SimICS supports multiple Sparc V8 processors, multiple physical address spaces,
and system-level code.
We have developed a set of devices to model
the Sun4m kernel architecture, which includes the Sparcstation 5, 10, and
20. A kernel architecture means that the same set of operating system binaries
are particular to one kernel architecture. These devices include a SCSI
chipset with disk, console, ethernet, interrupt chips, I/O memory management
unit (I/O MMU), on-chip MMU, timers and counters, DMA, and the boot prom.
Properly configured with these devices, the platform today boots either Linux 2.0.30
or Solaris 2.6, and run as a virtual workstation on the local network.
The kernel architecture model is hardware-specific in order to allow
us to run completely unmodified operating systems and application binaries, but
the memory organization and timing can be changed. This allow
us to study the impact of architectural alternatives on the performance
of the application program. SimICS supports a simple timing model where
instructions take one ``cycle'', but memory operations are allowed to stall
the processor, thus allowing realistic interleaving of memory
operations in a multiprocessor.
EXAMPLE OF USAGE
The capability to model arbitrary memory organizations allows for
flexible analyses of current as well as future architectures. Simulations
are deterministic, so that interesting events during an
execution can be studied in more detail in a followup execution with
more instrumentation during a specific interval. With this simulations environment,
we have explored different design alternative for database workloads,
such as TPC-C and TPC-D on the PostgreSQL database handler.
Moreover, since the simulator is program driven, the effects of modifications
of both kernel and application software can be explored, which is important
in two different scenarios. The first is performance tuning of an application,
in which case the simulation platform provides more information about
performance bottlenecks than can be retrieved from a real system. The second
scenario is evaluations of performance improvements involving modifications
of the HW/SW interface, such as prefetching or bulk data transfer. In such
cases, the simulation platform provides a powerful opportunity not only
to estimate the execution time reductions through hardware support but
also to evaluate real modifications of the application as well as system
software.
VPC and IADYN - Project Overviews
Rich Uhlig and Todd Austin
Intel Corporation
Despite their importance in the commercial marketplace, systems based
on PC hardware architecture and operating systems remain, for the most
part, unstudied in the research literature. Part of the reason has
been lack of good, publically-available analysis tools that work in the
PC realm; many excellent tools have long existed for RISC-based systems
(SimOS, Embra, Shade, ATOM, Qpt, EEL, Pixie, PatchWrx, etc.), but only
recently have similar tools become freely available for x86-based systems,
and even these new tools are limited in their ability to consider complete
system behavior. In this talk we will describe on-going work in Intel's
Microcomputer Research Lab (MRL) to build new simulation and performance-analysis
tools for PC-based systems. We will focus, in particular, on characteristics
of PC systems that make them inherently harder to study than RISC-based
systems.
The first part of the talk will discuss the VPC project, an effort to
build a fast functional PC platform simulator. The VPC design emphasizes
completeness and extensibility over raw simulation speed, and we'll discuss
the factors we considered in making this design tradeoff. Completeness
means that VPC is able to simulate the full execution of unmodified OS
and application binaries, including both their kernel-mode and user-mode
components. Extensibility means that the VPC simulator kernel has
been
designed to ease the development of new platform device models, and
to make it possible for similar device models (e.g., IDE and SCSI disk
interfaces) to share common simulation functions. Our VPC prototype
is able to simulate a complete boot of unmodified NT system binaries in
under 10 minutes. We will discuss work-in-progress that applies VPC to
the analysis of popular office productivity applications (e.g., Office97),
server workloads (e.g., Database, Web and File Servers), and games.
The second part of the talk will discuss the IADYN project, an ongoing effort to build a family of iA32 ISA simulation components. With over 15,000 semantically distinct variants of more than 700 instructions, the iA32 ISA presents a unique challenge in ISA simulator design. At the same time, varied user requirements ranging from fast functional simulation for workload positioning and light tracing to detailed functional simulation with support for micro-operation generation and arbitrary speculation serve to further amplify complexity. To help manage this development challenge we've built DEFIANT, a simulator development framework based on a formal model of the iA32 ISA. During this part of the talk, we will highlight the components being constructed as part of the IADYN project, lend some insights into the complexity of the iA32 ISA, and briefly describe our DEFIANT simulator development environment.