The Final Program for the workshop is listed below, followed by an abstract for each talk. A 20-minute time limit will be strictly enforced for each talk, and there will be plenty of time for audience participation. A round table session discussion will be held after the technical presentations. There will be no proceedings for the workshop since we encourage the presentation of work-in-progress and research in early stages. Copies of the foils used by the speakers will be distributed to the attendees.
One example where the hardware can better interact with software is in the area of processor caches. While it is well-known that higher levels of associativity in processor caches are of great benefit to the performance of commercial applications, there are still systems being shipped today with direct-mapped caches. In addition to performance loss however, such a cache organization also contributes to a second-order effect, specifically the inability to get consistent performance results between benchmark runs. This is caused by an inconsistent page allocation scheme which does not attempt to prevent address conflicts in the processor cache. While research papers have discussed the merits of such schemes, they are often not provided by commercial systems vendors. Other examples of such "non-holistic" computer systems design will be described in our talk, including effects of compiler optimizations that can not be applied and the penalty for not supporting memory "pinning."
In this talk we discuss several problems that can be characterized using TPC benchmarks and how these problems impact commercial workloads. The discussion will cover real-world problems with specific hardware features as well as deficiencies with system software and tools. We will also identify some key aspects of benchmark behavior and discuss what implications they have on the requirements for new commercial systems architectures. This discussion will include the I/O subsystem as well, which has typically received less attention than the primary processor-to-memory interconnects.
Our talk will cover performance data that has been gathered from Sequent PentiumPro systems and Xeon NUMA systems running commercial databases and workloads. Data has been collected on both large systems running a NUMA optimized OS and smaller systems running a non-optimized OS. The data will include measurements of two different databases and include data from TPC-C, TPC-D and customer workloads. We will give a brief overview of the system architecture and a table showing the system size, the OS, the database used, and the benchmark. The OS and databases will not be named explicitly but we will indicate where they are different.
We will show the hit and reference rates for the various workloads. The data will include data measured on actual systems including remote reference rates, L2 hit rates, and remote cache hit. The reduction in off quad traffic through the utilization of the remote cache will be demonstrated with remote cache hit rates and bus bandwidths. This will be equated to the reduction in interconnect bandwidth as a result of using the remote cache.
We will also present data generated from our performance model that compares the performance with and without a remote cache for some of these benchmarks. The model will be run for 10:1 and 2:1 remote-to-local latency ratios. We will show the remote-to-local ratio with no remote cache needed to get equivalent performance to a system with a 10:1 ratio and a remote cache. The performance for a ratio of 2:1 will be shown, with and without remote cache, to demonstrate that even with a very good ratio a remote cache can still improve performance. Lastly our data will show what ratio is required with a remote cache to get equivalent performance to the system with the 2:1 ratio and no remote cache.
We will finish with our conclusions on the effectiveness of remote cache.
Our analysis abstracts the lifetime of a memory block as a sequence of intervals, where each interval is a run of read accesses to the block bounded by write accesses to the block. Intervals have several attributes including the identity of the processor writing the block at the start of the interval, the identity of processors that participate in the interval, the identity of the processor writing the block at the end of the interval, the number of read accesses and the duration of the interval. A block is characterized by its interval behavior: the number of intervals and the interval attributes.
We examine the interval behavior of blocks in TPC-C at the interface between the processor caches and main memory using a trace of TPC-C running on Oracle captured on a 4-way multiprocessor system. We analyze interval counts and characteristics and producer-consumer access distributions of blocks (in addition to overall characteristics such as load/store frequencies, miss frequencies and miss types). We describe how the interval behavior varies across different blocks as well as how interval attributes vary across the intervals of a single memory block. Our analysis shows that a small fraction of the blocks have a large number of intervals and account for a majority of the misses. Also, interval attributes are consistent across the intervals of a single memory block. We explore the nature of producer-consumer interactions in the interval behavior of these blocks and identify the dominant memory sharing patterns in the workload. These workload characteristics can suggest potential improvements to the memory system for future high-end multiprocessor servers.
In addition to their complexity, benchmarks such as these do not allow the user to see how a system behaves as the workload changes. Typical benchmarks only provide us with one point on the cost/performance curve, and can not show us what happens when certain workload parameters changed, such as: the amount of data sharing, working set size, number and size of I/O access, etc.
We have recently begun to utilize the AIM benchmark as a alternative for studying system performance, and perhaps more importantly, as a tool which can provide ``workload knobs'' in order to investigate how varying the workload affects system performance.
Our approach to date has been to characterize TPC workloads on a single SMP system using various monitoring tools (e.g., DCPI and IPROBE) which capture CPU, system bus, and I/O activity. We use processor performance counters to collect statistics on stall and issuing times, cache misses, replay traps, TLB misses, and instruction frequencies. We also use bus monitoring counters to collect statistics on bus utilization, type of bus commands, and the amount of data sharing. We then try to achieve similar characteristics by tuning the AIM workload, both at the ``workload level'', and at the source code level. So far our approach has been successful, and we are able to synthetically mimic the behavior of TPCC on our SMP system.
The next step in our work (which will be presented at the workshop) will be to address synthetic workload scaling in order to match actual workload characteristics while scaling TPC benchmarks. This includes some definition of a ``transaction'' using AIM. Our eventual goal is to be able to provide knobs within AIM to allow for fine-grain system performance analysis across a variety of system configurations.
This talk will present our methodology to synthesize commercial workloads. We will also provide quantitative results on how close we have been able to mimic the behavior of a number of benchmark workloads.
Our results show many interesting issues. The amount of I/O performed in a query, together with the size of each transfer, is still a significant factor in the execution time. Also, the L2 data misses are an important factor that determines the CPI of the queries. Furthermore, there is a correlation between the frequency of branch instructions and CPI, suggesting that branches contribute significantly to the CPI.
Overall, we find that, while individual queries exhibit different characteristics, TPC-D queries as a whole tend to have lower L1 and L2 miss rates and branch misprediction rates than TPC-C. Finally, we study the scalability of TPC-D as we move from a 1 Gbyte to a 10 Gbyte database, as well as the effects of different hardware parameters and various data layouts.
A variety of instruction counts are collected from some LANL computational physics application benchmarks and multimedia commercial applications. These instruction counts can be used to form a set of abstract characteristic parameters directly related to a processor's architectural features. Based on microprocessor architectural constraints and these calculated abstract parameters, the architectural performance bottleneck for a specific application can be estimated. In particular, the analyzed results can provide some insight to the problem that only a small percentage of processor peak performance can be achieved even for many cache-friendly codes. Meanwhile, the bottleneck estimation can provide suggestions about viable architectural/functional improvement for certain workloads. Herein lies the difference between previous comparisons of commercial vs. scientific workloads and the work we have done: this method serves to provide evidence of architectural differences among applications across platforms using a new, unique, straightforward analysis technique based on a new method of workload characterization.
This talk describes a three-tiered analysis utilizing the aforementioned characterization technique on four LANL computational physics applications (SWEEP3D, HEAT, HYDRO, and HYDRO-T) and several commercial multimedia applications (RealPlayer, MPEG decoder, and some commonly used DSP kernels). All instruction-level characteristic data for these applications are collected on a MIPS R10000-based SGI Octane workstation using built-in performance counters. A set of abstract parameters is derived for each application to construct the application characteristics related to some key architectural features, such as pipeline queues, functional units, branch prediction, and memory hierarchy. The comparison provided in this talk attempts to draw conclusions for a particular architecture describing the performance differences attributable to architectural features utilized by scientific and commercial applications. This touches upon the greatest advantage of this new characterization technique: a better understanding of processor utilization efficiency and performance bottleneck for each application by pinpointing architectural constraints actually causing the CPU to stall. Another contribution of this technique is refined prediction of application-specific performance benefit from future architectural enhancements. Overall, the comparison between scientific and multimedia applications based on this new instruction-level characterization provides a picture of microprocessor performance utilization and efficiency across diverse applications for a particular architecture.
It is well known that cache misses contribute to a significant fraction of CPI on database workloads [1]. However, not much published work exists on the underlying constructs that are responsible for misses. This is in contrast to scientific/numerical benchmarks which also miss a lot in cache but where the underlying code constructs that cause the misses are well understood. In this work we analyzed a trace of an OLTP run to identify constructs that contribute to a large fraction of data cache misses. We find that a majority of data cache misses arise from accesses to database pages which store data and index records. Further we find that accesses to any individual page are clustered in time and that intent to access a page is known in advance of actual access. Based on our findings we propose a 'page-buffer' coupled with a special 'page prefetch' instruction to reduce cache miss penalty on database workloads.
Methodology
Our study was done on a single CPU, user only trace collected using Shade [2] from a commercial database engine running the TPC-C benchmark. The trace consists of 1 billion instructions. The trace was collected on a server only TPC-C setup with 125 warehouses. The physical memory on the system was 1 GB of which about 800 MB was allocated for the shared address space. The trace does not have kernel instructions and since it is a single CPU trace it cannot be used to study coherence misses. However, since we are focusing on constructs that miss in single CPU user only code, they will be misses in multi CPU user+kernel code as well.
Page Accesses
Database engines store data records in physical
memory
and disk in the form of pages [3]. Each page holds several data or index
records. The page size varies but is typically in the range of 2KB to
8KB
(in the run we traced a page size of 2KB is used). The amount of data
that
databases maintain in pages is huge (in gigabytes). So accesses to pages
tend to cause compulsory or capacity misses. As cache sizes increase
misses
occurring on other smaller structures gradually disappear leaving misses
on page accesses as a dominant fraction of total misses. The table below
presents data illustrating this fact.
|
|
2MB 1-way
|
23.17%
|
4MB 1-way
|
29.64%
|
8MB 1-way
|
36.10%
|
16MB 1-way
|
46.73%
|
16MB 2-way
|
58.12%
|
Page Prefetch
To reduce the impact of cache misses on page accesses we propose a small fully associative buffer on the CPU chip adjacent to the Level 1 data cache -- a page-buffer. The page-buffer has space to store 'n' complete database pages where typically 'n' is a small number (probably <=3D 4). We also propose a new flavor of prefetch instruction to load pages into this page-buffer typically well before first use. All load instructions check the page-buffer along with the first level cache. If the requested data is found in the page-buffer it is returned from the page-buffer to the pipeline.
The compiler identifies locations in the code where a series of accesses to a particular database page will start either through profile feedback or through 'pragmas' inserted in the source code by developers. In the trace we analyzed only two such locations exist. The compiler then inserts the new page prefetch instruction at these locations to bring in the page about to be accessed into the page-buffer. During execution, when the page prefetch instruction is executed the entire page is fetched into the page-buffer. Subsequent loads that access page data find the data in the page-buffer and data is returned to the pipeline at a significantly lower latency.
The page-buffer is somewhat similar to stream buffers proposed by Jouppi [4]. But stream buffers rely on automatic detection of strided access patterns, and do not work well for the irregular access patterns we see in database workloads. The page-buffer is also similar to the P-cache on UltraSPARC-III [5] and the Assist Cache on HP-PA7200 [6]. The main difference is the fetch granularity. In the page-buffer 2KB or larger blocks are fetched; in UltraSPARC-III and HP-PA7200 64 and 32 byte blocks (respectively) are fetched. But the structures on US-III or PA7200 may be adapted to work like the page-buffer, perhaps by issuing multiple requests.
Simulation Results
Prefetching pages into the page-buffer as outlined above has the potential to significantly reduce cache miss penalty. However since entire pages are fetched in, but only few lines are actually used in each page, it could significantly raise bandwidth demands. It would be best if only the needed lines could be brought in with prefetch. However, since we know the location of the needed lines only to page level granularity, the choice boils down to whether it is better to bring in the full page in advance OR to suffer the miss penalties that currently result due to the needed data not being in cache. To evaluate the effectiveness of our page prefetching scheme in improving application performance, we have implemented an infrastructure to insert prefetches in the database trace and to simulate a page-buffer on a UltraSPARC simulator. Early results look reasonably promising; we are in the process of consolidating our results and simulating a wide range of design options. We hope to have all the results in time for the workshop.
References
[1] L.A.Barroso, K.Gharachorloo and E.D.Bugnion. Memory System Characterization of Commercial Workloads. In Proceedings of the 25th Intl. Symp. on Computer Architecture.
[2] http://jhared.east/share/perftools/shade/shade.html
[3] R. Ramakrishnan. Database Management Systems. McGraw-Hill 1997.
[4] N.P.Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. of the 17th Intl. Symp. on Computer Architecture.
[5] Gary Lauterbach. UltraSPARC-III -- Scalable high clock-rate SPARC processor. Presentation at Microprocessor Forum, 1997.
[6] K.K.Chan, C.C.Hay, J.R.Keller, G.P.Kurpanek, F.X. Schumacher and J.Zheng. Design of the HP PA7200. Hewlett-Packard Journal, Feb '96.
This number of useful instructions per cycle provided by the fetch unit is broadly determined by three factors: the branch prediction accuracy, the cache hit rate and the number of instructions provided by the fetch unit for each access. Clearly, many things can go wrong. Branch mispredictions cause the fetch engine to provide wrong-path instructions to the processor. Instruction cache misses stall the fetch engine, interrupting the supply of instructions to the processor. Finally, the execution of non-contiguous basic blocks prevents the fetch unit from providing a full width of instructions.
Much work has been done in the past to address these problems. Branch effects have been addressed with techniques to improve the branch prediction accuracy and to predict multiple branches per cycle. Instruction cache misses have been addressed with software and hardware techniques. Software solutions include code reordering based on procedure placement or basic block mapping, either procedure oriented [PH90] or using a global scope [Hwu89,To95]. Hardware solutions include set associative caches, hardware prefetching, victim caches and other techniques. Finally, the number of instructions provided by the fetch unit each cycle can also be improved with software or hardware techniques. Software solutions include trace scheduling, and superblock scheduling. Hardware solutions include branch address caches, collapsing buffers and trace caches [Rot96].
While all these techniques have vastly improved the performance of superscalar I-fetch units, they have been largely focused and evaluated on engineering workloads. Unfortunately, there is growing evidence that popular commercial workloads provide a more challenging environment to aggressive instruction fetching.
Indeed, recent studies of database performance on current processors have given useful insight. These studies show that commercial workloads do not behave like other scientific and engineering codes [May94]. They execute fewer loops and have many procedure calls. This leads to large instruction footprints. The analysis, however, is not detailed enough to understand how to optimize them for improved I-fetch engine performance.
The work in this paper focuses on this issue. We proceed in three steps. First, we study the internal structure of a Relational Database Management System and the execution model implemented. Then, we characterize the locality patterns of database kernel code and find frequently executed paths. The database kernel used is PostgreSQL [Sto91]. This characterization shows that only 12% of the code is ever referenced. Furthermore, our data shows that there is significant locality, as 0.7% of the static basic blocks accumulate 90% of the references, and that the execution patterns are quite deterministic, with 80% of the static basic blocks behaving in a fixed way.
Second, we use this information to propose an algorithm to reorder the layout of the basic blocks in the database kernel for improved I-fetch. By mapping the basic blocks in execution order, we increase the potential bandwidth provided by a sequential fetch unit increasing the number of instructions executed between taken branches. Also, we will map the most frequently referenced sequences in a reserved area of the cache space, and other popular sequences next to other equally popular ones, saving many conflict misses.
Finally, we evaluate our scheme via simulations. Our results show a miss reduction of 60-98% for realistic instruction cache sizes and a doubling of the number of instructions executed between taken branches to over 22. As a consequence, a 16 instruction wide sequential fetch unit using a perfect branch predictor increases the fetch bandwidth from 5.6 to 10.6 instructions per cycle when using our proposed code layout.
The software scheme that we propose combines well with hardware schemes like a Trace Cache. The fetch bandwidth for a 256 entry Trace Cache improves from 8.6 to 12.1 when combined with our software approach. This suggests that commercial workloads may be amenable to the aggressive instruction fetch of future superscalars.
[Hwu89] Wen-mei Hwu and Pohua P. Chang, Achieving High Instruction Cache Performance with an Optimizing Compiler, Proceedings of the 16th Annual Intl. Symposium on Computer Architecture, pp 242-251, June 1989.
[May94] A.M.Maynard, C.M.Donnelly and B.R.Olszewski, Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads. Proceedings of the Sixth Intl. Conference on Architectural Support for Programming Languajes and Operating Systems, pp 145-156, October 1994.
[PH90] Karl Pettis and Robert C. Hansen, Profile Guided Code Positioning, Proc. ACM SIGPLAN'99 Conf. on Programming Language Design and Implementation, pp 16-27, June 1990.
[Rot96] E. Rottenberg, S. Benett and J. E. Smith, Trace Cache: a Low Latency Aprroach to High Bandwith Instruction Fetching. Proceedings of the 29th Anual ACM/IEEE International Symposium on Microarchitecture, pp 24-34, December 1996.
[Sto91] M. Stonebreaker and G. Kemnitz, The POSTGRES Next Generation Database Management System. Communications of the ACM, October 1991.
[To90] Josep Torrellas, Chun Xia and Russell Daigle, Optimizing Instruction Cache Performance for Operating System Intensive Workloads, Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pp 360-369, January 1995.
We include not only the Web server itself in our studies, but also the server applications that run under its control as CGI scripts or Java Servlets. An UltraSPARC-II is used to run the server program in the UNIX environment. Dynamic requests are observed to have a higher CPI in comparison to the static requests. The major factor that we could attribute to this was external (L2) cache misses. The other factors that affected the performance for the different requests, like pipeline stalls and cache misses are measured.
On a latency focused study of web servers, the primary metric is the mean time to complete a single end-to-end transaction between a web client and the web server. The results show that as the secondary cache size is increased, the latency of a single web transaction decreases and such latency improvement happens at a faster pace with higher memory requirement in the server for a given workload and along with latency, the maximum sustainable throughput also increases. On the server side, the latency of a web "get" transaction includes: 1. Receiving a client request, 2. Getting the requested data ("html" file) from memory and 3. Sending the data out to the network adapter using TCP/IP protocol. The network I/O performance is directly affected by the up front latency involved in the initial processing of client request, which also involves establishing a new connection, and getting the required data from memory. Given sufficient network I/O bandwidth on the web server, the latency of a web transaction thus mostly (excluding the "think" time required by clients) depends on the resulting CPI (Cycles per Instruction) and the L2 MPI (L2 Misses per Instruction) on the server for a given workload.
The SPECweb96 workload is used in this study to analyze the L2 access characteristics in the web server - the web server running the Microsoft Windows NT 5.0 Beta operating system with Microsoft Internet Information Server 5.0 for providing web services to clients. The web server behavior using the SPECWeb96 workload reveals that both CPI and MPI decrease with increased L2 size. While this observation agrees with our expectation, this paper also looks at the cache and processor-to-memory bus accesses in further granularity in terms of access and transaction patterns and attempts to understand the effects of MPI on CPI at various levels of load produced by the workload.
It should be noted that the results presented in this paper do not any way represent actual SPECweb96 benchmark performance for the hardware and software combination used in this study. Despite the fact that a synthetic benchmark is used for this study, the results of the analysis should be applicable for real-life static HTTP workloads on web server platforms of similar architecture.
The primary goal of the SPC is to serve as a catalyst for performance improvement in storage subsystems. It works to foster the open exchange of ideas and information, and to ensure fair and vigorous competition between vendors as a means of improving the products and services available to the public.
In support of its goals the SPC is developing benchmarks focused on storage subsystems. These subsystems include storage targets, as well as the adapters, controllers, and networks that connect storage devices to the computer system. Currently the SPC has 17 member companies.
The talk will focus on four topics: First, to introduce the SPC to people within the industry who may not be aware of our work. Second, to outline why a storage benchmark is important to our industry. Third, to describe where the SPC is in our work to develop a true storage subsystem benchmark. And fourth, to solicit industry comments & participation in our work.