Introduction

Overview

Shark is a tool for performance understanding and optimization. Why is it called “Shark?” Performance tuning requires a hunter’s mentality, and no animal is as pure in this quest as a shark. A shark is also an expert in his field — one who uses all potential resources to achieve his goals. The name “Shark” embodies the spirit and emotion you should have when tuning your code.

To help you analyze the performance of your code, Shark allows you to profile the entire system (kernel and drivers as well as applications). At the simplest level, Shark profiles the system while your code is running to see where time is being spent. It can also produce profiles of hardware and software performance events such as cache misses, virtual memory activity, memory allocations, function calls, or instruction dependency stalls. This information is an invaluable first step in your performance tuning effort so you can see which parts of your code or the system are the bottlenecks.

In addition to showing you where time is being spent, Shark can give you advice on how to improve your code. Shark is capable of identifying many common performance pitfalls and visually presents the costs of these problems to you.

Philosophy

The first and most important step when optimizing your code is to determine what to optimize. In a program of moderate complexity, there can be thousands of different code paths. Optimizing all of them is normally impractical due to deadlines and limited programmer resources. There are also more subtle tradeoffs between optimized code and portability and maintenance that limit candidates for optimization.

Here are a few general guidelines for finding a good candidate for optimization:

  1. It should be time-critical. This is generally any operation that is perceptibly slow; the user has to wait for the computer to finish doing something before continuing. Optimizing functionality that is already faster than the user can perceive is usually unnecessary.

  2. It must be relevant. Optimizing functionality that is rarely used is usually counter-productive.

  3. It shows up as a hot spot in a time profile. If there is no obvious hot spot in your code or you are spending a lot of time in system libraries, performance is more likely to improve through high-level improvements (architectural changes).

Low-level optimizations typically focus on a single segment of code and make it a better match to the hardware and software systems it is being run on. Examples of low-level optimizations include using vector or cache hint instructions. High-level optimizations include algorithmic or other architectural changes to your program. Examples of high-level optimizations include data structure choice (for example, switching from a linked list to a hash-table) or replacing calls to computationally expensive functions with a cache or lookup table.

Remember, it is critical to profile before investing your time and effort in optimization. Sadly, many programmers invest prodigious amounts of effort optimizing what their intuition tells them is the performance-critical section of code only to realize no performance improvement. Profiling quickly reveals that bottlenecks often lie far from where programmers might assume they are. Using Shark, you can focus your optimization efforts on both algorithmic changes and tuning performance-critical code. Often, even small changes to a critical piece of code can yield large overall performance improvements.

By default, Shark creates a profile of execution behavior by periodically interrupting each processor in the system and sampling the currently running process, thread, and instruction address as well as the function callstack. Along with this contextual information, Shark can record the values of hardware and software performance counters. Each counter is capable of counting a wide variety of performance events. In the case of processor and memory controller counters, these include detailed, low-level information that is otherwise impossible to know without a simulator. The overhead for sampling with Shark is extremely low because all sample collection takes place in the kernel and is based on hardware interrupts. A typical sample incurs an overhead on the order of 20μs. This overhead can be significantly larger if callstack recording is enabled and a virtual memory fault is incurred while saving the callstack. Time profiles generated by Shark are statistical in nature; they give a representative view of what was running on the system during a sampling session . Samples can include all of the processes running on the system from both user and supervisor code, or samples can be limited to a specific process or execution state. Shark’s sampling period can be an arbitrary time interval (timer sampling). Shark also has the ability to use a performance event as the sampling trigger (event sampling). Using event sampling, it is possible to associate performance events such as cache misses or instruction stalls with the code that caused them. Additionally, Shark can generate exact profiles for specific function calls or memory allocations.

Organization of This Document

This manual is organized into four major sections, each consisting of two or three chapters, plus several appendices. Here is a brief “roadmap” to help you orient yourself: